Announcing libcsv v0.3: A C Library for Processing CSV Files

Announcing libcsv v0.3: A C Library for Processing CSV Files

I have recently published Version 0.3 of libcsv, a CSV library for the C programming language. The archive for this version can be found here. The repo for the libcsv project is shown below:


About libcsv

libcsv is a C library that allows a user to validate, read, and write CSV files, converting between the serialized CSV format and a tabular representation readable by statistical algorithms. libcsv also provides SQL database operations for operating on this tabular representation. You can read more about libcsv here.


What's new in this version

Version 0.3 adds the ability to work with sets and to use those sets to select subsets of tables. This is essentially an implementation of the SELECT command in SQL. The following is a brief overview of how to use these new features...

There are three steps to selecting a subset of the rows of a table. First, use a numerical, string, or modular operator to define a csv_set variable representing all rows of the table satisfying a certain condition. Second, use the set theoretical functions provided by csv_set.c to build more complex Boolean expressions with other sets if desired. Third, use the resulting csv_set variable to generate a new table containing only the rows of the original table matching the condition given by that expression.

To extract a csv_set from a table, use the csv_select_subset() function, declared in csv.h and defined in csv_select.c. Its syntax looks like this:

csv_set *csv_select_subset( csv_table *table, enum operators op, char *op1, char *op2 )

The arguments to this function include the table you want to select rows from, an enum code for the operator (explained shortly), and two operands for the operator. The first operand is typically the name of a field, while the second operand is typically a string representation of a numeric or string value that that field is being compared to. The only exception to this is with the MOD operator, in which case both op1 and op2 should be integers. Currently the function provides only minimal error checking, so it should be used with caution, until I get around to implementing more rigorous error checking code in the future.

The codes for the operator are as follows:

  • EQ for number field op1 == numeric value op2

  • NE for number field op1 != numeric value op2

  • LT for number field op1 > numeric value op2

  • GT for number field op1 < numeric value op2

  • LE for number field op1 <= numeric value op2

  • GE for number field op1 >= numeric value op2

  • MOD for row number % op1 == op2

  • SEQ for string field op1 == string value op2

  • SNE for string field op1 != string value op2

For MOD, remember that the row numbers start at 0, so if, for example, you use "2" for op1 and "0" for op2, you will end up selecting the first, third, fifth, etc. rows of the table. This is somewhat counter-intuitive, but it makes the csv_set operations easier to implement.

The csv_select_subset() function returns a pointer to a csv_set variable, which you can combine with other csv_set variables using the following functions:

void csv_set_difference( csv_set *dst, csv_set *src )

Computes the set difference between the two operands, subtracting src from dst and storing the result in dst

void csv_set_complement( csv_set *dst )

Computes the complement of dst and stores it in dst

void csv_set_union( csv_set *dst, csv_set *src )

Computes the set union of src and dst and stores the result in dst

void csv_set_intersection( csv_set *dst, csv_set *src )

Computes the set intersection of src and dst and stores the result in dst

You can also add a _f to the end of the function name to have the function implicitly free its operands before returning the result (for the rationale behind this paradigm, see the introduction to libdfloat.

You can test the membership of an element (represented by an integer) in a set by using the csv_set_member() function, and you can add and remove elements using csv_set_add() and csv_set_del(). These functions are described briefly below:

void csv_set_add( csv_set *set, int element )

Adds the given element to the given set

void csv_set_del( csv_set *set, int element )

Removes the given element from the given set

bool csv_set_member( int element, csv_set *set )

Returns true if the given element is a member of the given set, false otherwise

Note: I realize that the order of the arguments is different in these functions. I plan to fix this in later versions, so that the functions are all consistent.

Once you have the set that you want, you can generate a subset table from that set using csv_select_records_by_subset(), which has the following syntax:

csv_table *csv_select_records_by_subset( csv_table *table, csv_set *subset )

You can also generate a partition of a data set from a csv_set variable. The csv_partition_table_by_subset() function generates a csv_partition variable that contains pointers to two tables - one containing all rows that match and the other containing all rows that don't match. These are accessed through the ident and cplmtfields of thecsv_partition` type.

csv_partition *csv_partition_table_by_subset( csv_table *table, csv_set *subset )

Example code

Here is an example of dividing a table into training data and testing data by selecting the fourth and fifth of every five rows of the table and generating a partition:

csv_set *fourth = csv_select_subset( data, MOD, "5", "3" );
csv_set *fifth = csv_select_subset( data, MOD, "5", "4" );
csv_set_union( fourth, fifth );
csv_partition *data_part = csv_partition_table_by_subset( data, fourth );
csv_table *training_data = data_part->cplmt;
csv_table *testing_data = data_part->ident;