libcsv: A C Library for Processing CSV Data Using SQL Database Operations

libcsv: A C Library for Processing CSV Data Using SQL Database Operations

This article is an introduction to the libcsv project, which I am hosting on GitHub. You can find the project's repository here.


What is libcsv?

libcsv is a CSV library for the C programming language. It provides C programmers with a clean and consistent interface for validating, reading, and writing CSV files and operating on CSV tables using SQL database operations. It is specifically designed with statistical and machine learning algorithms in mind. It uses only the C Standard Library and libdfloat as a basis, so the code is portable across different operating systems and APIs.


What does libcsv do?

libcsv provides the following functionality:

  • Validate CSV files to make sure the CSV code is well-formed before parsing it into a data structure.

  • Parse and deserialize CSV code into a table data structure that can be used by statistical algorithms.

  • Serialize a table structure into a textual format and write it back to the CSV file.

  • Traverse the records of a table structure so they can be accessed one-by-one.

  • Get and set the values of individual fields in the current record.

  • Create new tables, and drop tables when they are no longer needed.

  • Insert and delete records in a table.

  • Create a new table from a subset of an existing table based on numerical, string, and Boolean conditions (upcoming).

  • Partition a table into two subset tables in one operation (upcoming).

  • Create and manipulate set types directly and use them to select subsets of tables (upcoming).


How to use libcsv:

Let's look at a typical use case implementing a machine learning algorithm using CSV as an underlying data interchange format. This use case involves reading training data from a CSV file, using it to train and test a machine learning algorithm, and then computing new data points before adding them to the CSV file.

Before parsing a CSV file into a data structure, it is important to first validate it to make sure it contains valid CSV code. Otherwise, the parser might run into errors while reading badly-formed code that might cause segfaults and other problems that may be very hard to diagnose. You validate a CSV file with the csv_validate_file() function, declared in csv.h (the single header file used to define all externally visible types and functions). Its prototype looks like this:

bool csv_validate_file( FILE *fp, bool has_header );

The validator function reads the file pointed to by fp from start to end, processing a header if has_header is true, and returns true if the file is valid CSV code and false if the file is invalid. The code you would use would look something like this:

if( !csv_validate_file( fp, true ) ){
    fprintf( stderr, "Invalid CSV code.\n" );
    exit( -1 );
}

Use this code before reading any CSV file to ensure that the program will terminate with an error message when given an invalid input file.

After validating the input file, you can read it into a table structure using the csv_read_table() function, whose prototype looks like this:

csv_table *csv_read_table( FILE *fp, bool has_header );

This function is structured very similarly to the validator function, except instead of returning true or false it returns a pointer to a csv_table data structure containing all the data read from the table. libcsv is designed in such a way that you never have to access the internal implementation of a csv_table structure. All table operations are handled using functions that take care of all the underlying pointer operations for you.

Technically, a csv_table structure is allocated entirely on the heap, using malloc(). You never have to malloc it yourself though, the csv_read_table() function does this internally. Furthermore, you never have to explicitly free any of the pointers used by a csv_table structure, as the csv_drop_table() function frees all of these pointers for you. I decided to design libcsv in such a way that it frees the programmer from ever having to do pointer arithmetic, as this could result in much messier, more buggy code. libcsv functions handle all the nitty-gritty pointer details for you and provide a clean interface for operating on CSV tables.

A csv_table structure is accessed through a special handle called the Current Record Pointer or CRP. Again, you never have to access this pointer directly, nor is it recommended. Instead, advance or rewind the pointer using csv_next_record() and csv_rewind() and all functions that access an individual record will automatically use whatever record is pointed to by the CRP.

csv_next_record() takes a table as its single argument and transparently advances the CRP for that table to the next record, returning NULL if the end of the table has been reached. This allows you to write code to seamlessly enumerate all records in a table like this:

csv_rewind( table );
while( csv_next_record( table ) ){
    // process current record
}

There are several functions for operating on the current record. These include functions for deleting the current record from the table, and for setting and getting fields either by field name or field index. There are two types for fields: csv_string, which is implemented as a char * allocated on the heap, and csv_number, which is implemented as a dfloat64_t * (see the documentation for libdfloat), also allocated on the heap. The setter functions take either a string or a dfloat and write it to the field, assuming there isn't a type mismatch. The getter functions return the string or dfloat value for the addressed field. You can also add a new record, and even create a new table from scratch. An exhaustive list of all these functions is outside the scope of this introductory article, so see the libcsv documentation (DOC.md in the GitHub repo) for more information.

It is also possible to create a new table containing a subset of the rows of an existing table. This is roughly equivalent to the SELECT command in SQL. As of this writing, this functionality is not implemented in the current version of libcsv (Version 0.2.2), but Version 0.3 will add two new C modules: csv_set.c and csv_select.c, which will implement all the underpinnings of the SQL SELECT command.

csv_set.c implements a set data type known as csv_set. It functions much like a set type in Pascal, except it has arbitrary size. Each bit of the set type represents a row in the table, starting at 0 for the first row and incrementing upward until reaching the last row. A function is provided for extracting a csv_set type representing all the rows of the table that match a given condition. This set is used as an intermediate step in generating the subset table, and in the meantime it is possible to manipulate these set types using union, intersection, complement, and set difference functions, in order to simulate Boolean operations in a WHERE clause. Once you have the desired subset, you can generate a subset table from it using csv_select_records_from_subset(), which is defined in csv_select.c.

You can also partition a table into two complementary sets, using the csv_partition type and the csv_partition_table_by_subset() function. This function is almost identical to csv_select_records_from_subset() except that the records not matching the condition are copied to a second table. Both of these tables can be accessed through the csv_partition type returned by the function. This partitioning functionality is primarily designed to enable the splitting of a data set into a training set and a testing set. The typical way to do this would be using the modulus operator, which is one of the operators provided for extracting a subset from a table.

Version 0.3.1 of libcsv will add front-end functions to combine the set generation and table generation steps into one. These functions will take a mathematical expression as an argument and use it to construct all of the intermediate operations listed previously. These front-end functions will make the selection of subset tables and partitions much easier and more convenient than calling a function for each individual step. The old csv_select.c functions will still be available for those who want to get into more of the nitty-gritty details.

Once you've separated out the data in our sample use case, it's time to run the data analysis algorithm. This can be done by traversing the table using the functions shown previously and processing each row in turn. New data can be added to the table using the functions csv_insert_record() and csv_insert_new_record(). The former takes a void * array as an argument and creates a new record at the end of the table initialized with the values in that array. The latter simply creates a blank record, which you can then initialize with the setter functions. This is the easier method as it involves no pointer operations. Its use would look something like this:

csv_insert_new_record( table );
csv_set_number_field_by_index( table, 0, df );
csv_set_string_field_by_index( table, 1, "This is a string field!" );

The final step in our sample use case would be to write the modified table with the new data points back to the CSV file. This is done using the csv_write_table() function, which has the following prototype:

void csv_write_table( FILE *fp, csv_table *table, bool has_header );

Unlike csv_validate_file() and csv_read_table(), which save the current file position at the beginning and restore it before exiting, csv_write_table() writes the serialized table data in-place. This is done to allow multiple tables to be written to the same file stream one after another, effectively concatenating them into one CSV file.