I have recently published Version 0.3 of libcsv, a CSV library for the C programming language. The archive for this version can be found here. The repo for the libcsv project is shown below:
About libcsv
libcsv is a C library that allows a user to validate, read, and write CSV files, converting between the serialized CSV format and a tabular representation readable by statistical algorithms. libcsv also provides SQL database operations for operating on this tabular representation. You can read more about libcsv here.
What's new in this version
Version 0.3 adds the ability to work with sets and to use those sets to select subsets of tables. This is essentially an implementation of the SELECT
command in SQL. The following is a brief overview of how to use these new features...
There are three steps to selecting a subset of the rows of a table. First, use a numerical, string, or modular operator to define a csv_set
variable representing all rows of the table satisfying a certain condition. Second, use the set theoretical functions provided by csv_set.c to build more complex Boolean expressions with other sets if desired. Third, use the resulting csv_set
variable to generate a new table containing only the rows of the original table matching the condition given by that expression.
To extract a csv_set
from a table, use the csv_select_subset()
function, declared in csv.h and defined in csv_select.c. Its syntax looks like this:
csv_set *csv_select_subset( csv_table *table, enum operators op, char *op1, char *op2 )
The arguments to this function include the table you want to select rows from, an enum code for the operator (explained shortly), and two operands for the operator. The first operand is typically the name of a field, while the second operand is typically a string representation of a numeric or string value that that field is being compared to. The only exception to this is with the MOD
operator, in which case both op1
and op2
should be integers. Currently the function provides only minimal error checking, so it should be used with caution, until I get around to implementing more rigorous error checking code in the future.
The codes for the operator are as follows:
EQ
for number fieldop1
== numeric valueop2
NE
for number fieldop1
!= numeric valueop2
LT
for number fieldop1
> numeric valueop2
GT
for number fieldop1
< numeric valueop2
LE
for number fieldop1
<= numeric valueop2
GE
for number fieldop1
>= numeric valueop2
MOD
for row number %op1
==op2
SEQ
for string fieldop1
== string valueop2
SNE
for string fieldop1
!= string valueop2
For MOD
, remember that the row numbers start at 0, so if, for example, you use "2"
for op1
and "0"
for op2
, you will end up selecting the first, third, fifth, etc. rows of the table. This is somewhat counter-intuitive, but it makes the csv_set
operations easier to implement.
The csv_select_subset()
function returns a pointer to a csv_set
variable, which you can combine with other csv_set
variables using the following functions:
void csv_set_difference( csv_set *dst, csv_set *src )
Computes the set difference between the two operands, subtracting src
from dst
and storing the result in dst
void csv_set_complement( csv_set *dst )
Computes the complement of dst
and stores it in dst
void csv_set_union( csv_set *dst, csv_set *src )
Computes the set union of src
and dst
and stores the result in dst
void csv_set_intersection( csv_set *dst, csv_set *src )
Computes the set intersection of src
and dst
and stores the result in
dst
You can also add a _f
to the end of the function name to have the function implicitly free its operands before returning the result (for the rationale behind this paradigm, see the introduction to libdfloat.
You can test the membership of an element (represented by an integer) in a set by using the csv_set_member()
function, and you can add and remove elements using csv_set_add()
and csv_set_del()
. These functions are described briefly below:
void csv_set_add( csv_set *set, int element )
Adds the given element to the given set
void csv_set_del( csv_set *set, int element )
Removes the given element from the given set
bool csv_set_member( int element, csv_set *set )
Returns true
if the given element is a member of the given set,
false
otherwise
Note: I realize that the order of the arguments is different in these functions. I plan to fix this in later versions, so that the functions are all consistent.
Once you have the set that you want, you can generate a subset table from that set using csv_select_records_by_subset()
, which has the following syntax:
csv_table *csv_select_records_by_subset( csv_table *table, csv_set *subset )
You can also generate a partition of a data set from a csv_set
variable. The csv_partition_table_by_subset()
function generates a csv_partition
variable that contains pointers to two tables - one containing all rows that match and the other containing all rows that don't match. These are accessed through the ident
and cplmtfields of the
csv_partition` type.
csv_partition *csv_partition_table_by_subset( csv_table *table, csv_set *subset )
Example code
Here is an example of dividing a table into training data and testing data by selecting the fourth and fifth of every five rows of the table and generating a partition:
csv_set *fourth = csv_select_subset( data, MOD, "5", "3" );
csv_set *fifth = csv_select_subset( data, MOD, "5", "4" );
csv_set_union( fourth, fifth );
csv_partition *data_part = csv_partition_table_by_subset( data, fourth );
csv_table *training_data = data_part->cplmt;
csv_table *testing_data = data_part->ident;