Recently, I wrote ndarray-csv, a Rust crate for converting between CSV files and 2D arrays. There are already crates for CSV and arrays, so how exciting could this possibly be? Actually, there was a lot more to it than I had thought: although it started out as a two-hour project, I ended up rewriting the entire thing three times!
Update 2018-10-27
I ended up substantially revising many of these design decisions based on feedback. Read more here.
The libraries I’m using
csv is the standard for working with CSV data in Rust. It has excellent docs, and is pretty fast as far as I know.
ndarray is a crate for working with dense multidimensional arrays in Rust. It has good support for arithmetic, slicing, and working with parts of an array as views. Here are two examples of fun crates in the ndarray ecosystem:
- matrixmultiply is a simple and understandable crate for doing efficient matrix multiplication. There is an awesome blog post by bluss explaining how it works.
- ndarray-parallel is a crate for efficiently parallelizing operations across matrices. This could be a great way to un-bottleneck many machine learning algorithms.
Implementing the read method
So far I have gone through three different strategies for implementing my read
method. Its signature is:
pub fn read<A>(shape: (usize, usize), reader: &mut Reader<impl Read>) -> Result<Array2<A>, Error>
where
A: Copy,
for<'de> A: serde::Deserialize<'de>
For my first implementation, I created a new Array2<A>
using Array2::zero
, which fills the array with zeroes. This is fast and simple, but there’s a problem: it requires that A
, the type of each array element, must implement num_traits::identities::Zero
. This prevents us from reading arrays of booleans, chars, ACGT nucleobases, or anything that doesn’t have a concept of “zero.”
To solve this I switched to creating the 2D array using Array2::uninitialized
, an unsafe function. This function allocates enough memory for the whole array without initializing any of the memory. Then, I filled in each element from the CSV. I’m 75% sure that my code was correct, and I even wrote the previous version of this blog post justifying why this was the right choice. After thinking about it, though, I decided that this use of unsafe
wasn’t worth the risk.
My most recent and hopefully final implementation creates a new Array1<A>
using Array1::from_iter
, then reshapes it into a 2D array of the correct dimensions. This involved solving a pretty tricky problem that I’m planning to write about separately: how to flatten an Iterator<Item = Result<Iterator<Item = Result<T, E>>, E>>
into an Iterator<Item = Result<T, E>>
.
One disadvantage of my newest implementation is that it will now do multiple allocations because it relies on Vec::from_iter
. This is less efficient than my previous solutions. I should be able to optimize this in the future by implementing size_hint
on my iterator, which will let Vec::from_iter
allocate the right amount of memory.
Errors
While I love Rust’s error-handling mechanisms, propagating errors from nested functions can be tricky. To help with error chaining, I used the failure crate for my errors. This could make it easier for downstream consumers to easily chain errors that ndarray-csv returns.
When I write code and I’m not feeling lazy, I like to test that my functions return appropriate errors when things go wrong. Interestingly, it is usually not possible to use assert_eq
to test errors in Rust because errors don’t necessarily implement PartialEq
. Instead, I have been using assert_matches
from the matches crate.
I used the following strategy to test my failure errors (downcast_ref
is part of the failure crate).
let readed: Result<Array2<i8>, _> = read((2, 2), &mut test_reader());
assert_matches! {
readed.unwrap_err().downcast_ref().unwrap(),
NColumns { at_row_index: 0, expected: 2, actual: 3 }
}
If you’re curious to learn more, there is a discussion on users.rust-lang.org on why std::io::Error
doesn’t implement PartialEq
. I also opened an issue in failure to ask about error handling best practices; maybe there are better approaches that I don’t know about.
Reading untrusted CSV files
One source of web vulnerabilities is users uploading malicious input files, causing the server to hang, crash, or even to allow unauthorized access. There is a nice collection of zip bombs and gigantic images available from bomb.codes.
I wanted to make it so that this library could accept unbounded streams of input from untrusted sources without blowing up. This should be possible because:
- the number of rows is known in advance, so we can bail out if a file with too many rows
- the number of columns is known in advance, so we can bail out if a file has too many columns
- ndarray is mostly used for numerical data, so we can bail out if a field is too long. This is a little bit trickier because, for example
0
repeated infinitely could be considered as a valid integer. I would probably get around this by limiting the number of characters per field.
As it turns out, accepting unbounded streams of untrusted input is outside of the goals of the csv
crate. BurntSushi provides a really good explanation of why this is difficult. I have not yet implemented functionality for parsing unbounded streams of untrusted input. If I wanted to do this, I can think of two ways to do so.
As suggested by BurntSushi, it would be best to limit the number of bytes per line by wrapping the io::Read
. This would be up to users of ndarray-csv.
It might also be possible to limit the number of bytes per field. This would require me to use csv-core instead of csv
, which would require changes in both code and interfaces.
Future extensions
I’d like to let users read in an array without knowing the size in advance. The logic for accomplishing this would be pretty different from what I’ve already implemented.
Reading into a user-provided array could be more efficient in certain situations.
Arrays of size , , and are supported by ndarray
, but they don’t map elegantly to the CSV format. For example, all arrays of size would just be an empty string when represented as CSV, so the number of columns is thrown away. Hopefully no one wants to actually do this.
Conclusion
I hope that you agree that this endeavor was deceptively interesting!
Check out ndarray-csv, and please let me know which of these design decisions you agree and disagree with! It’s not too late to make improvements.
Thanks to Brandon Maister and Michael Hoy for comments.