What is column-oriented storage?
Row-oriented storage and column-oriented storage are two major ways of laying out data in memory. In Rust, there is a simple way to think of this: row-oriented storage is like an array of structs, whereas column-oriented storage is like a struct of arrays. It’s easy to use row-oriented storage in Rust, so this post is going to explore column-oriented storage.
To make this post a bit more concrete, I’m going to work with this example struct:
struct Planet {
mass_kg: f64,
radius_m: f64,
habitable: bool,
}
When I create a Vec<Planet>
, each Planet
will be stored contiguously in memory; this is row-oriented storage. For most purposes, this is nice. For example, appending another Planet
into my Vec
usually only requires writing to a single region of memory.
In column-oriented storage, all values for each column will be stored contiguously. For Planet
, this might look like:
struct PlanetFrame {
mass_kg: Vec<f64>,
radius_m: Vec<f64>,
habitable: Vec<bool>,
}
This can be useful for a few reasons:
- Many analytics queries only require looking at a few fields. By accessing only the fields that we need, we reduce IO.
- We should be able to leverage SIMD and CPU pipelining more easily.
- We can store certain types in more efficient ways. For example, booleans can be stored using run-length encoding or a bit field.
Column-oriented storage is most prevalent in databases like Redshift and BigQuery and “big data” file formats like ORC and Parquet, but it can also be useful for in-memory data. Daniel Abadi has a great piece on this: Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?
The problem that I’m investigating here is: how can I take a struct like Planet
and automatically transform it into a column-oriented version like PlanetFrame
?
What is frunk?
The frunk
crate is a “generic functional programming toolbelt for Rust.” Its flagship data type is HList
, which is short for “heterogenous list.” This is a kind of list where each element can have a different type, all of which are known at compile time. Here’s some example frunk
code to give you a feel of how HList
works:
// Create a new HList
let h = hlist![1, "hi"];
assert_eq!(h.len(), 2);
// Destructure the HList
let hlist_pat!(a, b) = h;
assert_eq!(a, 1);
assert_eq!(b, "hi");
One thing to note: all type-checking with HLists
happens at compile-time, so the elements inside the list don’t need to be boxed. The author has written a great introduction if you’d like to learn more about HLists in Rust.
The reason that I chose frunk
for this problem is because it allows me to implement column-oriented storage without writing any macros.
The problem at hand
For my proof-of-concept column-oriented Vec
storage, I chose to satisfy a minimal set of requirements:
- Retrieve a single row
- Retrieve a single column
- Iterate over all rows
To meet these requirements, I built a column-oriented data structure that is an HList
of columns. Briefly, I made Row
and Frame
traits, and implemented them for frunk::HNil
and frunk::HCons
.
Below, I’ll share some more details on my implementation.
Selecting a column
hlist
already has a way to select columns using the hlist_pat
macro. It looks like this:
let hlist_pat![_mass_kg, radius_m, ...] = &planets;
assert_eq!(radius_m, &[6.37814e6, 3.3972e6]);
This is great for collections with only a few columns. Any type errors will be caught at compile time, which is cool.
I think this macro could be burdensome for collections with many columns. Perhaps it could be augmented to look something like:
let hlist_pat![SKIP_8, flight_number, ..., departure_date, SKIP_5] = &flights;
Selecting a row
Naively, I would want the method signature for selecting a row to be something like fn row(&self) -> &Planet
. Unfortunately, no Planet
actually exists in memory, so returning a reference is not straightforward at all. Instead, I pursued a method that creates and returns a new Planet
.
Overall, this implementation wasn’t too difficult. One note is that all fields must implement Clone
, because we need to be able to copy it from the column-oriented storage into the newly created Planet
object.
To make this more efficient, you could make a function like fn write_row(&self, write_into: &mut Planet)
to write a row from the frame into an existing Planet
. This way, you could get the value of a row without needing to allocate new memory.
Iterating over rows
I wanted to implement iterators using closures. This could theoretically be done by giving Frame
a method fn iter(&self) -> impl Iterator<Item=Self::MyRow>
. Unfortunately, a trait can’t currently have a method with impl Trait
in return position. To solve this, I made HConsIterator
and HNilIterator
structs. This is a fairly common pattern for transforming objects like Iterator
and Future
.
My implementation is not too efficient because it does a lot of allocations. Another option would be to use an iterator of references. This is not possible using Rust’s built-in Iterator
trait, so you need a third-party lib like streaming_iterator.
Example code
Here’s the final example code that I came up with. You can see the whole project on GitHub.
#[macro_use]
extern crate frunk;
extern crate frunk_core;
use frunk::indices::Here;
use frunk::prelude::*;
use frunk::Generic;
use frunk_column::*;
use std::iter::FromIterator;
fn main() {
#[derive(Clone, Copy, Debug, Generic, PartialEq)]
struct Planet {
mass_kg: f64,
radius_m: f64,
habitable: bool,
}
let earth = Planet {
mass_kg: 5.976e+24,
radius_m: 6.37814e6,
habitable: true,
};
let mars = Planet {
mass_kg: 6.421e+23,
radius_m: 3.3972e6,
habitable: false,
};
let mut planets = <Planet as Generic>::Repr::new_frame();
planets.push(frunk::into_generic(earth));
planets.push(frunk::into_generic(mars));
let hlist_pat![_mass_kg, radius_m, ...] = &planets;
assert_eq!(radius_m, &[6.37814e6, 3.3972e6]);
assert_eq!(planets.row(1), Some(frunk::into_generic(mars)));
assert_eq!(
Vec::from_iter(planets.into_iter().map::<Planet, _>(frunk::from_generic)),
vec![earth, mars]
);
}
Reflections
Type ergonomics
To transform my struct into a frunk
HList
, I used the Generic
derived trait from frunk
. This was a bit inconvenient because I needed to add a lot of boilerplate in many situations. For example, creating a new frame:
let mut planets = <Planet as Generic>::Repr::new_frame();
planets.push(frunk::into_generic(earth));
planets.push(frunk::into_generic(mars));
Or, collecting the rows into a Vec
:
Vec::from_iter(planets.into_iter().map::<Planet, _>(frunk::from_generic))
I think it would be possible to smooth some of this over with convenience methods.
Trait ergonomics
One limitation is that I can’t implement any third-party traits (e.g. IntoIterator
) for Frame
. This is because the structs for which I’ve implemented Frame
are part of the frunk
crate, so implementing IntoIterator
would be an orphan trait. I worked around this by giving Frame
a method with a similar signature: fn into_iter(self) -> Self::Iterator
.
Conclusion
I was impressed that frunk
let me implement column-oriented storage so quickly and easily. I didn’t have to write a single macro! However, I’d also like to protoype other solutions to solve the problems that I highlighted above.
I would love to hear any suggestions for further directions to explore for columnar data storage in Rust. If you have any, please contact me via the Reddit post, rust-lang.org forum thread, or via email.
Thanks to Ben Striegel, Anton Lazarev, and Brandon Maister for comments.
Appendix: Further Reading
- SoA in Rust with Macros 1.1 by Maik Klein, a PoC for using derive to turn a struct of fields into a struct of arrays.
soa-derive
, a crate that lets you transform your struct into an array of structs using a derive.- If you’re interested in taking a deeper dive into
frunk
, check out Gentle Intro to Type-level Recursion in Rust: From Zero to HList Sculpting. - Columnarization in Rust by Frank McSherry, a crate that recursively transforms data between row and column storage.
- Introducing Utah, a Rust crate for tabular data manipulation. It uses an enum to support mixed types.
- Daniel Tolnay’s
reflect
crate exposes what looks like a runtime reflection API for defining procedural macros easily. This could be an alternative tofrunk
for metaprogramming. - The Apache Arrow in-memory columnar format has an in-progress Rust implementation.
- Andy Grove’s DataFusion now uses Apache Arrow.
- Dataframes: Traits, Enums, Generics, and Dynamic Typing, which uses both static and dynamic dispatching for a data frame to hold different data types.
- Diesel’s Queryable trait approaches a similar problem.
- Storages from The Specs Book. Although I don’t think this is column-oriented, it is also using derives to control data storage.