Once a user knows the hash of a file in BitTorrent, downloading that torrent is easy, fast, and secure. But finding a hash is a completely different story; the hash must be downloaded from a website, which then becomes a single point of failure in the torrenting experience. The Pirate Bay, probably the most famous torrent index, is one of the most popular sites in the world; at the end of the day, many torrent users are putting their security in the hands of a few Swedish guys. If this is something that so many people care about, why aren't there decentralized torrent indices?
World-writable data structures like lists, sets, or maps could solve this problem, as well as many others. It would be possible to create censorship-resistant username registries, content directories, and domain name systems.
Currently, though, there aren't many good tools for building these shared decentralized data structures. My proposal is to build world-writable data structures using the Bitcoin blockchain. These data structures must be easy to use, spam-resistant, and censorship-resistant.
This sounds so simple... so why doesn't it already exist?
Decentralized spam resistance
Note: I'll define "spam" to encompass any content that is not useful for a particular application.
Nowadays, I mostly take spam resistance for granted; the vast majority of content that I see on the web is clearly genuine. This is because each website has control over its own content, and the site owner can elect to take something down if it doesn't belong. Even Wikipedia has a hierarchy of users, administrators, and bureaucrats. Of course, a problem with centralized systems is that they are vulnerable to censorship; Wikipedia has been censored in many countries. This raises a question: how do you prevent spam when you're designing a system that doesn't have an owner?
Bitcoin was the first system with a spam-resistant world-writable data structure. It accomplishes this via proof of work: in order to write to the Bitcoin blockchain, you need to prove that you have wasted a certain amount of CPU cycles ("mining"). This means that spamming the Bitcoin blockchain requires electricity and hardware, making it expensive.
We can reuse this same concept in order to build a spam-resistant world-writable data structure: we will require a writer to destroy resources in order to write to the data structure ("proof of burn"). Fortunately, Bitcoin gives us simple way to do this: we can require users to destroy a small amount of bitcoin when they write.
This is based on the assumption that writing to the store is more valuable to legitimate users than to spammers. This is an unorthodox trust model, because there is no notion of users, accounts, identities, or permissions. Anyone can write data to any application, and the importance accorded to any message is based on the amount of bitcoins burned when that message was written. See the "deep dive" below for more on trust.
The data structures
This is an idea, but how do we make the actual data structures? Our basic building block is the Bitcoin blockchain, which is a shared append-only log of transactions. It's now possible to insert data into a Bitcoin transaction using
OP_RETURN; by tagging each message with a prefix for our application, we can identify which transactions are relevant.
Now that our application has an append-only stream of data, we need to convert it into the data structure needed by our application. Fortunately, a shared append-only log is a great data store, and you can build many other data structures from it. Delta encoding and CRDTs provide some interesting ways of thinking about this process. I also recommend The Log by Jay Kreps, a great article about systems architecture in which the canonical store is an append-only log (the "Kappa Architecture").
A silly example: cat names
Let's imagine that we want to build a system that stores a set of good names for cats. It should provide fast lookup times for whether a string is in the set.
Each message we write to the Bitcoin blockchain will be a JSON array that looks like this:
Making the set
In order to provide fast lookup times, we'll need to convert the data from log form into set form. Here's the general way to do that:
- Create an empty set in memory
- Look through the Bitcoin blockchain. When we see a relevant message:
- Split it by comma into individual words
- Insert each word into our in-memory set
...and there we go!
When the program starts up, it will read in cat names that are already present in the blockchain (this may take a little while). The system can be written so that the set is updated when new cat names are published to the blockchain.
The burn-stream node.js library provides a simple interface for reading and writing messages to the Bitcoin blockchain such that they can be efficiently retrieved. Use it for your next world-writable data structure!
If you have to spend money every time you write, clearly this is not a cost-effective option for a main database! A world-writable data structure should be used as one component in a system; for example, it could contain links to content stored in external systems. Blockstore already does this by storing most of its data in a DHT.
The crucial assumption here is that only people with a legitimate interest in your application will be willing to burn bitcoins to write to your data structure. Of course, it's also possible for a malicious party to make writes! Perhaps there is some incentive, or maybe they are just rich and mean.
The design of your data structures should reflect the priorities of your application. For example, an add-only set allows spam but prevents censorship. A set with both add and remove operations allows individuals to remove spam, but also permits censorship. Perhaps an Eigentrust-style trust graph could provide a good tradeoff that allows the community to remove entries as it sees fit (but that's a topic for another blog post).
In some cases, you may need to choose a rigid cutoff for whether a particular write counts -- e.g., is "Spaghetti" a good cat name, or is it not? In these cases, the cutoff will need to be tuned to balance cost vs. spam-resistance, although different clients could choose different cutoffs depending on their level of paranoia. In other scenarios, it might be acceptable to return a fuzzy answer -- e.g., "Spaghetti" is considered to be a good cat name with a weight of 12%.
It might be possible to use the dimension of time as a way to combat spam. For example, if you had to wait five years for writes to be processed, would this discourage spamming?
Another possibility is to financially reward users who make community-approved writes to your system; this is what Bitcoin does.
This proposal shares a major performance limitation with other Kappa Architecture systems: since the canonical data store is a log, a client has to download the whole log before being able to access data efficiently. This means that, without optimizations, the startup cost is proportional to the size of the Bitcoin blockchain. As the data is read in, it can be loaded into an optimized store, such as the example's in-memory palindrome set.
In general, a client probably does not need to save the original log; as soon as the data is loaded into the serving layer, the data directly from the blockchain can be discarded.
The write performance is the same as for Bitcoin, i.e. an average of 10 minutes between blocks. This means a wait of about one hour before a write has been confirmed six times.
Here are some optimizations:
- By using a simplified payment verification client, a client only needs to download block headers and relevant transactions, not the whole blockchain (~10,000x speedup). Burn-stream already does this.
- By adding a checkpoint to the client, it can start downloading transactions from the time your application was created, not from 2009 (saves ~25 MB of download).
- An application can sacrifice write performance to improve read performance (see Vespucci).
- Perhaps one day ultimate blockchain compression will be implemented.
Bitcoin enforces a total ordering over all confirmed transactions, which means that our append-only log also has a stable total order. However, this reflects the order in which the transactions were confirmed, which may not match the order in which they were submitted.
Like Bitcoin, these data structures are eventually consistent. Since multiple parties can write to the data structure at the same time, it might be wise to choose a data model that cannot enter a broken state, such as a CRDT.
The price of bitcoin
If the price of bitcoin changes drastically, then the effective cost of writing to a world-writable data structure will change. Writers can hedge against this risk by holding bitcoins.
Why the Bitcoin blockchain?
Bitcoin is not the only blockchain in town, so why choose it?
- Bitcoin's design and code have been thoroughly vetted.
- There's a large base of Bitcoin software in many languages, making it easy to implement custom software.
- The hashing power of Bitcoin's network is vastly stronger than that of any competitor, making it more resistant to a 51% attack.
- Bitcoin's price is stable relative to altcoins, reducing the risk from price changes.
There are currently many systems that use decentralized data structures, although I'm not aware of anything that is generic. If I'm missing something, please send me a tweet or email!
- Bitcoin has a decentralized transaction ledger, which is a specific application of an append-only log.
- Eternity Wall is a simple decentralized message log built using the Bitcoin blockchain.
- Namecoin, blockstore name registration, and blockname all attack the problem of decentralized DNS/name registration.
Thanks to George Davis, Zack Nichols, Max Livingston, Rafi Shamim, Kevin Wilson, and Eric Kernfeld for feedback.