My favorite papers

  • Calvin: Fast Distributed Transactions for Partitioned Database Systems (pdf)

    This paper shows us that by deterministically running transactions, we can provide strict serializability in a georeplicated distributed database while limiting latency. They accomplish this by fully ordering transactions ahead of processing and enable parallelism by locking keys for the transactions in that order (if transaction A reads something B wrote, it will wait for A to be processed, while transaction C which neither reads nor writes anything that A touches can run in parallel with A).

    This comes at a price: transactions must know their full read and write set before execution. The paper does go into some details on a "reconnaissance" phase in which applications can run transactions without write effects to collect their key set, and then rerun and validate this read set. This opens up the ability to run transactions without full knowledge of the key set, at the cost of executing the transaction twice (and the possibility the second run will abort due to conflicts with concurrent transactions changing data it read in the recon phase).

    Even with the limitations, it's a great read to see how pushing consensus to the boundary coupled with deterministic execution opens things up to some unique tradeoffs.

  • Spanner: Google’s Globally-Distributed Database (pdf)
  • Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (pdf)
  • Dynamo: Amazon’s Highly Available Key-value Store (pdf)
  • CORFU: A Shared Log Design for Flash Clusters (pdf)
  • LeanStore: In-Memory Data Management Beyond Main Memory (pdf)

    One of the challenges with in-memory database systems is that all data (including cold data) must fit into memory. There has been a number of attempts to solve this, but LeanStore takes a novel approach. The two interesting things they utilize are:

    1. Using the top bit of a pointer to a page to indicate if the pointer is referencing a page in memory or on disk, avoiding indirection for in memory pages.
    2. Randomly selecting pages to shift towards being placed on disk, which works out to be very effective approach that avoids tracking data to perform something like LRU eviction.

    Taken together, they're able to show some impressive results that track closely with in-memory data stores while being able to utilize the disk to hold cold data.

  • The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases (pdf)
  • The ART of Practical Synchronization (pdf)
  • Cicada: Dependably Fast Multi-Core In-Memory Transactions (pdf)

More papers

  • TicToc: Time Traveling Optimistic Concurrency Control (pdf)
  • Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System (pdf)
  • WPaxos: Wide Area Network Flexible Consensus (pdf)
  • Flexible Paxos: Quorum intersection revisited (pdf)

    An interesting paper that shows that you don't need a majority of nodes to agree in all phases of Paxos to reach consensus. A practical use of this would be to have 3 of 4 nodes agree who's leader, while only requiring 2 of 4 nodes to reach replication consensus.

    I could see this being very useful for lowering latency for consensus across data centers (for example, in a set of datacenters {west1, west2, east1, east2} we'd like the primary to be able to reach consensus through consulting its nearest datacenter (e.g. west2 -> west1) and if the primary region fails, for the other close pair to become primary (if a region in west fails, we shift the primary to one of the east regions). Under the prior understanding of Paxos, having 4 nodes required replication to reach 3 nodes (thus you almost always ran odd numbers, as both 3 and 4 node configurations only handle 1 node failure). FPaxos makes these unique configurations possible! (There are a number of other interesting configurations they talk about in the paper.)

I’m exploring

  • A fast alternative to the modulo reduction (external)

    This might be a useful trick to speed up indexing into multi-level page tables