Your experiment has a runtime

Big-O notation for experimentalists

In computer science, big-O notation describes how an algorithm’s runtime or memory usage scales with input size. O(1) is constant time, O(log(n)) grows slowly, O(n) grows linearly, and so on. It’s a powerful abstraction: it allows us to ignore implementation details and focus on scaling behavior that ultimately drives both time and monetary costs.

In the lab, we rarely talk this way. Yet every experimental protocol has an equivalent “experimental complexity”: how much human effort, pipetting, plate handling, and failure probability increases as the number of samples grows. O(1) biology means that you can double or even 1,000× the number of samples, and your effort stays the same. 

Comparing the growth in the number of PDB structures deposited to the number of cells in publicly available single-cell RNA seq experiments.

O(1) biology isn't directly about speed or cost: it's about scalability. In practice, what determines whether a method is widely deployed is whether it will melt down at 1000× scale. The things that cause experiments to melt down are not just experimental cost (in dollars or hours) but rather complexity.

This might include: 

  • the number of liquid-moving operations
  • the number of opportunities for error and loss
  • the cognitive and logistical overhead per sample

We argue that these parameters can be folded into “experimental complexity.” Consider the difference between an arrayed RNAi screen, where each perturbation lives in its own well and scales linearly with the number of targets, and a pooled CRISPR knockout screen, where thousands of genetic perturbations coexist in a single tube and scale with the number of conditions, not the size of the library. 

For small startups like us, O(1) experiments are especially important because our science demands are bursty. O(n) workflows fall apart when throughput suddenly increases, while O(1) workflows absorb spikes in experimental load, turning what would be a logistical headache into routine.

Barcoding for single-cell omics

As an example, let’s compare methods to carry out a common task in genomics: putting DNA barcodes onto pools of single cells. These molecular barcodes (a short string of 20ish DNA bases) act as unique identifiers for each cell, allowing us to assign each sequencing read directly back to its cell of origin. This converts sequencing from a tool to monitor bulk properties into a tool to understand what is occurring within individual cells. 

I'll start where the literature started: with the Drop-seq and inDrops protocols, which use droplet microfluidics to compartmentalize single cells and barcode beads, and molecular biology tricks to transfer barcodes uniquely onto (the molecules within) single cells. n refers to the number of cells being barcoded, and the cost is pipetting complexity.

Although these processes feel automated (and in many ways they are), the setup scales poorly. If you want to barcode n cells, you will need:

  • O(n) droplets
  • O(n) barcoded beads
  • O(n) minutes, since the droplet generator runs serially

Even if the droplets are generated at scale, the surrounding infrastructure grows linearly with throughput. So do the failure modes. That’s O(n); every additional cell adds more work.

The next major process development is Split-seq, which cleverly avoids linear scaling. The idea behind Split-seq is to think of a cell identifier as a sequence of “barcode words”, rather than a single monolithic field. The total number of entities that you can uniquely identify scales exponentially in the number of barcodes. 

Experimentally, this is easy to carry out: 

  1. Split cells into wells
  2. Barcode (various molecular biology options, but basically add juice, stew, remove juice)
  3. Pool
  4. Re-split
  5. Repeat

Each round expands the barcode space. If each split has k wells, then after r rounds you have \(k^r\) unique barcode combinations. So the number of rounds grows as: \(r = \lceil \log_k(n) \rceil\).  For example, if we want to uniquely label 1,000,000 cells (n) and we're performing the split-pool process in a 96-well plate (k), it will require just 4 rounds of split-pool barcoding (r). The pipetting doesn’t scale with the number of cells, it scales with the number of rounds.

Beating logarithmic complexity is hard, but the more recent PIP-seq behaves like an O(1) algorithm. The protocol itself relies on cleverly exploiting some cool fluid dynamics behavior that we won’t get into here, but at the bench, the protocol becomes wonderfully simple:

  1. load cells
  2. load barcode beads
  3. shake the hell out of everything

How do we get to O(1) biology?

In the examples above, there isn't a single generalizable way to reduce the experimental complexity of a process. Each reduction in complexity took one-off problem solving and creativity. If this were the case, the general concept of building to O(1) biology would itself be inherently unscalable. Thankfully, there’s a generalizable solution for at least some problems: DNA sequencers and DNA barcodes.

Most classic lab assays scale linearly because each observation requires its own dedicated physical entity: a well, a primer, a fluorescent probe, a lane on a gel, a microscope field of view. If you want to measure n things, you need n something elses.

DNA sequencing breaks this relationship[[1]]. A simple example of this is contrasting qPCR versus RNA-seq: with qPCR, every transcript you want to measure requires new primers, new PCR conditions, and probably a new well in a plate. Measuring 10 transcripts by qPCR is 10 reactions; measuring 100 transcripts is pretty annoying, and measuring 10,000 transcripts is probably not happening. RNA-seq, by contrast, measures all expressed transcripts in one shot, whether it’s 10 or 10 billion. The marginal cost of adding another RNA to your analysis is effectively zero. The labor, setup, and logistical overhead are constant even as the data generated grows exponentially. 

But that's not the full problem – actual biologically interesting experiments involve far more than just running a sequencer. To get to O(1) measurements rather than just O(1) readouts, we need one more tool: DNA barcodes. The general trick here is to come up with a bunch of hypotheses, somehow encode these hypotheses into DNA space, and assign each of these hypotheses its own sequenceable identifier. Then, put everything into a single pot, let biology do its thing, and put it on a sequencer. This is the logic behind MPRAs, CRISPR screens, pooled perturbation experiments, and so on; by combining DNA barcoding with DNA sequencing you now reduce the entire experimental pipeline into a single generic operation[[2]].

The combination of barcodes and sequencers creates a general path for designing O(1) experiments and it has worked spectacularly for genomics and transcriptomics. But not, unfortunately, for proteins.

At Escalante we are building a way to close this gap: quantitative, scalable, generalizable protein measurements with O(1) experimental complexity. 

Footnotes

[[1]]: Evergreen blogpost from Lior Pachter that explores similar topics

[[2]]: While powerful, this approach also has its limitations because not all biological questions are as easily encoded into DNA pools, or poolable in a big mixed culture. Questions like 'what happens within a cell when I perturb these genes' are tractable, while the subtly different question, 'what happens to this group of communicating cells when I perturb these same genes,' is not. This has real consequences–we think this is why so many are trying to model a virtual cell, rather than a virtual immune system, despite the fact that the latter is probably much more exploitable as a biopharma tool.