Teaching generative models to hallucinate
There are currently two main approaches to computational protein binder design: optimization (exemplified by BindCraft) and generative models (e.g. BoltzGen).
A design campaign using either method looks very similar: you first generate a large collection of designs and then rank and filter them down to a handful for wetlab testing. In practice the main difference is that optimization (also known as hallucination[[1]]) has high average per-design in silico quality, but is very slow; while generative models are much, much faster but average design quality is lower[[2]]. We'll show that, by borrowing the standard posttraining process from the LLM world, we can (probably) have the best of both worlds. To make this concrete, we can reduce the cost of generating a single high-quality (according to in silico metrics) 100 amino acid candidate binder to a 350 amino acid target from $1.50 to $0.03: given a design campaign might require hundreds or thousands of such candidates, this is a meaningful reduction.


Hallucination: optimize through a fixed model (left). Generative model: sample structures from a generative model (right).

Quick review of hallucination and generative models for binder design
Hallucination means optimizing through a neural network (or combination of networks): keeping the network fixed, varying the input sequence to maximize (for instance) cofolding confidence and sequence recovery. This works quite well, but iterative optimization can be very expensive: generating a single design candidate might require 150 structure predictions. Generative models like BoltzGen produce candidates in a single forward pass; thus, they're typically much faster per design. That's great! Unfortunately, however, the per-design in silico success rates of generative models are currently much lower than those of hallucination-based methods .
This is reflected in practice: in the Adaptyv Nipah competition, we won the de novo in vitro round with 10 binders from a pool of only 150 hallucinated candidates. Generative models like BoltzGen typically recommend sampling 10,000 to 60,000 candidates to get a handful of good designs. This means the overall computational cost of generating a handful of good binders with either approach is roughly equal.
What if we could combine the speed of diffusion with the quality of hallucination? [[3]]
A quick note before we get started: in this post we'll use BoltzGen in ways it wasn't designed to be used and we'll make obviously unfair comparisons between BoltzGen and other methods. It's important to say upfront that we love BoltzGen and the other amazing open-source methods from the Boltz team! In fact, their choice of an MIT license is what allows us to do work like this with a tiny team.
Posttraining for generative binder design
The most obvious way to do this would be to improve the per-design performance of generative models to approach that of hallucination-based methods. This would substantially reduce the overall computational cost of high-quality protein binder design, and could be a major step forward for the field [[4]].
One way to do this is to crib the standard three-stage pretraining to posttraining process from the LLM community, where all models typically undergo the following:
- large-scale pretraining on relatively unstructured and uncurated data to generate a base model
- smaller-scale finetuning on higher-quality (and possibly synthetic) data related to the downstream task (e.g. dialogue, code generation, etc.)
- reinforcement learning using model-based or verified rewards.
While BoltzGen has been pretrained on a large dataset of protein structures, we'll show a very simple implementation of this playbook for minibinder design in the next few sections. The overall process is as follows:
- We generate a tiny dataset of presumably high-quality protein binders designed using our hallucination library,
mosaic. - We finetune BoltzGen on this dataset to steer its output distribution closer to that of hallucination.
- We apply reinforcement learning with a sequence-based loss function from
mosaicto further improve the model's performance.
This is a proof of concept [[5]], but it appears to work well, and there are no obvious barriers to scaling. In accordance with the results for LLMs, we find that even a small amount of finetuning and reinforcement learning substantially improves the model's ability to generate high-quality binders (at least as measured by in silico metrics!)
Step 0: Define a loss functional for hallucination and RL
To generate binders using hallucination and as a target for RL we use the following loss functional from mosaic:
loss = protenix.build_multisample_loss(
loss=sp.BinderTargetContact()
+ sp.WithinBinderContact()
+ 10.0 * InverseFoldingSequenceRecovery(mpnn, temp=jnp.array(0.001))
+ 0.05 * sp.TargetBinderPAE()
+ 0.05 * sp.BinderTargetPAE()
+ 0.025 * sp.IPTMLoss()
+ 0.4 * sp.WithinBinderPAE()
+ 0.025 * sp.pTMEnergy()
+ 0.1 * sp.PLDDTLoss(),
features=features,
recycling_steps=6,
num_samples=4,
)
This loss functional is similar to the one we used in the Adaptyv Nipah competition, but we've swapped Boltz2 for Protenix (with the 2025 training cutoff). While Protenix does get slightly better benchmark numbers than Boltz2, this was mostly for fun! We wanted to see how far out-of-distribution we could push BoltzGen and still get improvements from finetuning and RL -- Boltz2 is essentially the same model as BoltzGen.
Step 1: Generating a high-quality binder dataset with mosaic
We curated a set of 22 benchmark targets [[6]], and used mosaic to generate just 64 binders of length 100 to each target [[7]]. This cost about $1000 using modal.
These binders look pretty good according to in silico metrics, and appear fairly diverse in terms of structure and sequence. We simply rank them by iPTM + ipSAE, and retain only the top half.
Top 3 hallucinated designs for three representative targets:
Step 2: Finetuning BoltzGen on hallucinated binders
Our goal is to steer the distribution of BoltzGen closer to that of mosaic. To that end, we first finetune BoltzGen on the hallucinated binder dataset from step 1. To substantially reduce computational cost, we freeze the \(O(N^3)\) conditional encoder (known as the trunk in other AF3 architectures) and finetune the structure module only [[8]]. For additional computational savings we use a JAX translation of BoltzGen. The code for this is simple: we sample minibatches of the hallucinated binders and take gradient steps on the standard BoltzGen denoising diffusion loss.
Step 3: Reinforcement learning with a custom reward function
Next, we apply reinforcement learning to the finetuned BoltzGen structure module to directly optimize the mosaic loss functional \(\ell.\) This step is operationally quite interesting: the mosaic loss depends only on the binder sequence, but the full output of BoltzGen is a structure of the binder-target complex. We find that simply extracting the sequence from the complex and passing that into the mosaic loss works fine.
Our algorithm [[9]] is roughly:
- Generate a batch of structures from BoltzGen
- Compute \(L_i = \ell(\text{seq}(S_i))\) for each structure \(S_i\) in the batch
- Compute a GRPO-like advantage: \(A_i = L_i - \bar{L}\) where \(\bar{L}\) is the batch mean of the losses from step 2.
- Take a gradient step on a weighted version of the finetuning loss function:
\[ \mathcal{L}(\theta) = - \frac{1}{N} \sum_{i=1}^N A_i \text{diffusion\_loss}(S_i, \theta).\]
Here \(\text{diffusion\_loss}(S_i, \theta)\) is the standard BoltzGen denoising diffusion loss for structure \(S_i\) under model parameters \(\theta\). Intuitively, this algorithm increases the likelihood of generating structures with better-than-average mosaic loss and decreases the likelihood of worse-than-average structures.
This simple algorithm works quite well! We use the same JAX translation of BoltzGen as in step 2. At each iteration we generate 256 designs for a single target, and take a gradient step on the above loss -- we do this for ~50 iterations.
Results
Evaluation on benchmark targets with mosaic
We evaluated each model by generating 64 designs (of length 100) per target and scoring them with Protenix (iPTM + ipSAE, 10 recycling steps, mean of 6 diffusion samples ranking). Note that we don't do any inverse folding, and 64 designs is tiny -- this is not how BoltzGen is intended to be used! None of these results should be taken as an indictment of BoltzGen's capabilities. Here are iPTM survival curves and distributions for three representative targets:


Finetuning and RL both substantially improve the consistency of generated binders. In an actual binder design campaign, we'd generate many, many more designs and select the top handful for experimental validation, so the improvements in the tail of the distribution are also encouraging. Across all targets hallucination performs best overall.
Diversity of generated binders
A natural concern with any application of RL is a loss of diversity in the generated designs. While these designs only lose a small amount of sequence diversity, the RL model essentially produces only helix bundles (generally a great strategy to maximize cofolding confidence).


The base BoltzGen model produces very low diversity sequences for PDL1 at this sequence length. Oddly, the model finetuned on the hallucinated structures produces far fewer beta sheets than occur in the hallucinated structures themselves.
A related trend is the massive enrichment of glutamates and alanines as we finetune and apply RL. The natural frequencies from SwissProt are shown for reference. These residues are a natural choice to create alpha helices; but unnaturally high A/E content may increase folding model confidence without actually improving the quality of the designs as binders. We'd have to test these designs in vitro to know for sure. Luckily, it's very easy to modify the mosaic loss functional to, for instance, penalize overuse of certain residues or secondary structures.

Generalization
Do these results generalize to targets not seen in training? And do the same findings hold when we use BoltzGen properly? To test this, we evaluated our RL'd model on EGFR, an older Adaptyv competition target which is not in our training set. We ran the pipeline in two different ways:
- using our JAX translation of BoltzGen, where we inverse fold with ProteinMPNN (instead of BoltzGen's default inverse folding model), refold with Boltz-2 and rank the designs using our favorite (and simple) ranking metric iPTM + ipSAE.
- loading the RL model weights in the standard BoltzGen implementation and running their full design and ranking pipeline.
For each setup we generated 10k length 80-120 binders (5k with each set of weights), mixed them together, and ranked all designs after clustering. Both ranking methods strongly prefer our weights: with mosaic ranking (Boltz2 iPTM + ipSAE), all 100 top ranked designs were generated by our RL model. The BoltzGen ranking method is much more complicated and involves Rosetta interface and bond metrics; even so, 94 out of the top 100 designs from this pipeline come from our model.


As a side note, translating BoltzGen to JAX gives serious speed improvements: even when we recompile every 500 designs to generate different binder lengths, our reimplementation is 3x faster (8 seconds per design vs 24 seconds per design on an H100, including design, inverse folding, and refolding).
Discussion
This is preliminary work, but it's exciting to see that even a very small amount of finetuning and reinforcement learning improves the performance of BoltzGen (at least according to structure-prediction metrics). The finetuned and RL'd checkpoints are available on HuggingFace if you'd like to try them out -- note these are trained only on 100 AA binders on the 22 benchmark targets: performance will degrade the further you get from this distribution.
There are many open questions and future directions here. First and foremost, do our models improve upon the success rate of BoltzGen in vitro? Similarly, does something like this approach work with actual experimental feedback (or a model fit to experimental feedback, a la RLHF) rather than a structure prediction loss? Stay tuned.
Are we doing anything more than teaching the model to generate more alpha-helices and fewer beta-sheets/loops?
As always, there are endless technical questions and potential improvements:
- Does this scale to larger datasets and more targets? Is it worth scaling up?
- Can we improve the reward function? Unlike hallucination-based design we do not need to worry about differentiability here, which opens up many possibilities: for instance the addition of physics-based energy functions.
- Are both finetuning and RL necessary?
- Can we provide more finegrained feedback during RL or finetuning? For instance, instead of just providing a single loss value for the whole structure, can we provide residue-level feedback to the model?
- Honestly, this feels a bit circular: we're using three models (BoltzGen, Protenix, and ProteinMPNN) that all trained on the same dataset (PDB) to generate, evaluate, and provide feedback on the same set of hallucinated designs. Do we need to do this? Would better low temperature sampling be enough?
- Refolding to rank designs is still very expensive; if we wanted to further reduce computational cost we'd need to replace or speed up this step.
Thanks to Brian Naughton and Martin Pacesa for feedback on an early version of this post.
[[1]]: Not to be confused with the (much more common) use of "hallucination" in the LLM world to refer to the generation of factually incorrect text.
[[2]]: Everything we'll talk about today is about generic globular protein binders (i.e. minibinders), not antibodies or VHH, where, currently, generative models are doing much better.
[[3]]: It's easy to set up this dichotomy between optimization (hallucination) and sampling; but in theory they're not so different. At an obnoxiously high level, optimization can be thought of as very low temperature sampling, and sampling almost always admits a variational formulation (can be written as an optimization problem). At a slightly lower level of abstraction, the actual algorithms used to sample from generative models are, in fact, iterative methods that look very similar to optimization algorithms (typically something like Langevin dynamics; noisy gradient descent). For generative binder design as actually practiced they look even more similar: designs are always ranked by roughly the same objective used in optimization! There are obviously some differences -- the objective function used in hallucination and the log-probability learned by generative models are not exactly the same, for instance. This is probably why (currently) some generative models are not quite as good at globular protein design -- if you see two chains in contact in PDB, it's not necessarily the case that they are each good binders to each other. Some of the most interesting and promising recent methods like PPIFlow are pretty clearly explicit hybrids of optimization and sampling. It's pretty clear that, long term, generative models are a great way to go, even if you're interested in optimization.
[[4]]: Actually, maybe not! Current low-throughput kinetics measurement techniques like BLI or SPR are expensive; typically at least $100 per sample. This is probably why most people are happy paying a similar amount for compute to generate a handful of good designs. Really, you'd only be interested in this if you had an accurate measurement assay with much better scaling than BLI/SPR.
[[5]]: Seriously. We use a tiny dataset, a single binder length, do very little HPO, even less analysis. In particular it's possible that our model is reward hacking and in vitro success rates are zero. It's pretty cool though.
[[6]]: PD-L1, PDGFR-beta, IFNAR2, IL-7Ralpha, CCL2, HNMT, ORM2, IL-13, GM2A, IL-3, IL-20, RFK, NGF, IDI2, BHRF1, PMVK, AMBP, INSR, ACE2, METTL16, MZB1, and BBF14.
[[7]]: Rough code for this here.
[[8]]: Finetuning both seems to perform roughly the same, but is slower.
[[9]]: This algorithm is purely heuristic, though I'm sure you could come up with some explanation involving ELBOs or something if that's what you're into. Our overall impression of the RL literature for generative models at the moment is that you can do whatever you want. For a great overview of the field see the argmin blog.