Antibody design de novo vs in vivo

The Baker lab tackles de novo antibody design by narrowing the problem as much as possible.

Last week saw a sizable update to the RF-antibody preprint1 as well as the release of its source code. RF-antibody repurposed the backbone design model RF-diffusion2 to create single-chain antibodies that bound with some success to some previously validated targets, such as toxin B from C. difficile and hemagglutinin from influenza. Coming less than one year after the publication of vanilla RF-diffusion, the project initially made a considerable splash, and coincided with the announcement of a spin-off startup, which hired the researchers behind the report. Whereas the first version focused only on VHHs, The updated version includes details on the design of single-chain variable fragment (scFvs), some retrospective binder validation with AlphaFold33, and most importantly, the source code.

Looking back on the preprint, it’s clear that some of the success can be attributed to how the authors managed to constrain the problem of de novo antibody design, which is defined by the huge number of potentially viable sequences (estimates put it as high 10^18 possible antibodies[^4,5]). The adaptive immune systems of jawed vertebrates navigate this space using something akin to a combination of high-throughput screening and directed evolution: infection exposes the antigen(s) to millions or billions of naive germline antibodies that vary in chain and CDR composition, and successful binders are then affinity-matured in a Darwinian process called somatic hypermutation. In effect, this system constrains the problem by limiting the search to promising germline lineages that already bind to their target, and the fact that nearly identical antibodies routinely show up in different humans attests to the influence these constraints have on the outcome4.

Now contrast this with RF-antibody, which limits the antibody design problem by massively reducing the number of design parameters. Whereas natural germlines have different sequence compositions and CDR lengths, RF-antibody exclusively tests VHHs and scFvs with one hand-picked framework each (h-NbBcII10FGLA and trastuzumab/Herceptin, respectively) and a fixed set of CDR lengths. The users then pre-specify the epitope, leaving the algorithm to sort out docking, CDR conformation design, and CDR sequence design. Their workflows couples the first two of these steps, which were the attention-getting steps from the workflow, while leaving the third for later.

It is at this third step that the sequence space can be more thoroughly explored than what would ordinarily happen in the immune system. Their sequence design method of choice, ProteinMPNN5, tends to design CDR sequences that look nothing like antibodies6 (we have a preprint coming out soon that goes into more detail about this). On the one hand, this simplifies the problem considerably since only the residues in the CDRs are being designed. On the other hand, it massively expands the search space, as ProteinMPNN designs CDRs that aren’t constrained by the processes defining our adaptive immune systems7. The authors acknowledge as much, stating in their discussion that “designing sequences that more closely match human CDR sequences would be expected to reduce the potential immunogenicity and developability”. In any case, if the authors intended for this pipeline to be a minimum viable product, then they succeeded at that by identifying binders in the first place, and can now tackle the harder problem of turning them into drugs. I suspect that, as the brakes are applied to ProteinMPNN during CDR sequence design in service of that objective, the reduction in diversity will be counterbalanced by considering other frameworks and CDR lengths. Over time, this may result in a broader pool of potential binders more reflective of what the human adaptive immune system is capable of producing.