benchmarking is the new data activation

Listen to this post
0:00
0:00

give me an optimizable metric and i’ll move the world.

benchmarking, i.e., the act of building and applying benchmarks, is a new(ish) form of data activation: a way of turning domain data into something models can be measured against, ranked by, and eventually trained on. i wrote about data activation a while back, and i want to come back to it, because i think benchmarking is one of the cleaner examples of the thing i was reaching for then.

start with where models actually get good. the places where large language models have improved fastest, and keep improving, are the ones with a verifiable hill to climb: coding first (probably the most verifiable task there is), then math, then optimization problems with a clear target. the common thread is a metric that tracks the target well enough for optimization to mean something. Archimedes wanted a place to stand; a benchmark is one example of such a place.

most complex domains don’t come with that hill. medicine and biology are the ones i care about, and those don’t natively have one: the “substrate” is messy, longitudinal, mostly not sitting in the model’s input space the way code is. so the interesting question, before any of the RL machinery, is whether you can give the domain a hill at all.

the first (and most basic) insight i want to start with is that just being able to measure is already a basic form of activation. if you take health data: the structured records, the workflows, the information sitting around already, and you turn it into something you can score a model against (what does this system actually know and where does it fail), you have pulled value out of that data even though nothing inside the model has changed. this is a looser sense of “activation” than the one in that first post, where it meant getting data into the weights as supervision. here, the data activates by becoming the surface models are measured, selected, and eventually trained against. right now, for health, we mostly cannot do even this first part well. we do not have a good answer to “what do these models know”, and being able to answer it is worth something on its own. the cliché holds (more than i’d like): you can’t improve what you can’t measure.1

verifiers are the same move taken one step further, to where it folds in on itself. when the benchmark is also an environment you can run reinforcement learning against,2 the score is no longer just an after-the-fact report. it becomes the reward. measuring the model and improving it stop being two separate things you do in sequence; they share the same substrate. that coupling is the power and the danger: the better the benchmark, the more useful the training signal; the worse the benchmark, the more faithfully you optimize the wrong thing.

i don’t think any of this is really an argument with the bitter lesson. it sits more orthogonal to it. the bitter lesson says general methods win once scale is available. the question underneath is the one that comes first: is the domain even in a shape where scale has “somewhere to go”. code and math are. medicine, mostly, is not. so part of the work is converting messy domain material into tasks with checkable outcomes, and benchmarking is one way to do that conversion. i put it this way recently and i still like it: sometimes the domain is already in a shape where scale can eat it, sometimes the work is turning the domain into a shape where scale has traction. what’s actually new,3 then, isn’t reinforcement learning, it’s RL-over-verifiers arriving in domains that were never natively language or code.

there’s more than one way to build that kind of scoring surface, and the ways are not interchangeable. i find it useful to lay them on a few axes: where the ground truth lives, what it costs to build, and how far it sits from the thing you actually care about.

at one end is the latchbio approach (SpatialBench-Long is the clearest recent example i’ve read). here the ground truth is rebuilt from the raw data itself: you hand the agent raw or near-raw data plus enough calibrated context to approximate what a scientist would know at the start, and you grade both the conclusion and the path to it. the special sauce is that claims from the original papers are candidates, rechecked against the data before they are allowed to be answers. it’s expensive, and it’s the closest to the real thing (how exactly to do this in medicine is kind of an open question, and how to do it at scale is a whole new question altogether).

in the middle, the HealthBench approach: rate conversations against rubrics written by physicians. the unit of work is clinical answer quality: did the response catch the relevant issue, communicate it safely, avoid harm, avoid overreach. that’s real clinical labor, and it is much easier to scale than rebuilding claims from raw data. but for ranking frontier models the signal drifts toward style, and it saturates pretty quickly.

at the lightweight end (but not unworthy!), the MedMarks approach: mostly multiple choice, wrapped as a verifiers environment so you can train directly against it. most reusable, easiest to generate, and also the furthest from what the clinical work actually is, and the easiest to contaminate.

QuestBench is interesting for a different reason (i mentioned it last time too). on these axes it doesn’t sit anywhere special, since underneath, most of it is still multiple choice.4 what’s worth stealing is the concept. it asks two things: can the model notice that a piece of information is missing, and can it use that piece once it’s provided. the second is ordinary reasoning QA. the first is the one i care about, because a lot of real tasks fail before the answer even begins, the system just doesn’t know what it needs to know. that maps almost directly onto medical QA, where the visible vignette hides the upstream clinical work of noticing which fact matters. i keep wondering whether some medical QA should be shaped more like a constraint satisfaction problem: which missing fact would split the differential, rule out the current action, or change management. not all of it, obviously. but a lot of the useful work is narrowing the space, and almost none of our benchmarks test that part.

none of this makes the underlying data work easy. “fracking” value out of complex data is easier than it has ever been, but it is still hard, and the most impressive work i’ve seen, the latchbio-style work, is also the most expensive: groups of experts taking a complex process apart by hand (and that “cost” is a clue).

because the cost is curation, and curation is exactly where i argued the problem was. in curation all the way down the issue was that a benchmark can claim to measure clinical reasoning after expert judgment has already done much of the reasoning upstream. the vignette looks like the task, but a lot of the task happened before the vignette existed.

that tension looks different when the benchmark itself is the activation layer. curation is still doing the work, but now the work is explicit: deciding which raw artifacts count, which claims survive rechecking, what the model is allowed to see, what gets rewarded, and what gets left out. latchbio’s special sauce, choosing claims that reproduce from raw data and grade deterministically, is expert judgment made into a reward function. i’m not sure that fully dissolves the tension, but it does move it somewhere more inspectable. the judgment has to land somewhere concrete. the benchmark is one of the places it can land.

a few honest caveats, because the framing oversells if i let it: reproducing a known conclusion from raw data is a real step, maybe a more meaningful one than i would have guessed, but it is not a “general scientific/medical mind”, and it is not the part where you ask the right question from nothing. (it’s also why QuestBench’s first question is important: noticing what’s missing is closer to the real skill than answering once everything is laid out.) and none of this is me calling MedMarks or HealthBench bad work. multiple-choice saturation isn’t a sin, it’s what happens when a benchmark is fully open and the next model trains on it. the one thing i want to flag is harder to wave away: if a benchmark only evaluates, a flaw in it mostly distorts my interpretation of a model. if the same benchmark becomes the reward, the flaw becomes an incentive. the model learns toward it.

which brings the original question back, but more concretely. after “how can you activate your data”, one answer is: decide which parts of the domain can be turned into meaningful tasks, build the environment where those tasks can be attempted, and make success verifiable enough that scale can pull on it. call that one concrete form of data activation. more and more, the act of building the benchmark looks like where that important work actually lives.

  1. or you can, but you won’t know that you did, which for medicine and other high-stakes work is close enough to not having improved at all. 

  2. by “verifiers environment” i mean a benchmark you can optimize against directly with RL, not just score after the fact. the grading function becomes the reward. MedMarks-style setups are built this way on purpose, which is part of what makes them attractive and part of what makes them fragile. 

  3. new as in newly central, not literally first; by mid-2026 there are already examples outside code/math. 

  4. the format is also fairly constrained: you have to write items with exactly one missing piece of information, which is part of why it stays close to multiple choice. the interesting concept and the actual instrument come apart a little here.