what i've read lately

24 Jun 2026 reading, agents, memory, evaluation

Listen to this post

i am still pretty bad at collecting things i read. the attempts are worth it though, i think. mostly because the act of collecting pushes a little against the default “decay” (i.e., where links disappear into saved tabs, impressions flatten out, and then a few weeks later i remember only that something seemed “important”). this list is biased in the ordinary ways: some things stayed with me and i forgot to write them down. other things stayed with me so strongly that they already became separate posts. so this is not a clean index of what i read. the rough shape: agents, memory, benchmarks, and the problem of making models useful in domains where the substrate is not already language or code.

“maybe later” was a feature is short and good, and it lands near a theme i keep circling: friction is not always a bug. when small features become cheap to implement, the default pressure is to add them. the cost moves from engineering time into product surface area, maintenance, and user cognition. this is familiar from software, but it rhymes with a lot of agent work too. sometimes the useful thing is making an action a little harder, so the human has to stay in the loop long enough to decide whether it is worth doing.

that is also why learning-opportunities caught my eye. the framing is directionally right (and i occasionally enjoy using it): add a little friction back into AI-assisted work so it does not become pure cognitive offloading.

How to write well with AI, by Quico Toro, belongs in the same cluster. AI is useful for writing when you can evaluate the output. that sounds obvious, but it is the whole game. if you have taste, domain knowledge, and a live sense of what the piece is trying to do, the tool can be useful.

the agent workflow pieces were useful in a more practical way. Mitchell Hashimoto’s AI adoption journey was probably the one that stuck most. part of it was just the pleasant experience of seeing someone else rediscover patterns i have also found useful: warm starts, “Hemingway bridge” style handoffs, using agents near the end of the day when human efficiency is dropping, and building verification into the workflow. the verification point matters especially. a lot of agent use still has a strange shape: the model does a task, then the human squints at the result. better workflows make the task produce evidence as it runs.

that connects to Simon Willison’s agentic engineering patterns, and to all the show-don’t-tell tooling around agent work. i continue to think a meaningful fraction of software work now should be meta-work: improving the loop itself, improving the harness, improving the instructions, improving the verification path. maybe not exactly 20%, but that number is a useful reminder. if agents are part of the production process, the production process itself becomes a live object of engineering.

Night Shift is another version of this. it is a nice implementation of separating the agent trenches from the actual thinking work. babysitting agents is tempting because it feels active. often it is just attention leakage. the valuable pattern is closer to setting up a run, letting it explore, and coming back to artifacts you can inspect.

Codex-maxxing, by Jason Liu, was helpful while i was using Codex more heavily. it is one of those posts where the useful bits are small and operational: how to structure context, when to let the agent run, how to make the loop more legible. none of this is magic. that is part of why it is useful.

Can LLMs Be Computers?, from Percepta, was the weirdest item in this cluster. they casually build a WASM interpreter inside the LLM as part of the argument. i am still not sure what i think about the big framing, but the side quest is wild. it is one of those pieces that makes the boundary between “model capability” and “system capability” feel especially blurry.

the memory/context pieces were alive for me because i was working on my own second-brain setup while reading them. Forgetting, by Tim Kellogg, is a useful counterweight to the impulse to store everything. Oscar Austegard’s Muninn at 100 Days and three clocks for forgetting are from almost the opposite direction, taking persistent memory systems seriously as lived infrastructure. i enjoy Oscar’s projects in general; they tend to be concrete enough that the abstract questions become easier to think about.

i also liked Context is the New Training, from Molnar’s tabular foundation model series.

the self-improvement pieces are where the optimism and unease sit closest together. Nathan Lambert’s Lossy self-improvement is valuable as a brake on autoresearch excitement. the phrase that stayed with me was his “complexity brake”: the more progress science makes toward understanding intelligence, the harder the next increment becomes. i do not know if this is exactly right, but it names something real. some domains get harder as you understand them better, because the easy abstractions stop carrying enough weight.

The Darwin Godel Machine is probably one of the best current demonstrations of self-improving agents. Hyperagents pushes in the same direction. i think the excitement is partly justified at this point, but the pieces mostly make me feel the benchmark problem more sharply. if systems can rewrite pieces of themselves, or their harnesses, or their search strategies, then measurement has to become much more serious (but enough about that for now).

this is also why QuestBench and its dataset were interesting. the benchmark asks two questions: can the model notice that information is missing, and can it use the missing information once provided? the second part is basically ordinary reasoning QA. the first part is the important one. many real tasks fail before the answer begins, because the system does not know what it needs to know. that feels especially relevant to medical QA, where the visible vignette often hides the upstream clinical work of noticing which information matters.

i keep wondering whether some medical QA should be formalized more like a constraint satisfaction problem. not all of it, obviously. but a lot of the useful work is narrowing the space: what facts would disambiguate this, what findings would make the current answer impossible, what missing piece changes the action? QuestBench still uses multiple choice for the missing information step, which limits the analogy, but the frame is useful.

Satya Nadella’s human capital / token capital piece sits nearby. “without human direction you have compute running in circles” is a good sentence because it points at the same problem from the organizational side. private reinforcement learning environments, real traces, organization-specific evals: all of this says that model capability is not enough. the judgment and taste of the organization have to land somewhere concrete. the GEPA thread puts it nicely: the learning can live in the harness, not only in the weights.

this may sound opposed to the bitter lesson, but i think it is more orthogonal than opposed. the bitter lesson says general methods win when scale becomes available. the organizational version says: until scale has somewhere useful to go, you need to make human attention and model tokens count. maybe the mistake is treating these as rival theories. sometimes the domain is already in a shape where scale can eat it. sometimes the work is turning the domain into a shape where scale has traction.

that brings me to See Spot Run, Eric Lefkofsky’s piece about LLMs in other domains. i found it useful less because it is novel and more because the obvious thing still has not happened. language, code, voice, video: these are domains where scaling has a very direct path. medicine and biology are different. some parts are clearly entering bitter-lesson territory, Evo2 is an example, but health as a whole is not there the way code is. the substrate is messier, less standardized, more longitudinal, and often less directly represented in the model’s native input space.

Tempus may or may not be the right implementation of this thesis; i do not know enough about the shape of their data, especially how much of it is longitudinal rather than snapshot-like. but the perspective is still useful. the moat is not just “we have many petabytes.” it is whether the data describes how human health changes over time in a way that can be activated by models.

SpatialBench-Long was the densest version of this question for me. i wrote too many notes on it, so here is the compressed version: it tries to make long-horizon spatial biology work verifiable by asking agents to recover scientific conclusions from raw or near-raw data plus calibrated experimental context. the important part is the calibration. raw data rarely gives one universal ground truth. paper claims are only candidates; they have to be rechecked before becoming benchmark answers. task descriptions have to approximate what a scientist would know at the start (without over-specifying the path). grading has to capture explicit relations and directions. for longitudinal health-data work, this raises the right hard questions: what claims from multimodal data over time are actually verifiable, which tasks belong to a single LLM call, which require an agentic run, and how much of “doing science” is just technical “data science”, and so on.

that last question is uncomfortable but useful. a lot of agent benchmarks in science want to test discovery, but the practical substrate is often redoing analysis until a known claim appears. maybe that is fine. maybe reproducing the claim from raw data is already a meaningful step. but we should be honest about what is being measured: not a general scientific mind, and not necessarily the ability to ask the right question from nothing. often it is the ability to turn messy artifacts into a verifiable conclusion under enough context.

and maybe that is the thread through most of this list. agents are getting better, but the interesting work keeps moving outward: from model to harness, (from prompt to workflow/loop), from benchmark answer to benchmark construction, from data possession to data activation. the model matters, obviously. but more and more of the leverage seems to live in the structures around it.

also, I Know What You Think of Me was just a fun little read i found by accident. not everything needs to be load-bearing.

what i've read lately

Get notified of new posts

More posts