Sparse Thoughts

maps, territory and LMs

2026-04-11T00:00:00+00:00

Borges has a very short story about an empire whose cartographers kept producing larger and larger maps, until they built one the size of the empire itself.¹ the following generations, less enchanted, saw it was useless and abandoned it to the weather. maps are useful precisely because they are reductions. when the pursuit of fidelity destroys the compression it destroys the point.

for the purposes of this text, LMs are our maps.²

i’ll cut to the chase (bear with me though). LMs have become really good. so good that they are now well beyond useful representations of the territory, and are in many ways beginning to reshape the territory itself. this means, i think, that we need to be much better at reading maps without losing our connection to the territory. we need more ways to stay engaged while reading and interacting with them. much of our (professional) interaction with computers is mediated through LMs now: when examining a new codebase, when reading a paper, when priming ourselves towards a task. sometimes even as an interface for thinking. this is an abstraction layer that we are not really willing to avoid at this point (and im not saying that we should) but it changes what we need to be good at.

Baudrillard, writing in 1981, proposed four stages that describe how representations relate to reality.³ i think they map (pun acknowledged) really nicely to LMs, and in a way that is unique, because LMs seem to occupy different stages all at once, depending on the use case.

in stage one, the image is a faithful copy of reality. LMs were in a sense designed this way: trained to predict and reproduce patterns in human-generated text as accurately as possible, a compressed but structurally faithful representation of what we’ve written and thought.⁴

in stage two, the image masks and distorts reality. LMs do this too. what you get back is a smoothed-out, averaged version of the territory, and subtle distortions are easy to miss precisely because the surface looks coherent. ask an LM to explain the causes of the 2008 financial crisis and you’ll get subprime mortgages and deregulation. ask again with different framing, same answer. the response feels authoritative, but it’s closer to a popularity-weighted consensus than to the still-unresolved debates among economists.⁵

in stage three, the image masks the absence of reality. once you have a good enough approximation, engagement with the territory itself becomes less needed. you (may) stop checking sources because the answer looks right. you stop exploring because the recommendation feels sufficient. the financial crisis question again: asking it feels like research but is really just consuming a pre-averaged explanation. the activity looks the same, but something has been hollowed out.

stage four, when the representation has no relation to reality at all, is trickier. i’m not sure we’re there yet, though part of what makes stage four unsettling is that you might not know when you’ve arrived. it possibly emerges when much of the content available for training new systems is mostly output of previous systems, or when “the chat” becomes everyone’s primary source of knowledge, “becoming both the image of god and god.”⁶

what’s uncomfortable is that the transitions between these stages aren’t clear. we can’t point to the exact moment when a useful map starts substituting for the territory. it’s slippery, and that slipperiness is what makes it hard to stay aware.

LMs are not like other maps.

a cartographic map looks the same to every reader. an LM doesn’t. the output changes with slight modifications to the prompt, and it has been shown that the sophistication of the model’s responses correlates with the educational background of the person prompting it.⁷ this map is very much in the eyes of the beholder. it’s also malleable in ways that static maps never were: you can zoom in and out of topics, approach them from different angles, connect information across fields that would be harder to bridge otherwise.

this is genuinely useful. it clears away clutter and creates space to truly reflect on a research project, on a complex codebase, on ideas that are half-formed. these “means of summarization,” as Henry Farrell called them,⁸ are useful even though we always have access to the full territory (the entire web), partly because of the flexibility of input and output, and mainly because they are much better at understanding what we want.

unlike any previous map, this one is also becoming an object of study in its own right. since these maps arrived, everyone wants to be a cartographer: building, tinkering, adapting these systems, exploring what’s inside them. in many areas of AI research, models are explored not as maps of reality but as something useful on their own terms. houellebecq’s protagonist in the map and the territory argued that the map is more interesting than the territory. in AI research, this is increasingly literally the case.⁹

so here is where i think the actually interesting part is.

using these maps well, knowing when to trust them, when to zoom in, when to stop and touch the territory, is a skill, and it is largely a tacit one. it’s also a personal one: because the map looks different to each reader, the intuitions you develop are calibrated to your own version of it, not to a shared public artifact. by tacit i mean something close to what Michael Polanyi meant when he wrote, with admirable directness, “we can know more than we can tell.”¹⁰

the tacit knowledge here isn’t about spotting obvious hallucinations. it’s subtler: a feeling that something hasn’t been verified, an uneasiness about a claim you’re not sure the data supports, a sense that the output is too smooth.¹¹ it’s the kind of thing a clinician means when they say a patient “looks sick” before the labs come back, or a developer means when they talk about “code smell.” you attend from pattern recognition you can’t fully articulate to a judgment that something needs checking.

Polanyi has a really nice example of a blind person learning to use a probe.¹⁰ at first you feel the impact of the probe against your hand. but as you learn, your awareness shifts: you stop feeling the probe and start feeling what the probe touches. the proximal sensation becomes distal perception. using LMs well might be something like this: at first you attend to the output itself (is this correct? does this look right?). over time, if you develop the skill, you begin to attend through the output to the “territory” behind it.

this skill is learned through practice and resists being codified into rules or checklists. a piece about maps, then, that arrives at the conclusion that the most important skill for navigating them can’t itself be mapped.¹²

i should probably note: the analogy of LMs as maps is itself a map, and i’ve tried to use it where it’s useful without forcing it where it isn’t. LMs are also tools, research subjects, economic forces, things that reshape the territory they represent. no single frame captures all of this.

the current LMs we use are the worst ones we’ll ever get to use. they will get better (in all aspects). the skill of reading maps well, of maintaining that productive uneasiness, will matter more next year than it does now.

Borges’ cartographers fell in love with map-making and forgot what it was for. the next generation, less enchanted, abandoned the map to the weather. neither response seems right to me. somewhere between obsession and indifference, there is the delicate practice of using these maps well, and of staying honest about how our own relationship to them keeps changing.

jorge luis borges, “on exactitude in science” (1946). the full story is one paragraph long ↩
i use “LMs” loosely here to mean the whole stack: language models, the systems built around them, the agents, the tools. the map is the whole thing, not just the weights. ↩
Jean Baudrillard, simulacra and simulation (1981). the four stages are drawn from his theory of how images progressively detach from reality. ↩
whether “faithful representation” was ever truly the goal of language model training is debatable: prediction and representation aren’t the same thing. but functionally, the result is a compressed mirror of human-generated text, which behaves like a faithful copy in many contexts. ↩
the loop compounds: more people absorb the simplified explanation, more content reflects that framing, the model becomes even more confident in offering it. the map shapes the territory it claims to describe. ↩
this echoes Baudrillard’s argument about the byzantine iconoclasts, who feared that creating images of god would reveal there was nothing behind the images. when the representation becomes authoritative enough, the distinction between “represents knowledge” and “is knowledge” quietly collapses. ↩
see anthropic’s economic index report (january 2026) on how model output sophistication correlates with the educational background of the person prompting. ↩
henry farrell, “the map is eating the territory” (2024). his treatment of the political economy of AI-as-summarization is thorough and worth reading in full. i’m deliberately not covering that ground here. ↩
michel houellebecq, the map and the territory (2010). his protagonist achieves fame by photographing road maps, arguing they are more beautiful than the landscapes they represent. to be clear: studying models on their own terms is important and legitimate work. i’m observing the shift, not criticizing it. ↩
michael polanyi, the tacit dimension (1966). LMs are, funnily enough, also a bit like this: they contain more than they can easily surface, and you have to prompt them in certain ways to make them reveal what they have. a kind of loosely encrypted zip file of human knowledge. ↩ ↩²
as an example: a colleague ran a hypothesis-generation agent on a prompt about liver ultrasound correlations in large populations. it retrieved 18 papers, applied five “research facets,” and produced five hypotheses with titles, research questions, and experiment plans. the structure was impeccable. but the first hypothesis simply restated the input prompt. the third grabbed a zero-citation paper about skeletal muscle oxygenation and proposed “do this but for liver” with no consideration of whether that makes anatomical sense. the map of how science works was used to produce the appearance of science working. ↩
i’ve written more about the related problem of cognitive offloading and maintaining engagement with these tools in “cognitive offloading, exoskeletons, and remaining sentient”. ↩

building tools as procrastination: a CLI for citations in Google docs

2026-03-13T00:00:00+00:00

so this post is comprised of a few thoughts that converged into a tool.

first: building small things is the new procrastination. i acknowledge that fully. and yet i keep finding myself building small solutions for specific usecases, and i think it comes down to the fact that the famous xkcd is it worth the time has just shifted. it takes a lot less time and effort to build small tools you wish you had.

as a researcher writing papers and preprints (anything that needs proper citations) i always hated the manual labor of managing references. from early days with EndNote to more recent alternatives, it always felt like adding sand to the gears of the actual writing process.

there’s also another thought, and it collides with the first one. the idea of reducing a problem to another problem that’s already been solved.¹ these days, anything that can be expressed as a CLI workflow is basically a solved problem, and if it isn’t solved yet, you mostly just need to add the right command to make it work (and the --help flag, and the skill file, but ok).

and but so, i wanted to take this specific pain point of adding citations to papers (and i’ll be more specific: papers that are edited collaboratively on a google doc) and make it frictionless.² the result is cite-cli: a terminal-based citation manager that resolves papers by DOI, PMID, arXiv ID or title search, stores them in a local library synced with Zotero, and inserts inline citations and bibliographies directly into Google Docs. the typical workflow is: write in Google Docs, paste reference URLs inline as you go, then run cite scan and it formalizes everything.

i probably got motivated by two things. first, i find myself using and improving my own specialized tools almost all the time now. Marginalia is much more feature rich than when i first built it and i use it every other day, and i use blog-narrator for most posts. second, as procrastination. it’s a busy and also unsettling time, and working on a shiny new tool is a really nice way to spend some free time.

i already have some early success using it in testing. i hope that when the work on the current manuscript is done, the tool will be much more complete and well-used.

the formal term is “reduction” but the intuition is simpler than that: if you can reframe your problem as an instance of something people already know how to solve, you’re most of the way there. ↩
i took some time to research other available options and couldn’t find anything close to a complete solution for this specific workflow (CLI to Google Docs with Zotero sync). ↩

what i've read in 2026 so far

2026-03-07T00:00:00+00:00

For various reasons¹, i found myself away from ‘work work’ for the past couple of weeks, and also mostly away from a laptop², and because of that, first of all i feel rusty and out of touch with the research areas (and some aspects of agentic coding). we’ll see how to come out and fix that: i can double click on this feeling and say that a] it’s probably mostly ‘fake’ in the sense that stuff is indeed moving fast, but two weeks away isn’t a lot in the grand scheme of things, and is more a result of being on the bleeding edge than anything else, and b] that this feeling is pretty much motivating and useful, so yay, let’s try to utilize it. And second of all, my mental energy flowed elsewhere i suppose. A nice thing about being away is that i had the mental space to think about grander things! So i’m writing about some great books.

I tried to focus on reading this past period, and this is by no means an exhaustive list of everything³. Anyway, i always wanted to better collect the content i go through, and thought this might be a good place to start.

Two big books dominated my last couple of months with some lighter stuff in between.

The Brothers Karamazov: i could never truly write about Dostoyevsky⁴. Too much has been written already. I can try though. Kurt Vonnegut allegedly referred to the Brothers and said “there is one other book that can teach you everything you need to know about life.” That feels right.

The part i remember laughing hard about was the description of medicine from the conversation between Ivan and the devil, at that period where it was ‘too specialized’ (one doctor for the right nostril and another for the left one). But this novel is also really deeply moving. The way the characters swing between moods, the way they converse, it all feels very real. And not only the brothers themselves, also Lise’s weird attraction to suffering and pain, which i can only think of as a form of searching for authenticity. The devil’s visit to Ivan (predicting satellites!, interesting). And of course, the narrator, who is on one hand very ‘familiar’ and talks about ‘our town’, but also knows the innermost thoughts of the characters, and yet is not completely reliable, and is very self-conscious about it. This kind of unreliable semi-omniscient narrator was really captivating. Of course, the trial, and the ‘two abysses’ of the Karamazov nature.

It would be hard for me to claim that i understand Dostoyevsky, but i definitely wish i could have a conversation with him⁵.

The Grapes of Wrath: i think i’m in a phase of classics; it feels like the ultimate antidote to much of the online discourse and the world at large at the moment. I went for Steinbeck because i enjoyed Of Mice and Men a lot. Somehow this feels heavier in content than the Brothers, in retrospect (i’d finished it before starting the Brothers). I guess i read it right after / during watching Pluribus and some of the scenery got tangled up for me, they do travel from the panhandle to California through NM. I think i’ve read it as a ‘familial’ and periodic novel, about the relations between the family members and how they hold (a reminder to myself that i should have written about it in real time). I will quote one thing here, from an exchange between Al and his mother:

Al: “Ain’t you thinkin’ what’s it gonna be like when we get there? Ain’t you scared it won’t be nice like we thought?”

Ma: “No,” she said quickly. “No, I ain’t. You can’t do that. I can’t do that. It’s too much — livin’ too many lives. Up ahead they’s a thousan’ lives we might live, but when it comes, it’ll on’y be one. If I go ahead on all of ‘em, it’s too much.”

Something about not pre-living many different lives when only one will end up arriving.

Essays and other stuff i found online

Code Mode and MCP: i enjoy the Cloudflare folks’ blog and try to read some of their technical stuff from time to time. This one is about how they fit the entire Cloudflare API into <1000 tokens.

Showboat and Rodney (Simon Willison): i’m not even sure this was the best simonw piece from recent times, but it touches on the important topic of making coding agents ‘show’ rather than ‘tell’ about their work.

Rereads:

Taste (Paul Graham): oldie but goodie, read it on the way to the airport.

Good Writing (Paul Graham): “ideas are tree-shaped and essays are linear.”

How I Practice at What I Do (Tyler Cowen): one of those essays where i can usually find something new and unexplored each time.

Learning by Writing (Holden Karnofsky): i recollected pieces of this one quite often, now that i try to write a bit more.

Interesting research / review papers (just one):

The Hallmarks of Cancer: this is almost nostalgic to me. This paper has three versions, and when i was in medical school we learned an earlier one. It’s interesting to see what has remained relevant and what changed in the past decade. It also made me miss medicine in the traditional sense, and good-old-fashioned biology research.

Some great, some not so much. ↩
The feeling of being out of touch even for a fortnight hits the FOMO really hard. I’ve accumulated a lot of stuff i want to try and do in that time! ↩
I also kind of wanted it to be exhaustive, maybe in the future this will be a running document, or at least a list i keep updated. At the moment this was mostly composed after the fact, due to the circumstances. ↩
It’s just too big and too cramped a space. Much more intelligent people have written about him. ↩
Like Salinger said about how it’s a great book if you finish it and you wish the author was a friend of yours and you could pick up the phone and talk to him. ↩

a small tool for diagrams

2026-02-18T00:00:00+00:00

I needed a way to make diagrams for posts and papers. The requirements were simple: something I could mostly just ask Claude to create, something that looked decent for scientific work, and something I could edit afterward — not a PNG I’d have to regenerate from scratch every time I wanted to move a box.

I looked at excalidraw and tldraw first, explored a few implementations, and ended up going with draw.io’s XML format. The reasons were mostly practical — it’s widely supported, the files open directly in the draw.io editor, and the XML structure turned out to be something Claude handles well.

The result is drawio-claude. The workflow is simple: you point Claude Code at the repo, the skills built with the skill creator guidelines tell it how to generate .drawio files, and then you just ask for what you need in plain text. Describe a diagram, give it context, and it produces an editable file you can open directly in draw.io. Here’s an example — an evaluation cycle figure I’m working on for a preprint:

A caveat: I built this mostly through detailed prompting and my interview skill, not by deeply understanding how draw.io’s XML works under the hood. I invested time in making sure the skill instructions were good and tested the outputs, but I didn’t sit down and learn the format. For this kind of single-purpose tool, I think that’s fine — the point is that it works and I can fix things in the editor when it doesn’t.

The broader point: I think investing a small amount of time in building a narrow internal tool is surprisingly high-leverage right now. Even if it serves one specific purpose. The eval post was partly about that — we built custom evaluation tooling that only made sense for our specific health agent, and it ended up being one of the most valuable things we did. This is the same instinct applied to something much smaller. The cost of building these things has dropped enough that “should I build a small tool for this?” is almost always worth asking.

narrating your blog with local AI

2026-02-11T00:00:00+00:00

A friend who drives a lot asked if there was an audio version of my posts. There wasn’t. So I looked into it — and it turned out to be absurdly easy.

The whole thing took an evening. That’s the part worth writing about.

Open-source text-to-speech crossed a quality threshold sometime in 2025, and I completely missed it. Kokoro-82M¹ is 82 million parameters — tiny — Apache 2.0 licensed, and it sounds good enough that I had to double-check I wasn’t accidentally using a paid API. On an M4 MacBook it generates 15 minutes of audio in about 60 seconds. Chatterbox by Resemble AI beat ElevenLabs in blind listening tests. Multiple open models now compete with the proprietary leaders. This happened fast.

what I built

I wrote a small tool called blog-narrator that adds “listen to this post” audio to Jekyll blogs. It strips markdown to clean narration text,² generates speech locally via Kokoro through mlx-audio on Apple Silicon, and embeds a minimal audio player that shows up on posts that have audio. No API keys, no cloud, no cost.

The workflow: write your post, run narrate _posts/your-post.md, commit, push. Done.

What strikes me isn’t the technology — it’s the ratio. An evening of work, zero ongoing cost, and every post on my blog now has a listenable version. A year ago this would have required an API subscription, careful rate limiting, probably a CI pipeline. Now it’s a Python script and a Jekyll include.

I keep noticing this pattern. Things that used to require real infrastructure quietly becoming a single local command. Not because someone built a product for it, but because the underlying models got good enough that you can just wire them up yourself. The fruit hangs so low now that it feels irresponsible not to pick it.

setup

If you want to add this to your Jekyll blog:

pip install git+https://github.com/galsapir/blog-narrator.git

Copy the audio player include to your _includes/ directory, add one line to your post layout, create a narrate.yml config, and run narrate on your posts. The README has the full details.

Requires macOS with Apple Silicon, Python 3.10+, and ffmpeg.

By Hexgrad. The HuggingFace TTS Arena tracks quality rankings across open and proprietary TTS models — worth a look if you’re curious about the landscape. ↩
Frontmatter, code blocks, images, footnotes, Liquid tags get stripped. Links and headers become plain text. The goal is prose that sounds natural when read aloud. ↩

a second opinion

2026-02-11T00:00:00+00:00

I’ve been noticing something about how I work with Claude. It’s not dramatic — no moment where the model led me off a cliff and I realized too late. It’s more of a drift. The models have gotten good enough that my default reaction to most outputs is “yeah, that’s probably right.” And most of the time it is. But “most of the time” is doing a lot of work in that sentence.

The issue isn’t that I’m getting burned. It’s that I’ve stopped checking. For high-stakes things — a spec, a significant architectural decision — I still verify. But the threshold for what counts as “high-stakes” keeps creeping upward, because the models keep being right enough to justify the creep. It’s the same slow drift I wrote about before,¹ just showing up in a new way: not as skill degradation from outsourcing the thinking, but as a quiet erosion of the habit of questioning the output.

I also wanted an excuse to try the OpenAI Codex models — I’ve been hearing good things and was curious about the intelligence, what they catch, what they care about — but didn’t want to significantly change my day-to-day workflow to do it.² So I built a Claude Code skill that lets me get a second opinion from a different model at distinct checkpoints — moments where the stakes feel high enough that I want extra confidence before moving on.

what it is

It’s called adversarial-review. You point it at a file, a diff, a spec, a GitHub issue, and it sends the thing to a different model for a second opinion. One extra call, not an elaborate multi-agent consensus protocol where three models argue for dozens of thousands of tokens until they converge. Just: here’s what I’m working on, what does a different set of weights think about it?

It supports a few backends — Codex (OpenAI’s GPT family), claude -p (a separate Claude instance with fresh context), and AWS Bedrock. The most interesting case is Codex, because it’s a genuinely different model family. When GPT and Claude agree on a finding, that convergence means more than either model flagging it alone. When they disagree, that’s interesting too.

The prompt template asks the reviewer to steelman first — articulate the author’s intent before critiquing — and requires concrete impact for every finding (“what actually breaks or degrades?”, not “best practice says…”). Each finding gets a severity and confidence rating, and I specifically wanted the confidence rating because LOW confidence is honest uncertainty, which is more useful than false authority. The output ends with a verdict: SHIP, ITERATE, or RETHINK.

The skill and the prompt template are here if anyone wants to look or use them — they already include the improvements I describe below.

the recursive first test

The first thing I actually tested it on was the skill itself. I sent the Bedrock wrapper script to Codex for review, and the review came back with real findings — no error handling around the boto3 calls, no file I/O error handling, hardcoded token limits. Both Codex and Claude agreed on the high-severity items (which was a useful signal in itself).

But then I looked at the skill more broadly and realized the whole thing didn’t follow Anthropic’s own skill-creator guidelines. The frontmatter had non-spec fields, the prompt template was in the wrong directory, the SKILL.md was overly verbose — trying to hand-hold the orchestrating model instead of trusting it. So the first review led to a significant restructuring: the SKILL.md went from 139 lines to 75, the prompt template moved to a references/ directory where it belongs, and the frontmatter got cleaned up to match the actual spec. The tool’s first review ended up improving itself.

the real test, and what it tells us

After the self-review, I pointed it at something more substantial — a spec for replacing our health chatbot’s architecture, moving from a DSPy state machine to Anthropic’s Agent SDK. This is the kind of thing I’d normally want a peer engineer to look at, but the review was done by Codex with GPT-5.3.

The review came back with a verdict of ITERATE and several findings. Being honest about how it went: I’d rate it maybe a 6/10 as a review. It added one genuinely valuable finding — a security gap where the SDK’s bypassPermissions combined with absolute file paths could expose PHI — plus a useful clarification about concurrency in the subprocess tool. It actually searched the SDK docs to verify claims rather than just pattern-matching on the spec text, and the understanding section was accurate, which builds trust in the rest.

But there was noise, too. It flagged streaming message ordering as a concern, suggesting turn IDs and sequence numbers — which makes sense for HTTP polling in a distributed system, but not for a WebSocket POC with a single user. WebSockets are ordered and reliable; the finding showed pattern-matching from a known category of concerns rather than understanding the actual transport. It also flagged subprocess capacity limits that the spec itself already acknowledged as POC-irrelevant.

More telling is what it missed entirely. The review was asked to challenge SDK API assumptions — are those import paths correct? Is receive_response() the actual method name? — and it searched the docs but never reported back on what it found. No comment on whether the Langfuse tracing integration would actually work. No challenge to the system prompt design, which embeds clinical safety guardrails as a raw string — is that robust? Can it be jailbroken? These are essence questions, and the review didn’t touch them.

This maps almost exactly onto a distinction I’ve written about before.³ The tool can check form — structural patterns, security anti-patterns, missing error handling. It can’t check essence — whether the SDK API actually works the way the spec assumes, whether the clinical guardrails are robust, whether the implementation plan ordering makes sense given the domain. The one real finding it surfaced (permissions/PHI exposure) is something any competent reviewer would catch by pattern-matching. The things it missed require actually understanding the specific SDK, the clinical context, the domain.

For zero human effort beyond invoking the command, one real finding plus one useful clarification is a positive ROI on ~46K tokens. But it’s not a substitute for a peer engineer who knows the SDK. The bar isn’t “is this as good as a senior engineer” — it’s “is this better than not checking at all,” which, given the drift I described at the top, is the actual alternative.

After the 6/10 review I went back and made targeted improvements to the skill — the kind that directly addressed why it scored poorly without bloating the prompt or touching the parts that already worked. Three things: the reviewer now has to explicitly address every focus area you give it (so it can’t silently skip “challenge SDK API assumptions” anymore), the prompt is stage-aware so it calibrates severity to whether you’re reviewing a POC or production code (no more distributed-systems concerns for a single-connection WebSocket), and for specs specifically it now has to verify technical claims and report back with a CONFIRMED/INCORRECT/UNVERIFIABLE status. I haven’t re-run the full review yet, so I can’t say whether these changes actually move the needle — that’s next.

I wrote about cognitive offloading and the drift toward less engagement in cognitive offloading, exoskeletons, and remaining sentient. The interview format I mentioned there — using AI as an interlocutor rather than a doer — is the same impulse, just applied differently. The /interview command is about staying engaged during the thinking phase; this one is about catching things during the review phase. ↩
I didn’t want to switch tools or start a separate workflow — I just wanted to be able to tap into a different model’s perspective from within my existing setup. ↩
The form vs. essence gap — being able to verify citations and tool calls but struggling to assess whether the output actually helps — is something I explored in how we actually evaluate agents (health). ↩

opus 4.6 and two small tools

2026-02-06T00:00:00+00:00

Opus 4.6 came out ~20 hours ago and I wanted to get a feel for it. It also seemed like a good chance to follow up on some of the ideas from my previous post — specifically, building small tools that help me stay more engaged with my own work rather than just producing output.¹

Opus 4.6

The experience isn’t wildly different from Opus 4.5, but the edges are smoother in ways that matter. It picks up intent faster: I found myself spending less time explaining what I wanted, which over a full session adds up. It seems more inclined toward longer time-horizon tasks, willing to loop and think for >8 minutes before surfacing an answer, and it spawns multiple sub-agents more readily out of the box without much coaxing.² I haven’t tried the full agent teams thing yet, but the multi-agent orchestration feels like it’s gotten significantly less fiddly.

The thing I noticed most was the quality of the questions. When I used it in an interview-style flow, the questions felt more to the point and more thoughtful — less of the “asking just because” quality I’d sometimes get before, where it seemed like the model was generating questions for the sake of generating questions. I wonder if they invested time specifically in making it interview better.

the things I built

I’ve been wanting to use the interview format more deliberately (AI as an interlocutor rather than a doer thing), where you’re still doing the cognitive work but something is prompting you to externalize and refine your thinking. So making that more convenient seemed like a natural place to start.

I went looking for existing interview-style plugins.³ The well-known ones⁴ weren’t quite what I was after — they tend toward exhaustive question lists without much ability to steer the process. What I wanted was a checkpoint mechanism: every few questions, pause and let me say “we’re too deep in the details” or “not deep enough.” The ability to tune the depth as you go, depending on what the task actually needs. The result is here — it’s a Claude Code command, nothing fancy, but it strikes a balance I’m happy with.

The second tool came from a pattern I’ve been noticing in my day-to-day. Markdowns are everywhere (specs, plans, research notes) and I kept finding myself needing to go over an LLM-generated markdown and provide detailed feedback before continuing. I was doing this by manually annotating raw .md files in VSCode or nano, which isn’t great when you want to see the rendered version and annotate on it at the same time. So I built Marginalia, a small SPA that lets you view rendered markdown and add margin annotations.⁵ No sessions, no backend — it just lives in the browser. It’s a one-off tool for a specific friction point, the moment where you need to actually engage with what came back rather than staying in the “fix it plz” loop.

I think both of these are attempts to take the stuff that’s been living in my head (about staying engaged, about not outsourcing the thinking) and work it into my actual day-to-day (whether they end up being useful beyond my own workflow, I don’t know, but the links are there if anyone wants to try them).

I wrote about this more fully in cognitive offloading, exoskeletons, and remaining sentient. ↩
I didn’t spend too much time exploring the code-level details here — the code itself wasn’t the point for me. This was more about getting a sense of how the model operates. ↩
See Claude Code plugins reference. ↩
e.g. this one — good, but 80 questions about everything wasn’t what I needed. ↩
The whole transcript of building it is available too if anyone’s interested — here. It wasn’t a zero-shot, but it was close. ↩

cognitive offloading, exoskeletons, and remaining sentient

2026-02-03T00:00:00+00:00

How can someone who enjoys thinking — enjoys the cognitive load — use coding agents and LLMs to foster continued learning, and not skill degradation? And what are some useful mental frameworks to have in mind?

I think this is really about the reality we all live: the reality of coding agents and chatting with LLMs and how it affects us building (or eroding?) our skills. I always considered myself fortunate that I didn’t have those tools when I was under the stress of a lot of coursework during undergrad (or grad school / PhD for that matter). Every time I wanted some sort of output, I had no choice but to learn how to make it happen. Like many things, I value it a lot more in retrospect. But eh, I mean I’m not sure anyone wants to hear about the good ol’ days. This is not what we are here for.

I’m writing about this because it’s a topic that has been pecking at me probably since I first tried ChatGPT way back when. It was less prominent in the past because of two things: models, and harnesses. At this point in time, the models are good enough to actually take away most of our cognitive work (if we are not careful), and the harnesses are good (or, treacherous) enough to make it almost frictionless.

the problem, stated

The trigger for this specific piece was the recent Anthropic study on AI assistance and coding skills.¹ The headline finding — AI assistance led to 17% lower quiz scores (50% vs 67%) — comes with important caveats: small sample size, and controlled conditions that don’t fully mirror real work. But I think the researchers’ thinking is more interesting than the numbers themselves, particularly how they characterized the different interaction patterns. High-scorers (65-86%) asked conceptual questions, requested explanations alongside code, used AI to check their own understanding. Low-scorers (24-39%) fully delegated, progressively relied more, debugged iteratively without understanding. The how matters enormously.²

What I found most interesting was the “unraveling” of these patterns — the fact that you could characterize what distinguished the people who learned from those who didn’t, even when both groups had access to the same tools. And this was with a sidebar assistant (i.e., not agentic tools like Claude Code). The effects are likely worse with more autonomous tools (they also note this in the discussion).

This isn’t just about code: the same pattern shows up in social skills, in writing, in thinking itself.³ The phenomenon is general: we are outsourcing our thinking, and we don’t fully understand what that means for us.

why this interests me

I’ll be honest — the reason I’m interested in this is because it scares me to “deskill”. I’m using these tools all the time, and I’m preoccupied with how this affects my mind and cognition. This is also why I took the SolveIt⁴ course (which I might write more about in the future).

There’s another reason, more connected to my work. I’ve written before about the gap between checking form and checking essence — how we can verify citations and tool calls, but struggle to know if the output actually helps the patient / user.⁵ I’m now realizing the same (or similar) problem applies to self-assessment. I can check the form of my work: Did I ship? Does it run? But I can’t easily verify the essence: Do I actually understand what I built? The verification problem I face professionally is now personal (and that’s uncomfortable).

mental frameworks

Two images help me think about this.

The first is the exoskeleton. This framing “dates back” to 2024 — research on knowledge workers using GenAI.⁶ We can think of coding tools (and LLMs in general) as an exoskeleton — they grant us abilities we can’t have without them. Iron Man suit. But here’s the thing: it’s useless to go to the gym in an Iron Man suit if what you actually want is to build muscle, and when we use these tools for work we care about, we want to be improving as well — not just producing output while our underlying capabilities atrophy. So we need to understand how we can utilize these exoskeletons in a way that doesn’t degenerate us.

The second image is darker: the rat with an electrode in the nucleus accumbens. Someone on Bluesky pointed this out⁷ — the compulsive AI use where you keep entering short prompts, getting outputs, entering more prompts, caught in a loop that feels productive but yields only short-lived dopamine hits without the actual fulfillment of achieving something or gaining understanding. It kills the joy of craftsmanship, the satisfaction that comes from struggling with a problem and actually solving it yourself.

This hollowness, I’ve found, is actually a signal.⁸ When I offload the cognitive work needed to form a clear picture in my mind — of what is happening, of all the moving parts, of how they interact — it immediately feels more hollow, less satisfying. The satisfaction comes from understanding, and when you skip that, you feel it. Jeremy Howard and Johno Whitaker talk about this quite a bit in SolveIt — the difference between the empty productivity of rapid prompting and the deeper satisfaction of actually learning something.⁹

The question is whether this signal stays sharp over time, or whether it dulls as the tools get better and the friction gets lower.

what comes out of it — some practices

So what do I actually do? It starts with making a choice. Due to limited time, the choice is always phrased negatively: “what is NOT important enough for me to understand deeply?” For example, this blog was built using Jekyll, which I really didn’t bother to understand. I have to mentally acknowledge — with some angst — that I don’t care enough here to understand what is going on. I’m fine with this, because it clears time and mental space for things I do find important: research, core topics in my work, code I’m writing. This choice frees me in a sense.

Then, for the things I do feel are important to understand, I try a few things. One is something akin to close reading.¹⁰ What I try to do is actually read for myself, with concrete questions in mind, to try and form understanding — preferably in a few layers. When I feel like I’ve finished and written my notes on the matter, I try to critically view them, see what I’ve missed. I don’t do this with all the text I consume, only with information-dense pieces that I want to truly understand.

In code, it’s a lot harder to stay engaged — it’s so tempting to become that rat with the NAc electrode, continuing to enter short prompts into Claude Code. One thing I try is opening prompts in a separate window (ctrl+g in CC) and actually investing time clarifying exactly what I want, because the act of articulating forces me to think. I also reread the prompt after dictating it — usually using Handy¹¹ — which helps me input more context while stimulating my thought process. I try to actually invest time in the output I get as well, understanding why choices were made; this limits the amount of output I produce, but it means I can stand behind whatever comes out. It was nice to see these patterns emerge in the “good spots” in the Anthropic paper — the high-scorers were doing something like this.

Another practice that’s been valuable is what I’d call the interview format — using AI as an interlocutor rather than a doer. It’s not quite “critic” — it’s more like a reviewer or even a kind of psychoanalyst (not in the “ChatGPT is your therapist!!” type of way) — someone who is present and poses questions, trying to get more out of me. The thing is, we know more than what we can say, so it helps when something prompts you to externalize more of your thought process. When I ask Claude or SolveIt to interview me about something I’m trying to understand — to probe my thinking, challenge my conclusions — I’m still doing the cognitive work, the thinking stays mine. It’s different from asking Claude to write something for me and accepting the output. I’ve come across others doing this online; I know I’m not the only one, but it feels right and valuable.

And then there’s writing itself, which might be the most important friction-creating practice I have. You can’t verify your own understanding through feeling alone, but writing forces externalization — incoherence becomes visible, gaps surface, and it’s harder to lie to yourself when stuff that’s incoherent or implies lack of understanding is right there on the page. I can actually see it sometimes: I write something and then notice this kind of logical jump and think, wait a minute, how did you get there? Do you have enough evidence to support this claim? Maybe you missed something? This is one of the reasons I try to write more — the process itself forms understanding in ways that passive consumption or even active prompting doesn’t.

limitations and open questions

I keep thinking about Instagram Reels, about the entertainment becoming the thing itself.¹² We can say people need to develop the ability not to sit crouched all day watching TikToks. But did most people actually develop that ability? Or did they just… adapt to a lower baseline? If that analogy holds, vigilance around cognitive offloading may be a minority practice — most people might simply offload, and the baseline of what counts as “understanding your work” will shift downward for the population as a whole.¹³

There’s also the fact that I can’t really A/B test myself — the Anthropic researchers could measure comprehension decline because they had a control group and a test at the end, but I don’t have that luxury. The “what have I learned” question, asking myself this daily, weekly, monthly, is my attempt at an essence check, but I’m not sure it’s reliable since it requires honesty, and honesty requires structures that make self-deception visible.

And then there’s the question of whether the hollowness signal recalibrates. What if it dulls over time? What if I get used to a lower baseline of satisfaction and stop noticing the gap? A year from now, would I trust that same internal signal to still be reliable?

I’m not sure there’s a clean answer here. Maybe the honest thing is to sit with the uncertainty — we’ve built tools we can’t fully evaluate, and now we’re using those tools in ways we can’t fully evaluate either. The epistemic uncertainty compounds.

How AI assistance impacts the formation of coding skills, Anthropic, Jan 2026. ↩
This isn’t meant to be a comprehensive summary of the research — the paper itself is worth reading in full. I’m focusing on the aspects that stuck with me. ↩
See for example the NYT piece on AI and social skills: link. ↩
SolveIt — Jeremy Howard’s course on AI-assisted learning that emphasizes understanding over output. ↩
I wrote about this in how we actually evaluate agents (health). ↩
GenAI as an Exoskeleton: Experimental Evidence on Knowledge Workers Using GenAI on New Skills, Wiles et al. ↩
This Bluesky post. ↩
This felt like a bit of a eureka moment when I first noticed it — the hollowness itself as information about whether I’m actually engaging with the work. ↩
Jeremy Howard’s SolveIt course emphasizes this distinction repeatedly — the difference between getting an answer and building understanding. ↩
See the UCalgary guide on close reading. ↩
Handy is a tool I use for voice input — great for getting more context into prompts while keeping the thinking active. ↩
I wrote about this previously: the entertainment is instagram reels. ↩
I hope I’m wrong here, but I suspect not. ↩

how we actually evaluate agents (health)

2026-01-29T00:00:00+00:00

In the last couple of months we’ve been working on a health agent. It was my role specifically to deal with the messy answer to the tough question: “is it good? worthwhile? valuable?”

I’ll try to describe here our attempt to answer this question. This follows Anthropic’s Demystifying evals for AI agents, which I recommend reading first if you haven’t.¹ What I’m adding here is the specifics of applying these principles to health, what we actually learned when the rubber met the road, and where the general advice breaks down or needs adaptation.

Starting with tasks

First, probably a pretty important insight: we started by defining the specific tasks we would want the agent to be able to perform. We actually wrote early on that “Tasks is an abstraction that is meant to encapsulate a capability or to modularize a larger set of tools and context into something tangible that can be tested. The evals of each task is probably the main reason to have this abstraction.” I think this turned out to be right.

We tried to aim for tasks that are somewhat verifiable, and that include some element of orchestration or tool chaining. We defined different levels (easy, medium, hard) where the easy ones can probably already be accomplished with a pretty simple prompt and SOTA models today, and even the medium and hard ones can maybe be close given detailed enough prompting.²

We also noted some challenges ahead of time. In most examples people already have a deployed AI product generating traces. We didn’t have that, so generating evals beforehand was challenging.³ It’s somewhat circular with the task definitions, tools definitions and so on. We wrote at the time that “it’s also okay to start with vibe checks” - which I still think is true, though insufficient.

We tried both GPT SOTA models and OpenEvidence and looked at their responses to our queries. We found good things and bad things. Some failure modes we discovered there would later carry over to our own system (LLM-inherited, baked into the base model). Others would be specific to our scaffolding, our agent harness.⁴ This distinction ended up mattering less than I initially thought it would.

The metabolic health report as an “evaluation surface”

Because we wanted something more consistent and easier to develop against, we concatenated a bunch of those tasks into one thing you can call a “metabolic health report” (the reason it’s “metabolic” at this point is due to input modalities - CGM data, diet logging).

The goal was to: present the data itself, contextualize it with metrics and personalized insights, add actionable guidance, and connect the dots while grounding the claims. Importantly, the report had to be constructed of evaluatable tasks.

I think this was a clever bit. We could have separated it into specific prompts, and this would probably enable some different things. Arguably, this would create a more flexible system (I’m not sure our system right now would handle very well answering concrete, limited in scope questions). But the report gave us a controlled evaluation surface - same input, same task, observable improvement over time.⁵

The hierarchy: what we actually solved

We mostly solved the general structure of it, the hierarchy. Like, we know that similarly to SWE, we have the equivalent of a unit test, we have the equivalent of an integration test, and thankfully because in many senses the entire task itself was at the early POC stage rather well defined, we also have the equivalent of an end to end test.

But here’s where it gets interesting. There’s this tension between form and essence.⁶ We can check form pretty well: Did the citation exist? Was the tool called? Do the numbers in the output trace back to tool outputs?

The Anthropic post talks about two types of evals: capability evals (what can this agent do well? - should start at a low pass rate, giving teams a hill to climb) and regression evals (does the agent still handle all the tasks it used to? - should have nearly 100% pass rate). Looking back, I think we focused mainly on regression-style thinking: see a failure, make sure it doesn’t happen again. But the pattern of “see them fail and then add a skill or change something to make them green” - maybe that is the same pattern of hill-climbing, or at least similar.

Error analysis: it starts with the data

Before we could add evals, we had to understand what was actually failing. This meant doing what everyone says you should do but few actually enjoy: manually going over the data. It has been repeatedly stated in the past how valuable this is. Our case wasn’t any different - it’s extremely valuable.

I built a simple annotation tool - maybe a day’s work, completely ad hoc, knowing it would be useful for this POC stage and being perfectly okay with not needing it later. Just the ability to properly show markdown and the tool chain beside it, and a small place to highlight quotes and write annotations. That was really nice and really useful. The ratio was great - minimal investment, high return.

The annotation tool: markdown output on the left, tool trace on the right, annotation panel below. Ad hoc but useful.

We did detailed error analysis comparing against our GPT baseline.⁷ Here are the major failure points we discovered:

Ungrounded percentiles/comparisons. Population percentiles appearing without cohort size disclosure or traceable tool output. “Digital twin” comparisons fabricated without reference dataset.

Diagnostic language misuse. Terms like “CRITICAL,” “SEVERE HYPOGLYCEMIA,” “IMMEDIATE ATTENTION” applied to likely CGM artifacts in healthy individuals. Diabetes guidelines (54 mg/dL thresholds) misapplied to non-diabetic physiology.

Missing modality blindness. Post-meal and exercise claims made without diet/activity data. The agent sometimes fails to caveat when required data is unavailable.

Pseudo-evidence citations. Claims labeled “evidence-based” without guideline citations. ADA/ATTD mentioned but specific criteria not actually applied.

Over-interpretation from CGM alone. Physiology inferred (“insulin sensitivity,” “reactive hypoglycemia”) from metrics that don’t support such conclusions.

We stated our goal simply: address as many of those as we can. The pattern became: see a failure, add an eval that catches it, add a skill or change something to make it green, move on. Small circuits of failure → eval → prevention. Whether the root cause was LLM-inherited or scaffolding-introduced mattered less than whether we could catch it and prevent recurrence.

What we couldn’t easily measure: the essence problem

Even if we solve the foundational issues (and we did make progress), gaps remain. This is directly linked to what I call the essence problem. We can check form. But it’s still hard to evaluate:

Clinical appropriateness. When the system thinks findings are real but the person is healthy. When it applies the wrong clinical frame.

Narrative coherence. Does the report tell a story, or is it a collection of metrics?

Utility. The hardest one, maybe. Trying to answer the question “is this helpful” - to the participant, to the clinician.

Some meta point to consider: maybe some essence evals would require us accepting fuzzier metrics or human judgment. Better to measure imperfectly than to ignore entirely.

The thing that lives in my head

There’s a finding from Google’s work on health agents (The Anatomy of a Health Agent) that constantly lives in my head. When presented with different answers with differing degrees of “professionalism,” ordinary users are not so good at sorting through and correctly recognizing which output is better. Concretely, given two health reports - one well-established and would gain approval from experts, and the other AI slop - many people won’t be so good at figuring out which is which. Meanwhile, experts can easily tell the two apart, with a high degree of inter-observer reliability.

To be frank I’m not sure it’s directly related to the evals piece, but it just so constantly lives in my head that it’s hard to avoid when thinking about this work.

I think it hints at something deeper. What is it actually hinting about? From my point of view, the Google paper showed that even though the more complicated architecture prevailed on metrics, on evals, on benchmarks, and even on expert opinion - it didn’t matter to the end user. There’s a tension here. You do want a system that is better all around. Better on evals, better on benchmarks, better according to experts. This is the thing you want to expose and bring to general availability.

But then you’re doing something that in some sense ignores the users.

My answer, at least for health: in terms of integrity, you still have to choose the system that provides better answers. You can’t ship something that is slop. Hopefully over time it gains trust because it actually works - because when someone follows its guidance, things go well.

The agency question

There’s a tough question I kept in the back of my mind: did we end up over-complicating something that had the potential of being pretty simple by trying to force it to become an “agent”? Where agency wasn’t needed most if not all of the time?

The metabolic health report, as an evaluation task, could arguably be a workflow with simple rule-based conditions. You could make that case.

We defined a diverse set of tasks with the input modalities we decided to include - diet logging and continuous glucose monitoring data. We were actually trying to build a system that would be the basis of something with a bit more agency, being able to answer a diverse set of tasks. The metabolic health report was our version of being able to run the same prompt with the same system and see improvement in our results.

Recently we encountered a Washington Post piece about failure modes in GPT for health advice.⁸ Almost as a challenge, we wanted to see how our agentic system would handle the same kind of task - cardiovascular risk assessment. We added a skill, tried with roughly the same prompt structure, and it worked. It didn’t hallucinate the way the article described. The architecture we built - the task abstractions, the eval framework, the tool integration patterns - transferred. We didn’t rebuild from scratch. The fact that we created a system with those hill-climbing evals we described earlier, it kind of proved itself(!!). Given this experience, I think the agent architecture was forward-looking infrastructure that hadn’t been fully tested yet. Now it has been, a little more.

What we contributed, what we learned

You know, in some sense the question really to answer is “what did we contribute to the world?” and “what did we learn in the process?” These are the things I should be focusing on.

What we learned is probably clearer: the form/essence gap is real and doesn’t go away. Small circuits of failure → eval → prevention work. Error analysis is worth the time. Expert opinion is non-negotiable. And the task abstraction - defining what you want the agent to do before building it - pays off.

What we contributed is maybe just a specific case study of what Anthropic and others have talked about in the abstract. We implemented the lessons. We discovered which parts of the general advice break down when you’re actually in a health domain, looking at CGM data, trying to figure out if “reactive hypoglycemia” is a reasonable inference or a hallucination.

Maybe that’s enough. Implementing lessons is good. Showing what actually happens when theory meets practice is useful.

What I’d tell someone starting this

If you came to me tomorrow starting a health agent project and asked how to approach evals:

This is going to be harder than you think. Even if you follow a thorough process of error analysis, it’s hard to do without user feedback. It’s hard to do without expert opinion. Expert opinion is something you must have.

Small circuits work. See a failure mode, create an eval that finds it, fix it, prevent recurrence. Don’t overthink the taxonomy of why things fail. Just catch them and stop them from happening again.

Invest a little in tooling for yourself - an annotation interface, a way to see traces alongside outputs. It doesn’t have to be fancy. A day of work paid off in weeks of usable error analysis.

Accept that some things you care about - utility, clinical appropriateness, narrative coherence - won’t have clean automated evals. That’s okay. Measure what you can. Use expert judgment for the rest.

And define your tasks before you build. “The evals of each task is probably the main reason to have this abstraction.” We wrote that early. It turned out to be true.

(hopefully) this will be a part of a series exploring different aspects of our health agent work. The goal is to answer two questions: what did we contribute, and what did we learn?

Anthropic aren’t the only ones writing about this - OpenAI recently posted about evaluating skills specifically, which is a nice complement: Eval Skills. And Hamel Husain’s work on evals is great. ↩
And trial and error, and health data. Ok, maybe just the medium ones. The hard ones were genuinely hard. ↩
This is a chicken-and-egg problem: you want decent performance before deploying to users, but some of the most important evaluation signals (especially around “essence” - utility, clinical appropriateness) require actual user feedback to assess properly. ↩
The Anthropic post defines an “agent harness” (or scaffold) as the system that enables a model to act as an agent - it processes inputs, orchestrates tool calls, and returns results. When we evaluate “an agent,” we’re evaluating the harness and the model working together. ↩
There’s an idea we ended up not exploring: the concept of “agent state” as a middle ground where we could run some evals (mostly against ground truth). We did use the traces quite a lot, which was useful. What we actually did was a “right shift” instead of “left shift” - we shifted everything to evaluate the end result of the metabolic health report rather than intermediate states. For a POC this was fine, but I still think agent state as an evaluation surface might be worth exploring in the future. ↩
Coming from a background in NMR spectroscopy and signal processing, this was disorienting. In spectroscopy, ground truth is relatively clear - the peaks are the peaks. In this domain, the ground truth you actually care about is expensive, fuzzy, or sometimes unknowable. ↩
I had a dual role here - looking at outputs both as a physician (is this clinically appropriate?) and as a research scientist (do these numbers trace back to tool outputs?). Some things are flat out wrong and easy to catch. But some things are not such giveaways - even if the numbers are accurate, the context might be misleading. Having both perspectives was useful. ↩
The Washington Post article. The general pattern was GPT providing cardiovascular risk assessments that included hallucinated numbers and misapplied clinical guidelines - exactly the failure modes we’d identified and built evals against. ↩

the entertainment is instagram reels (and tiktoks)

2026-01-26T00:00:00+00:00

“The Entertainment is real, and it’s called Instagram Reels.”

— Will Gottsegen, The Atlantic

Not a unique observation (the connection is clear). But there’s something I like about people still arriving at Infinite Jest and feeling compelled to say it out loud.

how will we know the model did a good job?

2026-01-23T00:00:00+00:00

A foundation model I’ve been working on recently got published in Nature.¹ For a while I’ve wanted to write this. Now the paper is finally out so I have to do it in a timely manner, and I also have to start investing more thought in the upcoming projects (some are similar). So what is this? I think the honest answer is something between a post-mortem of a successful project and some exploration towards the future. Exploration about the question that lived in my mind when I was working on this project: “how do we know it’s actually worthwhile?” I think it comes up often in these kinds of research works.

The question from the title predates the code. We were building a foundation model for continuous glucose monitoring data, training a transformer to learn representations of metabolic health, and from the start the problem was: if it worked, what would “worked” even mean? What would this thing be able to do that would prove it has value?

So one of the first things we wrote wasn’t code. It was a document titled “Evaluation Metrics for CGM Foundation Model.” The subtitle was the question itself: How will we know the model did a good job? It was an internal document, used for discussion, exploring our options, thinking through things in writing together. The aim was to prevent later cherry-picking, to separate hypothesis generation from hypothesis testing.² We listed every physiological system that CGM might plausibly encode: microvascular complications, macrovascular complications, liver function, lipids, sleep, body composition. We mapped out which external cohorts we could test on, which clinical trials had CGM data, what predictions would actually be meaningful versus what would just be impressive-looking.

We were inventing our criteria for success as part of starting the work. I don’t think this specific kind of “intellectual labor” has a name, but it turned out to be where we invested the most time — and where I felt my contribution to be most significant.³

Then Guy (the lead author) finished training the model, and we looked at the latent space as represented by UMAP. There, we saw axes that looked like physiology (at least if you are hopeful). I could see postprandial glucose response organizing along one dimension, fasting glucose along another. It looked like the model had learned something real about metabolic health(!!). But UMAPs are notorious for showing you what you want to see.⁴ The pattern was suggestive, not conclusive. So we went back to the evaluation document we’d created and started working through it, testing the model on real cohorts we didn’t train on.

The moment things shifted (not to certainty, but to less doubt) was the AEGIS cohort.⁵ A Spanish study with long-term follow-up. We could ask: does the model’s representation of someone’s CGM predict their cardiovascular risk years later? It did(!). The risk stratification worked. People in the top quartile of model-predicted risk had dramatically higher rates of cardiovascular mortality. This wasn’t a standard CGM metric like time-in-range or glucose variability, it was something the model extracted that we couldn’t fully interpret.⁶ We ran more external cohorts after that, and each one reduced uncertainty a little more.

The paper is now published in Nature. By every external metric, this is a success. And I’m still not sure what hard biological problem it definitively solves.

I’m not trying to be modest here, it’s just the honest epistemic state. We proved the model learns something. We proved it generalizes. We proved its representations predict outcomes better than standard metrics. What we didn’t prove (what I’m not sure anyone has proved for foundation models in biology) is that this approach is worth it compared to simpler methods. There’s this post I saw out of context and now it feels relevant: “It is now easier to build an AI tool for biology than to use that tool against a hard biological problem.” We built a tool, and the hard problem remains.

So what would “worth it” actually look like? I’ve been thinking about this for the next foundation model we’re building, which is multimodal (across lab values, imaging, longitudinal trajectories). One approach we’re exploring is perturbation: taking an individual and asking what the model predicts would happen to their biomarkers if we artificially modified one parameter. What if we simulate weight loss? What if we reduce their LDL by some percentage? The model was never explicitly taught dose-response relationships. It just saw real trajectories. So any systematic relationship between intervention magnitude and outcome magnitude would have to be something it learned from the data itself. If it learned actual physiology, it should get the directions right, maybe even the relative magnitudes. That’s an open question, not a solved one. But this feels closer to what “works” might actually mean: not just prediction, but demonstrating the model learned something about how the system operates (trying to answer quasi-counterfactual questions, to start with).

The question that started the GluFormer project — how will we know the model did a good job? — hasn’t been answered, it’s (somewhat) evolved. And maybe that’s the thing: “how do you prove this is good?” is a question you carry. For this model, the next one, the field. I don’t have a cleaner way to put it.

A foundation model for continuous glucose monitoring data. I’m second author; Guy Lutsker led. For full text: rdcu.be/eY5fH ↩
We discussed actual pre-registration early on. Decided against it — foundation model research might be inherently too exploratory. You’re building a thing and then figuring out what it’s good for. Hard to pre-register that. ↩
There’s a version of scientific writing where you discover truth and then report it. And another where you choose which questions matter, which framing makes the work legible. We did the second. That’s not dishonesty — it’s deciding what counts as interesting. ↩
UMAPs have well-documented problems — they can show structure that isn’t there, they’re sensitive to hyperparameters, the axes aren’t necessarily meaningful. Ours ended up in supplementary materials, not main figures. The suggestive pattern was a starting point, not evidence. ↩
People sometimes ask about the “eureka moment.” There wasn’t one. The generative capabilities felt promising. The cardiac risk stratification felt real. But no single breakthrough of certainty. That might just be what this kind of research feels like. ↩
Specifically, we processed raw CGM data from 580 participants (followed for a median of 11 years) through GluFormer to generate high-dimensional embeddings, then mapped these to a single “GluFormer-derived score” originally trained to predict HbA1c. ↩

data activation thoughts

2026-01-17T00:00:00+00:00

The landscape is shifting in recent years — it’s a cliche to start texts like this these days, but the fact that it’s a cliche doesn’t make it any less true.¹ In 2019, the folks at Andreessen Horowitz wrote this about data (in a piece titled The Empty Promise of Data Moats): “Instead of getting stronger, the defensible moat erodes as the data corpus grows and the competition races to catch up.” (Trying to prove some data has value — I’ve experienced it firsthand.)

LLMs have shifted where value comes from. It’s no longer enough to simply have proprietary data; what matters now is how effectively you can make that data useful to these systems (and therefore, to anything else that lives off that). So, if traditional data moats are eroding, the new competitive edge lies in data activation. The pressing question becomes: how quickly can you connect your proprietary data to LLMs in ways that demonstrably improve their performance (before someone else figures out how to replicate your insights without your data)?

Before we continue I want to think about a simple metaphor here — LLMs can ingest the data. They’ll happily consume every row and column you throw at them. But (and this is important) without the right transformation, they can’t metabolize it. The nutritional value passes through unabsorbed. They’re missing the “enzymes” I guess you can call it. Data activation is about providing those enzymes: converting raw information into a form the model can actually digest and turn into a capability.

Why this matters now (healthcare as case study)

Looking specifically at healthcare data, the opportunity is immense — and let’s face it, time limited. Looking at OpenAI’s report from January 2026: more than 5% of all ChatGPT messages globally are healthcare-related. 25% of weekly active users ask health-related questions. More than 40 million people turn to ChatGPT daily for healthcare guidance (!!!).

The big labs are clearly taking notice: within the span of a single week (January 2026), OpenAI launched “ChatGPT for Healthcare” (already rolling out to institutions like Cedars-Sinai, Memorial Sloan Kettering, and Stanford Medicine) and Anthropic announced “Claude for Healthcare” with HIPAA-ready infrastructure and native integrations to medical databases/ontologies (CMS Coverage Database, ICD-10, PubMed). To me, it looks like healthcare is now a primary battleground for frontier AI companies.²

Yet, if you look at the numbers from OpenRouter, they claim that health remains “the most fragmented of the top categories”. What does this mean? According to OpenRouter, it signals both the domain’s complexity and the inadequacy of current general-purpose models.

One (potential) method for data activation

It seems that recent research already demonstrates that the bridge between structured medical data and improvements in LLM reasoning is working.

Tables2Traces established a framework for converting raw, tabular patient-level data into contrastive reasoning traces that can be used for LLM fine-tuning. They tried to “mirror how a clinician would think” — what they did is pretty simple. For every patient record, they identified similar patients with different outcomes (someone similar who died and someone similar who survived). Once they had those triplets of patients they prompted a strong LLM to generate explanations for the divergence. These reasoning traces become fine-tuning data for smaller models.

For their specific use-case they showed significant improvement (>17% in domain-specific MedQA and even generalization capabilities — they trained only on cardiovascular cases but noted improvement in other areas of medicine as well). The paper’s “simple vs. full” comparison also provides empirical evidence: naively converting tables to patient narratives doesn’t work (and can hurt performance). So the models actually need the structured reasoning scaffold — the contrastive comparison, together with the reasoning and quasi-counterfactual thinking, is what makes the difference.

Saying it a bit differently — they kind of show that the value in structured medical data is like potential energy trapped behind a dam. The power is real, but it just sits there. Naive table-to-text conversion doesn’t work; you’re essentially drilling a small hole in the dam and expecting electricity. The reasoning scaffold (in their case — contrastive comparison, counterfactual thinking) is the turbine. It converts stored potential into usable power.

Another work worth mentioning is EHR-R1. They synthesized 300k “high-quality” traces using a different method — something they call a “thinking-graph pipeline”: [1] extract medical entities from each patient’s longitudinal EHR (including free-text), [2] quantify associations between medical entities, then [3] map entities to medical ontology (UMLS concepts) and use graph search to recover medical relations that connect context entities to the target labels. They then prompted an LLM with the patient record plus these retrieved relations to produce a structured reasoning chain, which became the supervision data. The results? Their model outperforms strong commercial/open models, averaging >30 points over GPT-4o on EHR-Bench (which they also created).

Another paper shows this scales pretty well: fine-tuned 8B parameter models have achieved 89.3% accuracy while being 85x cheaper than their 70B teacher models.

So I think the existence proof is established: structured EHR/biobank data can be transformed into reasoning supervision that measurably improves LLM clinical performance.

What’s still unclear

I think Tables2Traces proved feasibility in some sense, but synthetic traces are still in the “unverified” realm. This gap showed mostly in the way physicians treated those traces (mostly, they didn’t think the traces were very good). And there’s a deeper issue — recent work shows that traces can sometimes be “unfaithful,” meaning they don’t accurately reflect the actual basis for a decision. Plainly: the trace says one thing, the model’s decision is different.

It’s also worth noting that these papers tend to show improvements on less capable models. That’s not an accident — showing improvements on stronger models is harder (or the improvements aren’t there). We should be honest about that.

So the question that keeps bothering me: what’s the right transformation? The papers above offer some approaches — contrastive reasoning, knowledge graphs, ontology grounding. There are others being explored (RL-based methods, temporal modeling for longitudinal records). But I don’t have a clean answer. The dam metaphor still holds — the potential energy is real — but we’re still figuring out how to build the right turbine.

Speaking of cliches — I’m aware this piece is full of em-dashes, which have become a telltale sign of AI-assisted writing. But as Nabeel Qureshi pointed out, David Foster Wallace was doing this decades ago. The italics for emphasis, the informality, the casual speech tone. I’ll keep my em-dashes. ↩
In some sense, what once seemed like a disadvantage — healthcare’s lagging technological infrastructure — may now be an asset: a greenfield opportunity with less legacy baggage to work around. But that might be substance for a different post. ↩

What's Sparse Thoughts?

2025-08-16T00:00:00+00:00

For a while now I’ve been collecting too many things to read and think about, mostly in twitter ‘saved links’. This is a place for me to silently collect together things I enjoyed reading / other content I’ve enjoyed and sometimes jot some thoughts about it. It’s built in a way that won’t make me feel too committed, should be kind of under the radar, low friction, minimal effort.