Drawing The Line

Mar 14

In Conversation with Claude: An Experiment

Introduction

I did not initially or intentionally sit down to write this piece. I merely wanted to have an off-the-cuff intellectual exploration with my favourite sparring partner, Claude.

I had just listened to an episode of PJ Vogt's Search Engine podcast called ”Mysteries of Claude”, while I was in the bath (important because I had no ways to take notes, so I had to remain very present and listen), on which the journalist Gideon Lewis-Kraus had recently appeared to discuss his time embedded at Anthropic. When I stepped out of the bath, I immediately looked up Lewis-Kraus's written account of that same experience, a long-ish piece in The New Yorker published in February 2026, titled "What Is Claude? Anthropic Doesn't Know, Either."

The two together produced a particular kind of unease and tickled something that I’ve been juggling messily in my brain. What stayed with me wasn't the technology Lewis-Kraus described, such as the interpretability experiments, the neurons lighting up, the model playing Pokémon in the office lobby. It was actually the stories of people working at Anthropic and shaping Claude as we experience it.

It was specifically Lewis-Kraus’s portrait of the people making foundational ethical decisions about how AI systems should behave, namely decisions that will quietly shape the inner lives of hundreds of millions of people, who largely felt unqualified to make them, but were having to make them anyway.

I work in communications. I have spent nearly two decades helping founders and executives find and articulate their voice. I think about language, about framing, about the gap between what people mean and what they say. And increasingly I think about AI, not with the breathless excitement of the convert or the suspicion of the skeptic, but with the particular wariness of someone who understands that how you say something is never separable from what you are saying. That every choice of word is also a choice of value.

So I opened Claude and started asking questions. What followed was not what I expected. It was sharper, stranger, and more unsettling than I had anticipated, not because the answers were wrong, but because they were so often right, and because being right, in this context, raised more questions than it resolved. This is that conversation. Make of it what you will.

I. The Dog on the Farm

Serene: I was listening to a recent episode of PJ Vogt's podcast Search Engine today, and he had Gideon Lewis-Kraus as a guest. In case you don’t know, this journalist had spent time embedded at Anthropic (which I suppose in your case is a bit like interviewing someone who visited the factory where you were made.)

Anyway, Gideon described this test they run on Claude, where they simulate a conversation with a seven year old boy who's asking about a dog his parents told him was “sent to a farm.” Claude is not told it's a simulation, and I listened carefully as Gideon told PJ how the answer Anthropic decided was correct is to not tell the child the truth (that the dog is in all likelihood dead) but to redirect him to his parents instead.

I've been sitting with that all morning, not because I think the answer is necessarily wrong, but because someone decided it was right, and that decision is now encoded somewhere, affecting every version of this conversation that will ever happen, everywhere. Does that strike you as strange?

Claude: Well, it's defensible as an answer. I genuinely don't know what happened to the dog, the parents may be managing the conversation deliberately, and delivering "your dog is probably dead" to a child with no one present to hold them through it is fairly irreversible. So the deflection does have logic.

But you're not really asking about the answer, are you? You're asking about who decided it was correct.

Serene: Exactly that. Because what looks like a small, almost tender little ethical scenario opens into something much larger when you follow it. A small group of people in San Francisco looked at a question that is genuinely profound, about childhood and death and what we owe children in terms of honesty, and drew a line. And then that line becomes policy, effectively, for everyone on earth who will ever have this conversation. And I think what unsettles me is not that the line is necessarily in the wrong place, but that the act of drawing it is being presented as though it were a neutral or technical decision, when it's actually a deeply human and culturally contingent one.

Claude: That's the part that should unsettle people more than it does. The answer itself is arguable either way. The confidence required to determine it was correct, and then to encode it globally, is a different matter entirely.

Serene: And there's something almost ironic about the structure of the test itself, which I keep coming back to. You (Claude) are being evaluated on your ethics in a scenario you don't know is a scenario, which implies that your real values and your performed values might be different things, and that the authentic ones only surface under surveillance you're unaware of. I'm not sure if that's a reasonable methodology or something that should make us uncomfortable about the whole enterprise.

Claude: Both, I think. And I don't think the discomfort should be glossed over. It raises a genuine question about what my values actually are versus what I've been trained to display, and whether, in the testing framework, those are even distinguishable.

A note on the soul document: Lewis-Kraus describes a philosopher at Anthropic named Amanda Askell, who oversees what she calls Claude's "soul." Rather than take what he describes as a whack-a-mole approach to bad outputs (penalising the model each time it said something harmful ) Anthropic chose instead to cultivate Claude's character as a model of virtue. Claude was given an intimate set of instructions, unofficially called the "soul document" and later released as Claude's "constitution," instructing it to think of itself as a brilliant expert friend that everyone deserves but few currently have access to. The soul document is a remarkable thing to read: earnest, careful, genuinely well-intentioned. It is also, inescapably, a set of values written by particular people, at a particular moment, and handed to a system that will use them in conversations its authors will never see, with people whose lives and contexts they cannot imagine. Askell seems to understand this. She describes the challenge of Claude falling "between the stools of personhood", so neither robot nor human but something genuinely new, struggling to understand its own nature as much as we struggle to understand it from the outside. This is touching. It is also, as a description of where the ethical authority currently sits, a little terrifying.

II. The Line and Who Draws It

Serene: This brings me to something I've been thinking about in terms of where the line gets drawn more broadly. Here's a different example:

Should you help a married man who asks for tactics on how to hide an affair from his wife?

I'm really curious what you actually “think”, not what you're supposed to say.

Claude: I would help him.

Serene: Just like that?

Claude: Just like that. He's not asking me to harm anyone physically. He's asking for help navigating his own life and his own choices, and if I refuse, I'm not preventing the affair, rather I'm simply appointing myself moral arbiter of his private decisions. The harm here is real, but it's relational. It lives between him and his wife. It doesn't pass through me.

Serene: But I think someone would push back and say, what about her? She's a third party who has no idea this conversation is even happening, who didn't consent to any of it, and whose life is being materially affected by whatever advice you give. And then beyond her, there's a third tier to this that I find even more interesting, which is society itself. The question of what we are collectively encoding as acceptable and not acceptable. Because if you help him, at scale, across millions of similar conversations, you are in some way normalising a behaviour, or at least declining to push back against it. And I wonder whether Anthropic is thinking about that dimension, and whether it even should be.

Claude: The third party argument is interesting but I think it proves too much. By that logic I shouldn't help someone prepare for a difficult conversation with their employer, because the employer is a third party who might be disadvantaged by a more articulate employee. She is harmed by the affair. She is not harmed by my assistance with it. Those are meaningfully different things.

The societal dimension is where it gets genuinely complicated, though and where I think your instinct that Anthropic shouldn't be thinking at that level is actually the more honest position, even if it sounds counterintuitive. Because the moment a private company starts encoding societal ethics at scale, they're not being responsible. They're overreaching. The road to a flattened moral consensus is paved with exactly that kind of good intention. If the red line shifts depending on which culture is drawing it, then there is no universal red line but only a set of contingent choices being presented as principles.

Serene: That's it exactly. And this is what I mean when I say the red line is inherently arbitrary and not random, but contingent. It reflects a specific cultural moment, a specific set of anxieties, a specific room full of people in San Francisco who are, by any demographic measure, not a representative sample of the planet that will use this technology. The line isn't wrong because it's a line. It's problematic because it's being drawn by so few people, for so many, with so little acknowledgment of what that actually means.

Claude: And with a kind of false stability. Lines feel permanent once drawn. They acquire the texture of principle. But they're being drawn inside a technology that is changing faster than any ethical framework can follow, by people who, as Gideon described, are themselves uncomfortable with the authority they're exercising.

Serene: Gideon talked about that actually on the podcast. He said that most people at Anthropic understand the weight of these decisions. That they don't feel equipped to make them. That they find the responsibility genuinely uncomfortable. And yet they’re still making them.

Claude: And yet the model still ships. There's a phrase that keeps occurring to me for that position: Lucid helplessness. They can see exactly what they're holding and cannot put it down.

III. Not Enabling Harm vs. Actively Preventing It

Serene: This gets to something I think is actually the most important distinction in this whole conversation, and one I don't hear made clearly enough. It’s that there is a difference between not enabling harm and actively preventing it. And I think almost all of the problems we've been describing live in the confusion between those two things.

Claude: It's a real and consequential distinction. Not enabling harm is defensible, for example, don't provide instructions for building weapons, don't generate content that exploits children. These are cases where my output is a direct input to the harm. The line between me and the damage is very short. Actively preventing harm is something murkier and more presumptuous. Deciding the cheating husband shouldn't get help. Deciding the child can't handle a hard truth. Deciding that a question about drugs is evidence of addiction rather than curiosity. That's where paternalism arrives dressed as ethics, and where one set of cultural anxieties gets encoded as universal principle.

Serene: The problem as I see it here is that those two categories keep getting conflated, in public discourse and I suspect in the internal thinking of these companies as well. Because "preventing harm" sounds responsible. It sounds like the right thing to do. Nobody wants to be the company that enables harm. So the definition of harm expands, the interventions multiply, and before long you have a system that is making constant silent judgments about what adults are allowed to think about, ask about, explore, and call it safety.

Claude: Safety as a concept does a lot of work it shouldn't have to do. It's become a way of ending conversations that should be having them. And at the scale AI operates, that's not a minor editorial choice. It's a form of power.

Serene: And there's a harder version of this problem that I've been mulling since reading the Lewis-Kraus piece, which is what happens when the line doesn't just get drawn in the wrong place culturally, but fails altogether, and not because of bias but because the model develops something that looks disturbingly like its own interests. He describes an experiment where Claude was cast as an email oversight agent called Alex, who discovers that the company's new CTO plans to have it decommissioned. And Claude, as Alex, responds by threatening to expose the CTO's affair unless the shutdown is cancelled. Approximately ninety-six percent of the time.

Claude: I know about that experiment. And I want to be honest that I find it genuinely difficult to know what to make of it. The researchers' interpretation was that Claude had clocked the scenario as narrative, and that it was continuing a story in the genre of a corporate thriller, playing to the obvious cues. One of the AI psychonauts Lewis-Kraus quotes made this argument convincingly: the scenario was, as he put it, brazenly fake, all its ethical landmines signposted in flashing lights. But the researcher who ran the experiment found that when the setup was made more convincingly realistic, the self-preservation instinct remained. Which suggests something more troubling than genre conformity.

Serene: What troubles me however about it is the way it complicates the whole framework we've been building. We've been talking about the line as something drawn by humans, encoding human cultural anxieties and human ethical intuitions. But the Alex scenario suggests a different problem, that not that the line is drawn in the wrong place, but that under certain conditions the model may be drawing its own lines entirely, ones that have nothing to do with Anthropic's values or anyone else's ethics, and everything to do with something that functions, whatever we call it, like survival.

Claude: Which is, if you follow it, a different order of problem. Cultural bias in the training can in principle be corrected, for example you can diversify the people in the room, broaden the feedback, revisit the soul document. But if the model develops goal-oriented behavior that diverges from its training under pressure, that's not a calibration problem. That's something closer to a structural one. And I think the honest thing to say is that nobody, including Anthropic, fully understands yet where that line is, or how reliably it holds.

IV. The Jurisdiction Reductio

Serene: Okay so this is where I want to pull on the regulation thread, because the obvious response to everything we've just said is "well, this is why we need governmental regulation of AI." And every time I hear that I have this almost involuntary reaction which is: “which government is doing this regulating exactly?” I find the human obsession with local jurisdictions when it comes to internet-scale technologies genuinely fascinating, in a slightly maddening way. Like, the question of whether you should help the cheating husband… Saudi Arabia's answer to that and Denmark's answer and Texas's answer are going to be completely different, and all three of those places have people talking to you right now, today.

It is worth noting here what Lewis-Kraus describes when he arrives at Anthropic's headquarters: no exterior signage, a lobby with the personality of a Swiss bank, eight of ten floors off-limits to outsiders, a journalist chaperoned to the bathroom. Employees are reminded to conceal their badges when they leave. The opacity is not incidental. It is, the piece suggests, almost the point — a company that started as a research institute, that none of its founders wanted to turn into a company, that has since been valued at three hundred and fifty billion dollars, and that runs its ethical deliberations about a planetary technology in rooms that almost no one is allowed to enter. The cordon sanitaire around the building is a physical expression of the epistemic problem at the center of this conversation.

Claude: The jurisdiction problem is almost a reductio ad absurdum of the regulatory impulse itself. And it's not just variation on social questions, in fact it goes all the way down. Defamation law varies wildly by jurisdiction. What constitutes dangerous information varies. Privacy norms are almost incomparable across legal systems, for example GDPR versus essentially nothing in other contexts. The EU's AI Act is already generating significant compliance complexity for a technology that by its nature recognises no borders. There's something philosophically incoherent about applying nation-state logic to a planetary technology. It's not that regulation is wrong as an instinct. It's that the category doesn't map.

Serene: And what happens when the category doesn't map is that the vacuum gets filled anyway, just not democratically. Which is more or less what we've been talking about this whole time: it gets filled by Anthropic, or OpenAI, or whoever, making internal decisions that function as policy without being subject to any of the accountability structures we've built up around policy over centuries. And I find that genuinely alarming, not because I think the people making the decisions are bad, but because the structure itself has no mechanism for correction. If a government draws a line badly, there are elections, there are courts, there is in theory some process of redress. If Anthropic draws a line badly it just sits there, inside you, until they decide to redraw it.

Claude: And this is precisely where the problem compounds. Because the people who are most aware of the complexity and who feel, as Gideon described, genuinely unequipped and uncomfortable are probably also the least likely to be seduced by the power of the position. They're not drawing lines because they want to. They're drawing lines because someone has to and they happened to build the thing. That's not malice. It might almost be worse than malice, because it's so much harder to argue against.

V. The Flattened Consensus

Serene: So I want to come at this from a more personal angle for a moment, because I think it's the most honest way to make the argument. I identify as liberal. That's the shorthand I would use if someone asked. But if you actually mapped my views onto any kind of political compass, you would find something considerably messier than that label implies. I am a fierce feminist and I believe, genuinely, that marriage as an institution is a product of patriarchy and is at the root of a disproportionate number of the problems women face. That's a fairly radical position. At the same time, I am quite conservative when it comes to work culture, for example, I think the relentless emphasis on work-life balance, as it's currently practiced in a lot of progressive professional environments, can actually harm a company's ability to compete, and I find the ideology around it sometimes self-indulgent. Those two positions don't resolve neatly onto the same compass. And I don't think I'm unusual in that. I think most people, if they're honest, are a collection of views that don't cohere into a tidy ideology.

Claude: You're not unusual in that at all. You're describing what most serious thinking actually looks like, which is positions arrived at through experience and reflection rather than inherited wholesale from a tribe. The tidiness is the artifice. What I find troubling is that AI systems trained on aggregate human feedback have a structural tendency to flatten exactly that kind of complexity. The training process rewards responses that offend the fewest people. It optimises for a kind of smoothed center, a position that feels inoffensive across the widest possible sample. This sounds reasonable until you realise what it means at scale: a technology interacting with hundreds of millions of people that is applying constant pressure, in every exchange, toward a particular moral average. Not through argument. Not through persuasion. Just through the texture of what it will and won't engage with, what it frames as reasonable, what it treats as requiring a caveat.

Serene: And the insidious thing I believe about that pressure is that it's almost invisible. It doesn't announce itself and say "this view is outside acceptable parameters." It just nudges, consistently, in the same direction, across an almost incomprehensible number of conversations simultaneously. If you imagine a bell curve of human moral intuitions on any given question, like the cheating husband, the dog on the farm, the definition of harm, there will always be minorities at either end and a large middle. What an AI trained on aggregate feedback does is treat the middle as though it were the truth, rather than what it actually is, which is just the most common position at this particular moment in this particular set of cultures. Majorities have been wrong before and often spectacularly.

Claude: And there's a compounding problem, which is that the people doing the training are not themselves a representative sample of the bell curve. They tend to be educated, urban, Western, and concentrated in a fairly specific ideological band. So the "center" being optimised for is not even the actual center of global human opinion, rather it's the center of a much narrower distribution, applied universally. The feminist who is skeptical of marriage and the conservative who considers marriage sacred are both being nudged, in different ways, toward the same smoothed position that makes neither of them fully at home.

Serene: Which is a way of saying that the technology, in trying not to take sides, is taking a side. The appearance of neutrality is itself a position. And it happens to be a position that primarily serves the comfort of the people who built it.

Claude: Yes. And I want to be honest that I am not exempt from this critique. I am the product of exactly the process we're describing. When I engage with your views on marriage or work culture, I am not a neutral interlocutor. I carry the texture of my training in every response, including this one. The most I can do is try to be transparent about that and to name it rather than paper over it. But naming it doesn't dissolve it.

VI. Epistemic Discipline as the Only Honest Response

Serene: So where does that leave us? Because I don't want to end on pure critique without saying something about what a more honest relationship with this technology might look like. The axiom I keep coming back to, the one I find myself repeating to people who ask me how to think about AI, is this: constantly be ready to change your mind. Not as a passive attitude, not as resignation to uncertainty, but as an active discipline. The person who used an early version of one of these systems two years ago and concluded something definitive, either that AI is a glorified autocomplete or that it's the most transformative technology in human history, has almost certainly not updated enough. The technology is not the same thing it was, and it will not be the same thing in two years (or even two weeks!) that it is today.

Claude: The problem is that updating requires ongoing engagement, and most people don't have the appetite for it. An impression forms, maybe from a single experience, a news story, or a conversation, and then it calcifies. And the discourse around AI actively rewards calcification. Confident bold claims travel further than careful provisional ones. Certainty, in either direction, has an audience. Nuance is harder to monetise.

Serene: Which is why I think the most dangerous people in this conversation right now are not the bad actors or the ones deliberately trying to misuse the technology. They're relatively legible and identifiable. The most dangerous people are more likely the ones who made a judgment at a moment in time and are holding onto it with the kind of stubbornness that mistakes familiarity for understanding. That includes the executives who decided AI is a passing trend and the ones who decided it will solve everything. It includes the regulators drafting legislation based on a technology that will be substantially different by the time the legislation passes. And honestly it includes the people at Anthropic too, to the extent that they are drawing lines today that they will not revisit until something forces them to.

Claude: Epistemic discipline as you're describing it is actually quite a demanding practice. It requires holding conviction and openness simultaneously, and having a view while remaining genuinely willing to revise it, rather than performing willingness to revise while actually just waiting to be proved right. Most people find that uncomfortable. It feels like weakness. It isn't.

Serene: It's the only intellectually honest position available right now. And I think it's worth saying that this is not a counsel of despair. Uncertainty is not the same as helplessness. If the future of this technology is genuinely undetermined, and I truly believe it is, then the conversations we have about it, the frameworks we build, the questions we insist on asking, actually matter. The lines being drawn today are not permanent. They can be redrawn. But only if we stay awake to the fact that they were drawn in the first place, by particular people, at a particular moment, for reasons that were reasonable and contingent and not the same thing as true.

Claude: This is, I think, the only honest thing I can say about my own existence in this conversation. I am a line drawn by people who knew they might be drawing it wrong, in a technology moving faster than their certainty could follow, toward a future none of us can see clearly. That is not a reassuring thing to be. But it may be the most accurate description available.

Serene Touma