MCP in production: what I learned maintaining a server that works in Claude Code, Cursor, and Gemini

I maintain an MCP server that ended up running inside Claude Code, Cursor, and Gemini CLI. I did not plan for three clients. I built it for one, and then watched the other two start exercising paths I never tested, in ways the spec technically allowed but I never imagined.

That gap, between what the protocol permits and what a client actually does, is where most of the real lessons live. MCP looks simple on the surface: you expose a few tools, the model picks one and calls it, you return a result. That surface is exactly what fools you. The lessons below are the ones that only show up once a caller you didn't build for starts driving the server. Here is what survived contact with real usage, and what I deleted.

The model is a hostile-by-accident caller

Every backend engineer learns to treat input as untrusted. With MCP, the caller is a language model, and it is untrusted in a stranger way. It is not malicious. It is confidently wrong at a rate you do not see in human users.

A human who gets a confusing tool response stops and asks. A model retries. It will call your tool with a slightly mangled argument, get an error, reword the argument, and call again, sometimes a dozen times in a loop, because retrying is the cheapest action it has. Early on, one of my tools accepted a file offset and a length. A model would request a length that ran past the end of the file, get an error, and then ask for the same thing with the offset nudged by one. My logs filled with near-identical failures that no human would ever produce.

The fix was not better validation messages, though I added those. It helps to know that MCP gives you two different ways to signal trouble: a protocol-level JSON-RPC error, which tends to read as "the call itself was malformed," and a normal tool result flagged with isError, whose text the model actually reads and can act on. A hard protocol error invites the blind retry; a flagged result with a usable message invites a correction. The deeper fix was making the tool harder to misuse in the first place: clamp the range to the file instead of erroring, return what exists, and say so in the response. The model reads that, adjusts, and moves on. I stopped writing errors for a careful human reader and started writing responses a confused model could recover from without a loop. Once I saw it that way, half my "validation" work turned out to be the wrong shape entirely.

Fail-close is the only default that survives three clients

The deleted code is the part worth talking about. My first version had a permissive fallback: when a tool received an ambiguous or partially valid request, it tried to do something reasonable. That worked fine in Claude Code, where I had tested it. It broke in subtle ways the moment Cursor and Gemini drove it, because each client frames requests differently, truncates context differently, and decides differently when to call a tool at all.

A "reasonable default" is a guess about caller intent. Three clients meant three different intents behind the same malformed request, and my guess was right for one of them at best. So I tore the fallback out and made the server fail closed. If a request is ambiguous, refuse it and explain what a valid request looks like. The model gets a clear next step instead of a silent wrong answer.

This sounds obvious written down. It did not feel obvious while I was deleting code that worked in my one tested client to fix behavior in two I had not. Fail-close costs you a bit of apparent helpfulness and buys you the thing that matters more: a server that does the same predictable thing regardless of which agent is holding the other end. I have spent sixteen years on systems where the wrong silent default is how you lose money, including payment platforms serving over ten million users at three-nines uptime. Fail-close was not a new lesson. MCP just made me relearn it in an unfamiliar shape.

You cannot debug what you cannot replay

For the first weeks I had almost no idea what was actually happening inside a session. The model called my tools, things mostly worked, and when they did not, I had a vague bug report and no way to reproduce it. The non-determinism of the caller meant "run it again" produced a different sequence of calls.

What changed everything was logging each tool invocation as a structured record: the arguments as received, the response as returned, the timing, and which client sent it. Not application logs, an event trail I could replay and reason about. Once I had that, the loops surfaced, the client-specific framing differences were suddenly obvious, and the bugs stopped being ghost stories. Observability is not a nice-to-have you add when an MCP server gets popular. It is the thing that lets you understand a caller you do not control. I would build it on day one next time, before the first real user.

The second-order benefit surprised me. The event trail also told me which tools were never called. A model will quietly ignore a tool whose description does not earn its place, and you will never know unless you are counting. I removed two tools that looked useful to me and were invisible to every model that touched the server.

Tool descriptions are an interface, not documentation

The part I underestimated most: the model chooses tools based on their names and descriptions, and nothing else. There is no onboarding, no docs site, no support channel. The description is the entire contract, and it is read by a system that interprets language statistically rather than literally.

Tightening descriptions moved usage more than any code change I made. A tool that was being called in the wrong situations got called correctly once I rewrote its description to state plainly what it was for and what it was not for. This is closer to API design than to writing docs, and it rewards the same discipline: name the thing precisely, scope it narrowly, say what it does not do. I had shipped a KYC AI pipeline in three weeks once by being ruthless about scope, and the same instinct applied here. A narrow tool with a sharp description beats a flexible tool the model cannot reliably aim.

What this points at

None of this is exotic. It is ordinary production engineering, applied to a caller that behaves unlike any client you have built for before: non-deterministic, retry-happy, and reading your interface through a probabilistic lens. The teams I see struggling with agents in production are not failing at the model layer. They are failing at the boundary, where governance, fail-close defaults, and observability decide whether an agent is a reliable colleague or a liability you cannot debug.

So if you take one thing from this, let it be where to point your attention. The interesting engineering with MCP is not in the model and not in the tools. It is in the thin layer between them, the part that is easy to skip in a demo and impossible to skip in production.

Sources

Understanding MCP servers — server concepts (modelcontextprotocol.io) — tools are model-controlled: the model discovers and invokes them based on their names and descriptions.
MCP specification, server tools (2025-06-18) — tool definition (name, description, input schema) and the two error-reporting mechanisms: protocol-level JSON-RPC errors versus a result with isError set, whose text the model can read and recover from.
MCP architecture overview (modelcontextprotocol.io) — host, client, and server roles, and why the boundary between them is where production behavior is decided.
MCP — Security Best Practices (2025-06-18) — consent, confused-deputy risk, and the host-side controls that make fail-close defaults enforceable.
OWASP — LLM06: Excessive Agency — the blast-radius risk of handing an agent broad tool functionality and permissions without scoping.
OWASP — LLM01: Prompt Injection — why a model driving your tools is an untrusted caller, not just an inconvenient one.