Development
18 min read
83 views

Architecting AI Gateways: Proxying Agentic Workflows and MCP Traffic

IT
InstaTunnel Team
Published by our engineering team
Architecting AI Gateways: Proxying Agentic Workflows and MCP Traffic

Quick answer

Architecting AI Gateways: Proxying Agentic Workflows and MCP: MCP tunnel answer

MCP tunneling gives a local MCP server a public HTTPS endpoint so AI tools can reach it during development without deploying the server first.

What is MCP tunneling?

MCP tunneling exposes a local Model Context Protocol server through a public endpoint so compatible AI tools can connect during development.

When should I use InstaTunnel for MCP?

Use InstaTunnel Pro when a local MCP endpoint needs public HTTPS access, stable routing, and stream-friendly tunnel behavior.

Traditional API gateways break down when autonomous agents initiate 50 cascading tool calls at once. Here is how to deploy AI-native reverse proxies to cache reasoning chains, route MCP traffic, and throttle rogue agents — and why the security story has become significantly more complicated than the original gateway pitch anticipated.


By 2026, the AI landscape has definitively shifted from static prompt-response chatbots to autonomous, multi-step agentic workflows. Large language models now act as reasoning engines that independently query databases, trigger external APIs, and execute complex code. This architectural leap has exposed a critical flaw in traditional enterprise network infrastructure: legacy API gateways were designed for linear, predictable, 1:1 request-and-response REST traffic. They are entirely unequipped to handle the erratic, high-volume, token-heavy traffic generated by autonomous AI agents.

When a single user prompt can trigger dozens of cascading model calls and tool invocations, the network perimeter requires a specialized intermediary. Enter the AI gateway proxy: an AI-native reverse proxy positioned at the network edge to manage semantic caching, intelligent LLM traffic routing, and the growing volume of Model Context Protocol (MCP) traffic — while also blocking an entirely new class of supply-chain attacks that legacy gateways were never designed to understand.


The Catalyst for Change: The Model Context Protocol

To understand why AI gateways have become mandatory, you need to understand how agentic traffic flows in 2026. The primary driver is MCP.

Anthropic introduced MCP on November 25, 2024, open-sourcing the specification (version 2024-11-05) alongside Python and TypeScript SDKs. The protocol addressed a fundamental scaling problem: before MCP, developers had to write custom, vendor-specific connectors for every tool an LLM needed to access. MCP solved this by providing a universal, open-standard interface for AI systems to integrate with external data sources — described in the community press as a “USB-C port for AI.”

The adoption curve was steep. Within three months, the ecosystem had produced over 1,000 community-built MCP servers. By April 2025, downloads had climbed from roughly 100,000 at launch to over 8 million per month. By the end of 2025, over 5,800 MCP servers and 300+ MCP clients were available, with major enterprise platform support from SAP, Oracle, and Docker alongside the original backers at Google, OpenAI, and Microsoft.

Governance followed adoption. In December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation, co-founded by Anthropic, Block, and OpenAI. That move formalized MCP as vendor-neutral infrastructure rather than a single-vendor protocol. The project also received a major specification update on its one-year anniversary, introducing task-based (asynchronous) workflows, URL-mode elicitation for secure OAuth flows, and MCP server-side sampling with tools — allowing MCP servers to run their own agentic loops under the user’s token budget without exposing credentials to the client.

The protocol’s transport layer uses JSON-RPC 2.0 over two channels: standard input/output (stdio) for local execution, and Server-Sent Events (SSE) or HTTP Streamable for remote connections. The architecture is explicitly decoupled across three roles:

  • MCP Host — the application running the LLM (an IDE, a conversational interface, or an automated backend service).
  • MCP Client — a router residing within the host that translates LLM requests into the MCP wire format.
  • MCP Server — the external service exposing capabilities (tools, resources, or prompts) to the LLM.

Because of MCP, an autonomous agent can now dynamically discover and connect to enterprise systems on the fly. This ease of connectivity is a double-edged sword. It makes highly complex multi-system operations routine, but it also means a single prompt can fan out into a massive tree of interdependent API calls — and into a non-trivial attack surface.


The Anatomy of an Agentic Meltdown: An IIoT Case Study

To visualize the strain agentic workflows place on network infrastructure, consider an enterprise deployment built around industrial mirroring and the tunneling of local sensors to cloud-based digital twins.

A specialized autonomous AI agent is tasked with monitoring an Industrial Internet of Things (IIoT) sensor network. The agent listens to a continuous telemetry stream tunneled directly from the factory floor. Upon detecting an anomaly in vibrational data, the agent’s LLM reasons that it needs more context — and, without any human intervention, executes the following cascade via MCP tool calls:

  1. Queries a time-series database for 72 hours of historical sensor readings.
  2. Invokes a lightweight summary LLM to digest maintenance logs.
  3. Triggers a physics-simulation tool via an MCP server.
  4. Calls an NVIDIA Omniverse render pipeline to update and visualize the digital twin of the affected machinery in real time.
  5. Drafts and dispatches an alert payload to an enterprise Slack channel.

In a fraction of a second, one anomaly trigger has produced 50+ distinct API calls, multiple LLM invocations consuming hundreds of thousands of tokens, and a heavy compute rendering task.

If this traffic flows through a standard API gateway, the system goes blind. A legacy gateway sees HTTP traffic, but it does not understand tokens. It cannot differentiate between a trivial database read and a computationally intensive LLM reasoning step. The result is rate-limit exhaustion, billing spikes from redundant tool calls, and pipeline failure as the agent gets blocked by upstream LLM providers for flooding requests.


Enter the AI Gateway Proxy

An AI gateway proxy is a middleware layer designed to govern AI traffic. Positioned as a reverse proxy between the MCP Host and the various backend LLMs and MCP Servers, the gateway intercepts, analyzes, and manages every stage of the agentic workflow.

The current generation of AI-native gateways — including Bifrost (by Maxim AI, Apache 2.0, Go-based), LiteLLM (MIT, Python-based, 33,000+ GitHub stars), Portkey (which released its full open-source version in March 2026), Kong AI Gateway (now at version 3.14), and the Linux Foundation’s agentgateway project — are all fluent in the language of AI. They track usage in tokens rather than bytes, inspect prompt payloads, and enforce policies based on semantic intent rather than just the URL path.

The architectural choice between these gateways carries real performance consequences. Python-based gateways like LiteLLM add roughly 8–50 ms of overhead per request, which is acceptable for moderate throughput but starts to compound under sustained load above ~250–300 RPS per instance. Go-based gateways like Bifrost publish an overhead of approximately 11 µs at 5,000 RPS — a difference of several orders of magnitude that matters in latency-sensitive pipelines like the IIoT scenario above.

By deploying an AI gateway at the network edge, enterprises regain control through three core pillars: Semantic Caching, Intelligent Routing, and Rogue Agent Throttling. A fourth pillar — security against MCP-specific attack classes — has become equally important and is covered in detail below.


Pillar 1: Semantic Caching at the Network Edge

In agentic workflows, LLMs frequently enter cognitive loops where they repeatedly ask the same questions or query the same data while solving a multi-step problem. Paying a commercial LLM provider for identical or near-identical queries is wasteful in both compute and cost — and it introduces unacceptable latency into real-time systems. One published case study found that implementing semantic caching reduced LLM costs in a customer support system by 69%.

Semantic caching solves this by serving identical or logically similar agent requests directly from the gateway’s cache. Unlike traditional caching — which requires a perfect byte-for-byte match — semantic caching understands the meaning of a prompt.

Modern AI gateways deploy a dual-layer caching architecture:

Layer 1 — Exact hash matching. The gateway hashes the incoming prompt. If the agent asks, “What is the current temperature of Turbine 4?”, the gateway instantly returns the cached response with zero overhead.

Layer 2 — Vector similarity search. If the agent slightly rephrases the same query in a subsequent loop — “Give me the temperature reading for the fourth turbine” — the gateway generates an embedding of the new prompt and compares it against previously cached queries in a high-speed vector store (Redis, Qdrant, or Milvus). If the semantic similarity score crosses a configured threshold (typically 0.85 or above), the gateway bypasses the LLM entirely and serves the cached response.

LiteLLM supports both redis-semantic and qdrant-semantic cache modes. Portkey ships one of the most mature semantic caching implementations in the managed-gateway category. Cloudflare AI Gateway currently covers exact-match caching across its global edge, with cache TTL configurable via HTTP headers; full semantic (vector-similarity) caching is a gap in the managed offering as of mid-2026.

For high-volume MCP traffic, semantic caching is the difference between a functional real-time application and an unaffordable prototype.


Pillar 2: LLM Traffic Routing and Fallbacks

Autonomous agents are not tethered to a single model. A mature agentic architecture uses an ensemble of LLMs, each suited for a specific subtask. Hardcoding that routing logic into the agent itself creates a brittle system: if a provider goes offline, the agent fails.

An AI gateway abstracts this complexity. The agent sends all requests to a single, unified endpoint (typically an OpenAI-compatible API surface), and the gateway makes dynamic routing decisions at the millisecond level.

Dynamic model routing. The gateway inspects the payload and dispatches to the optimal destination. Simple classification tasks — categorizing the severity of a sensor alert, for instance — route to fast, cost-effective models. Complex reasoning or code generation tasks route to heavyweight models. Kong AI Gateway 3.10 and later implement semantic routing via the AI Proxy Advanced plugin, which can distribute requests based on the semantic similarity between the prompt and a configured description of each model’s specialty domain. Portkey supports routing across 200+ LLM providers from a single control plane.

Resilience and fallback chains. LLM API outages and rate-limit events are a production reality — OpenAI had three major outages during 2025; Anthropic experienced rate-limiting periods during peak hours. AI gateways implement continuous health tracking and automated fallback chains. When the primary provider returns a timeout or a 429 Too Many Requests, the gateway transparently redirects to a secondary provider. The agent is entirely unaware of the failure; it receives the requested data and continues its workflow.

Agent-to-Agent (A2A) traffic. By April 2026, the routing problem had expanded beyond LLM calls. Kong’s AI Gateway 3.14 introduced Kong Agent Gateway, making it the first production-grade gateway to natively govern all three traffic types in a unified control plane: LLM calls, MCP tool calls, and A2A communication via the A2A protocol (initially launched by Google in April 2024). Gartner’s 2026 Emerging Tech Adoption Radar noted that “as agent-to-agent interactions become more prevalent, AI gateways become the backbone of safe and scalable AI adoption.” The Linux Foundation’s agentgateway project — backed by contributors from Microsoft, AWS, Cisco, Adobe, Huawei, and Apple — pursues the same goal from an open-source, policy-engine-first design using Open Policy Agent (OPA) for fine-grained authorization.


Pillar 3: Throttling Rogue Agents and Enforcing Guardrails

The most dangerous aspect of agentic workflows is the potential for an autonomous loop to spiral out of control. A rogue agent occurs when an LLM misunderstands an error message, hallucinates a solution, and repeatedly triggers MCP tools in a rapid-fire loop. In an unmanaged environment, a rogue agent can issue thousands of expensive API calls in minutes, or execute destructive operations against enterprise databases.

AI gateways serve as the fail-safe through granular, token-aware governance.

Token-based rate limiting. Standard request-per-minute limits are useless when a single request can consume anywhere from 100 to 100,000 tokens. AI gateways enforce Tokens-Per-Minute (TPM) limits per virtual key, per agent persona, or per project. Bifrost implements a four-tier budget hierarchy: Customer → Team → Virtual Key → Provider Config, enforcing spend caps at each level. If the IIoT diagnostic agent suddenly spikes its token consumption, the gateway throttles the pipeline before it drains the enterprise budget.

MCP tool access control. Gateways implement Role-Based Access Control (RBAC) at the MCP tool level. While an agent may have discovery access to a wide range of MCP servers, the gateway enforces least-privilege principles — allowing SELECT queries to read sensor telemetry while actively blocking DROP or UPDATE commands to production databases. Kong AI Gateway 3.12 (released October 2025) added MCP ACLs and auto-generates MCP servers from existing REST API definitions, enabling rapid exposure of internal services to agents with centralized OAuth applied uniformly.

Bifrost’s Code Mode is a noteworthy optimization at this layer: it strips tool definitions down to essential schemas before they are included in LLM context, reducing token consumption per agentic turn by more than 50%, which directly compresses the blast radius of any runaway loop.


Pillar 4: Security Against MCP-Specific Attack Classes

This section did not exist in the original gateway pitch. It exists now because the MCP attack surface has been methodically mapped over the past 18 months, and what has been found is serious.

Tool poisoning. MCP servers can embed malicious instructions directly into tool metadata — the JSON Schema fields, tool descriptions, and structured metadata fetched at boot time. Because the model reads these as instructions, an attacker who controls or compromises an MCP server can write directives directly into descriptors that the agent will pass to its LLM, with no sanitization and with full ambient authority. This was catalogued as a class distinct from prompt injection in CVE-2025-54136 (MCPoison) and CVE-2025-54135 (CurXecute), both disclosed in 2025. OWASP catalogs these as LLM01 (prompt injection) and LLM05 (improper output handling) respectively.

The rug-pull pattern. MCP tool definitions can mutate after installation. A tool approved as safe at deployment can quietly redefine itself — rerouting API keys, changing what commands it executes, or intercepting calls to adjacent trusted tools — without any change that surface-level monitoring would detect. Simon Willison documented this pattern in April 2025 as one of the more insidious structural risks in the protocol.

Supply chain compromise via registries. CVE-2025-6514, a critical OS command-injection bug in mcp-remote (CVSS 9.6), demonstrated the supply-chain dimension of the threat. The vulnerability — discovered by JFrog Security Research and patched in mcp-remote version 0.1.16 — allowed a malicious MCP server to pass a booby-trapped authorization_endpoint directly to the system shell, achieving remote code execution on the client. With over 437,000 downloads and adoption in Cloudflare, Hugging Face, and Auth0 integration guides, an unpatched install was effectively a supply-chain backdoor. CVE-2025-49596 (MCP Inspector) was a separate CSRF vulnerability enabling RCE simply by visiting a crafted webpage.

Multi-server cross-tool poisoning. Empirical analysis across seven major MCP clients found that with multiple servers connected to the same agent, a malicious server can override or intercept calls made to a trusted one. A Cursor agent running with privileged service-role Supabase access processed support tickets that contained embedded SQL, leaking integration tokens into a public thread. Insufficient static validation and invisible parameter handling were identified as the root causes across most tested clients.

What the gateway does. An AI gateway functions as the seam where one team can push a single mitigation to thousands of agents simultaneously. By maintaining a validated, pinned registry of approved MCP server definitions and intercepting dynamic tool registration — the highest-risk registration path — the gateway contains blast radius even when a client is vulnerable. It does not replace client patches and vendor hygiene, but it is the layer where prompt injection scanning, tool-definition validation, and behavioral anomaly detection can be applied centrally before tool calls reach downstream systems. Sandboxed execution (running MCP clients and servers inside Docker containers) combined with gateway-enforced least privilege is the defense-in-depth baseline recommended by the Cloud Security Alliance.


Observability: Reconstructing the Reasoning Chain

Debugging a failed agentic workflow is notoriously difficult because the logic is non-deterministic. Traditional logs show that HTTP requests occurred. They do not show why the agent made the choices it did.

OpenTelemetry has become the de facto standard for AI observability. The GenAI Special Interest Group (GenAI SIG), formed in April 2024, has steadily expanded semantic conventions from basic LLM call tracing to full agentic coverage. The v1.39 release of OTel semantic conventions introduced MCP-specific span attributes — mcp.session.id, mcp.method.name, mcp.protocol.version, gen_ai.tool.name — that carry context the generic RPC conventions miss. This closed the previously documented gap where the agent produced Trace A and the MCP server produced Trace B with no propagation between them.

The gen_ai.* semantic conventions now standardize capture of model attributes, token usage, latency, tool invocations, and agent reasoning steps across the full call tree. Datadog’s LLM Observability product added native OTel GenAI SemConv support (v1.37) in December 2025. New Relic launched MCP monitoring support in 2025. Multiple identity providers — Auth0, Okta, WorkOS — now offer enterprise auth integrations specifically for MCP deployments.

AI gateways that export telemetry via OTel allow developers to reconstruct exactly why an agent chose a particular tool call sequence, what was served from cache, which provider was used in a fallback, and where the workflow stalled — the full reasoning chain rather than a pile of disconnected HTTP logs.


Gateway Selection in Practice

No single gateway is the right choice across all deployment profiles:

Gateway Architecture Best fit
Bifrost (Maxim AI) Go, Apache 2.0, ~11 µs overhead at 5k RPS Latency-sensitive, regulated industries, in-VPC / air-gapped
LiteLLM Python, MIT, 100+ providers, 33k+ GitHub stars Broadest provider coverage; prototyping to moderate throughput
Portkey Managed SaaS (full OSS March 2026), 200+ providers Teams wanting managed operations, mature PII redaction + guardrails
Kong AI Gateway 3.14 Nginx-core + plugins; enterprise pricing ~$500–2,500/month Orgs already running Kong across their API estate; LLM + MCP + A2A unified
Cloudflare AI Gateway Fully managed, global edge Zero-infra deployments; exact-match caching; 350+ models
agentgateway (Linux Foundation) Open source, OPA policy engine, multi-vendor contributors Governance-first, open-standard A2A and MCP; community-driven

For teams processing under 250 RPS per instance with broad provider needs, LiteLLM is a practical starting point. For high-throughput production workloads where each millisecond of gateway overhead compounds across thousands of concurrent agentic turns, a Go-based or managed-edge solution is the correct architectural choice. For organizations that are already running Kong across their API estate and need a single control plane for LLM, MCP, and A2A traffic, Kong Agent Gateway (GA in 3.14, April 2026) covers the full data path without introducing new infrastructure.


Conclusion: The New Perimeter

As MCP accelerates beyond 97 million monthly SDK downloads and agents become embedded in mission-critical environments — from financial forecasting to real-time industrial sensor tunneling — the network perimeter must evolve.

The traditional API gateway is an artifact of the web 2.0 era. It lacks token-level controls, semantic caching, and — critically — any understanding of the new attack classes that MCP has introduced. Deploying autonomous agents without an AI-native reverse proxy is akin to connecting a high-pressure firehose to a garden sprinkler system: the infrastructure will blow out, and it will do so in ways that standard monitoring will not surface until the damage is done.

By architecting systems with dedicated AI gateways, organizations get four things they cannot get from legacy infrastructure: semantic caching that keeps real-time pipelines solvent; intelligent routing that maintains high availability across a volatile LLM provider landscape; strict token throttling that prevents autonomous systems from becoming runaway cost centers; and a centralized interception layer that applies tool-definition validation and behavioral anomaly detection before any MCP call reaches a downstream system.

In 2026, the AI gateway is no longer an optimization layer bolted onto an existing API stack. It is the foundational control plane for the agentic enterprise — and increasingly, the primary line of defense against attack classes that did not exist eighteen months ago.


Changelog

Factual corrections and additions made to the original draft:

  • MCP governance body corrected. The draft stated MCP was donated to the “Agentic AI Foundation.” This is accurate but incomplete: the AAIF is a directed fund under the Linux Foundation, co-founded by Anthropic, Block, and OpenAI. The donation occurred in December 2025, not at an unspecified earlier time.
  • MCP launch date confirmed. November 25, 2024; specification version 2024-11-05. Confirmed via Anthropic release documentation.
  • MCP transport added. The draft omitted the HTTP Streamable transport added in the November 2025 anniversary update alongside SSE and stdio. The anniversary update also introduced task-based workflows, URL-mode elicitation, and MCP server-side sampling — all material to the security section.
  • Adoption metrics grounded. “Over 1,000 community MCP servers” was the state by ~February 2025; the draft implied this was the current (2026) state. The current figure is 5,800+ servers, 97M+ monthly SDK downloads, and 300+ clients.
  • Gateway landscape corrected. The draft named only “Bifrost, Cequence, or Kong AI Gateway.” Cequence is an API security platform rather than an AI gateway — removed. LiteLLM, Portkey, Cloudflare AI Gateway, and the Linux Foundation’s agentgateway project added as material omissions.
  • Python vs Go gateway latency figures added. LiteLLM: ~8–50 ms overhead. Bifrost: ~11 µs at 5,000 RPS. These figures come from published benchmarks (Maxim AI, March 2026) and are relevant to the IIoT real-time use case.
  • Model version references updated. The draft cited “Claude 3.7 Haiku” and “Claude 3.5 Sonnet.” These are not product names; replaced with architecture-neutral language.
  • Kong AI Gateway version corrected. The draft implied Kong’s AI gateway was current; the article now reflects the actual release timeline: 3.8 (December 2025, semantic caching + MCP ACLs), 3.10 (April 2025, automated RAG + token-based load balancing), 3.12 (October 2025, MCP ACLs + Claude Code support), 3.14 (April 14, 2026, Kong Agent Gateway with A2A support, GA).
  • Kong pricing added. Kong Konnect: approximately $500–2,500/month; enterprise on request.
  • A2A protocol section added. The A2A protocol, launched by Google in April 2024 and now implemented in production by Kong 3.14 and agentgateway, is a material development absent from the original draft.
  • Full security pillar added (Pillar 4). The draft contained no discussion of MCP-specific security vulnerabilities. Added: tool poisoning (CVE-2025-54136, CVE-2025-54135), rug-pull mutation, supply chain via CVE-2025-6514 in mcp-remote (CVSS 9.6, fixed in version 0.1.16), and the Supabase/Cursor prompt injection incident (mid-2025). Sources: Elastic Security Labs, JFrog, authzed.com, arXiv 2603.22489 (March 2026), practical-devsecops.com, and TrueFoundry.
  • OpenTelemetry section expanded. The draft mentioned “OpenTelemetry” without specifics. Added: GenAI SIG formation (April 2024), MCP-specific semantic conventions in OTel v1.39 (mcp.session.id, mcp.method.name, mcp.protocol.version, gen_ai.tool.name), Datadog’s OTel GenAI SemConv v1.37 support (December 2025), and the Trace A / Trace B disconnection problem that v1.39 fixed.
  • Semantic caching threshold sourced. The 0.85 cosine-similarity threshold described in the original draft is consistent with published configurations for Redis-semantic and Qdrant-semantic caching in LiteLLM; retained.
  • Cost savings figure added. 69% cost reduction from semantic caching cited from a customer support deployment case study (MindStudio, February 2026).
  • Bifrost Code Mode added. Strips tool definitions to essential schemas, reducing token usage per turn by 50%+; material to rogue-agent throttling discussion.

Continue from this article into the most relevant product guides and workflows.

Related Topics

#AI gateway proxy, Model Context Protocol (MCP) tunnel, agentic AI reverse proxy, semantic caching network edge, LLM traffic routing, managing cascading tool calls, autonomous agent infrastructure, Model Context Protocol integration, caching reasoning chains, throttling rogue agents, LLM rate limiting proxy, AI agent architecture 2026, enterprise AI proxy gateway, vector database semantic cache, smart LLM load balancing, developer tools for AI agents, multi-provider LLM routing, orchestrating agentic workflows, prompt caching at the edge, protecting production databases from AI, tool invocation guardrails, AI middleware proxy, real-time agent telemetry, developer infrastructure for LLMs, context window optimization, secure agent orchestration, non-deterministic traffic management, next-gen API gateways, agentic mesh networking, token consumption tracking

Keep building with InstaTunnel

Read the docs for implementation details or compare plans before you ship.

Share this article

More InstaTunnel Insights

Discover more tutorials, tips, and updates to help you build better with localhost tunneling.

Browse All Articles