How We Built a Unified API Gateway for 131+ AI Models
Building a unified API gateway for 131+ AI models sounds straightforward until you actually try it. Every provider has slightly different request formats, response schemas, error codes, streaming implementations, and rate limiting behaviors. Here's how we solved each of these problems.
The Core Challenge: Provider Heterogeneity
OpenAI's API has become the de facto standard, but most Chinese AI providers have their own native formats. Baidu's ERNIE API, ByteDance's Doubao, Zhipu's GLM — each has its own authentication scheme, request body structure, and response format.
Our gateway needs to accept a single OpenAI-compatible request and translate it to whatever format each provider expects, then translate the response back. This is the core of what we built.
Request Translation Layer
Each provider has an adapter that implements a common interface:
interface ProviderAdapter {
transformRequest(openAIRequest: ChatRequest): ProviderRequest;
transformResponse(providerResponse: unknown): ChatResponse;
transformStreamChunk(chunk: unknown): StreamDelta | null;
getAuthHeaders(apiKey: string): Record<string, string>;
}
// Example: Baidu ERNIE adapter
class BaiduAdapter implements ProviderAdapter {
transformRequest(req: ChatRequest): BaiduRequest {
return {
messages: req.messages,
temperature: req.temperature ?? 0.8,
top_p: req.top_p ?? 0.8,
stream: req.stream ?? false,
// Baidu doesn't support system messages in the same way
system: req.messages.find(m => m.role === 'system')?.content,
};
}
// ...
}Streaming Normalization
Streaming was the hardest part. Providers use different SSE formats, different chunk structures, and different termination signals. We built a unified streaming pipeline that normalizes all of these into OpenAI's data: [DONE] format.
Fallback and Reliability
Provider outages happen. Our routing layer maintains a health score for each provider, updated every 30 seconds based on error rates and latency. When a provider's health score drops below a threshold, requests are automatically routed to the next best option.
We also implement request-level retries with exponential backoff for transient errors (rate limits, 503s), and circuit breakers to prevent cascading failures.
Billing at Scale
Accurate token counting across 131+ models requires model-specific tokenizers. We run tokenization in a separate worker pool to avoid blocking the request path, and reconcile counts against provider-reported usage asynchronously.
What We Learned
The biggest lesson: don't try to abstract away all provider differences. Some providers have unique capabilities (like Qwen's 1M context or ByteDance's multimodal features) that are worth exposing directly. The goal is compatibility, not uniformity.
Questions about our architecture? Reach out at [email protected].
