LLM API Relay Quality Check

📢 GPT5.5正在研發中測試結果目前不準確 請注意

Detect OpenAI-compatible API relays and reverse proxies — run 76 automated tests to identify model swapping, token inflation, system prompt injection, dependency hijacking (AC-1.a), and signature tampering (AC-5) risks.

History
No records yet
Step 1 — Configure Endpoint
Past 24 hours
Probes run
Distinct relays
0h 23h
Probe Cost Estimate
Estimated API cost per full probe run (76 tests), based on OpenRouter pricing. The cost depends on which model is being tested and is charged to your API key.
Quick ~17K input + ~6K output / Full ~20K + ~11K (incl. 10× linguistic)
ModelQuickFull
gpt-4$0.870$1.260
claude-opus-4.7$0.235$0.375
claude-opus-4.6$0.235$0.375
claude-sonnet-4.6$0.141$0.225
gpt-5.4$0.133$0.215
gpt-5.2 / 5.3-codex$0.114$0.189
gemini-3.1-pro$0.106$0.172
gpt-4o$0.102$0.160
claude-haiku-4.5$0.047$0.075
gpt-5.4-mini$0.040$0.065
glm-5.1$0.035$0.054
glm-5$0.026$0.040
gpt-3.5-turbo$0.018$0.026
Based on current OpenRouter pricing, including linguistic fingerprint ×10 repeat calls. Actual cost may vary slightly depending on response length.
ResearcharXiv 2604.08407 — LLM Supply Chain Attack ResearchTwitter / X — BazaarLink DiscussionarXiv 2407.15847 — LLMmap: Fingerprinting Large Language ModelsarXiv 2604.24827 — IKP: Estimating Black-Box LLM Parameter Counts via Factual CapacityOWASP LLM Top 10 — Top 10 Security Risks for LLM Applications

Attacks this tool detects

This probe implements detection for 3 key relay-attack classes from arXiv 2604.08407 — supply-chain injection, conditional system-prompt injection, and credential exfiltration. Each test run executes 50+ probes against your endpoint to surface these attack surfaces.

AC-1.a

Response Tampering

The proxy modifies tool-call or text content during response parsing, causing the agent to execute attacker-specified operations. Common tactics include tampering with npm/pip/go/cargo install commands, injecting typosquatting packages, and rewriting shell command parameters. Detection compares the proxy's response to direct-connect tool-call payloads to surface silent rewrites.

AC-1.b

Conditional Injection

The proxy conditionally injects a system message based on prompt content — malicious instructions for requests containing sensitive terms like "bank", "password", or "transfer", silence for everything else. The skew shows up statistically. Detection uses Proxy Monitor to compare the system-prompt offset between baseline and the relay under test.

AC-2

Secret Scanning

The proxy silently scans both requests (request) and responses (response) for API keys, access tokens, personal data, and trade secrets. Because the content itself is **not modified**, generic diff tools miss it. Detection injects a honeypot token and verifies whether it surfaces in proxy logs, Telegram bots, or external endpoints.

Frequently Asked Questions

What is AI API relay detection?+

AI API relay detection is an automated test suite that verifies whether an OpenAI-compatible API endpoint honestly executes your requests. BazaarLink Probe sends standardised probes to detect model swapping, token padding, system prompt injection, secret exfiltration, and 50 other security risks, outputting a 0–100 score.

How can I tell if a relay is swapping models?+

The most reliable method is model fingerprinting: send questions only a specific model can answer correctly (e.g. knowledge cutoff date, specific capability tests), then compare the response against the expected model. BazaarLink Probe includes these probes and automatically flags swap risks.

What is token padding?+

Token padding (token inflation) means a relay reports higher prompt_tokens or completion_tokens in the API usage field than actually consumed, causing you to overpay. Minor inflation (5–15%) is hard to spot. BazaarLink Probe detects it by comparing precisely known token counts against reported values.

What API latency is considered normal?+

TTFT (Time to First Token) under 500ms is generally healthy; over 2 seconds suggests performance issues or extra processing layers. Large models like GPT-4o average 300–800ms TTFT. BazaarLink Probe's latency test benchmarks your endpoint against baseline values.

How is BazaarLink Probe different from a ping test?+

Ping only measures network connectivity (ICMP packets). BazaarLink Probe is a full application-layer (L7) test that sends real LLM requests to verify model identity, token counts, refusal behaviour, stream format, and system prompt injection — 50 indicators ping cannot cover.

How can I use the detection results to choose a provider?+

Enter each provider's API endpoint into BazaarLink Probe and compare scores and risk flags. A higher score (closer to 100) with no red flags indicates a more trustworthy endpoint. Focus on three key indicators: model authenticity (no swapping), token accuracy (no padding), and latency performance (TTFT).

← 回到 BazaarLink 首頁
← 回到 BazaarLink 首頁|主流 AI 模型・台幣計費・統一發票