What “observability” means for Apps in ChatGPT
Your app spans three layers—each needs visibility:
- ChatGPT client & discovery
Does ChatGPT invoke your app when it should? Tune names, descriptions, and parameter docs; the model’s decision to call your connector is driven by your metadata. Track recall/precision on a fixed prompt set. - MCP server & tools
Are tool calls valid and fast? Use MCP Inspector / API Playground to inspect request/response JSON, arguments, and errors—without the full UI. Log inputs (sanitized), outputs, latency, and error codes. - Policy & safety gates
Are state-changing operations guarded? Follow Security & Privacy: least privilege, explicit user consent, and confirmation prompts for write actions. Keep audit logs.
A practical metrics set (you can implement today)
- Discovery precision / recall on a curated “golden prompt” suite (e.g., 50–200 prompts per task). Optimize via the Optimize Metadata guide.
- Tool call success rate and latency p95/p99 from MCP logs (per tool). Inspect traces via Developer Mode / API Playground during dev.
- Write-action confirmation rate and post-confirm error rate (signals clarity + safety; confirmation is required policy for risky actions).
- UI render failures / component errors observed during Developer Mode sessions (attach screenshots and payloads for regression packs).
How to test reliably (source-aligned workflow)
- Contract-first tool design → unit tests
Define narrow JSON-schema tools and validate parameters server-side (defense-in-depth). Keep audit logs per Security & Privacy. - Run end-to-end traces in Developer Mode
Connect your HTTPS/mcpendpoint (Settings → Apps & Connectors) and issue test prompts. Use API Playground to view raw request/response pairs when you need log-level detail. - Debug tool behavior with MCP Inspector
When discovery is fine but execution isn’t, use MCP Inspector from the testing guide to validate schemas, arguments, and tool outputs in isolation. - Evaluate quality with OpenAI Evals
Convert your golden prompts into eval datasets and run graded evaluations (model- or script-graded) to catch regressions before release. Use OpenAI’s Evals guides and API for repeatable scoring. - Optimize discovery copy
Iterate names/descriptions/parameter docs; the Optimize Metadata guide explains how better copy increases recall and reduces accidental activation. Re-measure on your prompt suite.
Logging & audit: what to capture (and why)
- Correlation IDs: one per user action → tool call chain.
- Inputs/outputs (sanitized): log schema-validated arguments and structured results; avoid sensitive data.
- Auth + scope: which scopes were used; confirm least privilege posture.
- Confirmation events: capture when ChatGPT displayed and the user approved write actions.
- Latency & error taxonomy: client/middleware/upstream failures.
OpenAI’s Security & Privacy guidance explicitly calls for least privilege, explicit consent, defense-in-depth, and keeping audit logs—use that as your audit baseline.
Rollouts the review team will appreciate
- Preview → Submission (later this year)
The Apps SDK is in preview; you can build & test now and submit later this year. Plan a pre-submission stabilization window. - Post-listing changes require re-submission
Once listed, tool names, signatures, and descriptions are locked. Changing or adding tools requires another review—treat tool definitions like versioned APIs. - Release checklist (evidence pack)
Include: Developer Mode traces (screens + JSON), Evals results on your prompt set, confirmation-flow screenshots for every write action, privacy policy link, and scope/permission mapping. These map directly to the App developer guidelines.
Minimal “quality gate” you can adopt this week
- SLOs: discovery precision ≥ X% on prompts; tool success ≥ Y%; p95 tool latency ≤ Zs. (Pick values appropriate to your domain.)
- Evals: passing score thresholds per task before any release.
- Policy gates: no writes without confirmation; no sensitive data collected; audit logs enabled (per Security & Privacy).
Common failure patterns (and fixes)
- “The model doesn’t pick my app.”
Usually metadata. Tighten names/descriptions/param docs and re-test recall/precision per Optimize Metadata. - “My tool gets called with wrong args.”
Strengthen schemas and server-side validation; reproduce in MCP Inspector to isolate. - “Risky actions run without friction.”
Ensure tools that change external state are labeled as write actions so ChatGPT inserts confirmation prompts; log the approval event. - “Production changed, app broke.”
Because tool definitions lock after listing, plan additive versions and re-submission, not breaking edits.
How we work (brief)
We implement contract-first tools, wire Developer Mode + MCP Inspector into CI, add Evals as a pre-release gate, and produce a submission-ready evidence pack (traces, evals, confirmation screenshots, privacy and scope mapping) aligned to OpenAI’s official docs.