Why pair Realtime + Apps SDK (and build once with MCP)
- One set of tools (MCP), many surfaces. Expose your systems as MCP tools once, then reuse them in:
- a ChatGPT app (Apps SDK UI + ChatGPT Voice),
- a web/mobile voice widget via Realtime WebRTC, and
- a phone agent via Realtime SIP—all hitting the same contracts.
- Production voice capability out of the box.
gpt-realtimeimproves instruction following, function (tool) calling, and conversational naturalness; the Realtime API now natively brokers remote MCP tool calls. - Native, trusted ChatGPT surface. The Apps SDK renders components inline in chat (iframe) via the
window.openaibridge; you handle logic on your MCP server.
Three deployment patterns (you can start with one and add the others)
1) ChatGPT-native voice + UI (no extra hosting)
Users speak to ChatGPT; your Apps SDK component renders results (lists, forms, review) while your MCP server executes tools. Build and test by connecting your MCP server in Developer Mode; tune metadata so the model reliably selects your tools.
When to choose: voice-friendly flows that benefit from a visual confirmation (quotes, tickets, schedules) and from ChatGPT distribution (upcoming directory).
2) Embedded voice agent on your site/app (WebRTC)
Use the Realtime API in the browser to stream mic audio and play responses; point the session at your remote MCP server so tools “just work.” This yields true hands-free interactions in your product.
When to choose: you want own-brand voice UX (kiosks, mobile apps, sales assistants) and full control of availability and analytics.
3) Phone agents & contact center automation (SIP)
Use Realtime SIP to answer or place calls; keep PCI-sensitive flows on your systems, while the model handles natural conversation and MCP tool calls (lookup account, create case, schedule service).
When to choose: inbound support lines, order status, appointments, or after-hours triage where speech-to-speech and tool use are both required.
Architecture (source-aligned)
Client layer
- ChatGPT app (Apps SDK component in an iframe) or your web/app voice UI (WebRTC), or a phone endpoint (SIP).
Model & session
gpt-realtimeover Realtime API (WebRTC/WebSocket/SIP). Sessions can attach remote MCP servers, and the API orchestrates function calls asynchronously while conversation continues.
Integration layer
- Your MCP server exposes narrowly scoped tools (JSON Schema), returns structured results and component HTML for Apps SDK, and enforces auth.
Distribution
- Apps SDK: build/test now; submissions + directory later this year for ChatGPT.
UX principles for voice-first + in-chat UI
- Chat drives; UI confirms. Keep Apps SDK components compact; use fullscreen only to deepen engagement (review/confirm)—the composer remains present.
- Design to the tool schema. Every field maps to a tool parameter; accurate names/descriptions boost discovery and reduce false activations.
- Respect voice context. Let speech lead the flow; keep visual elements scannable (e.g., top 3 options + “confirm”). For voice in ChatGPT itself, follow the Voice Mode guidance for capabilities and limits.
Security, privacy, and compliance (what reviewers check)
- Least privilege + explicit consent. Request only necessary scopes; label write actions so ChatGPT inserts human confirmation; validate inputs server-side.
- Sensitive data & retention. Follow App developer guidelines (general-audience content, published privacy policy, data minimization).
- Realtime hardening. Use ephemeral credentials for WebRTC clients; keep PCI/PHI handling in your systems; consult Realtime docs for transport and session best practices.
Delivery plan (4–8 weeks, scope-dependent)
Phase 1 — Contracts & POC (Dev Mode + WebRTC sandbox)
- Define 3–6 MCP tools (e.g.,
lookup_customer,quote,create_ticket); wire a minimal Apps SDK component; validate discovery and write confirmations in Developer Mode; stand up a basic Realtime demo with the same MCP server.
Phase 2 — Voice hardening & auth
- Move to
gpt-realtimeGA; add OAuth if linking accounts; measure latency and function-calling accuracy on your golden prompts.
Phase 3 — Phone channel (optional)
- Add SIP for inbound/outbound; script disclaimers and DTMF fallbacks; keep tool permissions tight.
Phase 4 — Submission-ready
- Map controls to Security & Privacy and App developer guidelines; prep metadata and screenshots for the ChatGPT directory when submissions open.
KPIs to instrument from day one
- Voice latency (p95) and barge-in success (interrupt-and-respond rate) from Realtime sessions.
- Function-call precision (right tool, right args) and confirmation rate for write actions.
- Discovery precision/recall for your app’s prompts (Apps SDK Optimize Metadata guidance).
Common pitfalls (and how to avoid them)
- Duplicating connectors. Reuse one MCP server across Realtime and Apps SDK; Realtime’s remote MCP support removes custom integration code.
- Over-rich fullscreen UI. Apps SDK is chat-first; oversized UI hurts completion—keep tasks focused.
- Skipping Developer Mode tests. Use MCP Inspector & Developer Mode to validate discovery, schemas, and mobile layouts before scaling.
What you’ll get from our team
A voice-native ChatGPT app and Realtime agent powered by the same MCP toolchain—including contract-first schemas, Apps SDK UI (inline & fullscreen review surfaces), gpt-realtime configuration (WebRTC and optional SIP), and a compliance packet mapped to Security & Privacy and App developer guidelines. All patterns above follow OpenAI’s official docs.