Developer Guide
Voice Agents are Deepdesk Assistants hooked up to the OpenAI Realtime API. This API supports speech-to-speech interactions, while also providing a realtime audio transcription.
Deepesk offers webhook-based endpoints, integrated with a number of voice platforms. Currently supported platforms are:
- Twilio
- Dialogue Cloud
- Dialogue Coud NEO
Architecture Overview
The following diagram presents a high-level overview of the architecture of the Voice Agent.
Sequence Diagram
The following diagrams illustrate the sequence of events that occur during a voice agent call with Dialogue Cloud.
General Flow
- When a call is initiated, the Voice Agent starts a session with the OpenAI Realtime API, setting the prompt and available tools.
- The Voice Agent then starts listening for incoming audio from the Dialogue Cloud Platform.
- The Voice Agent sends the received audio to OpenAI for transcription and processing.
- OpenAI generates a response, which may include audio data, tool calls, or other actions.
- The Voice Agent receives the response from OpenAI and sends the audio back to the Dialogue Cloud Platform for playback to the user.
Interruptions
- If the user speaks again during the agent's response, the Voice Agent interrupts the current response and processes the new input.
WebSocket
The Deepdesk API exposes a WebSocket endpoint that can receive an audio stream from the supported voice platforms:
wss://{account}.deepdesk.com/api/v2/{assistant_code}/{voice_platform}/v2
For example, for Twilio:
wss://my-account.deepdesk.com/api/v2/my-assistant/twilio/v2
Where the supported voice platforms are:
- Twilio (
twilio) - Dialogue Cloud (
dialogue_cloud) - Dialogue Cloud NEO (
acs)