Developer Guide

Voice Agents are Deepdesk Assistants hooked up to the OpenAI Realtime API. This API supports speech-to-speech interactions, while also providing a realtime audio transcription.

Deepesk offers webhook-based endpoints, integrated with a number of voice platforms. Currently supported platforms are:

Twilio
Dialogue Cloud
Dialogue Coud NEO

Architecture Overview

The following diagram presents a high-level overview of the architecture of the Voice Agent.

Sequence Diagram

The following diagrams illustrate the sequence of events that occur during a voice agent call with Dialogue Cloud.

General Flow

When a call is initiated, the Voice Agent starts a session with the OpenAI Realtime API, setting the prompt and available tools.
The Voice Agent then starts listening for incoming audio from the Dialogue Cloud Platform.
The Voice Agent sends the received audio to OpenAI for transcription and processing.
OpenAI generates a response, which may include audio data, tool calls, or other actions.
The Voice Agent receives the response from OpenAI and sends the audio back to the Dialogue Cloud Platform for playback to the user.

Interruptions

If the user speaks again during the agent's response, the Voice Agent interrupts the current response and processes the new input.

WebSocket

The Deepdesk API exposes a WebSocket endpoint that can receive an audio stream from the supported voice platforms:

wss://{account}.deepdesk.com/api/v2/{assistant_code}/{voice_platform}/v2

For example, for Twilio:

wss://my-account.deepdesk.com/api/v2/my-assistant/twilio/v2

Where the supported voice platforms are:

Twilio (twilio)
Dialogue Cloud (dialogue_cloud)
Dialogue Cloud NEO (acs)

Architecture Overview​

Sequence Diagram​

General Flow​

Interruptions​

WebSocket​

Architecture Overview

Sequence Diagram

General Flow

Interruptions

WebSocket