Skip to main content

Developer Guide

Voice Agents are Deepdesk Assistants hooked up to the OpenAI Realtime API. This API supports speech-to-speech interactions, while also providing a realtime audio transcription.

Deepesk offers webhook-based endpoints, integrated with a number of voice platforms. Currently supported platforms are:

  • Twilio
  • Dialogue Cloud
  • Dialogue Coud NEO

Architecture Overview

The following diagram presents a high-level overview of the architecture of the Voice Agent.

Sequence Diagram

The following diagrams illustrate the sequence of events that occur during a voice agent call with Dialogue Cloud.

General Flow

  • When a call is initiated, the Voice Agent starts a session with the OpenAI Realtime API, setting the prompt and available tools.
  • The Voice Agent then starts listening for incoming audio from the Dialogue Cloud Platform.
  • The Voice Agent sends the received audio to OpenAI for transcription and processing.
  • OpenAI generates a response, which may include audio data, tool calls, or other actions.
  • The Voice Agent receives the response from OpenAI and sends the audio back to the Dialogue Cloud Platform for playback to the user.

Interruptions

  • If the user speaks again during the agent's response, the Voice Agent interrupts the current response and processes the new input.

WebSocket

The Deepdesk API exposes a WebSocket endpoint that can receive an audio stream from the supported voice platforms:

wss://{account}.deepdesk.com/api/v2/{assistant_code}/{voice_platform}/v2

For example, for Twilio:

wss://my-account.deepdesk.com/api/v2/my-assistant/twilio/v2

Where the supported voice platforms are:

  • Twilio (twilio)
  • Dialogue Cloud (dialogue_cloud)
  • Dialogue Cloud NEO (acs)