Skip to main content

Overview

The voice scam protection APIs perform real-time risk analysis on the audio stream of a phone call. They are bidirectional, full-duplex Websocket APIs. As the audio is streamed in and processed, the results are streamed back to the client simultaneously.

Session flow

The common session flow for all endpoints is the following:

  1. Client initiates the Websocket connection.
  2. Client sends an initial Websocket text frame with JSON metadata.
  3. Server responds with an initial Websocket text frame with a server-supplied session ID.
  4. Client starts sending chunks of audio data in Websocket binary frames.
  5. Server sends output events containing the transcript and the analysis results in JSON Websocket text frames.
  6. Client closes the connection. The client may close the connection at any point in time before the audio stream naturally ends, e.g. if the client has received a classification that makes further analysis unnecessary.

An example session looks like as follows:

# 1. → Client sends initial metadata
{ "sampleRate": 8000, ...API-dependent metadata... }

# 2. ← Server sends session metadata event
{"type": "sessionMetadata", "sessionId": "838e90ff-ddaf-4c73-8bac-1e380137d99e"}

# 3. → Client sends audio frame
# 4. → Client sends audio frame
# 5. → Client sends audio frame

# 6. ← Server sends transcript line
{ "type": "transcript", "speaker": "USER", "transcript": "hello who is this" }

# 7. ← Server sends result event
{ "type": "result", ...API-dependent result... }

# 8. → Client sends audio frame
# 9. → Client sends audio frame

# 10. ← Server sends transcript line
{ "type": "transcript", "speaker": "SUBJECT", "transcript": "hello this is your bank we have frozen your bank account" }
{ "type": "transcript", "speaker": "SUBJECT", "transcript": "i am going to need your social security number to unlock it" }

# 11. ← Server sends result event
{ "type": "result", ...API-dependent result... }

# 12. Client closes the connection

Occasionally the service might close the Websocket connection during normal operation. This does not indicate that the audio stream cannot be further processed. We recommend that you reconnect and continue streaming the audio in this case.

Errors

In case an irrecoverable error occurs during the session, the server sends a Websocket text frame with the error and the server closes the connection.

In case the initial text frame is invalid, the session flow is:

  1. Client initiates the Websocket connection.
  2. Client sends an initial Websocket text frame with JSON metadata.
  3. Server responds with a Websocket text frame with the error.
  4. Server closes the connection.

We recommend logging and monitoring the errors indicated by the events with the error type, along with the ID of the session during which they occur.

Input metadata format

The initial text frame the client sends contains a JSON object. Depending on the API, this JSON object contains metadata about the audio stream and the call.

APISchemaDescription
Detect Scam CallDetect Scam Call metadataAudio and call metadata.
Detect Scam DialogDetect Scam Dialog metadataAudio metadata.

Output event format

The output events that the server sends are always Websocket text frames with a JSON object. This JSON object contains a type field, which describes which subtype the event belongs to.

type fieldDescription
sessionMetadataThe session has been established successfully.
resultProcessing the audio stream has yielded a result. The content and schema of these events are different for each API.
transcriptA section of the audio stream has been transcribed.
errorAn error occurred either when processing the iniital metadata, or later during an established session.

As the result event is different for each API, the overall schema of the event object also depends on the API.

APISchema
Detect Scam CallDetect Scam Call event
Detect Scam DialogDetect Scam Dialog event

Audio stream format

The audio stream must meet the following requirements:

  • The format of the binary audio frames is 16-bit little-endian linear PCM raw audio.
  • The audio has two channels: the audio of the user protected by the service, and the audio of the other party. For inbound phone calls, these are the recipient and the caller respectively. It is configurable which audio channel corresponds to which party.
  • The audio is uncompressed.
  • The sample rate of the audio channel is one of the following: 8000, 16000, 24000, 44100, 48000 or 64000 Hz.
  • For the best results, the audio should be sent at real time, e.g. in 10 real time seconds 10 seconds worth of audio should be sent. The acceptable range of audio speed is 80-120% of real time.

Reporting

The Send Report API allows you to send feedback regarding the expected result in a session. On the reporting API you can refer to a session by its unique identifier, which you receive in the first session metadata event on the voice scam protection APIs.

Additionally, the reporting API allows you to send in the transcript of the session to improve the quality of the feedback. This transcript can be accumulated using the following flow:

  1. When you establish a session on the voice scam protection API, you receive the unique identifier of the session.
  2. As the audio stream is processed, you receive continuous chunks of the transcript in addition to the voice scam risk results.
  3. Accumulating these transcript chunks yields the transcript of the session.
  4. On the reporting API you can specify both the session ID and the accumulated transcript.