Skip to main content

Integration Guide

Overview

Often when a scammer is talking to a person on the phone, the scammer instills a sense of fear and urgency in the user, causing the user to become nervous and lose their ability to recognize the scam. Hiya’s Voice Scam Protection analyzes a 2-channel conversation in real time to detect and warn a user if patterns reveal that a scam is likely occurring in a conversation. The API is designed so that developers can insert this capability into their own applications to protect users against scam attempts in voice calls. The system returns 2 warning levels:

  1. Information warning: This output is intended to notify the user that it appears personal or sensitive information is or is about to be shared. This informational warning is meant to simply remind the user. The warning may be triggered in a case where the call does not involve a scam.
  2. Scam warning: This output is intended to notify the user that this conversation is typical of real scam conversations and the user may want to end the conversation.

The service is expected to be used with a clear and transparent opt-in, so that the user knows that the audio of the conversation is being analyzed. Audio and transcript content is stored for the minimum amount of time for the analysis and reporting, and then is deleted.

Voice Scam Protection User Interface

The intended user interface consists of two parts - 1) the user interface of warning the user in real time of a potential scam, and 2) the user interface for reporting calls as ‘scam’ or ‘not scam’.

  1. Warnings consist of two types, the ‘information warning’ and the ‘scam warning’. The user interface intended on the return of these warnings are below:

    1. ‘Privacy warning’:
      1. The phone should vibrate indicating to the user that information is available on the phone screen.
      2. The phone screen should have a yellow pop-up that says something similar to

        Privacy warning
        “It appears that sensitive information may be shared in this conversation. Please stop and think, and make sure you know and trust the person you are talking with.
        Use caution when sharing sensitive information. If you’re unsure, end your call.

      3. This popup should have ‘dismiss’, ‘hang up’ and ‘hang up and report scam’ buttons available to the user.
    2. ‘Scam warning’:
      1. The phone should vibrate indicating to the user that information is available on the phone screen.
      2. The phone screen should have a red pop-up that says something similar to

        Potential scam call
        We have detected language or tactics that are often used by scammers.
        Use caution when sharing sensitive information or following their instructions. If you’re unsure, end your call.

      3. This popup should have ‘dismiss’, ‘hang up’ and ‘hang up and report scam’ buttons available to the user. If the user chooses the ‘hang up and report scam’ button, the system should call the reporting API to report this scam.
  2. The user interface for reporting calls as ‘scam’ or ‘not scam’.
    If available, in the call log of the dialer there should be an informational banner next to each flagged call indicating whether it was a scam call. Each call in the call log should have an option next to it for reporting about a call. The type of option will depend on whether the call was identified as a scam in real time. The 2 cases are below:

    1. Calls not identified as scams:
      These entries should have a button/option entitled ‘report as scam’. If the user selects this option, the system should call the reporting API indicating this call was a scam.
    2. Calls identified as scam:
      These entries should have a button/option entitled ‘report as not scam’. If the user selects this option, the system should call the reporting API to notify the system that this call was not a scam.

Voice Scam Protection APIs summary and interactions

The API interface at a high level consists of 2 parts:

  1. A detection API that streams audio and other information for analysis. The API optionally uses additional information about the call if available, such as whether the caller was in the users Contacts database, and some elements of the SIP headers if that is available. The API outputs events indicating ‘scam’ as well as output indicating whether the target is using synthetic (AI generated) audio. Two detection APIs are available. The choice of which API to use depends on the data available to the application that uses the API:
    1. Detect Scam Call: Use this API if optional call data is available.
    2. Detect Scam Dialog: Use this API if only dialog audio is available.
  2. A reporting API that allows the user to report a scam that the system did not detect, or to report a non-scam conversation that the system erroneously detected:
    1. Report Session: This API is called when a user reports a scam (that was missed by the system) or a ‘not-scam’ (that is a false positive detection). This reporting helps us improve the performance of the system. This API allows passing session ID and an optional transcript. The transcripts should be passed to Hiya if the system is configured to not retain any audio (default).

Voice Scam Protection APIs

API usage

The voice scam detection APIs are bidirectional, full-duplex Websocket endpoints consuming an audio stream. As the audio is streamed into the endpoint, the results of the analysis are streamed back to the client simultaneously, in real time.

The interaction pattern for the voice scam detection APIs is as follows:

  1. You initiate the Websocket connection to the endpoint.
  2. You send an initial Websocket text frame with JSON metadata. See the API documentation for the schema of this text frame for each individual endpoint.
  3. The service responds with an initial Websocket text frame, to indicate that the session has been established successfully.
  4. You send chunks of audio data in Websocket binary frames. See the API documentation for the format of the audio stream.
  5. Simultaneously the service sends events containing the results in JSON Websocket text frames.
  6. When the audio stream ends, or you determine that you need no further processing based on the results already received, you close the Websocket connection.

Voice Scam Protection API flow

The individual endpoints perform scam detection using different factors:

  • Detect Scam Call: the dialog content, i.e. the words spoken by both parties of the call.
  • Detect Scam Dialog: the dialog content, whether the caller voice is synthetic, the reputation of the call originator.

API details

The events provided by the endpoints belong to one of several event types. These event types can be distinguished based on the “type” field of the events.

  • sessionMetadata: A reply to the initial metadata text frame sent by the client. It indicates that the session has been established successfully.
  • result: As the audio stream is analyzed, a voice scam detection result is sent to the client periodically. See the API documentation for more information on the contents and schema for each API.
  • transcript: As the audio stream is transcribed, chunks of the transcript are reported to the client. These can be accumulated by the client to add more context to reports.
  • error: Signals an error either when establishing the session or during the processing of the audio stream. For fatal errors, the server closes the connection after sending an error event.

For the best results, we recommend the following best practices when interacting with the voice scam detection APIs.

  • Sending audio at real-time speed: The audio should be sent at real time, e.g. the audio length of the Websocket frames sent in the course of 10 real-time seconds should sum up to 10 seconds of audio. The acceptable range of audio speed is 80-120% of real time.
  • Error logging: we recommend logging and monitoring the errors indicated by the events with the “error” type.
  • Handling closed connections: occasionally the service might close the Websocket connection during normal operation. This does not indicate that the audio stream cannot be further processed. We recommend that you reconnect and continue streaming the audio in this case.

Reporting API

API usage and Details

The Report Session API allows you to send feedback regarding the expected result in a session. On the reporting API you can refer to a session by its unique identifier, which you receive in the first session metadata event on the voice scam detection APIs.

Additionally, the Report Session API allows you to send in the transcript of the session to improve the quality of the feedback. This transcript can be accumulated using the following flow:

  1. When you establish a session on the voice scam detection API, you receive the unique identifier of the session.
  2. As the audio stream is processed, you receive continuous chunks of the transcript in addition to the voice scam risk results.
  3. Accumulating these transcript chunks yields the transcript of the session.
  4. On the reporting API you specify both the session ID and the accumulated transcript.

Data Retention

By default the system will not retain any audio or transcripts after the call. As described above, to support reporting of unidentified and incorrectly identified scam or warnings please retain the transcript as described above (see ‘Reporting API’). This allows the client to send to Hiya - with the user’s permission - for scam calls. To simplify integration, the system is also able to retain the audio and transcript for a short period of time (typically 30 minutes) so that reporting can be implemented without the complexity of saving call transcripts to send back to Hiya. Please contact Hiya if you would like to activate this simplification.

Demo Implementation

See the sample code for a sample Python script that illustrates using the Detect Scam Call API.