Torola.ai Logo

Voice LLM agent

Case study categories:
ai voice nlp telecom

(LLM voice agent)

Voice-controlled Large Language Model agent for telephony applications

Various automated systems have been used to answer calls for many decades now. The latest development in this field is to use Large Language Models in combination with advanced speech to text and text to speech systems to facilitate natural machine/human dialog.

While many off-the-shelf products are offered to handle typical use-cases, like company IVRs and call centers, building a custom, highly scalable solution with carrier level availability is a non-trivial exercise.

The problem

As part of the development of a larger system, our task is to set up voice-based LLM agent that interacts with our data and is accessible from both PSTN (i.e. accepts normal voice calls) and ip-based clients (i.e. mobile or web apps).

Core requirements:

  • SIP and WebRTC endpoints
  • Low latency speech recognition and synthesis in English
  • High availability and resource utilization efficiency
  • Pluggable (uncoupled) LLM module

The solution

Voice LLM Agent Architecture

For scalability and high availability reasons, we made the decision to build the solution around a message queue. We went with RabbitMQ, which supports all the necessary primitives and features clustering.

As can be seen on the diagram, Asterisk is serving as our frontend for the SIP and WebRTC traffic, which then gets translated into RTP and ends up in the message bus after Voice Activity Detection (VAD).

Nvidia’s Triton inference server hosts all the machine learning models split between available GPUs. Dynamic batching is enabled and with sufficient traffic, good GPU utilization is observed. In our case, the models running on Triton are:

  • Whisper for speech recognition,
  • Tortoise TTS for sythesis and
  • custom finetune of LLAMA2 for NLP To actually run inference on Triton, there are wrappers for each model that take RabbitMQ messages as input, call the corresponding model and return the output back to the message queue.

Note that there is another ML model - for VAD. It is running on CPU in ONNX format. We didn’t move that model to Triton because the model is efficient enough on CPU and being invoked so frequently (to detect silence in small audio chunks), had we moved it, the latency would suffer as network roundtrip times would dominate.

All the pieces, including Triton server, RabbitMQ, Asterisk, business logic and model wrappers are running as containers on a Kubernetes cluster. Number of replicas is configured as needed to handle specific traffic numbers and/or level of fault tolerance.

The results we observed were interesting: latency is comparable to communicating with a human at a reasonable distance. The LLM response starts in under a second after the speaker finishes talking.

Disclaimer: due to proprietary nature of work done for the customers and employers, the case studies are merely inspired by that work, are presented at a very high level and some sensitive details have been changed or omitted.

Interested in what you see?

If you got inspired by what you see and want to create something with our help - don't hesitate to reach out. Get in touch

Start your journey with us

We know that working with new partners is difficult and risky. To help make this first step easier - we are happy to offer no-commitments, free consultation* with one of our engineers when you first reach out.

Start Simple, Scale At Your Own Pace:

Your Central European Software Services Partner

Privacy policy | © 2024 Torola. All rights reserved.