Voice-controlled Large Language Model agent for telephony applications

Various automated systems have been used to answer calls for many decades now. The latest development in this field is to use Large Language Models in combination with advanced speech to text and text to speech systems to facilitate natural machine/human dialog.

While many off-the-shelf products are offered to handle typical use-cases, like company IVRs and call centers, building a custom, highly scalable solution with carrier level availability is a non-trivial exercise.

The problem

As part of the development of a larger system, our task is to set up voice-based LLM agent that interacts with our data and is accessible from both PSTN (i.e. accepts normal voice calls) and ip-based clients (i.e. mobile or web apps).

Core requirements:

SIP and WebRTC endpoints
Low latency speech recognition and synthesis in English
High availability and resource utilization efficiency
Pluggable (uncoupled) LLM module

The solution

For scalability and high availability reasons, we made the decision to build the solution around a message queue. We went with RabbitMQ, which supports all the necessary primitives and features clustering.

As can be seen on the diagram, Asterisk is serving as our frontend for the SIP and WebRTC traffic, which then gets translated into RTP and ends up in the message bus after Voice Activity Detection (VAD).

Nvidia’s Triton inference server hosts all the machine learning models split between available GPUs. Dynamic batching is enabled and with sufficient traffic, good GPU utilization is observed. In our case, the models running on Triton are:

Whisper for speech recognition,
Tortoise TTS for sythesis and
custom finetune of LLAMA2 for NLP To actually run inference on Triton, there are wrappers for each model that take RabbitMQ messages as input, call the corresponding model and return the output back to the message queue.

Note that there is another ML model - for VAD. It is running on CPU in ONNX format. We didn’t move that model to Triton because the model is efficient enough on CPU and being invoked so frequently (to detect silence in small audio chunks), had we moved it, the latency would suffer as network roundtrip times would dominate.

All the pieces, including Triton server, RabbitMQ, Asterisk, business logic and model wrappers are running as containers on a Kubernetes cluster. Number of replicas is configured as needed to handle specific traffic numbers and/or level of fault tolerance.

The results we observed were interesting: latency is comparable to communicating with a human at a reasonable distance. The LLM response starts in under a second after the speaker finishes talking.

Voice LLM agent

Voice-controlled Large Language Model agent for telephony applications

The problem

The solution