LOW-LATENCY FLUX TRANSFORMER MODEL FOR REAL-TIME VOICE ASSISTANTS

LOW-LATENCY FLUX TRANSFORMER MODEL FOR REAL-TIME VOICE ASSISTANTS

Authors

  • N.M. Zhunissov International Kazakh-Turkish University named after Khoja Ahmed Yasawi, Turkistan, Kazakhstan
  • A.B. Aben International Kazakh-Turkish University named after Khoja Ahmed Yasawi, Turkistan, Kazakhstan
  • M. Khiniyazov International Kazakh-Turkish University named after Khoja Ahmed Yasawi, Turkistan, Kazakhstan

DOI:

https://doi.org/10.55956/KLLF9699

Keywords:

low-latency ASR, real-time voice assistant, small whisper, stream transformer, Kazakh voice assistant, Vosk, Gemini AI, JARVIS

Abstract

This paper investigates low-latency automatic speech recognition (ASR) systems for real-time voice assistants, in particular, the Voice Assistant for Resource-poor Languages (JARVIS). The Tiny Whisper and Streaming Transformer models are evaluated for their low computational cost and high accuracy, and the performance of the Kazakh voice assistant powered by AlphaCephei’s Vosk model and Google’s Gemini AI technology is analyzed. The study uses LibriSpeech, Common Voice, and a specially collected 200-hour Kazakh speech dataset. Experiments show that Tiny Whisper is effective on edge devices, Streaming Transformer provides low latency in streaming-based scenarios, and the Kazakh voice assistant (JARVIS) increases language accessibility in digital environments. The proposed hybrid model combines the strengths of these technologies to provide high accuracy and low latency in real-time applications. JARVIS performs system tasks such as launching the browser, controlling volume and brightness, and supports online information retrieval through Gemini AI integration. This study demonstrates the practical application of low-latency ASR systems and the technological development of resource-poor languages such as Kazakh.

Downloads

Published online

2025-12-30

Issue

Section

Information аnd communication technologies
Loading...