LOW-LATENCY FLUX TRANSFORMER MODEL FOR REAL-TIME VOICE ASSISTANTS
DOI:
https://doi.org/10.55956/KLLF9699Keywords:
low-latency ASR, real-time voice assistant, small whisper, stream transformer, Kazakh voice assistant, Vosk, Gemini AI, JARVISAbstract
This paper investigates low-latency automatic speech recognition (ASR) systems for real-time voice assistants, in particular, the Voice Assistant for Resource-poor Languages (JARVIS). The Tiny Whisper and Streaming Transformer models are evaluated for their low computational cost and high accuracy, and the performance of the Kazakh voice assistant powered by AlphaCephei’s Vosk model and Google’s Gemini AI technology is analyzed. The study uses LibriSpeech, Common Voice, and a specially collected 200-hour Kazakh speech dataset. Experiments show that Tiny Whisper is effective on edge devices, Streaming Transformer provides low latency in streaming-based scenarios, and the Kazakh voice assistant (JARVIS) increases language accessibility in digital environments. The proposed hybrid model combines the strengths of these technologies to provide high accuracy and low latency in real-time applications. JARVIS performs system tasks such as launching the browser, controlling volume and brightness, and supports online information retrieval through Gemini AI integration. This study demonstrates the practical application of low-latency ASR systems and the technological development of resource-poor languages such as Kazakh.
Downloads
Published online
Issue
Section
License
Copyright (c) 2025

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
