This thesis presents the process of developing an application for the Android mobile operating system that can run neural networks with a transformer architecture.
The application has been tested with the Whisper model for speech transcription and the Phi-2 model for natural language processing.
The practical part includes construction of the Android application using the Jetpack Compose framework
and integration of existing C++ libraries from GitHub.
Additionally, it covers the training of the Whisper-Small STT model and the Phi-2 LLM.
It also involves preparing the ARTUR1.0 dataset with audio transcription and translating
the OASST1 dataset, utilizing these datasets for model training.
The mentioned models are executed in the application using the whisper.cpp and llama.cpp libraries.
In the final version of the application, users can ask a question by speaking, with the recorded audio being processed by the Whisper model to convert it into text.
Alternatively, users can type their questions into a predefined field. Once the text is submitted,
it is processed by the language model, which generates a response.
The application then displays this response below the user's question.
I evaluated the models using test datasets and compared them with existing online solutions.
I compared the speech transcription process more precisely with the Govori.si application, developed in my mentor's laboratory.
Voice input represents a safe alternative to typing in situations that require our attention to the surroundings.
The entered text is processed by the Phi-2 LLM, which I compared with probably the most well-known LLM, ChatGPT, as well as the open-source Mistral.
I measured the response time on my own phone, a Poco F3.
Speech transcription is quite slow, as is the time to generate the first token in the response.
Subsequent tokens are generated faster, at a rate of approximately 2.6 tokens per second.
|