We deployed a machine learning model for recognizing spoken digits (0–9) on an STM32 microcontroller. This work covers the complete end-to-end process, including one-second audio capture at 16 kHz on the STM32F769I-DISCOVERY board, signal preprocessing (DC offset removal, root mean square normalization, and clipping), implementing and training the neural network on raw waveforms, and deployment of the trained model to the embedded device. The model was developed and trained in Python, exported to the Open Neural Network Exchange (ONNX) format, and converted into C code using STMicroelectronics’ Edge AI Developer Cloud. On the device, a chain of Timer 2 (TIM2), Analog-to-Digital Converter (ADC), and Direct Memory Access (DMA) is configured to stream audio samples, with inference results reported over the Universal Asynchronous Receiver-Transmitter (UART) interface. A custom dataset of 400 recordings (40 per digit) was collected and normalized to ensure consistent behavior between the training environment and the embedded runtime. Hardware testing achieved 70\% accuracy on sequences of spoken digits. Performance is primarily limited by the small, single-speaker dataset and the microphone quality. Potential improvements include expanding the dataset with recordings from multiple speakers and upgrading the recording hardware.
|