Note that using a microphone may degrade your privacy. Consider not having the microphone active when you want privacy.
Giving the bot access to talk and understand speech makes it easier to communicate without hands on the keyboard. Once this is setup you can talk with the bot using asynchronous voice messages on WhatsApp or your preferred communications app. You can send it a quick voice message asking it to do something, or ask it to send you back the status of something with its voice.
eSpeak Text To Speech
eSpeak is an open source text to speech project originally published 2006. It is very robotic compared to more modern TTS solutions that have been trained on countless hours of actual human speech recordings. The robot vibe fits quite nicely for an LLM agent running inside of a Raspberry Pi. eSpeak is included in the Raspberry Pi apt repository by default so it can be installed with a single command via apt.
sudo apt install espeak
Once eSpeak is installed, it's easy to use it from the terminal with a command like this.
espeak "Hello world"
I told the bot this was installed and ready, and asked it to try it out and then add it as a tool to use later.Â
Here's a recording of what it said followed by the chat transcript.
Me: I installed `espeak` so that you can use it to speak out loud. Go ahead and try to use to say hi. Bot: It ran without errors, which means it should have spoken through the audio output. Let me ask the user if they heard it. Did you hear me? 🔊 Me: Yes, heard it loud and clear. Perfect! You can go ahead and save that in your tools file so you use your voice later.
Whisper Small Speech To Text
Next up is to enable speech to text so that we can talk to the bot using voice instead of text. There are many remote speech to text APIs available that may be used, but I wanted to see if it was possible to keep the transcription local using the Whisper Small model. For this, I decided to be more hands off, passing the reigns over to the bot and see if it could complete the full installation and usage of the whisper-small model.
When it uses the whisper-small model for the first time, it must fetch the files for the model from Huggingface servers. It also takes a few minutes to download and install all of the necessary requirements from pip.Â
Peeking into the full session logs, it's possible to see the bots "thinking" steps where it writes a message to itself about what has happened and what it should do next. Reviewing the logs of this session showed that the bot encountered a few different issues when it wrote the test script, but was able to resolve them on its own.
Once everything was installed and the issues resolved, it was able to successfully transcribe the audio file. As the bot noted in the summary, it does take 30-60 seconds to transcribe a short audio message, so this is not good enough for real time communication or for long form discussions. But it is impressive that the Pi can handle it locally for free, and works great for short voice messages.
It did ignore the prompt to tell a joke from the test audio recording during the documented trial, but I did this step multiple times during testing and it did tell a joke during some of them. Here is one of the jokes
"Why do programmers prefer dark mode? Because light attracts bugs!"
Me: Lets work on speech input. Can you please install the whisper-small model https://huggingface.co/openai/whisper-small and any requirements needed inside of your venv. I have put a test_audio_input.mp3 file in your workspace that you can try to transcribe with it in order to verify whether it's working. Bot (Thinking): The installation completed successfully. Now let me create a test script to load the whisper-small model and transcribe the test audio file. Packages installed successfully. Now let me create a test script to download the whisper-small model and transcribe your audio file: Bot (Thinking): There's an error with num_frames. This seems to be a compatibility issue with the newer transformers version and the audio pipeline. Let me try a different approach - using the pipeline directly without the model preloading, or using a simpler API. There's a compatibility issue with the pipeline. Let me try a simpler approach using the Whisper pipeline directly: Bot (Thinking): Same error. This seems to be a bug in transformers 5.0.0. Let me try loading the audio manually and processing it. I'll need to install librosa or soundfile for audio loading. Same error - this appears to be a bug in the new transformers version. Let me try a different approach with manual audio loading: Bot (Thinking): Now let me update the test script to manually load the audio file and use the model directly rather than through the pipeline. Now let me try a different approach using manual audio loading: Bot: **Summary:** ✅ **Installed:** - `transformers` (5.0.0), `torch` (2.10.0), `torchaudio` (2.10.0) - `librosa`, `soundfile` for audio loading - `accelerate` for model optimization ✅ **Model:** `openai/whisper-small` downloaded and cached ✅ **Test transcription successful:** - Audio file: `test_audio_input.mp3` (5.8 seconds) - Result: *"Hey PipBot, can you hear this? If you're able to hear it, can you tell us a joke?"* **Performance note:** Running on CPU, transcription takes ~30-60 seconds for short clips. The Pi 5 handles it well, but it's not real-time. For the learn guide, this should work great for demonstrating voice input capabilities. The test script is at `whisper_test.py` in your workspace, and I've added documentation to `TOOLS.md` for reference.
Page last edited February 03, 2026
Text editor powered by tinymce.