Voice Activity Detection
the system continuously listens to the mic, but only starts ‘recording’ when it detects speech, and stops when there’s silence for a certain duration.
Without VAD:
- Wasted CPU/GPU transcribing long silences.
- Context window gets filled with “um… yeah… (5 seconds of silence)” instead of meaningful content.
- Risk of producing huge files and high latency in processing.
With VAD:
- The bot reacts only to speech.
- Silences become natural ‘breaks’ for sending content to the LLM.
- You can better preserve meeting flow without constant interruptions.
can use a VAD library