👀 Live Demo: Voice Activity Detection
This demo runs a Voice Activity Detection (VAD) model based on FunASR
FSMN-VAD
(Deep-FSMN paper).
VAD estimates the probability that speech is present in short audio frames. FSMN-VAD is a
lightweight (0.4M params) strictly causal architecture: its "memory" is a depthwise
convolution over past frames (no lookahead), making it a natural fit for streaming and for
tract's pulse transform.
Did you know? In tract, a pulse is a small, fixed‑size
chunk on the time axis (here: 4 frames). The engine streams these pulses through
the graph, carrying just enough internal state to run causally without
replaying history. Latency is thus bounded by the pulse size (ops must be
streamable or have a small look‑ahead).
Modes:
- Pulsed: low‑latency streaming using the current pulse size.
- Batch: larger context per step for reference parity.
- Both: plots both scores side‑by‑side for comparison.
Dev notes: see VAD module README
for local wasm build and serve instructions.