👀 Live Demo: Voice Activity Detection

This demo runs a Voice Activity Detection (VAD) model based on FunASR FSMN-VAD (Deep-FSMN paper). VAD estimates the probability that speech is present in short audio frames. FSMN-VAD is a lightweight (0.4M params) strictly causal architecture: its "memory" is a depthwise convolution over past frames (no lookahead), making it a natural fit for streaming and for tract's pulse transform.

Did you know? In tract, a pulse is a small, fixed‑size chunk on the time axis (here: 4 frames). The engine streams these pulses through the graph, carrying just enough internal state to run causally without replaying history. Latency is thus bounded by the pulse size (ops must be streamable or have a small look‑ahead).

Modes:

Pulsed: low‑latency streaming using the current pulse size.
Batch: larger context per step for reference parity.
Both: plots both scores side‑by‑side for comparison.

Dev notes: see VAD module README for local wasm build and serve instructions.

Mode: Pulsed Batch Both

👀 Live Demo: Voice Activity Detection

downloading model