Speech Engines¶
Modern speech recognition and synthesis models are powerful but heavy — large downloads, slow to load, and resource-intensive. Every application that wants voice interaction has to implement its own loading, optimization, and lifecycle management. Running a model for a few voice commands and then discarding it is wasteful.
Dictare solves this by keeping STT and TTS engines loaded in memory as a background service. The models are loaded once at startup, optimized for your hardware, and stay ready. Any application can use them instantly through the OpenVIP protocol — no model loading, no ML dependencies, no GPU management. Just speak and listen.
All engines run 100% locally. No audio ever leaves your machine.
Speech-to-Text (STT)¶
Engine Selection¶
Dictare automatically selects the best STT engine for your platform:
| Model config | Engine | Runtime | Platform |
|---|---|---|---|
tiny to large-v3-turbo |
MLXWhisperEngine | MLX | macOS Apple Silicon |
tiny to large-v3-turbo |
FasterWhisperEngine | CTranslate2 | Linux / Intel Mac |
parakeet-v3 |
ParakeetEngine | ONNX Runtime | Any |
Models¶
| Model | Size | Speed | Accuracy | Best for |
|---|---|---|---|---|
tiny |
~75 MB | Fastest | Lower | Testing, low-resource |
base |
~140 MB | Fast | Moderate | Quick dictation |
small |
~460 MB | Moderate | Good | General use |
medium |
~1.5 GB | Slower | High | Accuracy-focused |
large-v3 |
~3 GB | Slowest | Highest | Best accuracy |
large-v3-turbo |
~1.6 GB | Fast | High | Recommended |
parakeet-v3 |
~600 MB | Fast | High | Cross-platform |
Configuration¶
[stt]
model = "large-v3-turbo" # Recommended default
language = "auto" # Auto-detect, or set "en", "it", etc.
translate = false # Translate to English
hw_accel = true # Use GPU/NPU acceleration
[stt.advanced]
device = "auto" # auto, cpu, cuda, mlx
compute_type = "float16" # int8 (faster), float16 (balanced), float32 (precise)
hotwords = "" # Help recognition: "Dictare, OpenVIP, Claude"
Model Management¶
dictare models # List available models
dictare models download # Download a model
dictare models delete # Remove a cached model
Hotwords¶
Improve recognition of technical terms and names:
[stt.advanced]
hotwords = "Dictare, OpenVIP, Claude Code, Codex, pytest"
Text-to-Speech (TTS)¶
Available Engines¶
| Engine | Platform | Quality | Speed | Notes |
|---|---|---|---|---|
espeak |
Any | Basic | Instant | Built-in, no download |
say |
macOS | Good | Instant | Uses macOS system voices |
piper |
Any | Good | Fast | ONNX-based, many voices |
kokoro |
Any | High | Moderate | Neural, natural-sounding |
outetts |
Any | High | Slower | Neural TTS |
Configuration¶
Your default engine is set in config.toml:
[tts]
engine = "say" # Your preferred engine (always loaded, fastest)
language = "en" # Voice accent
voice = "" # Engine-specific speaker name
Using Multiple Engines¶
You can use any installed engine on the fly — even if it's not your default. Pass --engine to dictare speak:
dictare speak "Build complete" # Uses default engine
dictare speak "Test passed" --engine kokoro # Uses kokoro for this request
dictare speak "Deploying" --engine piper # Uses piper for this request
The first time you use a non-default engine, it takes a moment to load. After that, the audio is cached — same text + engine + language + voice = instant playback from cache.
dictare speak --list-engines # Show all available engines
dictare speak --list-voices # Show voices for current engine
TTS Worker Isolation¶
Kokoro, Piper, and OuteTTS engines run in an isolated subprocess worker. This prevents their dependencies from interfering with the main Dictare process. The worker is managed automatically.
Performance Tips¶
- macOS Apple Silicon: Use
large-v3-turbowith MLX for the best speed/accuracy balance - Linux with NVIDIA GPU: Use
large-v3-turbowith CUDA (device = "cuda") - CPU-only machines: Use
smallorparakeet-v3for reasonable speed - Low memory: Use
tinyorbase; setcompute_type = "int8"to reduce memory usage - Hotwords: Always set hotwords for domain-specific terms you use frequently