Introduction to Voxtral API
The Voxtral API provides developers with access to state-of-the-art speech understanding capabilities through a simple, RESTful interface. Built on Mistral AI's frontier language models, Voxtral offers not just transcription, but deep semantic understanding of spoken content, including built-in question-answering capabilities.
Whether you're building a voice-powered application, processing podcast content, or creating accessibility tools, this guide will help you harness the full power of Voxtral's speech understanding technology.
Prerequisites and Setup
What You'll Need
Before diving into the implementation, ensure you have:
- A Voxtral API account (sign up at voxtral.life)
- Your API key (available in your dashboard)
- Basic knowledge of HTTP requests and JSON
- A development environment with internet connectivity
Supported Audio Formats
Voxtral accepts a wide range of audio formats to maximize compatibility:
- Formats: MP3, WAV, FLAC, OGG, M4A, WEBM
- Sample Rates: 8kHz to 48kHz (16kHz+ recommended)
- Channels: Mono and stereo supported
- Duration: Up to 4 hours per file
- File Size: Maximum 500MB per request
Your First API Call
Basic Transcription Request
Let's start with a simple transcription request using cURL:
curl -X POST https://api.voxtral.life/v1/transcribe \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "audio=@/path/to/your/audio.mp3" \
-F "model=voxtral-3b" \
-F "language=auto"
Understanding the Response
A successful response will include:
{
"id": "txn_abc123",
"status": "completed",
"transcript": "Welcome to Voxtral, the future of speech understanding...",
"segments": [
{
"start": 0.0,
"end": 3.2,
"text": "Welcome to Voxtral,",
"confidence": 0.95
}
],
"metadata": {
"duration": 45.6,
"language": "en",
"model": "voxtral-3b"
}
}
Language-Specific Integration Examples
Python Implementation
For Python developers, here's a complete example using the requests library:
import requests
import json
class VoxtralClient:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://api.voxtral.life/v1"
self.headers = {"Authorization": f"Bearer {api_key}"}
def transcribe(self, audio_file_path, model="voxtral-3b", language="auto"):
url = f"{self.base_url}/transcribe"
with open(audio_file_path, 'rb') as audio_file:
files = {"audio": audio_file}
data = {
"model": model,
"language": language,
"response_format": "json"
}
response = requests.post(url, headers=self.headers, files=files, data=data)
if response.status_code == 200:
return response.json()
else:
raise Exception(f"API Error: {response.status_code} - {response.text}")
# Usage example
client = VoxtralClient("your-api-key-here")
result = client.transcribe("meeting_recording.mp3")
print(result["transcript"])
JavaScript/Node.js Implementation
For JavaScript developers, here's how to integrate with Node.js:
const fs = require('fs');
const FormData = require('form-data');
const axios = require('axios');
class VoxtralAPI {
constructor(apiKey) {
this.apiKey = apiKey;
this.baseURL = 'https://api.voxtral.life/v1';
}
async transcribe(audioFilePath, options = {}) {
const form = new FormData();
form.append('audio', fs.createReadStream(audioFilePath));
form.append('model', options.model || 'voxtral-3b');
form.append('language', options.language || 'auto');
try {
const response = await axios.post(`${this.baseURL}/transcribe`, form, {
headers: {
'Authorization': `Bearer ${this.apiKey}`,
...form.getHeaders()
}
});
return response.data;
} catch (error) {
throw new Error(`Voxtral API Error: ${error.response?.data || error.message}`);
}
}
}
// Usage
const voxtral = new VoxtralAPI('your-api-key');
voxtral.transcribe('./audio.mp3')
.then(result => console.log(result.transcript))
.catch(error => console.error(error));
Advanced Features and Configuration
Question Answering Capabilities
One of Voxtral's unique features is built-in question answering. You can ask questions about the transcribed content:
curl -X POST https://api.voxtral.life/v1/transcribe \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "audio=@meeting.mp3" \
-F "model=voxtral-24b" \
-F "questions=[\"What were the main action items?\", \"Who was responsible for the budget?\"]"
Speaker Identification
Enable speaker diarization to identify different speakers in your audio:
{
"audio": "base64_encoded_audio",
"model": "voxtral-24b",
"enable_speaker_diarization": true,
"max_speakers": 4
}
Custom Vocabulary and Context
Improve accuracy for domain-specific terminology:
{
"audio": "base64_encoded_audio",
"model": "voxtral-3b",
"custom_vocabulary": ["Voxtral", "API", "transcription", "diarization"],
"context": "This is a technical discussion about speech recognition APIs"
}
Error Handling and Best Practices
Common Error Codes
Understanding API error codes helps with robust implementation:
- 400 Bad Request: Invalid audio format or parameters
- 401 Unauthorized: Invalid or missing API key
- 413 Payload Too Large: Audio file exceeds size limits
- 429 Rate Limit Exceeded: Too many requests
- 500 Server Error: Temporary service issue
Retry Logic Implementation
Implement robust retry logic for production applications:
import time
import random
def transcribe_with_retry(client, audio_path, max_retries=3):
for attempt in range(max_retries):
try:
return client.transcribe(audio_path)
except Exception as e:
if attempt == max_retries - 1:
raise e
# Exponential backoff with jitter
delay = (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
return None
Rate Limiting and Optimization
Optimize your API usage with these best practices:
- Batch Processing: Process multiple files in parallel
- Audio Preprocessing: Compress audio without quality loss
- Model Selection: Use voxtral-3b for fast processing, voxtral-24b for maximum accuracy
- Caching: Store results to avoid reprocessing identical content
Real-World Implementation Patterns
Streaming Audio Processing
For real-time applications, implement streaming with WebSocket connections:
const WebSocket = require('ws');
class VoxtralStream {
constructor(apiKey) {
this.apiKey = apiKey;
this.ws = null;
}
connect() {
this.ws = new WebSocket('wss://api.voxtral.life/v1/stream', {
headers: { 'Authorization': `Bearer ${this.apiKey}` }
});
this.ws.on('message', (data) => {
const result = JSON.parse(data);
if (result.type === 'partial') {
console.log('Partial transcript:', result.text);
} else if (result.type === 'final') {
console.log('Final transcript:', result.text);
}
});
}
sendAudio(audioChunk) {
if (this.ws && this.ws.readyState === WebSocket.OPEN) {
this.ws.send(audioChunk);
}
}
}
Webhook Integration
For long audio files, use webhooks for asynchronous processing:
// Submit job with webhook
const response = await axios.post('https://api.voxtral.life/v1/transcribe/async', {
audio_url: 'https://example.com/long-recording.mp3',
model: 'voxtral-24b',
webhook_url: 'https://your-app.com/voxtral-webhook'
});
// Handle webhook in your application
app.post('/voxtral-webhook', (req, res) => {
const { job_id, status, transcript, error } = req.body;
if (status === 'completed') {
console.log(`Job ${job_id} completed:`, transcript);
// Process the completed transcription
} else if (status === 'failed') {
console.error(`Job ${job_id} failed:`, error);
}
res.status(200).send('OK');
});
Performance Optimization Strategies
Audio Preprocessing
Optimize audio files before sending to the API:
- Format Conversion: Convert to FLAC for best compression-to-quality ratio
- Noise Reduction: Remove background noise for better accuracy
- Normalization: Normalize audio levels for consistent processing
- Segmentation: Split very long files at natural pause points
Parallel Processing
Process multiple files simultaneously to maximize throughput:
import asyncio
import aiohttp
async def transcribe_batch(audio_files, api_key):
async with aiohttp.ClientSession() as session:
tasks = []
for file_path in audio_files:
task = transcribe_single(session, file_path, api_key)
tasks.append(task)
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
async def transcribe_single(session, file_path, api_key):
data = aiohttp.FormData()
data.add_field('audio', open(file_path, 'rb'))
data.add_field('model', 'voxtral-3b')
headers = {'Authorization': f'Bearer {api_key}'}
async with session.post('https://api.voxtral.life/v1/transcribe',
data=data, headers=headers) as response:
return await response.json()
Troubleshooting Common Issues
Audio Quality Problems
If you're experiencing poor transcription accuracy:
- Ensure audio is recorded at 16kHz or higher sample rate
- Check for excessive background noise or distortion
- Verify the language setting matches the spoken content
- Consider using custom vocabulary for technical terms
API Connection Issues
For connectivity problems:
- Verify your API key is correct and active
- Check your internet connection and firewall settings
- Ensure you're using HTTPS for all API calls
- Implement proper timeout handling (recommended: 300 seconds)
Performance Optimization
To improve processing speed:
- Use the appropriate model size for your accuracy requirements
- Compress audio files without quality loss
- Implement connection pooling for multiple requests
- Consider regional API endpoints for lower latency
Security and Privacy Considerations
API Key Management
Protect your API credentials:
- Store API keys in environment variables, never in code
- Use different keys for development and production
- Rotate keys regularly and monitor usage
- Implement IP whitelisting when possible
Data Privacy
Ensure compliance with privacy regulations:
- Review Voxtral's data retention policies
- Implement proper consent mechanisms
- Consider on-premises deployment for sensitive data
- Encrypt audio files in transit and at rest
Next Steps and Advanced Topics
Exploring Advanced Features
Once you're comfortable with basic transcription, explore:
- Custom Model Training: Train models on your specific domain
- Real-time Processing: Build live transcription applications
- Multi-modal Analysis: Combine speech with other data sources
- Integration Platforms: Connect with popular workflow tools
Community and Support
Join the Voxtral developer community:
- Follow our blog for updates and tutorials
- Join our Discord server for real-time support
- Contribute to our open-source projects on GitHub
- Attend our monthly developer webinars
Conclusion
The Voxtral API provides a powerful, flexible foundation for building speech-enabled applications. With its combination of accuracy, performance, and developer-friendly design, you can create sophisticated voice experiences that delight your users.
This guide has covered the essentials of getting started with Voxtral, from basic transcription to advanced features and optimization strategies. As you build with Voxtral, remember that our team is here to support your journey—don't hesitate to reach out with questions or feedback.
Ready to start building? Head over to voxtral.life to create your account and get your API key. The future of speech understanding is at your fingertips.