I recently discovered ElevenLabs while searching for a Text-to-speech app to convert text into audio for my app. They offer an API with several features, including transforming text into speech.
In this tutorial, I will talk about how to build a simple text-to-speech app using Next.js and ElevenLabs API.
No advanced knowledge is needed, but you need to have a basic understanding of JavaScript to create a Text-to-speech app.
Text-to-speech: Step by step
Getting started with ElevenLabs
Elevenlabs
ElevenLabs is an AI platform focused on high-quality voice synthesis. The platform allows for customizable voice settings, such as adjusting stability and similarity, which helps create more realistic and natural-sounding audio.
API Endpoints overview
Here are the endpoints we will be using in this tutorial:
- Voice
- Endpoint: /voices
- Description: Retrieves all available voices
- Text-to-Speech
- Endpoint: /text-to-speech/{voice_id}
- Description: Converts the text into audio using the specified voice by the voice_id parameter in the Text-to-speech app
Get an API Key
You need to generate your API key on your profile settings:
This key will be used for authentication in API requests.
Setup project
Init your Next.js
Follow these steps:
npx create-next-app text-to-speech-app cd text-to-speech-app
Set Up .env
Create .env file and add your api key for the Text-to-speech app.
ELEVENLABS_API_KEY=your-apikey-here
Create API Routes
We’re going to set up our APIs routes to connect to the ElevenLabs API, so our front-end can use them.
Create the file /api/voices.js
export default async function handler(req, res) {
const apiUrl = 'https://api.elevenlabs.io/v1/voices';
const apiKey = process.env.ELEVENLABS_API_KEY;
try {
const response = await fetch(apiUrl, {
method: 'GET',
headers: {
'xi-api-key': apiKey,
},
});
const voices = await response.json();
res.status(200).json(voices);
} catch (error) {
res.status(500).json({ error: 'Error fetching voices' });
}
}
Create the file /api/text-to-speech.js
export default async function handler(req, res) {
const { text, voice_id } = req.body;
const apiUrl = `https://api.elevenlabs.io/v1/text-to-speech/${voice_id}`; const apiKey = process.env.ELEVENLABS_API_KEY;
const headers = {
"Accept": "audio/mpeg",
"xi-api-key": apiKey,
"Content-Type": "application/json"
};
const requestBody = JSON.stringify({
text,
model_id: "eleven_monolingual_v1",
});
try {
const response = await fetch(apiUrl, {
method: 'POST',
headers: headers,
body: requestBody
});
const audioBuffer = await response.arrayBuffer();
res.setHeader('Content-Type', 'audio/mpeg');
res.status(200).send(Buffer.from(audioBuffer));
} catch (error) {
console.error('Error generating text-to-speech:', error); res.status(500).json({ error: 'Error generating text-to-speech: });
}
}
We also can configure voice_settings. It allows you to fine-tune the output voice in your Text-to-speech app. Here’s how adjusting these values impacts the generated speech:
- Stability:
- What it does: Controls how consistent or dynamic the voice sounds.
- Higher values (closer to 1): The voice will sound more steady and formal.
- Lower values (closer to 0): The voice will be more expressive and natural.
- Similarity Boost:
- What it does: Dictates how closely the output matches the original voice model.
- Higher values (closer to 1): The voice will adhere closely to the original voice’s tone and style.
- Lower values (closer to 0): The voice will allow for more variation and flexibility
Example:
const requestBody = JSON.stringify({
text,
model_id: "eleven_monolingual_v1",
voice_settings: {
stability: 0.5,
similarity_boost: 0.5,
}
});
Currently, I’m using the model_id eleven_monolingual_v1, but ElevenLabs offers also other models (support multiple languages, more fine-tuned voice generation).
In fact, ElevenLabs provides an API to retrieve available models, allowing you to make this value dynamic if you want to take your integration further.
More details here
Front implementation
Now that we’ve set up our API routes, we can move on to the front integration of our Text-to-speech app.
1 – Fetches all voices
When the page loads, we fetch and display the list of available voices for the user to choose.
const [voices, setVoices] = useState([]);
const [selectedVoice, setSelectedVoice] = useState('');
useEffect(() => {
const loadVoices = async () => {
const response = await fetch('/api/voices');
const data = await response.json();
setVoices(data.voices);
};
loadVoices();
}, []);
return (
<div>
<h1>Select a voice</h1>
<select
value={selectedVoice}
onChange={(e) => setSelectedVoice(e.target.value)}
>
{voices.map((voice) => (
<option key={voice.voice_id} value={voice.voice_id}>
{voice.name}
</option>
))}
</select>
</div>
);
1 – Generating the Speech
const handleSubmit = async (e) => {
e.preventDefault();
const response = await fetch('/api/text-to-speech', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text, voice_id: selectedVoice }),
});
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();
};
<form onSubmit={handleSubmit}>
<textarea
value={text}
onChange={(e) => setText(e.target.value)}
placeholder="Enter your text"
rows="5"
/>
<button type="submit">
Generate Speech
</button>
</form>
By converting the response to a Blob and creating a temporary URL with URL.createObjectURL(audioBlob), we can make the browser treat the audio data like a file and allow it to play directly without needing to download.
Here is the entire file.
const [text, setText] = useState('');
const [voices, setVoices] = useState([]);
const [selectedVoice, setSelectedVoice] = useState('');
const handleSubmit = async (e) => {
e.preventDefault();
const response = await fetch('/api/text-to-speech', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text, voice_id: selectedVoice }),
});
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();
};
useEffect(() => {
const loadVoices = async () => {
const response = await fetch('/api/voices');
const data = await response.json();
setVoices(data.voices);
};
loadVoices();
}, []);
return (
<section>
<div>
<h1>Select a Voice</h1>
<select
value={selectedVoice}
onChange={(e) => setSelectedVoice(e.target.value)}
>
{voices.map((voice) => (
<option key={voice.voice_id} value={voice.voice_id}>
{voice.name}
</option>
))}
</select>
</div>
<form onSubmit={handleSubmit}>
<textarea
value={text}
onChange={(e) => setText(e.target.value)}
placeholder="Enter your text"
rows="5"
/>
<button type="submit">
Generate
</button>
</form>
</section>
);
Features to Explore
And if we take it a step further? Here are a few features I’ve come across with the ElevenLabs Text-to-speech app.
- Voice Cloning
You can upload a voice, which can be used for text-to-speech generation
Endpoint: /v1/voices/clone
- Sound Generation
You can converts text into sounds
Endpoint: /v1/sound-generation
- Dub a video or an Audio
You can translate and dub the provided audio or video files into the target language
Endpoint: /v1/dubbing
Conclusion
We have reached the end of our tutorial, and as you’ve seen, with just a few steps, you can build a simple Text-to-speech app with ElevenLabs API. I hope this introduction inspires you and gives you ideas for integrating ElevenLabs into your own future projects
Would you like to read more articles by Tekos’s Team? Everything’s here.