Simple Text-to-Speech with Next.js and ElevenLabs

I recently discovered ElevenLabs while searching for a Text-to-speech app to convert text into audio for my app. They offer an API with several features, including transforming text into speech.

In this tutorial, I will talk about how to build a simple text-to-speech app using Next.js and ElevenLabs API.

No advanced knowledge is needed, but you need to have a basic understanding of JavaScript to create a Text-to-speech app.

Text-to-speech: Step by step

Getting started with ElevenLabs

Elevenlabs

ElevenLabs is an AI platform focused on high-quality voice synthesis. The platform allows for customizable voice settings, such as adjusting stability and similarity, which helps create more realistic and natural-sounding audio.

API Endpoints overview

Here are the endpoints we will be using in this tutorial:

Voice
- Endpoint: /voices
- Description: Retrieves all available voices
Text-to-Speech
- Endpoint: /text-to-speech/{voice_id}
- Description: Converts the text into audio using the specified voice by the voice_id parameter in the Text-to-speech app

Get an API Key

You need to generate your API key on your profile settings:

This key will be used for authentication in API requests.

Setup project

Init your Next.js

Follow these steps:

npx create-next-app text-to-speech-app cd text-to-speech-app

Set Up .env

Create .env file and add your api key for the Text-to-speech app.

ELEVENLABS_API_KEY=your-apikey-here

Create API Routes

We’re going to set up our APIs routes to connect to the ElevenLabs API, so our front-end can use them.

Create the file /api/voices.js

export default async function handler(req, res) {
const apiUrl = 'https://api.elevenlabs.io/v1/voices';
const apiKey = process.env.ELEVENLABS_API_KEY;

  	try {
    		const response = await fetch(apiUrl, {
      			method: 'GET',
      			headers: {
        		'xi-api-key': apiKey,
      			},
    		});

    	const voices = await response.json();
    	res.status(200).json(voices);
    
} catch (error) {
    res.status(500).json({ error: 'Error fetching voices' });
}
}

Create the file /api/text-to-speech.js

export default async function handler(req, res) { 
const { text, voice_id } = req.body; 
const apiUrl = `https://api.elevenlabs.io/v1/text-to-speech/${voice_id}`; const apiKey = process.env.ELEVENLABS_API_KEY; 

const headers = { 
"Accept": "audio/mpeg", 
"xi-api-key": apiKey, 
"Content-Type": "application/json" 
}; 

const requestBody = JSON.stringify({ 
text, 
model_id: "eleven_monolingual_v1",
}); 

try { 
const response = await fetch(apiUrl, { 
method: 'POST', 
headers: headers, 
body: requestBody 
}); 
const audioBuffer = await response.arrayBuffer(); 
res.setHeader('Content-Type', 'audio/mpeg'); 
res.status(200).send(Buffer.from(audioBuffer)); 
} catch (error) { 
console.error('Error generating text-to-speech:', error); res.status(500).json({ error: 'Error generating text-to-speech: }); 
} 

}

We also can configure voice_settings. It allows you to fine-tune the output voice in your Text-to-speech app. Here’s how adjusting these values impacts the generated speech:

Stability:
- What it does: Controls how consistent or dynamic the voice sounds.
- Higher values (closer to 1): The voice will sound more steady and formal.
- Lower values (closer to 0): The voice will be more expressive and natural.

Similarity Boost:
- What it does: Dictates how closely the output matches the original voice model.
- Higher values (closer to 1): The voice will adhere closely to the original voice’s tone and style.
- Lower values (closer to 0): The voice will allow for more variation and flexibility

Example:

const requestBody = JSON.stringify({ 
text, 
model_id: "eleven_monolingual_v1",
voice_settings: { 
stability: 0.5,
similarity_boost: 0.5,
}
});

Currently, I’m using the model_id eleven_monolingual_v1, but ElevenLabs offers also other models (support multiple languages, more fine-tuned voice generation).

In fact, ElevenLabs provides an API to retrieve available models, allowing you to make this value dynamic if you want to take your integration further.

More details here

Front implementation

Now that we’ve set up our API routes, we can move on to the front integration of our Text-to-speech app.

1 – Fetches all voices

When the page loads, we fetch and display the list of available voices for the user to choose.

const [voices, setVoices] = useState([]);
const [selectedVoice, setSelectedVoice] = useState('');

useEffect(() => {
  const loadVoices = async () => {
    const response = await fetch('/api/voices');
    const data = await response.json();
    setVoices(data.voices);
  };
  loadVoices();
}, []);

return (
    <div>
      <h1>Select a voice</h1>
      <select 
        value={selectedVoice} 
        onChange={(e) => setSelectedVoice(e.target.value)}
      >
        {voices.map((voice) => (
          <option key={voice.voice_id} value={voice.voice_id}>
            {voice.name}
          </option>
        ))}
      </select>
    </div>
);

1 – Generating the Speech

const handleSubmit = async (e) => {
    e.preventDefault();
    const response = await fetch('/api/text-to-speech', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ text, voice_id: selectedVoice }),
    });

    const audioBlob = await response.blob();
    const audioUrl = URL.createObjectURL(audioBlob);
    new Audio(audioUrl).play(); 
  };


 <form onSubmit={handleSubmit}>
        <textarea 
          value={text} 
          onChange={(e) => setText(e.target.value)} 
          placeholder="Enter your text"
          rows="5" 
        />
        <button type="submit">
          Generate Speech
        </button>
 </form>

By converting the response to a Blob and creating a temporary URL with URL.createObjectURL(audioBlob), we can make the browser treat the audio data like a file and allow it to play directly without needing to download.

Here is the entire file.

const [text, setText] = useState(''); 
const [voices, setVoices] = useState([]);
const [selectedVoice, setSelectedVoice] = useState('');


const handleSubmit = async (e) => {
  e.preventDefault();

  const response = await fetch('/api/text-to-speech', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text, voice_id: selectedVoice }),
  });

  const audioBlob = await response.blob();
  const audioUrl = URL.createObjectURL(audioBlob);
  new Audio(audioUrl).play(); 
};

useEffect(() => {
  const loadVoices = async () => {
    const response = await fetch('/api/voices');
    const data = await response.json();
    setVoices(data.voices);
  };
  loadVoices();
}, []);


return (
  <section>
    <div>
      <h1>Select a Voice</h1>
      <select 
        value={selectedVoice} 
        onChange={(e) => setSelectedVoice(e.target.value)}
      >
        {voices.map((voice) => (
          <option key={voice.voice_id} value={voice.voice_id}>
            {voice.name}
          </option>
        ))}
      </select>
    </div>

    <form onSubmit={handleSubmit}>
      <textarea 
        value={text} 
        onChange={(e) => setText(e.target.value)} 
        placeholder="Enter your text"
        rows="5" 
      />
      <button type="submit">
        Generate
      </button>
    </form>
  </section>
);

Features to Explore

And if we take it a step further? Here are a few features I’ve come across with the ElevenLabs Text-to-speech app.

Voice Cloning

You can upload a voice, which can be used for text-to-speech generation

Endpoint: /v1/voices/clone

Sound Generation

You can converts text into sounds

Endpoint: /v1/sound-generation

Dub a video or an Audio

You can translate and dub the provided audio or video files into the target language

Endpoint: /v1/dubbing

Conclusion

We have reached the end of our tutorial, and as you’ve seen, with just a few steps, you can build a simple Text-to-speech app with ElevenLabs API. I hope this introduction inspires you and gives you ideas for integrating ElevenLabs into your own future projects

Would you like to read more articles by Tekos’s Team? Everything’s here.

Author

Marie

Marie is a FullStack developer at Tekos Interactive. She is passionate about her work and driven by a strong desire to continuously learn and grow. Her goal is to contribute meaningfully to innovative projects that push the boundaries of what’s possible, while expanding her skills and exploring new ideas and technologies.