Chatterbox, A New Open-Source TTS Model from Resemble AI

2 weeks ago

Here, at DigitalOcean we are very excited about voice models. Can you blame us? Open-source speech recognition and TTS models are getting so good that we are optimistic for their adoption in just about any application where voice technology makes sense. From enhancing accessibility to improving user interfaces across devices (e.g. smartphones, smart glasses, robots, and voice-operated televisions), we are excited for improved user experiences.

If we take a look at agentic computer-use, for example, we’re bottlenecked by our ability to type and click. Sometimes our minds are running faster than our ability to convey our thinking and therefore it isn’t unlikely that voice will prove to be a more expeditious medium for articulating user intent. That being said, an office with ten people shouting instructions to their laptops is far from ideal, but having voice operation as an option can certainly be beneficial

When it comes to implementation, current systems for spoken dialogue typically depend on pipelines of independently functional components, including voice activity detection, speech recognition, textual dialogue (often from an LLM) and text-to-speech.

Delays build up through the different parts of these systems, making the total response time several seconds. This is much slower than natural conversations, which have response times in the hundreds of milliseconds. While considerable progress has been made, interrupting current voice AI systems during their response still feels unnatural and awkward. Additionally, because many of these voice pipelines typically just understand and generate text, they cannot process any information that isn’t written.

For those interested, the paper, Moshi: a speech-text foundation model for real-time dialogue, does a good job of illustrating the limitations of voice AI in its introduction. On another note, the Conversational Speech Model (CSM) from Sesame (so so cool), which we covered in the past, borrows from this paper with their advanced tokenizer for discretizing high-fidelity audio information.

Anywaysssss, as the focus of this article is a TTS model called Chatterbox, let’s turn your attention to Text-to-Speech, shall we?

Text-to-Speech (TTS) models, as their name suggests, convert text into speech. We have all heard that personalization is one of the biggest leverage-points of AI. When it comes to Voice AI, TTS models with voice cloning capabilities allow one to tailor voices to desired languages, accents, and emotional tones so that interactions feel more personal and engaging. An excellent application is audiobooks where entire books can be generated in the author’s voice. We’ll show you how you can potentially approach this in the implementation section of this article.

Thanks to TTS models, information isn’t just something you read, it’s something you can absorb while you’re cooking, driving, or waiting in line. If you haven’t tried NotebookLM already, we encourage you to do so – it’s incredible. Among its many features, NotebookLM generates a podcast with natural sounding voices creating digestible and engaging audios of your uploaded documents and links.

Our AI content team has been looking a lot at TTS models, such as Nari Lab’s Dia. Interestingly, the TTS models we’ve been exploring don’t have a research paper – which makes sense given the small teams that are accomplishing these amazing feats. For example Nari Labs, which released Dia, only has two people who worked on the model and Chatterbox, which we are about to cover, is currently a three person team. We’re very excited about the progress made by these small but mighty teams.

Resemble AI recently launched their first open source TTS model with a MIT license. This model has been trending on Hugging Face since its release. What’s unique about this model is that it introduces a feature they call emotion exaggeration control. Feel free to play around with this adjustable exaggeration parameter in their demo.

Resemble AI acknowledges Cosyvoice, HiFT-GAN, and Llama 3 (now deprecated) as inspiration. Audio files generated by Chatterbox incorporate the PerTH Watermarker, allowing for detection of AI content.

The voice cloning ability of Chatterbox is very impressive. When testing, our team found that the voices bore remarkable similarity to our own and the generations were very impressive. For those interested in comparisons to ElevenLabs, AB testing is available on Podonos.

This article will cover two implementation options for using the Chatterbox TTS model:

Gradio:Using the Gradio interface for quick testing and interaction
Creating an Audiobook:Generating an audiobook by cloning an author’s voice and processing text segments

Step 1: Set up a GPU Droplet

Begin by setting up a DigitalOcean GPU Droplet, select AI/ML, and choose the NVIDIA H100 option.

Step 2: Web Console

Once your GPU Droplet finishes loading, you’ll be able to open up the Web Console.web console

Step 3: Install Dependencies

Next, install the necessary software packages. In the web console, paste and run the following commands to install pip for managing Python packages and git-lfs for handling large files:

apt update apt install python3-pip python3.10 git-lfs -y

Step 4: Gradio App

Now, download the application code from Hugging Face and prepare its environment.

git-lfs clone https://huggingface.co/spaces/ResembleAI/Chatterbox cd Chatterbox python3 -m venv venv_chatterbox source venv_chatterbox/bin/activate pip3 install -r requirements.txt pip3 install spaces

Step 5: Configure the Application for Sharing

To make your Gradio app accessible over the internet, you need to make a small change to its source code.

Open the main application file in the Vim text editor:

vim app.py

Press the i key to enter INSERT mode. You’ll see — INSERT — at the bottom of the terminal. Then, locate the last line of the file, which likely looks something like demo.launch(). Modify it to include share=True:

demo.launch(share=True)

Press the ESC key to exit INSERT mode. Afterwards, type :wq and press Enter to save your changes and exit Vim.

Step 6: Launch the Gradio App

You’re all set! Run the application with the following command:

python3 app.py

After the script initializes, you will see a public URL in the terminal output. Open this URL in your web browser to interact with your live Gradio application.

Step 1: Prepare a Reference Audio Sample

To create an audiobook, Chatterbox requires a short audio sample of the author’s voice to clone it effectively. For optimal results, the Resemble AI Team recommends that audio recordings should be at least 10 seconds in duration and ideally in WAV format. Furthermore, the audio should have a 24k sample rate or higher, feature a single speaker with no background noise, and if possible, be recorded on a professional microphone. The content and speaking style are also important; the context of the spoken sentence should match the emotion in the audio file, and the reference clip’s speaking style should be similar to the desired output, such as using an audiobook-style clip for audiobook generation.

Check option 1 earlier in this tutorial for instructions on setting up a GPU Droplet, cloning the Chatterbox repo and setting up a virtual environment. The code snippet below can be pasted into the terminal to install the necessary packages.

pip3 install chatterbox-tts torchaudio

Step 3: Generate Speech Using the Author’s Voice

import torchaudio as ta from chatterbox.tts import ChatterboxTTS # Load the pre-trained Chatterbox model model = ChatterboxTTS.from_pretrained(device=”cuda”) # Use “cpu” if CUDA is unavailable # Define the text to be converted into speech text = “Your audiobook text goes here.” # Specify the path to the reference audio sample audio_prompt_path = “author_sample.wav” # Generate the speech waveform wav = model.generate(text, audio_prompt_path=audio_prompt_path) # Save the generated audio to a file ta.save(“audiobook_segment.wav”, wav, model.sr)

Replace “Your audiobook text goes here.” with the actual text from your audiobook and author_sample.wav with the path to your reference audio file.

Step 4: Adjust Voice Characteristics

You can adjust the expressiveness and pacing of the synthesized voice using the exaggeration and cfg parameters:
exaggeration: Controls emotional expressiveness. Higher values make the speech more dramatic.
cfg (classifier-free guidance): Adjusts the adherence to the reference voice’s characteristics. Lower values can slow down the speech for clarity.

wav = model.generate( text, audio_prompt_path=audio_prompt_path, exaggeration=0.7, # More expressive cfg=0.3 # Slower, more deliberate pacing )

Step 5: Compile the Audiobook

Process each chapter or section of your audiobook individually, generating the corresponding audio files. Once all segments are synthesized, use an audio editing tool like Audacity to:

Concatenate the audio segments in the correct order.
Add background music or sound effects if desired.
Ensure consistent volume levels and audio quality throughout.
Finally, export the complete audiobook in your preferred format (e.g., MP3, wav).

Chatterbox, developed by Resemble AI, is a recently released text-to-speech model with impressive voice cloning abilities and natural sounding voices. The model can be implemented in Gradio and can be incorporated for a variety of use cases (e.g. audiobooks). Chatterbox represents the significant progress we have made in personalized Voice AI.

Deepgram, an enterprise Voice AI platform, published a report “State of Voice AI 2025” which highlights trends around voice AI adoption. They make the case that 2025 is the year of the Voice AI Agent.

Check out one of our older tutorials which leverages Deepgram: “Building a Real-time AI Chatbot with Vision and Voice Capabilities using OpenAI, LiveKit, and Deepgram on GPU Droplets”

2 weeks ago

Step 1: Set up a GPU Droplet

How to Use the nohup Command in Linux

Trae: A New Free AI-Powered Code Editor from ByteDance

Related Articles

Leave a Reply Cancel reply