Introducing Dia, a TTS model from Nari Labs

One of the most exciting areas in AI right now is the advancement of voice models. In our ongoing exploration of cutting-edge text-to-speech (TTS) models, we previously highlighted the Conversational Speech Model from Sesame.
In this article, we will discuss Dia, a 1.6 billion parameter open-source TTS model from Nari Labs. Currently, there is not much information available on its architecture other than it is heavily inspired by SoundStorm, Parakeet, and Descript Audio Codec. We’ll leave it up to you to speculate how this model was trained and will perhaps cover it in a follow-up article once more information is available, but for now, we’ll focus on its implementation.
We’re very impressed with the model’s performance. Test it out yourself in this HuggingFace space or follow the implementation instructions below.
We’ll cover two different ways of testing out this model. The first is in the Web Console. This is great for one-off testing scenarios for you to do a quick check of the model’s capabilities in a Gradio interface. The second is using the Python library, which is great for developing more intricate applications.
Step 1 : Set up a GPU Droplet
Begin by setting up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDIA H100 option.
Step 2: Web Console
Once your GPU Droplet finishes loading, you’ll be able to open up the Web Console.
In the web console, copy and paste the following code snippet:
git clone https://github.com/nari-labs/dia.git cd dia python -m venv .venv source .venv/bin/activate pip install -e . python app.py
The output will be a Gradio link that you can access within VS Code.
Step 2: Open VS Code
In VS Code, click on “Connect to…” in the Start menu.
Choose “Connect to Host…”.
Step 3: Connect to your GPU Droplet
Click “Add New SSH Host…” and enter the SSH command to connect to your droplet. This command is usually in the format ssh root@[your_droplet_ip_address]. Press Enter to confirm, and a new VSCode window will open, connected to your droplet.
You can find your droplet’s IP address on the GPU Droplet page.
Step 4: Access the Gradio
In the new VSCode window connected to your droplet, type >sim and select “Simple Browser: Show”.
Paste the Gradio url from the Web Console, hit enter, and click the arrow in the top right.
This is the Gradio interface. Feel free to modify the input text to your liking.
To use Dia effectively, it’s essential to consider the length of your input text. Nari Labs recommends aiming for text that corresponds to 5-20 seconds of audio for the most natural-sounding results. If your input text is too short, equivalent to under 5 seconds of audio, the output may sound unnatural. On the other hand, inputs that would take over 20 seconds to speak will be compressed, resulting in unnaturally fast speech. By keeping your text within the moderate range, you can achieve more realistic and engaging audio outputs.
When creating dialogue with Dia, using speaker tags correctly is crucial. Always begin your input text with the [S1] tag to indicate the first speaker. When switching between speakers, alternate between [S1] and [S2] tags, making sure to never use [S1] twice in sequence. This simple tagging system helps Dia understand the conversation flow and produce a more natural-sounding dialogue.
In addition to speaker tags, non-verbal elements can also enhance your audio outputs. However, it’s recommended to use non-verbal tags sparingly for the most natural results. Stick to the officially supported non-verbal sounds listed in the documentation, as overusing these tags or attempting to use unlisted non-verbals may introduce unwanted artifacts.
To work with Dia in a more programmatic way, we can implement its Python library. The code snippet below from voice_clone.py can be modified to your liking.
from dia.model import Dia model = Dia.from_pretrained(“nari-labs/Dia-1.6B”, compute_dtype=”float16″) clone_from_text = “[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face.” clone_from_audio = “simple.mp3” text_to_generate = “[S1] Hello, how are you? [S2] I’m good, thank you. [S1] What’s your name? [S2] My name is Dia. [S1] Nice to meet you. [S2] Nice to meet you too.” output = model.generate( clone_from_text + text_to_generate, audio_prompt=clone_from_audio, use_torch_compile=True, verbose=True ) model.save_audio(“voice_clone.mp3”, output)
Kudos to Nari Labs for pushing the frontier of text-to-speech models – and what’s even more remarkable is that it’s driven by just two passionate undergraduate students. You can really just do things.
We’re excited to hear about how you’re leveraging TTS models. Share your experiences with DigitalOcean GPU Droplets in the comments below: how are you harnessing their power for your TTS applications?