speech to speech ai translation

Speech translation

Easily integrate real-time speech translation to your app.

Enable multilingual communication

Translate audio from more than 30 languages and customize your translations for your organization’s specific terms—all in your preferred programming language.

speech to speech ai translation

Production-ready

Benefit from fast, reliable speech translation powered by neural machine translation technology.

speech to speech ai translation

Customizable translations

Tailor models to recognize domain-specific terminology and unique speaking styles.

speech to speech ai translation

Normalized text

Deliver readable translations with an engine trained to normalize speech output.

speech to speech ai translation

Built-in security

Your data stays yours—your speech input is not logged during processing.

Add high-quality translations to your apps

Generate speech-to-speech and speech-to-text translations with a single API call. Speech Translation captures the context of full sentences to provide accurate, fluent translations and improve communication between speakers of different languages.

Tailor translations to reflect domain-specific terminology

Normalize text for better translations.

Speech Translation can remove verbal fillers ("um," "uh," and coughs) and repeated words, add proper punctuation and capitalization, and exclude profanities for more readable translations.

speech to speech ai translation

Fuel App Innovation with Cloud AI Services

Learn 5 key ways your organization can get started with AI to realize value quickly.

Privacy and security

The Speech service, part of Azure AI Services, is  certified  by SOC, FedRamp, PCI, HIPAA, HITECH, and ISO.

View or delete any of your custom translator data and models at any time. Your data is encrypted while it’s in storage.

You control your data. Your audio input and translation data are not logged during audio processing.

Backed by Azure infrastructure, the Speech service offers enterprise-grade security, availability, compliance, and manageability.

Comprehensive security and compliance, built in

Microsoft invests more than $1 billion annually on cybersecurity research and development.

speech to speech ai translation

We employ more than 3,500 security experts who are dedicated to data security and privacy.

The security center compute and apps tab in Azure showing a list of recommendations

Azure has more certifications than any other cloud provider. View the comprehensive list .

speech to speech ai translation

Flexible pricing gives you the power and control you need

Pay only for what you use, with no upfront costs.

With Speech Translation, you pay as you go, based on hours of audio translated.

Get started with an Azure free account

speech to speech ai translation

After your credit, move to  pay as you go  to keep building with the same free services. Pay only if you use more than your free monthly amounts.

speech to speech ai translation

Documentation and resources

Get started.

Read our  documentation .

Take the  Microsoft Learn course .

Explore code samples

Check out our  sample code .

See customization resources

Customize your speech solution with

  Speech Studio . No code required.

Start building with AI Services

speech to speech ai translation

In our increasingly interconnected world, where language differences may present a barrier to communication, translation systems can enable people from different linguistic backgrounds to share knowledge and experiences more seamlessly. However, many of these systems today do not preserve key elements of speech that make human communication human. More specifically, it’s not just the words we choose that convey what we want to say—it’s also how we speak them. Tone of voice, pauses, and emphasis carry important signals that help us communicate emotions and intent. Moreover, human speech and translation are sensitive to nuances such as turn-taking and timing controls. Picture, for example, how human interpreters work: they find just the right balance between low-latency and accurate translations. Waiting too long stifles the flow of communication, while going too fast compromises the overall quality of a translation. Translation systems that enable authentic conversations should deliver across all of these elements of communication.

RECOMMENDED READS

  • Bringing the world closer together with a foundational multimodal model for speech translation
  • Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
  • 200 languages within a single AI model: A breakthrough in high-quality machine translation

Today, we are excited to share Seamless , the first publicly available system that unlocks expressive cross-lingual communication in real time. To build Seamless, we developed SeamlessExpressive, a model for preserving expression in speech-to-speech translation, and SeamlessStreaming, a streaming translation model that delivers state-of-the-art results with around two seconds of latency. All of the models are built on SeamlessM4T v2, the latest version of the foundational model we released in August. SeamlessM4T v2 demonstrates performance improvements for automatic speech recognition, speech-to-speech, speech-to text, and text-to-speech capabilities. Compared to previous efforts in expressive speech research, SeamlessExpressive addresses certain underexplored aspects of prosody, such as speech rate and pauses for rhythm, while also preserving emotion and style. The model currently preserves these elements in speech-to-speech translation between English, Spanish, German, French, Italian, and Chinese.

SeamlessStreaming unlocks real-time conversations with someone who speaks a different language by generating the translation while the speaker is still talking. In contrast to conventional systems which translate when the speaker has finished their sentence, SeamlessStreaming translates while the speaker is still talking. This means that the person they're speaking to can hear a translation in closer to real-time - there is a delay of a few seconds - rather than waiting until the speaker has finished their sentence. SeamlessStreaming supports automatic speech recognition and speech-to-text translation for nearly 100 input and output languages, and speech-to-speech translation for nearly 100 input languages and 36 output languages. In keeping with our approach to open science, we’re publicly releasing all four models to allow researchers to build on this work.

Introducing metadata, data and data alignment tools

speech to speech ai translation

Today, alongside our models, we are releasing metadata, data and data alignment tools to assist the research community, including:

  • Metadata of an extension of SeamlessAlign corresponding to an additional 115,000 hours of speech and text alignments on top of the existing 470k hours. In addition to more hours, the latest version of SeamlessAlign covers a broader range of languages (from 37 previously to 76 with the extension). This corpus is the largest public speech/speech and speech/text parallel corpus in terms of total volume and language coverage to date.
  • Metadata of SeamlessAlignExpressive, an expressivity-focused version of the dataset above. In this dataset, the pairs are parallel from both a semantic and prosodic perspective. SeamlessAlignExpressive is released as a benchmark to validate our expressive alignment approach. In order to train our expressive models, we applied our alignment method to a proprietary dataset.
  • Translated text data for mExpresso, a multilingual, parallel extension of read speech in Expresso , a high-quality expressive speech dataset that includes both read speech and improvised dialogues rendered in different styles. This text benchmark enables benchmarking expressive translation systems from English into other languages.
  • Tools to assist the research community in collecting more datasets for translation.

In particular, we are updating our stopes library and SONAR encoders . With these tools, anyone can automatically create multimodal translation pairs from their own speech and/or text monolingual data through parallel data alignment methods.

Our approach

speech to speech ai translation

All our models run on fairseq2 , the latest update of our sequence modeling toolkit. Similar to our previous work on SeamlessM4T, fairseq2 offers an ideal framework for building our streaming and expressivity updates because it is lightweight, easily composable with other PyTorch ecosystem libraries, and has more efficient modeling and data loader APIs.

UnitY2, a new architecture that has a non-autoregressive text-to-unit decoder, is also instrumental to our work. In SeamlessM4T v2, we used multitask-UnitY2 to enable text input (updated from v1's multitask-UnitY). We also used the architecture for SeamlessStreaming and SeamlessExpressive. As our next generation multitask model, UnitY2 has superior speech generation capabilities through its improved text-to-unit model. This implementation leads to improved consistency between text output and speech output, compared to the SeamlessM4T v1 model.

Instead of using an autoregressive text-to-unit model as in UnitY, we used a non-autoregressive model. Autoregressive models predict the next token based on the previously generated tokens. While autoregressive models model speech naturally, they scale poorly as sequence length increases. They are also more likely to exhibit repetitive degeneration. Non-autoregressive models predict the duration of each segment, which enables each segment to be decoded in parallel. This makes them robust to long sequences, and we see improvements over the initial iteration of UnitY. Since the model inherently predicts duration, it is much more easily adaptable to the streaming use case, because we know exactly how much speech is needed to be generated for each piece of text, which is not the case for autoregressive models.

EMMA is our core streaming algorithm, which allows us to intelligently decide when we have enough information to generate the next speech segment or target text. It improves upon previous state-of-the-art algorithms especially for long input sequences, which is the case for speech-to-text or speech-to-speech translation. Further, this algorithm allows us to fine-tune from offline models, which allows us to reap the benefits of the Seamless M4T v2 foundation model. Finally, we show empirically that this algorithm generalizes well across many different language pairs, which is particularly challenging for streaming models because the language pairs may be structured differently.

Expressivity

Preserving expression also requires a new approach. We replaced the unit HiFi-GAN vocoder in SeamlessM4T v2 with PRETSSEL, an expressive unit-to-speech generator. PRETSSEL is conditioned on the source speech for waveform generation to transfer tones, emotional expression, and vocal style qualities. We initialize our model from SeamlessM4T v2 in order to achieve high translation quality, which is the most fundamental need for a speech-to-speech translation system. We also developed Prosody UnitY2, integrating an expressivity encoder in SeamlessM4T v2 to guide unit generation with proper rhythm, speaking rate, and pauses. In addition, we release a suite of evaluation tools to capture the preservation of these aspects of expressivity.

speech to speech ai translation

The updates to UnitY2 have resulted in improved translation quality across a variety of tasks. SeamlessM4T v2 achieves sate of the art translation for speech-to-speech and speech-to-text results in 100 languages. In the same model, it also beats Whisper v3’s for automatic speech recognition on average and in particular for lower resource languages.

For speech-to-text translation, SeamlessM4T v2 improves by 10% compared to the model we released in August and by more than 17% over the strongest cascaded models when translating into English. For speech-to-speech translation, SeamlessM4T v2, improves over SeamlessM4T (v1) by more than 15% when translating into English, and by 25% when translating from English.

In other tasks, SeamlessM4T v2 is on par with No Language Left Behind (NLLB) in text-to-text translation. It is also on-par on average with MMS in automatic speech recognition (ASR) (with better performance on mid and high-resource languages while MMS has better performance on low resource languages), and improving over the recently released Whisper-Large-v3 by more than 25%. In the zero-shot task of text-to-speech translation, SeamlessM4T v2 is on-par with strong cascaded models into English, and improves over these baselines by 16 percent in English.

We compared SeamlessExpressive against a cascaded speech-to-text and text-to-speech pipeline, where speech-to-text is from SeamlessM4T v2, and text-to-speech is from strong open-sourced cross-lingual text-to-speech system that supports vocal style and emotion transfer. Results show that SeamlessExpressive is more stable with respect to noise in the source speech such that the output speech maintains high content translation quality, and better preserves styles and speech rate. SeamlessStreaming achieves state of the art low latency quality with speech-to-speech translation.

How we built AI translation systems responsibly: Toxicity mitigation

Accuracy is paramount in translation systems. Translation errors or unintended toxicity can cause misunderstandings between two people who don’t speak the same language.

Keeping with our commitment to building responsible AI, we explored the problem of hallucinated toxicity further. We focused our efforts on SeamlessM4T v2, which serves as the foundation for SeamlessStreaming, SeamlessExpressive, and our unified Seamless model.

The primary root cause for hallucinated toxicity often lies in the training data. Training samples can be noisy and contain unbalanced toxicity. For example, the input language side and target language side can contain different amounts of toxic words by mistake. Prior to training, we discarded any sample that showed signs of this imbalance.

However, filtering is only a passive technique and does not fully prevent hallucinated toxicity. We went one step further this time, and implemented a novel approach that actively mitigates this phenomenon. During the translation generation process, our model automatically detects generated toxic words. When there are misaligned levels of toxicity, we automatically re-adjust the generation process and use a different choice of words. This works at inference time and does not require any fine-tuning of the translation model. By doing so, we significantly reduce added toxicity while preserving translation quality.

Finally, building upon our past work on toxicity and bias evaluation, we’ve extended our evaluation framework with a new hallucinated toxicity detection tool. While our previous approach relied on an intermediate transcription model (ASR), we are now capable of detecting toxicity directly in the speech signal. This is useful in cases where toxicity is not conveyed by individual words, but rather in tone or general style. This allows us to get a more precise picture of the potential toxicity profile of our model. Additional research needs to be done on responsible AI for machine translation; however, we believe these measures bring us closer to realizing safer and more human-centric translation systems.

Audio watermarking

While AI tools can help bring the world closer together, it’s just as important that we include measures to prevent the risk of imitation and other forms of misuse. Our watermarking method offers a better level of reliability compared to passive discriminators, which are becoming less effective at differentiating synthetic voices from human ones as voice preservation technology advances. Watermarking actively embeds a signal that is imperceptible to the human ear, but still detectable within the audio using a detector model. Through this watermark, the origin of the audio can be accurately traced. This helps promote the responsible use of voice preservation technology by establishing a verifiable audio provenance and helps prevent potential abuses.

Beyond sheer detection accuracy, our watermarking solution needs to be robust to various attacks. For example, bad actors can try to modify the audio by adding noise, echo, or filtering some frequencies to dilute the watermark and bypass detection. We tested our watermarking method against a broad range of attack types and the results show that it is more robust than the current state-of-the-art. Our method can also pinpoint AI-generated segments in audio down to the frame level, surpassing the previous state-of-the-art (which only provides a one second resolution).

As in any kind of neural-network based safety mechanism, the watermarking model can be fine-tuned in isolation to forget its core properties. However, fine-tuning SeamlessExpressive and Seamless for translation purposes would not involve any update to the watermarking model itself, which does not play any role on translation quality.

Providing access to our technology

The breakthroughs we’ve achieved with Seamless show that the dream of a universal, real-time translator isn’t science fiction—it’s becoming a technical reality. We invite everyone to try our expressive translation demo. We’re also making our code, model and data available to the research community.

This blog post was made possible by the work of Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, and Mary Williamson.

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with the latest AI news, events and research breakthroughs from Meta.

Join us in the pursuit of what’s possible with AI.

speech to speech ai translation

Latest Work

Our Actions

Meta © 2024

Online / Hybrid / In Person

KUDO AI Speech Translator

Live audio and captions in 30+ languages

Make your accessible in any language

speech to speech ai translation

Click for live audio and captions in:

  • See all languages

Available on:

speech to speech ai translation

3,800 users

The tool is great. I will come back whenever we have clients for our next events who need translation support". KUDO AI user

Live-translated audio and captions, powered by AI

Audio and captions in 30+ languages, perfect for live-translated webinars, presentations, and training

Choose an AI-generated voice that imitates the natural flow of speech .

Access multilingual captions and audio from any laptop or smartphone.

Download  recordings in each language for offline use after the fact.

Benefit from our end-to-end features like request-to-speak and voting .

For two-way conversations. One-to-many formats only, today.

Only speech-to-text. We provide live multilingual audio and captions.

A literal or word-for-word translator . You can expect human-like speech.

A ‘record and playback’ tool. We do real-time, continuous translation.

Workflow Engagement

Inclusion starts with speaking your listener's language

Increase workforce engagement with inclusive all-hands meetings and team building.

A non-engaged or disengaged workforce can cost companies up to $7.8 trillion to globally in lost productivity – the equivalent of 11% of global GDP*).

Increase employee participation and efficiency with truly inclusive internal meetings and L&D programs that speak to the whole company.

Webinars and Events

Productivity starts with speaking your listener's language

Expand the reach and participation rate of your webinars and events.

Time is wasted when a company has to use internal resources to translate for global executives, holding multiple meetings to deliver the same content for each country. Don't let language fragmentation put a stop to your growth.

Run your webinars and meetings in an easy, cost-effective, and more accessible way.

Training Programs

Boost the effectiveness and accuracy of your corporate training programs

Training users and workers can be challenging for sectors that require precision like manufacturing or medical devices. Inefficient training can limit product adoption and companies’ expansion to new markets.

With KUDO, you can be confident that complex topics are being effectively communicated and understood.

Research

Growth starts with speaking your listener's language

Global diffusion for your research and university courses.

Business English no longer needs to be the only language for your academic courses and MBAs, or the only language used by researchers for seminars and conferences.

What if you could make your institution’s research and programs accessible to more students and researchers?

Want to learn more about KUDO AI?

How To Boost Your L&D Strategy with Multilingual Content

How To Boost Your L&D Strategy with Multilingual Content

Multilingual content can significantly improve employee engagement and knowledge retention. Read how to implement a multilingual L&D strategy using technology.

speech to speech ai translation

KUDO AI Translation Quality: Testing Results & Performance Review

Translation quality, accuracy, and fluency: what can you expect from KUDO AI? Here’s the outcome of our tests, and the methodology behind it.

speech to speech ai translation

How Good is AI Speech Translation Today? An A-Z Guide to Quality

How accurate and reliable are AI speech translators? What is the quality of human vs AI interpretation? Here’s everything you need to know.

Cartoon laptop vs cartoon human interpreter

KUDO CEO Takes On the ‘AI vs Human Language Interpreters’ Debate

Five Tips to Successfully Engage Your Remote Workforce

Five Tips to Successfully Engage Your Remote Workforce

Got a question we’re all ears.

We can answer that in five points. KUDO AI offers:  

  • Speech-to-speech translation (multilingual audio as well as captions). This means that you can hear the speaker in your preferred language without having to follow only subtitles. This is particularly useful during events or conferences where you can simply plug a pair of headphones into your smartphone and experience the content live.  
  • Real-time (simultaneous) translation. At KUDO, we exclusively offer continuous, live speech translation. No having to record your sentences then play them back one-by-one; KUDO AI will translate your voice and the voice of your listeners in real-time.  
  • A fully integrated solution. Unlike other AI translation tools on the market, KUDO AI is not a standalone app that you have to run on another device in parallel to your videoconferencing platform. And if you’re using it on the KUDO platform itself, you can expect a host of additional features like voting/polling, request-to-speak, etc.  
  • Human-like translation. KUDO was designed by a team that includes 50% language interpretation and 50% translation technology expertise. Each language that is added to KUDO AI undergoes extensive research and development to attain the highest quality translation on the market, and we have created a unique process to render the flow and intonation of sentences in the most natural way possible for the listener.  
  • An affordable solution. Inclusion is one of KUDO’s three core values. We want language accessibility to be available on a global scale, not just reserved for the organizations with the highest budgets. This is reflected in the pricing we’ve created for KUDO AI.

No – not yet . AI speech technology has come a long way in the last year, but no solution on the market can deliver a good enough user experience for a back-and-forth, real-time conversation. But we’re working on it.

As of today, KUDO AI lets you have multiple speakers of different languages taking it in turns to present to an audience, but this is not the same as having those speakers hold a two-way conversation using the solution.

For one-to-many presentations, webinars, training, and teaching, KUDO AI is a great language accessibility solution for your participants.

For decision-making, discussions, panels, and any conversational events, we recommend using professional human interpretation instead.

Yes, absolutely. Our clients include some of the world’s biggest private and public organizations, so when it comes to InfoSec, we know what we’re doing.  

  • 100% GDPR compliant  
  • ISO 27001  
  • SOC 2 Type 1/Type 2  
  • FedRAMP Moderate Readiness

Sorry to disappoint, but the answer is: that completely depends on what type of communication you want to make language-accessible; the meeting or event type, the duration, and the languages you require .

Talk to our experts about your language needs and they will advise you on which solution is best suited.

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

a thousand voices —

Meta’s “massively multilingual” ai model translates up to 100 languages, speech or text, meta aims for a universal translator like "babel fish" from hitchhiker’s guide ..

Benj Edwards - Aug 22, 2023 7:57 pm UTC

An illustration of a person holding up a megaphone to a head silhouette that says

On Tuesday, Meta announced SeamlessM4T , a multimodal AI model for speech and text translations. As a neural network that can process both text and audio, it can perform text-to-speech, speech-to-text, speech-to-speech, and text-to-text translations for "up to 100 languages," according to Meta. Its goal is to help people who speak different languages communicate with each other more effectively.

Further Reading

Continuing Meta's relatively open approach to AI, Meta is releasing SeamlessM4T under a research license (CC BY-NC 4.0) that allows developers to build on the work. They're also releasing SeamlessAlign, which Meta calls "the biggest open multimodal translation dataset to date, totaling 270,000 hours of mined speech and text alignments." That will likely kick-start the training of future translation AI models from other researchers.

Among the features of SeamlessM4T touted on Meta's promotional blog, the company says that the model can perform speech recognition (you give it audio of speech, and it converts it to text), speech-to-text translation (it translates spoken audio to a different language in text), speech-to-speech translation (you feed it speech audio, and it outputs translated speech audio), text-to-text translation (similar to how Google Translate functions), and text-to-speech translation (feed it text and it will translate and speak it out in another language). Each of the text translation functions supports nearly 100 languages, and the speech output functions support about 36 output languages.

In the SeamlessM4T announcement, Meta references the Babel Fish, a fictional fish from Douglas Adams' classic sci-fi series that, when placed in one's ear, can instantly translate any spoken language:

Building a universal language translator, like the fictional Babel Fish in The Hitchhiker’s Guide to the Galaxy , is challenging because existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages. But we believe the work we’re announcing today is a significant step forward in this journey.

How did they train it? According to the Seamless4MT research paper , Meta's researchers "created a multimodal corpus of automatically aligned speech translations of more than 470,000 hours, dubbed SeamlessAlign" (previously mentioned above). They then "filtered a subset of this corpus with human-labeled and pseudo-labeled data, totaling 406,000 hours."

As usual, Meta is being a little vague about where it got its training data. The text data came from "the same dataset deployed in NLLB ," (sets of sentences pulled from Wikipedia, news sources, scripted speeches, and other sources and translated by professional human translators). And SeamlessM4T's speech data came from "4 million hours of raw audio originating from a publicly available repository of crawled web data," of which 1 million hours were in English, according to the research paper. Meta did not specify which repository or the provenance of the audio clips used.

Meta is far from the first AI company to offer machine-learning translation tools. Google Translate has used machine-learning techniques since 2006, and large language models (such as GPT-4 ) are well known for their ability to translate between languages. But more recently, the tech has heated up on the audio processing front. In September, OpenAI released its own open source speech-to-text translation model, called Whisper , that can recognize speech in audio and translate it to text with a high level of accuracy.

SeamlessM4T builds from that trend by expanding multimodal translation to many more languages. In addition, Meta says that SeamlessM4T’s "single system approach"—a monolithic AI model instead of multiple models combined in a chain (like some of Meta's previous audio-processing techniques)—reduces errors and increases the efficiency of the translation process.

More technical details on how SeamlessM4T works are available on Meta's website , and its code and weights (the actual trained neural network files) can be found on Hugging Face .

reader comments

Channel ars technica.

  • SeamlessM4T is the first all-in-one multilingual multimodal AI translation and transcription model.
  • This single model can perform speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages depending on the task.

The world we live in has never been more interconnected, giving people access to more multilingual content than ever before. This also makes the ability to communicate and understand information in any language increasingly important.

Today, we’re introducing SeamlessM4T, the first all-in-one multimodal and multilingual AI translation model that allows people to communicate effortlessly through speech and text across different languages. SeamlessM4T supports:

  • Speech recognition for nearly 100 languages
  • Speech-to-text translation for nearly 100 input and output languages
  • Speech-to-speech translation, supporting nearly 100 input languages and 36 (including English) output languages
  • Text-to-text translation for nearly 100 languages
  • Text-to-speech translation, supporting nearly 100 input languages and 35 (including English) output languages

In keeping with our approach to open science, we’re publicly releasing SeamlessM4T under a research license to allow researchers and developers to build on this work. We’re also releasing the metadata of SeamlessAlign, the biggest open multimodal translation dataset to date, totaling 270,000 hours of mined speech and text alignments.

Building a universal language translator, like the fictional Babel Fish in The Hitchhiker’s Guide to the Galaxy , is challenging because existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages. But we believe the work we’re announcing today is a significant step forward in this journey. Compared to approaches using separate models, SeamlessM4T’s single system approach reduces errors and delays, increasing the efficiency and quality of the translation process. This enables people who speak different languages to communicate with each other more effectively.

SeamlessM4T builds on advancements we and others have made over the years in the quest to create a universal translator. Last year, we released No Language Left Behind (NLLB), a text-to-text machine translation model that supports 200 languages, and has since been integrated into Wikipedia as one of the translation providers. We also shared a demo of our Universal Speech Translator , which was the first direct speech-to-speech translation system for Hokkien, a language without a widely used writing system. And earlier this year, we revealed Massively Multilingual Speech , which provides speech recognition, language identification and speech synthesis technology across more than 1,100 languages.

SeamlessM4T draws on findings from all of these projects to enable a multilingual and multimodal translation experience stemming from a single model, built across a wide range of spoken data sources with state-of-the-art results.

This is only the latest step in our ongoing effort to build AI-powered technology that helps connect people across languages. In the future, we want to explore how this foundational model can enable new communication capabilities — ultimately bringing us closer to a world where everyone can be understood.

Learn more about SeamlessM4T on our AI blog .

Related News

Preserving the world’s language diversity through ai.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookie Policy

Mobile Navigation

Introducing whisper.

Whisper

Illustration: Ruby Chen

We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.

More resources

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Other existing approaches frequently use smaller, more closely paired audio-text training datasets, [^reference-1] [^reference-2] [^reference-3] or use broad but unsupervised audio pretraining. [^reference-4] [^reference-5] [^reference-6]  Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out the  paper ,  model card , and  code  to learn more details and to try out Whisper.

Help | Advanced Search

Computer Science > Computation and Language

Title: translatotron 2: high-quality direct speech-to-speech translation with voice preservation.

Abstract: We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

speech to speech ai translation

High-Quality, Robust and Responsible Direct Speech-to-Speech Translation

September 23, 2021

Posted by Ye Jia and Michelle Tadmor Ramanovich, Software Engineers, Google Research

Speech-to-speech translation (S2ST) is key to breaking down language barriers between people all over the world. Automatic S2ST systems are typically composed of a cascade of speech recognition, machine translation, and speech synthesis subsystems. However, such cascade systems may suffer from longer latency, loss of information (especially paralinguistic and non-linguistic information), and compounding errors between subsystems.

In 2019, we introduced Translatotron , the first ever model that was able to directly translate speech between two languages. This direct S2ST model was able to be efficiently trained end-to-end and also had the unique capability of retaining the source speaker’s voice (which is non-linguistic information) in the translated speech. However, despite its ability to produce natural sounding translated speech in high fidelity, it still underperformed compared to a strong baseline cascade S2ST system (e.g., composed of a direct speech-to-text translation model [ 1 , 2 ] followed by a Tacotron 2 TTS model).

In “ Translatotron 2: Robust direct speech-to-speech translation ”, we describe an improved version of Translatotron that significantly improves performance while also applying a new method for transferring the source speakers’ voices to the translated speech. The revised approach to voice transference is successful even when the input speech contains multiple speakers speaking in turns while also reducing the potential for misuse and better aligning with our AI Principles . Experiments on three different corpora consistently showed that Translatotron 2 outperforms the original Translatotron by a large margin on translation quality, speech naturalness, and speech robustness.

Translatotron 2

Translatotron 2 is composed of four major components: a speech encoder, a target phoneme decoder, a target speech synthesizer, and an attention module that connects them together. The combination of the encoder, the attention module, and the decoder is similar to a typical direct speech-to-text translation (ST) model. The synthesizer is conditioned on the output from both the decoder and the attention.

There are three novel changes between Translatotron and Translatotron 2 that are key factors in improving the performance:

  • While the output from the target phoneme decoder is used only as an auxiliary loss in the original Translatotron, it is one of the inputs to the spectrogram synthesizer in Translatotron 2. This strong conditioning makes Translatotron 2 easier to train and yields better performance.
  • The spectrogram synthesizer in the original Translatotron is attention-based, similar to the Tacotron 2 TTS model, and as a consequence, it also suffers from the robustness issues exhibited by Tacotron 2. In contrast, the spectrogram synthesizer employed in Translatotron 2 is duration-based, similar to that used by Non-Attentive Tacotron , which drastically improves the robustness of the synthesized speech.
  • Both Translatotron and Translatotron 2 use an attention-based connection to the encoded source speech. However, in Translatotron 2, this attention is driven by the phoneme decoder instead of the spectrogram synthesizer. This ensures the acoustic information that the spectrogram synthesizer sees is aligned with the translated content that it’s synthesizing, which helps retain each speaker’s voice across speaker turns.

More Powerful and Responsible Voice Retention

The original Translatotron was able to retain the source speaker's voice in the translated speech, by conditioning its decoder on a speaker embedding generated from a separately trained speaker encoder. However, this approach also enabled it to generate the translated speech in a different speaker's voice if a clip of the target speaker's recording were used as the reference audio to the speaker encoder, or if the embedding of the target speaker were directly available. While this capability was powerful, it had the potential to be misused to spoof audio with arbitrary content, which posed a concern for production deployment.

To address this, we designed Translatotron 2 to use only a single speech encoder, which is responsible for both linguistic understanding and voice capture. In this way, the trained models cannot be directed to reproduce non-source voices. This approach can also be applied to the original Translatotron.

To retain speakers' voices across translation, researchers generally prefer to train S2ST models on parallel utterances with the same speaker's voice on both sides. Such a dataset with human recordings on both sides is extremely difficult to collect, because it requires a large number of fluent bilingual speakers. To avoid this difficulty, we use a modified version of PnG NAT , a TTS model that is capable of cross-lingual voice transferring to synthesize such training targets. Our modified PnG NAT model incorporates a separately trained speaker encoder in the same way as in our previous TTS work — the same strategy used for the original Translatotron — so that it is capable of zero-shot voice transference.

Following are examples of direct speech-to-speech translation from Translatotron 2 in which the source speaker’s voice is retained:

To enable S2ST models to retain each speaker’s voice in the translated speech when the input speech contains multiple speakers speaking in turns, we propose a simple concatenation-based data augmentation technique, called ConcatAug . This method augments the training data on the fly by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples. The resulting samples contain two speakers’ voices in both the source and the target speech, which enables the model to learn on examples with speaker turns. Following are audio samples from Translatotron 2 with speaker turns:

More audio samples are available here .

Performance

Translatotron 2 outperforms the original Translatotron by large margins in every aspect we measured: higher translation quality (measured by BLEU , where higher is better), speech naturalness (measured by MOS , higher is better), and speech robustness (measured by UDR , lower is better). It particularly excelled on the more difficult Fisher corpus . The performance of Translatotron 2 on translation quality and speech quality approaches that of a strong baseline cascade system, and is better than the cascade baseline on speech robustness.

Multilingual Speech-to-Speech Translation

Besides Spanish-to-English S2ST, we also evaluated the performance of Translatotron 2 on a multilingual set-up in which the model took speech input from four different languages and translated them into English. The language of the input speech was not provided, which forced the model to detect the language by itself.

On this task, Translatotron 2 again outperformed the original Translatotron by a large margin. Although the results are not directly comparable between S2ST and ST, the close numbers suggest that the translation quality from Translatotron 2 is comparable to a baseline speech-to-text translation model, These results indicate that Translatotron 2 is also highly effective on multilingual S2ST.

Acknowledgments

The direct contributors to this work include Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz. We also thank Chung-Cheng Chiu, Quan Wang, Heiga Zen, Ron J. Weiss, Wolfgang Macherey, Yu Zhang, Yonghui Wu, Hadar Shemtov, Ruoming Pang, Nadav Bar, Hen Fitoussi, Benny Schlesinger, Michael Hassid for helpful discussions and support.

Orion

Orion Unveils AI-powered Speech-to-Speech Translation

Avatar photo

Orion Unveils AI-powered Speech-to-Speech Translation for the Frontline Workforce First-of-its-kind, Bi-directional Speech-to-Speech Translation is Now Available

SAN FRANCISCO, May 24, 2022 — Orion Labs , Inc., the world’s first Voice AI Communications Platform for frontline enterprises, today released its Speech-to-Speech Translation (S2ST), available via our Push-to-Talk (PTT) 2.0 Platform . Orion’s S2ST is a first-of-its-kind, voice-AI solution that layers bi-directional translation into real-time, multipoint communications, unlocking entirely new staffing and collaboration possibilities for the deskless workforce and edge operations.

Orion’s Speech-to-Speech Translation uses Orion’s Frontline AI to edge-deliver speech and text translation instantly to deskless workers at their point of work and in the flow of work. Orion’s Frontline AI is powered by proprietary Automated Speech Recognition (ASR) that serves edge operations of any size, supporting each individual user’s language preferences. Orion PTT Groups support up to 60 language preferences, enabling all users to set the language they hear for their real-time PTT group or 1-to-1 direct messages, regardless of the language of the original speaker.

“Orion’s Speech-to-Speech Translation is a revolutionary, voice-driven solution that enables any team member, regardless of their native language, to collaborate in Orion talk groups, bringing translation into the flow of real-time team communication,” said Greg Taylor, Orion CEO. “Our approach is a game-changer for frontline workforce productivity and safety. I am so proud of Orion’s engineering innovation to provide the first-ever solution for global and mixed-language operations leveraging AI.”

Frontline operations across transportation, logistics, hospitality, retail, healthcare, and more all benefit from removing language barriers with real-time translation. Frontline enterprises expand labor pools, improve team collaboration, and reduce miscommunication. Managers also ensure new hires are effective on Day 1 by simplifying the onboarding of multilingual staff.

Orion’s Speech-to-Speech Translation is just one of Orion’s many Voice AI Workflows . Additional offerings include Voice Queries like the Inventory Query & Price Check Workflow or Voice Ticketing Workflow, Voice Checklists, and Voice Commands & Workflows like the Emergency Escalation Workflow, Shift Check-in Workflow, and Register Backup Workflow. Best of all, Orion Voice AI Workflows can seamlessly operate on other frontline collaboration apps including Microsoft Teams.

About Orion Labs, Inc.

Orion is the only AI Workflow Platform for frontline workforce productivity, connecting people to people and people to enterprise systems. Orion’s proprietary Voice AI Workflows are systems of action for the frontline worker with voice-driven process automation and integrations to leading operational systems of record. Orion’s proprietary Automated Speech Recognition (ASR) drives innovative operational automations like Speech-to-Speech Workflows, Voice Queries, Checklist Workflows, Emergency Response Workflows, and more.

Orion was recently named a Top AI Solution Provider 2022 by CIO Applications, a Most Promising Unified Communications Solutions for 2022 by CIO Review, a Top 10 Intelligent Transport Systems Solution Provider for 2021 by Logistics and Transportation Review, a Top 10 Industrial IoT Solution Provider 2020 by Manufacturing Technology Insights, and an IDC Innovator. Orion holds 62 patents that support its award-winning solutions. For more information, visit www.orionlabs.io .

Stay in touch with Orion Labs | LinkedIn | Twitter | Blog

Media contact:

Jacqueline Wasem Orion Labs, Inc. 415-800-5467 [email protected]

Speech to speech translation: Breaking language barriers in real-time

Table of contents.

If you want to reach a wider audience, speech to speech translation is a great way to do it. Here's everything you need to know.

Language barriers have been a long-standing issue in communication across different cultures and regions. However, the advent of advanced translation technology, particularly speech to speech translation, is progressively minimizing these barriers. This article will delve into what speech-to-speech translation is, how it works, its advantages, and some of the top tools available in this field.

What is speech to speech translation?

Speech to speech translation (S2ST) is an advanced system of language translation that translates spoken language from one language to another in real-time. Unlike traditional translation or interpretation methods that translate text, S2ST handles spoken language, including unwritten languages, making it a valuable tool for diverse, multilingual communication.

How speech to speech translation tools work

Speech to speech translation tools rely heavily on machine learning and artificial intelligence technologies, specifically natural language processing (NLP), automatic speech recognition (ASR), and text to speech (TTS) synthesis.

Here is a simplified breakdown of the process:

  • Speech recognition: The S2ST system starts by encoding the input speech using automatic speech recognition. This phase transforms spoken words into a written format.
  • Translation: The transcribed text is then processed using machine translation. It gets converted from the source language (say, English or Mandarin) into the target language (like Spanish or Hokkien).
  • Speech synthesis: Finally, the translated text is transformed back into spoken language using TTS synthesis. This results in a playback of the translated speech in the target language.

More advanced models of S2ST systems, known as direct speech to speech translation systems, skip the transcription phase, converting the speech from one language to another without creating a written intermediary. These systems are more complex as they involve training data and creating embeddings from large datasets of different languages and waveforms.

There are two more important terms to know when it comes to speech to speech translation: speech to speech translation models and decoders:

Speech to speech translation models

A speech to speech translation model is an advanced type of translation system that uses machine learning and artificial intelligence to convert spoken language from one language to another in real time.

This technology typically comprises several components:

  • Automatic speech recognition (ASR): This component takes the input speech, recognizes it, and converts it into text form. It’s a complex process that involves identifying the spoken language, understanding the speech in the context of that language, and transforming spoken words into written words.
  • Machine translation (MT): The transcribed text is then translated from the source language into the target language using machine translation algorithms. These algorithms leverage vast datasets and sophisticated language models to ensure accuracy and fluency.
  • Text to speech synthesis (TTS): The translated text is then converted back into speech in the target language using TTS systems. These systems generate spoken language that sounds natural, maintaining the correct pronunciation and intonation.

The most advanced speech to speech translation models skip the transcription step and translate the spoken words from one language directly to another, making the process more efficient and accurate. These direct translation models are typically trained on large datasets that include a broad variety of languages and accents, allowing them to perform well in real-world situations.

In the context of machine learning and natural language processing, a decoder is part of a model that translates the condensed understanding of the input data into the target or output data.

Often, the term decoder is used within the architecture of an encoder-decoder model. The encoder processes the input data and compresses it into a context vector, also known as a hidden state. This hidden state is then passed to the decoder, which generates the output data.

In the context of speech-to-speech or speech to text translation, the encoder might convert the input speech into an intermediate representation, and the decoder would then generate the translated speech or text from that representation.

In digital communications, a decoder is a device or software that converts an encoded or compressed digital signal or data back into its original format. For instance, a video decoder takes compressed video data and converts it into a viewable format.

Advantages of speech to speech translation

So, why would you want speech to speech translation for your audio or video content? Here are the top reasons:

  • Real-time communication: One of the significant advantages of S2ST is real-time translation, which facilitates immediate communication across different languages. This is particularly valuable in real-world situations like business meetings, conferences, or travel.
  • Breaking language barriers: With the ability to translate multiple languages, including those that are traditionally unwritten, S2ST breaks down barriers, enabling more effective communication.
  • Accessibility: S2ST can also provide accessibility solutions for those with hearing or speech impairments by transcribing and translating spoken language.
  • Ease of use: Many S2ST tools are designed to be user-friendly, with interfaces that are easy to navigate, even for beginners.

Top speech to speech translation tools

Speech to speech translation is a remarkable technological breakthrough, eliminating language barriers and fostering global communication like never before. As AI and machine learning technologies continue to advance, we can expect even more efficient and accurate tools in the future.

Several tech giants and emerging startups are at the forefront of S2ST technology, including Google, Microsoft, Meta (formerly Facebook), and SpeechMatrix.

Google Translate

This tool offers a conversation mode for speech to speech translation in real-time. It supports a variety of languages and dialects and is widely used due to its high-quality translation and user-friendly interface.

Microsoft Translator

This tool not only supports text translation but also allows speech translation. Its API can be integrated into other services to provide real-time translation.

Meta’s AI research

Meta’s research division has made significant strides in S2ST technology. They’ve been open-sourcing their models and tools, allowing others to build upon their work.

SpeechMatrix

An emerging player in the field, SpeechMatrix offers a toolkit for multilingual and multitask speech recognition and synthesis. Their advanced technology can handle both speech to text and speech to speech translation.

Speechify AI Dubbing

Speechify AI Dubbing is completely transforming how direct speech to speech translation is done with AI dubbing. Powered by sophisticated AI voice models, this tool can provide instant language translations at the click of a button.

Get fast and accurate speech to speech translation with Speechify AI Dubbing

If you need to translate your audio or videos quickly and accurately, we recommend Speechify AI Dubbing. With it, you can translate audio content into hundreds of different languages in seconds. The AI voices are incredibly natural-sounding, and they can even be customized to meet your needs or artistic vision.

Reach a wider audience with the help of Speechify AI Dubbing .

  • Previous Discover the best voice cloning tools to mimic voices for voiceovers, e-learning, podcasts, and more. Transform text-to-speech with lifelike AI voices.
  • Next The Ultimate Guide to iOS 17 Voice Cloning, Personal Voice, and Live Speech

Cliff Weitzman

Cliff Weitzman

Cliff Weitzman is a dyslexia advocate and the CEO and founder of Speechify, the #1 text-to-speech app in the world, totaling over 100,000 5-star reviews and ranking first place in the App Store for the News & Magazines category. In 2017, Weitzman was named to the Forbes 30 under 30 list for his work making the internet more accessible to people with learning disabilities. Cliff Weitzman has been featured in EdSurge, Inc., PC Mag, Entrepreneur, Mashable, among other leading outlets.

Recent Blogs

Is Text to Speech HSA Eligible?

Is Text to Speech HSA Eligible?

Can You Use an HSA for Speech Therapy?

Can You Use an HSA for Speech Therapy?

Surprising HSA-Eligible Items

Surprising HSA-Eligible Items

Ultimate guide to ElevenLabs

Ultimate guide to ElevenLabs

Voice changer for Discord

Voice changer for Discord

How to download YouTube audio

How to download YouTube audio

Speechify 3.0 Released.

Speechify 3.0 is the Best Text to Speech App Yet.

Voice API

Voice API: Everything You Need to Know

Text to audio

Best text to speech generator apps

The best AI tools other than ChatGPT

The best AI tools other than ChatGPT

Top voice over marketplaces reviewed

Top voice over marketplaces reviewed

Speechify Studio vs. Descript

Speechify Studio vs. Descript

Google Cloud Text to Speech API

Everything to Know About Google Cloud Text to Speech API

Source of Joe Biden deepfake revealed after election interference

Source of Joe Biden deepfake revealed after election interference

How to listen to scientific papers

How to listen to scientific papers

How to add music to CapCut

How to add music to CapCut

What is CapCut?

What is CapCut?

VEED vs. InVideo

VEED vs. InVideo

Speechify Studio vs. Kapwing

Speechify Studio vs. Kapwing

Voices.com vs. Voice123

Voices.com vs. Voice123

Voices.com vs. Fiverr Voice Over

Voices.com vs. Fiverr Voice Over

Fiverr voice overs vs. Speechify Voice Over Studio

Fiverr voice overs vs. Speechify Voice Over Studio

Voices.com vs. Speechify Voice Over Studio

Voices.com vs. Speechify Voice Over Studio

Voice123 vs. Speechify Voice Over Studio

Voice123 vs. Speechify Voice Over Studio

Voice123 vs. Fiverr voice overs

Voice123 vs. Fiverr voice overs

HeyGen vs. Synthesia

HeyGen vs. Synthesia

Hour One vs. Synthesia

Hour One vs. Synthesia

HeyGen vs. Hour One

HeyGen vs. Hour One

Speechify makes Google’s Favorite Chrome Extensions of 2023 list

Speechify makes Google’s Favorite Chrome Extensions of 2023 list

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

speech to speech ai translation

Speechify text to speech helps you save time

Popular blogs.

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

The Best Celebrity Voice Generators in 2024

Youtube text to speech: elevating your video content with speechify.

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

The 7 best alternatives to Synthesia.io

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

Everything you need to know about text to speech on TikTok

The 10 best text-to-speech apps for android, how to convert a pdf to speech.

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

The top girl voice changers

How to use siri text to speech.

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

Obama text to speech

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

Robot Voice Generators: The Futuristic Frontier of Audio Creation

Pdf read aloud: free & paid options.

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

Alternatives to FakeYou text to speech

All about deepfake voices, tiktok voice generator, text to speech goanimate, the best celebrity text to speech voice generators, pdf audio reader, how to get text to speech indian voices, elevating your anime experience with anime voice generators, best text to speech online, top 50 movies based on books you should read.

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

Download audio

How to Add a Voice Over to Vimeo Video: A Comprehensive Guide

Only available on iPhone and iPad

To access our catalog of 100,000+ audiobooks, you need to use an iOS device.

Coming to Android soon...

Join the waitlist

Enter your email and we will notify you as soon as Speechify Audiobooks is available for you.

You’ve been added to the waitlist. We will notify you as soon as Speechify Audiobooks is available for you.

Audio Course documentation

Speech-to-speech translation

Audio course.

and get access to the augmented documentation experience

to get started

Speech-to-speech translation (STST or S2ST) is a relatively new spoken language processing task. It involves translating speech from one langauge into speech in a different language:

Diagram of speech to speech translation

STST can be viewed as an extension of the traditional machine translation (MT) task: instead of translating text from one language into another, we translate speech from one language into another. STST holds applications in the field of multilingual communication, enabling speakers in different languages to communicate with one another through the medium of speech.

Suppose you want to communicate with another individual across a langauge barrier. Rather than writing the information that you want to convey and then translating it to text in the target language, you can speak it directly and have a STST system convert your spoken speech into the target langauge. The recipient can then respond by speaking back at the STST system, and you can listen to their response. This is a more natural way of communicating compared to text-based machine translation.

In this chapter, we’ll explore a cascaded approach to STST, piecing together the knowledge you’ve acquired in Units 5 and 6 of the course. We’ll use a speech translation (ST) system to transcribe the source speech into text in the target language, then text-to-speech (TTS) to generate speech in the target language from the translated text:

Diagram of cascaded speech to speech translation

We could also have used a three stage approach, where first we use an automatic speech recognition (ASR) system to transcribe the source speech into text in the same language, then machine translation to translate the transcribed text into the target language, and finally text-to-speech to generate speech in the target language. However, adding more components to the pipeline lends itself to error propagation , where the errors introduced in one system are compounded as they flow through the remaining systems, and also increases latency, since inference has to be conducted for more models.

While this cascaded approach to STST is pretty straightforward, it results in very effective STST systems. The three-stage cascaded system of ASR + MT + TTS was previously used to power many commercial STST products, including Google Translate . It’s also a very data and compute efficient way of developing a STST system, since existing speech recognition and text-to-speech systems can be coupled together to yield a new STST model without any additional training.

In the remainder of this Unit, we’ll focus on creating a STST system that translates speech from any language X to speech in English. The methods covered can be extended to STST systems that translate from any language X to any langauge Y, but we leave this as an extension to the reader and provide pointers where applicable. We further divide up the task of STST into its two constituent components: ST and TTS. We’ll finish by piecing them together to build a Gradio demo to showcase our system.

Speech translation

We’ll use the Whisper model for our speech translation system, since it’s capable of translating from over 96 languages to English. Specifically, we’ll load the Whisper Base checkpoint, which clocks in at 74M parameters. It’s by no means the most performant Whisper model, with the largest Whisper checkpoint being over 20x larger, but since we’re concatenating two auto-regressive systems together (ST + TTS), we want to ensure each model can generate relatively quickly so that we get reasonable inference speed:

Great! To test our STST system, we’ll load an audio sample in a non-English language. Let’s load the first example of the Italian ( it ) split of the VoxPopuli dataset:

To listen to this sample, we can either play it using the dataset viewer on the Hub: facebook/voxpopuli/viewer

Or playback using the ipynb audio feature:

Now let’s define a function that takes this audio input and returns the translated text. You’ll remember that we have to pass the generation key-word argument for the "task" , setting it to "translate" to ensure that Whisper performs speech translation and not speech recognition:

Whisper can also be ‘tricked’ into translating from speech in any language X to any language Y. Simply set the task to "transcribe" and the "language" to your target language in the generation key-word arguments, e.g. for Spanish, one would set:

generate_kwargs={"task": "transcribe", "language": "es"}

Great! Let’s quickly check that we get a sensible result from the model:

Alright! If we compare this to the source text:

We see that the translation more or less lines up (you can double check this using Google Translate), barring a small extra few words at the start of the transcription where the speaker was finishing off their previous sentence.

With that, we’ve completed the first half of our cascaded STST pipeline, putting into practice the skills we gained in Unit 5 when we learnt how to use the Whisper model for speech recognition and translation. If you want a refresher on any of the steps we covered, have a read through the section on Pre-trained models for ASR from Unit 5.

Text-to-speech

The second half of our cascaded STST system involves mapping from English text to English speech. For this, we’ll use the pre-trained SpeechT5 TTS model for English TTS. 🤗 Transformers currently doesn’t have a TTS pipeline , so we’ll have to use the model directly ourselves. This is no biggie, you’re all experts on using the model for inference following Unit 6!

First, let’s load the SpeechT5 processor, model and vocoder from the pre-trained checkpoint:

As with the Whisper model, we’ll place the SpeechT5 model and vocoder on our GPU accelerator device if we have one:

Great! Let’s load up the speaker embeddings:

We can now write a function that takes a text prompt as input, and generates the corresponding speech. We’ll first pre-process the text input using the SpeechT5 processor, tokenizing the text to get our input ids. We’ll then pass the input ids and speaker embeddings to the SpeechT5 model, placing each on the accelerator device if available. Finally, we’ll return the generated speech, bringing it back to the CPU so that we can play it back in our ipynb notebook:

Let’s check it works with a dummy text input:

Sounds good! Now for the exciting part - piecing it all together.

Creating a STST demo

Before we create a Gradio demo to showcase our STST system, let’s first do a quick sanity check to make sure we can concatenate the two models, putting an audio sample in and getting an audio sample out. We’ll do this by concatenating the two functions we defined in the previous two sub-sections, such that we input the source audio and retrieve the translated text, then synthesise the translated text to get the translated speech. Finally, we’ll convert the synthesised speech to an int16 array, which is the output audio file format expected by Gradio. To do this, we first have to normalise the audio array by the dynamic range of the target dtype ( int16 ), and then convert from the default NumPy dtype ( float64 ) to the target dtype ( int16 ):

Let’s check this concatenated function gives the expected result:

Perfect! Now we’ll wrap this up into a nice Gradio demo so that we can record our source speech using a microphone input or file input and playback the system’s prediction:

This will launch a Gradio demo similar to the one running on the Hugging Face Space:

You can duplicate this demo and adapt it to use a different Whisper checkpoint, a different TTS checkpoint, or relax the constraint of outputting English speech and follow the tips provide for translating into a langauge of your choice!

Going forwards

While the cascaded system is a compute and data efficient way of building a STST system, it suffers from the issues of error propagation and additive latency described above. Recent works have explored a direct approach to STST, one that does not predict an intermediate text output and instead maps directly from source speech to target speech. These systems are also capable of retaining the speaking characteristics of the source speaker in the target speech (such a prosody, pitch and intonation). If you’re interested in finding out more about these systems, check-out the resources listed in the section on supplemental reading .

Generative Speech to Speech Translation

Automated AI dubbing when quality matters

AI generated with speaker labels and detailed timing.

AI generated from proofed transcript in text, subtitles, and JSON.

Match original speaker or native speaker voice.

speech to speech ai translation

Highest quality transcription. Contextual translation.

speech to speech ai translation

Natural sounding dubbed output. Indistinguishable from human speech.

speech to speech ai translation

Dozens of language pairs and multiple dialects.

speech to speech ai translation

Full control of an end-to-end platform with professional support.

speech to speech ai translation

Speechlab with your workflow

Designed to scale with your enterprise systems and processes

Share and manage access to content and collaborate with team members.

speech to speech ai translation

White glove service using a network of vetted specialists to review translations and dub.

speech to speech ai translation

Import from and export to content management and localization services

speech to speech ai translation

We work with

Expand reach for podcasts, documentaries, news, and scripted content.

Update pre-sales, marketing, customer onboarding, and compliance videos.

Expand access to educational videos to make it consumable in multiple languages.

Get started today for free

©2024 Speechlab All rights reserved

speech to speech ai translation

Generated Audio

You can listen to the generate audio from here .

We can conclude that speech-to-speech translation involves several steps but performing them sequentially makes the process easy. . The article introduces a comprehensive pipeline function, “speech_to_speech_pipeline,” which orchestrates the entire translation process, making it accessible for users interested in English-to-Hindi speech translation. Speech-to-speech translation in very important for various real-time application.

Please Login to comment...

Similar reads.

author

  • Geeks Premier League 2023
  • NLP-Projects
  • Geeks Premier League
  • How to Use ChatGPT with Bing for Free?
  • 7 Best Movavi Video Editor Alternatives in 2024
  • How to Edit Comment on Instagram
  • 10 Best AI Grammar Checkers and Rewording Tools
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

speech to speech ai translation

speech to speech ai translation

Latest reviews

  • Search resources
  • OBS Studio Plugins

LocalVocal: Local Live Captions & Translation On-the-Go

LocalVocal: Local Live Captions & Translation On-the-Go v0.2.1

  • Author royshilkrot
  • Creation date Aug 14, 2023

✅

  • Transcribe audio to text in real time in 100 languages
  • Translate immediately to/from ~100 languages
  • Display captions on screen using text sources
  • Translate captions in real time to any language - see https://youtu.be/Q34LQsx-nlg or https://youtu.be/ryWBIEmVka4
  • Remove unwanted words from the transcription
  • Summarize the text and show "highlights" on screen
  • Detect key moments in the stream and allow triggering events (like replay)
  • Detect emotions/sentiment and allow triggering events (like changing the scene or colors etc.)
  • Background Removal removes background from webcam without a green screen.
  • Detect will detect and track >80 types of objects in real-time inside OBS
  • URL/API Source that allows fetching live data from an API and displaying it in OBS.

More resources from royshilkrot

URL/API Source: Live Data, Media and AI on OBS Made Simple

Share this resource

Latest updates, v0.2.1 - translation built-in, v0.2.0 - cuda on windows mac apple arm optimization, v0.1.1 - new whisper, variable buffer, bugfix 7.1 audio.

  • 5.00 star(s)
  • Feb 25, 2024
  • Version: v0.1.0
  • Dec 18, 2023
  • Version: v0.0.7

Destroy666

  • Sep 18, 2023
  • Version: v0.0.2
  • This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register. By continuing to use this site, you are consenting to our use of cookies. Accept Learn more…
  • Become an LSP Partner
  • I am an interpreter

Home / Solutions / AI Speech Translation

AI Translation

Boostlingo ai pro converts spoken words to translated captions with ai., ai captions.

Caption and transcribe meetings instantly, capturing and annotating every word, even if spoken in different languages.

AI Translations

Translate spoken content from one language to another instantly.

Prefer to hear the translations instead of reading them? Then AI Speech is a feature you’d like to turn on.

AI Assistance

Get AI assistance with various tasks. Meeting note summaries to AI generated glossary, AI is integrated in every step.

Begin captioning and translating today

Boostlingo ai pro: how it works.

Boostlingo AI Pro offers high-quality multilingual captions via AI speech recognition and AI translations. Effortlessly convert spoken words into written text with live captioning during meetings or conferences, enabling seamless communication across languages. Check out how it works!

AI CAPTIONING AND TRANSLATION

How to set up a session.

  • Setup your session by defining which languages should be supported for captioning & translating.
  • Share session details with other participants.
  • Click “start session” & captioning will begin.
  • When meeting is over, a transcript with all meeting notes will be available to review & download.

AI Speech Translation

Toggle to audio button for ai speech.

Listen to AI translated captions with AI speech by clicking on the “audio” button during a session.

There are different languages available for:

  • Spoken Languages (input)
  • Translated Captions (output)
  • AI Speech (output)

The list of languages keeps evolving, full list of languages is available in the app.

Integrations

Integrate with any conferencing app.

AI Pro easily integrates natively with third-party software like Zoom,  Google Chrome   and  Microsoft Teams to provide precise translated captions.

Boostlingo AI Pro is compatible with these apps:

The number of integrations will continue to increase. For a wide range of integration options and instructions on how to use them, click here .

Boostlingo AI Pro: Overview Video

Boostlingo ai pro pricing, try ai pro free.

  • Ai Integrations Included
  • Captioning Included*
  • Translation Included**
  • 30 Minutes Max Length
  • Ai Integrations included
  • Captioning included*
  • Translation included**
  • 5-days retention
  • 20 participants max
  • 90 minutes max length/session
  • 30-Days retention
  • 50 Participants max
  • 240 minutes max length/session
  • Unlimited retention
  • Unlimited participants
  • Unlimited length

*1 Minute per captioned source language, **1 Minute charge per translated target language 

State of AI in Language Industry

Frequently asked questions.

Yes! Boostlingo AI Pro captures speech to text for multiple languages at once with over 130 languages in our library.

Boostlingo AI Pro supports transcriptions as text, in .srt and .vtt formats.

You can currently access a list of languages supported for conversation and those available for caption translation on this page .

  • Latest News
  • Artificial Intelligence
  • Big Data and Analytics
  • Cybersecurity
  • Applications
  • IT Management
  • Small Business
  • Development
  • PC Hardware
  • Search Engines
  • Virtualization

5 Best AI Voice Generators: AI Text-To-Speech in 2024

In search of the best AI voice generator? Discover the leading AI text-to-speech platforms available in 2024.

Artificial humanoid face made of binary data producing digital sound waves.

eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

An AI voice generator is a specialized type of generative AI technology that enables users to create new voices or manipulate existing vocal audio with no audio engineering expertise. Instead, they simply insert text, or some other media, with requested parameters to direct the vocal generator to create a relevant voice or voice product.

In this guide, we’ll take a closer look at the five best AI voice generators available today, but first, here’s a glance at where each of these tools differentiates itself the most:

  • Murf : Best for Multichannel Content Creation
  • PlayHT : Best for AI Voice Agents
  • LOVO : Best Combined AI Voice and Video Platform
  • ElevenLabs : Best for Enterprise AI Scalability
  • Speechify : Best for AI Narration

Featured Partners: AI Software

Wrike

Top AI Voice Generator Software Comparison

In addition to text-to-speech and voice cloning capabilities, we’ll primarily compare these tools across these key criteria for generative AI voice generation software:

TABLE OF CONTENTS

Murf AI icon.

Murf: Best for Multichannel Content Creation

Murf is one of the top generative AI voice tools available to both casual and business users, providing them with an accessible user interface and a range of scalable voice generation and editing features. Its primary focus areas include text-to-speech content generation, no-code voice editing, AI-powered translation, AI voice deployment to apps via API, voice cloning, and an AI dubbing feature that is currently in beta for more than 20 languages.

Many business users select this tool for its wide range of collaborative features, its enterprise-level security and compliance expertise and features, its vocal quality and variety, and its comprehensive support for various enterprise use cases.

In addition to its easy-to-use enterprise integrations with various creative and product development tools, Murf also offers free creative guides and resources on the following topics: e-learning, explainer videos, YouTube videos, Spotify ads, corporate videos, advertisements, audiobooks, podcasts, video games, training videos, presentations, product demos, IVR voices, animation character voices, and documentaries.

Pros and Cons

  • Creator Lite: $23 per month billed annually, or $29 billed monthly for one editor to access up to five projects and 24 hours per year of voice generation.
  • Creator Plus: $39 per month billed annually, or $49 billed monthly for one editor to access up to 30 projects and four hours per month of voice generation (up to 48 hours per year).
  • Business Lite: $79 per month billed annually, or $99 billed monthly for up to three editors and five viewers to access up to 50 projects and eight hours per month of voice generation (up to 96 hours per year). Free trial access to this plan’s features is available for one editor, up to two projects, and up to 10 minutes of voice generation.
  • Business Plus: $159 per month billed annually, or $199 billed monthly for up to three editors and five viewers to access up to 200 projects and 20 hours per month of voice generation (up to 240 hours per year). Free trial access to this plan’s features is available for one editor, up to two projects, and up to 10 minutes of voice generation.
  • Enterprise: Pricing information available upon request. This plan is designed for more than five editors and unlimited viewers to create custom projects with unlimited voice generation access.
  • Murf API: Pricing information available upon request.
  • AI Translation: Add-on for Enterprise and Business plan users. Pricing information available upon request.
  • Integrations: Integrations are available for Canva, Google Slides, Adobe Audition, Adobe Captivate and Captivate Classic, and HTML Embed Code. Users can also download Murf Voices Installer to directly incorporate Murf voices into Windows apps.
  • Vocal library: More than 200 voices, styles, and tonalities in more than 20 languages are available to users.
  • Team collaboration and project organization: Folders, sub-folders, shareable links, and private folders and projects all support controlled collaboration.
  • Enterprise compliance: Depending on the plan selected, users can benefit from GDPR, SOC2, and EU compliance support as well as SSO, access logs, custom contracts, and security reviews.
  • Visual voice editing: Easy-to-use buttons and clickability to adjust pitch, emphasis, speed, interjections, pauses, pronunciation, and more.

To see a list of the leading generative AI apps, read our guide: Top 20 Generative AI Tools and Apps 2024

Play.ht icon.

PlayHT: Best for AI Voice Agents

PlayHT has been a favorite artificial intelligence voice generation tool for a few years now, extending to users a highly accessible and scalable tool for multilingual AI voice generation. Compared to other AI voice generation tools, PlayHT first and foremost sets itself apart with its range of voice and language options: All plans, including the free plan, can access 907 voices and 142 different languages and accents. The tool also comes with limited instant voice clones and will soon offer high-fidelity clones to enterprise users.

Beyond its more conventional AI voice features and tools, PlayHT has set its sights on a very specific enterprise use case: AI voice agents. With its new feature set, Play Agents, users can create their own AI voice agent avatars with specific parameters and prompts about how they should greet and respond to user interactions. The tool also comes with several prebuilt agent templates, API-driven agent training and tracking for developers, and a simple table for tracking agent conversation history.

Pricing for PlayHT depends on whether you select PlayHT Studio, AI voice agents, or the API subscription plans:

PlayHT Studio

  • Free Plan: $0 for non-commercial access to all voices and languages, one instant voice clone, and up to 12,500 characters.
  • Creator: $31.20 per month billed annually, or $39 billed monthly.
  • Unlimited: Typically $99 per month, billed annually or monthly. A special discount is currently running for the annual plan for $29 per month.
  • Enterprise: Custom pricing.

AI Voice Agents

  • Free Plan: $0 for non-commercial access to 30 minutes of agent content creation.
  • Pro: $20 billed monthly plus $0.05 per each minute used over 400 minutes.
  • Business: $99 billed monthly plus $0.05 per each minute used over 2,000 minutes.
  • Growth: $499 billed monthly plus $0.05 per each minute used over 10,000 minutes.
  • Enterprise: Custom pricing for unlimited limits and other advanced features.
  • Hacker: $5 billed monthly plus $0.25 per every additional 1,000 characters over 25,000 characters per month.
  • Startup: $299 billed monthly plus $0.20 per every additional 1,000 characters over 1.5 million characters per month.
  • Growth: $999 billed monthly plus $0.10 per every additional 1,000 characters over 10 million characters per month.
  • Business: Custom pricing for large volume discounts and custom rate limits.
  • Multilingual voice library: PlayHT’s voice library includes 907 text-to-speech voices and 142 languages and accents.
  • Pronunciation library: This feature allows users to define specific pronunciations and save these rules for future projects.
  • Multi-voice content creation: A single audio file and project can include multiple voices, which is useful for AI conversational projects .
  • Play Agents feature: Custom AI voice agents and preconfigured agent templates for healthcare, hotels, restaurants, front desks, and e-commerce can be used to create more intelligent customer service AI chatbots/agents.
  • Real-time streaming API: Character-based pricing for API access, which scales up to include dedicated enterprise clusters and other advanced features.

For more information about generative AI providers, read our in-depth guide: Generative AI Companies: Top 20 Leaders

LOVO icon.

LOVO: Best Combined AI Voice and Video Platform

LOVO offers its users a suite of useful AI features that not only support AI voice generation and voiceover initiatives but also other creative tasks related to video and image creation . LOVO’s flagship platform, Genny, is a user-friendly tool that uses its own generative AI technologies to enable video editing, subtitle generation, voice generation, and voice cloning tasks. With the help of ChatGPT and Stable Diffusion models , users can also generate shortform and longform text and AI art projects at no additional cost and with no third-party tooling requirements.

Users most appreciate that this tool supports multiple languages and unique vocal tones, is easy to use, and offers high-quality voice outputs compared to many competitors. Many users also appreciate that they can purchase affordable, lifetime deals through AppSumo.

Pricing for LOVO depends on whether you select an All in One or Subtitles subscription plan:

  • Basic: $24 per month billed annually, or $29 per user billed monthly. Limited to one user per plan subscription.
  • Pro: $48 per user per month, billed annually, with a 50% discount for the first year, or $48 per user billed monthly. A 14-day free trial is also available for this plan’s features.
  • Pro +: $149 per user per month, billed annually, with a 50% discount for the first year, or $149 per user billed monthly.
  • Enterprise: Pricing information available upon request.
  • Free: $0 for limited features.
  • Subtitles: $12 per user per month, billed annually, or $18 per user billed monthly.
  • Genny: All-in-one video creation platform with voice generation, voice cloning, subtitle generation, art generation, text generation, and video editing capabilities.
  • Multilingual voice library: The text-to-speech library includes more than 500 voices and more than 100 languages. LOVO also caters voices to 30 different emotions.
  • Built-in voice recorder: For voice cloning, users can record their voices directly within the LOVO tool. They also have the option to upload a prerecorded clip, if preferred.
  • Simple Mode: For shorter voice generation and voiceover projects (between 2,000 and 5,000 characters), users can work with the lightweight, faster Simple Mode format.
  • API access: LOVO voice application development features are available in all plans.

For an in-depth comparison of two leading AI art generators, see our guide: Midjourney vs. Dall-E: Best AI Image Generator 2024

ElevenLabs icon.

ElevenLabs: Best for Enterprise AI Scalability

ElevenLabs is an artificial intelligence research firm that has developed comprehensive AI voice technologies for text to speech, speech to speech, dubbing, voice cloning, and multilingual content generation. Users frequently compliment ElevenLabs on the quality of the voice products it produces, noting that the vocal tone and overall quality feel more realistic than what most other competitors are producing.

ElevenLabs is one of the most business-friendly AI voice tools on the market today, offering advanced features at different price points. Its free plan is fairly comprehensive, including access to 29 languages and thousands of voices, automated dubbing, custom voices, and API. Six different pricing tiers are available, with the top tier offering unique enterprise draws like custom terms and SSO, unlimited concurrency, and volume-based discounts.

Additionally, ElevenLabs offers a grant program designed for the unique needs of business startups. Eligible startup applicants who can convince the vendor of their longterm strategy and growth potential will be given three months of free access with 11 million characters per month and enterprise features.

  • Free: $0 for 10,000 monthly characters, or approximately 10 minutes of audio per month.
  • Starter: $50 per year, billed annually, with the first two months free, or $5 billed monthly with 80% off the first month.
  • Creator: $220 per year, billed annually, with the first two months free, or $22 billed monthly with 50% off the first month.
  • Pro: $990 per year, billed annually, with the first two months free, or $99 billed monthly.
  • Scale: $3,300 per year, billed annually, with the first two months free, or $330 billed monthly.
  • Custom Enterprise Plans: Pricing information available upon request.
  • Precision voice tuning: With this drag-and-drop editing feature, users can adjust vocal stability and variability, vocal clarity, and style exaggerations on a scale.
  • Multilingual voice library: More than 1,000 voices across 29 different languages are available for text-to-speech content generation.
  • Speech to speech: Users can upload an audio file or record their voice for voice changing, custom voices, and voice cloning capabilities.
  • Dubbing Studio: Video translation and dubbing available in 29 different languages. Speaker. Studio interface allows users to granularly adjust specs.
  • AI Speech Classifier: This unique feature allows users to upload an audio file so the vendor can evaluate if the clip was created by ElevenLabs AI.

Speechify icon.

Speechify: Best for AI Narration

Speechify is an AI voice solution that specializes in text-to-speech technology for mobile platforms and more casual use cases, like audiobook narration. With the Speechify AI platform, users can select from a wide variety of AI voices, including voices that mimic celebrities like Gwyneth Paltrow and Snoop Dogg. All of this is available in various mobile and online locations, including through browser extensions that are accessible and favorably reviewed by users.

While Speechify’s core audience is recreational users, students, and other more casual users who want a convenient solution for reading off text in various formats, the platform offers some key enterprise AI usability features through its Voice Over Studio for Business. With this suite of Speechify solutions, business users can benefit from unlimited video and voice downloads, commercial rights, collaborative project management features, dozens of voices, and enterprise security and compliance features.

Pricing for Speechify all depends on how you want to use the tool. Here are some of the options you have as a Speechify user:

  • Speechify Limited (text to speech): $0 for 10 standard reading voices and limited text-to-speech features.
  • Speechify Premium: $139 per year for advanced text-to-speech features and capabilities.
  • Speechify Studio Free: $0 for access to basic AI voice and video features with no downloads.
  • Speechify Studio Basic: $24 per user per month, billed annually, or $69 per user billed monthly.
  • Speechify Studio Professional: $32.08 per user per month, billed annually, or $99 per user billed monthly.
  • Speechify Studio Enterprise: Pricing information available upon request.
  • Text to Speech API: Users can join the waitlist.
  • Speechify Audiobooks: $9.99 per month, or $120 billed annually.

Custom pricing and discounts may also be available for business teams and educational organizations.

  • Browser extensions and app: Users can access Speechify through the Chrome extension, Edge Add-on, Android, iOS, and PDF readers like Adobe Acrobat.
  • Multilingual voice library: More than 100 voices in over 40 languages are available for enterprise users.
  • AI dubbing: Dubbing is available in multiple languages, with the ability to adjust voice, tone, and speed.
  • AI video generator: Users can combine Speechify’s AI voiceovers with avatars to create AI videos.
  • Various upload and download formats: Content can be uploaded in .txt, .docx, .srt, and YouTube URL formats; Speechify projects can be downloaded as video, audio, or text.

Key Features of AI Voice Generator Software

AI voice generator software typically includes features that help users transform text, existing audio, and other media into voices with adjustable qualities to meet their needs. Additionally, many of these generative AI tools come with features to make enterprise-level collaboration and content creation run more smoothly. In general, expect to find the following features in AI voice generators:

Text to Speech

Text to speech (TTS) is a type of AI technology that changes written text into spoken audio. Most AI voice generator software allows users to upload text of different lengths and in different languages in order to generate a vocal version of the same content.

Voice Cloning

With voice cloning, AI technology can capture the content, tonality, speed, and other characteristics of a person’s voice in a recording and use that information to create a faithful replica or clone of that unique voice. With this capability, users can generate entirely new content and recordings that sound like they were spoken by that person.

Custom Voices or Voice Changing

On some AI voice platforms, if you submit your own voice clip or directly record your voice into the app, you can then change that voice into a completely different character, adjusting the tone, accent, mood, and other features. Many users want this feature for creative projects like video game development.

Multilingual Voice Library

Most generative AI voice tools give users access to a diverse, multilingual library of predeveloped voice models. Through extensive training, these TTS models are prepared to create voice transcripts and recordings that accurately adhere to each language’s specific pronunciations, tonalities, pauses, and other characteristics of that language’s speech patterns.

Dubbing and Translation

Taking TTS a step further, dubbing and translation with AI make the effort to translate an existing text or voice recording into a different spoken language. For dubbing specifically, existing recordings — often movies, commercials, and other visual media — receive a new vocal overlay, typically dubbed in a different language by an AI model.

APIs and Third-Party Integrations

With the help of APIs and built-in third-party integrations, users can more easily add AI voice creation and editing capabilities directly into their app and product development workflows. A growing number of AI voice tools are adding relevant third-party integrations to creative platforms as well as social and distribution channels.

To learn about today’s top generative AI tools for the video market, see our guide:  5 Best AI Video Generators

How We Evaluated AI Voice Generators

To evaluate these AI voice generators and other leaders in this AI market sector, we looked at each tool’s standard and unique features while focusing on the following criteria. Each criterion is weighted based on its importance to the typical business user:

Vocal Quality – 30%

Needless to say, vocal quality, fidelity, and usability are the most important aspects of an AI voice generator. Within this criterion, we evaluated each tool based on the realistic quality of AI voices, the accuracy of AI voice generations, the availability of different voices and languages, and the ability to granularly edit generated voice products. We also considered whether a tool offered users the ability to customize or record their own voices and voiceovers.

Enterprise Scalability – 30%

Enterprise scalability is hugely important for AI voice generators since many companies invest in this type of platform to create global marketing, sales, and product content at scale.

For enterprise scalability, we assessed each tool’s global library of voices and dialects, its adherence to enterprise security and compliance standards, features that go beyond voice content production, collaboration and sharing capabilities, integrations with relevant third-party tools and platforms, and the scalability of APIs. We placed a special emphasis on each tool’s enterprise-level plans and the additional features that are available at this level.

Pricing – 20%

Pricing is a crucial factor when considering AI voice technology, as the cost of these tools varies widely for the features you get at that price point. As part of this evaluation, we identified whether each tool offered a free plan option, we compared how prices scale from package to package, we considered how many price points were available to users, and we looked at the value of the features added to each tier, particularly enterprise-level tiers.

Ease of Use – 20%

AI voice tools are supposed to make content creation a simpler task; for this reason, ease of use and accessibility were also important factors in how we judged each of these tools. We looked at each tool’s no-code features, the user-friendliness of voice editing tools, the quality of customer support at each subscription tier, and the availability of self-service resources and community forums for getting started and troubleshooting.

AI Voice Generators: Frequently Asked Questions (FAQs)

Learn more about AI voice generator technology and the top solutions available through these frequently asked questions:

What is the best AI voice generator?

The best AI voice generator will depend on your particular needs and project plans, but Murf is consistently a top choice for its flexibility, with a wide range of general use cases.

Is there a free AI voice generator?

Yes, several AI voice generators are free or are available in free, limited versions.

What is the best free AI voice generator?

The best free AI voice generator options will vary based on your exact requirements. ElevenLabs is the best free solution for users who require API access and interoperability with other resources, while Speechify is the most generous for users who don’t require downloads or more complex features.

Bottom Line: AI Voice Generators Are Affordable and Customizable

AI voice technology has grown in popularity for content creators of all backgrounds and budgets. These type of generative AI tools enable creative scalability for videos, podcasts, audiobooks, customer service interactions, and a slew of other enterprise use cases that require consistent and original voice content. What’s more, this technology is frequently customizable and available in affordable plans, meaning users of all stripes can try out these tools to figure out their potential for their projects.

If you’re not sure which of the AI voice tools in this guide is the best fit for your organization, take some time to test out the free plans or trials that are available for each tool. You’ll quickly discover if the software meets your particular needs, if it’s user friendly, and if it has the features necessary to keep up with your organization’s security and compliance requirements.

For a full portrait of the AI vendors serving a wide array of business needs, read our in-depth guide:  150+ Top AI Companies 2024

Get the Free Newsletter!

Subscribe to Daily Tech Insider for top news, trends & analysis

MOST POPULAR ARTICLES

10 best artificial intelligence (ai) 3d generators, ringcentral expands its collaboration platform, 8 best ai data analytics software &..., zeus kerravala on networking: multicloud, 5g, and..., datadog president amit agarwal on trends in....

footer ad

We use cookies to enhance your experience.

Speech-to-Text

Experience industry-leading speech-to-text accuracy with Speech AI models on the cutting-edge of AI research, accessible through a simple API.

Call Transcript (04.02.2024)

Thank you for calling Acme Corporation, Sarah speaking. How may I assist you today? Hi Sarah, this is John. I’m having trouble with my Acme Widget. It seems to be malfunctioning. I’m sorry to hear that, John. Let’s get that sorted out for you. Could you please provide me with the serial number of your widget? Thank you, John. Now, could you describe the issue you’re experiencing with your widget? Well, it’s not turning on at all, even though I’ve replaced the batteries. Let’s try a few troubleshooting steps. Have you checked if the batteries are inserted correctly? Yes, I’ve double-checked that.

Universal-1

State-of-the-art multilingual speech-to-text model

Latency on 30 min audio file

Hours of multilingual training data

Industry’s lowest Word Error Rate (WER)

See how Universal-1 performs against other Automatic Speech Recognition providers.

See it in action

*Benchmark performed across 11 datasets, including 8 academic datasets & 3 internally curated datasets representing real world English audio.

Harness best-in-class accuracy and powerful Speech AI capabilities

Async speech-to-text.

The AssemblyAI API can transcribe pre-recorded audio and/or video files in seconds, with human-level accuracy. Highly scalable to tens of thousands of files in parallel.

See how in docs

Custom Vocabulary

Boost accuracy for vocabulary that is unique or custom to your specific use case or product.

Speaker Diarization

Detect the number of speakers in your audio file, with each word in the text associated with its speaker.

International Language Support

Gain support to transcribe over 99+ languages and counting, including Global English (English and all of its accents).

Auto Punctuation and Casing

Automatically add casing and punctuation of proper nouns to the transcription text.

Confidence Scores

Get a confidence score for each word in the transcript.

Word Timings

View word-by-word timestamps across the entire transcript text.

Filler Words

Optionally include disfluencies in the transcripts of your audio files.

Profanity Filtering

Detect and replace profanity in the transcription text with ease.

Automatic Language Detection

Automatically detect if the dominant language of the spoken audio is supported by our API and route it to the appropriate model for transcription.

Custom Spelling

Specify how you would like certain words to be spelled or formatted in the transcription text.

Continuously up-to-date and secure

Monthly updates and improvements.

View weekly product and accuracy improvements in our changelog.

View changelog

Enterprise-grade security

AssemblyAI is committed to the highest standards of security practices to keep your data and your customers' data safe.

Read more about our security

AssemblyAI's accuracy is better than any other tools in the market (and we have tried them all).

Vedant Maheshwari , Co-Founder and CEO

Explore more

Streaming speech-to-text.

Transcribe audio streams synchronously with high accuracy and low latency.

Speech Understanding

Extract maximum value from voice data with Audio Intelligence, and leverage Large Language Models with LeMUR.

Get started in seconds

speech to speech ai translation

IMAGES

  1. Translatotron: Direct speech-to-speech translation with a sequence-to-sequence model

    speech to speech ai translation

  2. Speech Recognition in AI

    speech to speech ai translation

  3. Real-time Speech-to-Text and Translation with Cognitive Services, Azure

    speech to speech ai translation

  4. 6 Tips for Using AI Voices In Your eLearning Courses

    speech to speech ai translation

  5. AI-Powered Speech Translation

    speech to speech ai translation

  6. Machine translation (MT) components. The automatic speech recognition

    speech to speech ai translation

COMMENTS

  1. Speech Translation

    The Speech service, part of Azure AI Services, is certified by SOC, FedRamp, PCI, HIPAA, HITECH, and ISO. View or delete any of your custom translator data and models at any time. Your data is encrypted while it's in storage. You control your data. Your audio input and translation data are not logged during audio processing.

  2. Introducing a foundational multimodal model for speech translation

    Today, we're introducing SeamlessM4T, a foundational multilingual and multitask model that seamlessly translates and transcribes across speech and text. SeamlessM4T supports: Automatic speech recognition for nearly 100 languages. Speech-to-text translation for nearly 100 input and output languages. Speech-to-speech translation, supporting ...

  3. Introducing Translatotron: An End-to-End Speech-to-Speech Translation

    Posted by Ye Jia and Ron Weiss, Software Engineers, Google AI Speech-to-speech translation systems have been developed over the past several decades with the goal of helping people who speak different languages to communicate with each other. Such systems have usually been broken into three separate components: automatic speech recognition to transcribe the source speech as text, machine ...

  4. Introducing a suite of AI language translation models that preserve

    To build Seamless, we developed SeamlessExpressive, a model for preserving expression in speech-to-speech translation, and SeamlessStreaming, a streaming translation model that delivers state-of-the-art results with around two seconds of latency. All of the models are built on SeamlessM4T v2, the latest version of the foundational model we ...

  5. KUDO AI

    lets youis not. Choose an AI-generated voice that imitates the natural flow of speech. Access multilingual captions and audio from any laptop or smartphone. Download recordings in each language for offline use after the fact. Benefit from our end-to-end features like request-to-speak and voting. Request a demo.

  6. Introducing CVSS: A Massively Multilingual Speech-to-Speech Translation

    In addition to translation speech, CVSS also provides normalized translation text matching the pronunciation in the translation speech (on numbers, currencies, acronyms, etc., see data samples below, e.g., where "100%" is normalized as "one hundred percent" or "King George II" is normalized as "king george the second"), which can benefit both model training as well as ...

  7. Meta AI announces first AI-powered speech translation system for an

    Artificial speech translation is a rapidly emerging artificial intelligence (AI) technology. Initially created to aid communication among people who speak different languages, this speech-to ...

  8. Using AI to Translate Speech For a Primarily Oral Language

    Takeaways. We've built the first speech-to-speech translation system powered by artificial intelligence (AI) for Hokkien, a primarily oral language spoken within the Chinese diaspora. Our AI research can help break down language barriers in both the physical world and the metaverse to encourage connection and mutual understanding. AI-powered ...

  9. Meta's "massively multilingual" AI model ...

    a thousand voices — Meta's "massively multilingual" AI model translates up to 100 languages, speech or text Meta aims for a universal translator like "Babel Fish" from Hitchhiker's Guide

  10. Introducing SeamlessM4T, a Multimodal AI Model for Speech and Text

    SeamlessM4T is the first all-in-one multilingual multimodal AI translation and transcription model. This single model can perform speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages depending on the task. The world we live in has never been more interconnected, giving people access to more ...

  11. Introducing Whisper

    About a third of Whisper's audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

  12. Meta Demonstrates AI-Powered Speech-to-Speech Translation System

    The company says its AI developers are aiming to create speech-to-speech translation tools that would cover most of the world's languages. Earlier this year, Meta announced two new AI projects ...

  13. Translatotron 2: High-quality direct speech-to-speech translation with

    We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large ...

  14. Live Speech-to-Speech AI Translation Goes Commercial

    The adoption of live speech-to-speech translation (S2ST) has rapidly accelerated across multiple commercial applications since mid-2023.. Large multimodal language models such as Meta's SeamlessM4T, a model that can translate and transcribe speech in more than 100 languages, and Google's AudioPaLM, a model that its creators claim "can process and generate text and speech with ...

  15. High-Quality, Robust and Responsible Direct Speech-to-Speech Translation

    To enable S2ST models to retain each speaker's voice in the translated speech when the input speech contains multiple speakers speaking in turns, we propose a simple concatenation-based data augmentation technique, called ConcatAug.This method augments the training data on the fly by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the ...

  16. Direct Speech-to-Speech Translation

    A new approach shows that neural networks can translate speech directly without first representing the words as text. What's new: Researchers at Google built a system that performs speech-to-speech language translation based on an end-to-end model. Their approach not only translates, it does so in a rough facsimile of the speaker's voice.

  17. Meta's Zuckerberg Reveals First Speech to Speech AI Translation System

    Watch Meta CEO Mark Zuckerberg give a demo of Meta's speech to speech AI translation system.Never miss a deal again! See CNET's browser extension 👉 https://...

  18. Orion Unveils AI-powered Speech-to-Speech Translation

    Orion's S2ST is a first-of-its-kind, voice-AI. solution that layers bi-directional translation into real-time, multipoint communications, unlocking. entirely new staffing and collaboration possibilities for the deskless workforce and edge. operations. Orion's Speech-to-Speech Translation uses Orion's Frontline AI to edge-deliver speech ...

  19. Speech to speech translation: Breaking language barriers in real-time

    Top speech to speech translation tools. Speech to speech translation is a remarkable technological breakthrough, eliminating language barriers and fostering global communication like never before. As AI and machine learning technologies continue to advance, we can expect even more efficient and accurate tools in the future.

  20. Speech-to-speech translation

    Speech-to-speech translation. Speech-to-speech translation (STST or S2ST) is a relatively new spoken language processing task. It involves translating speech from one langauge into speech in a different language: STST can be viewed as an extension of the traditional machine translation (MT) task: instead of translating text from one language ...

  21. Speechlab

    Speechlab automates dubbing for audio and video. Upload a file and get an editable transcript, translation, and dub in the same voices. Download captions, subtitles, and dubbed audio/video. ... Generative Speech to Speech Translation. Automated AI dubbing when quality matters. Get started for free. Transcription. AI generated with speaker ...

  22. Meta develops a new way for people to connect through language using AI

    Meta decided to develop a new speech-to-speech (S2ST) translation system for existent languages to include the spoken ones. The main goal of Meta's project is to build language tools that can be ...

  23. Speech and Translation AI for Speech and Accuracy

    Conclusion. NVIDIA speech and translation AI models are pushing the boundaries of performance and innovation. With its RNNT and CTC variants, the Parakeet family of models offers a spectrum of options, balancing accuracy and speed to suit diverse deployment needs. The Parakeet-TDT model sets a new benchmark by coupling superior accuracy with ...

  24. Speech-to-speech translation

    How speech-to-speech translation works. Speech-to-speech translation works in four simple steps, which are discussed below: Speech Recognition: This process is the first step in speech-to-speech translation, which begins with converting the spoken words in the source language into text through speech recognition systems. This involves identifying and transcribing the spoken words accurately.

  25. Build a speech translator app using Azure

    Part 3 - Convert translated text to speech. To complete our speech translator app, we now focus on transforming translated text back into speech, making the translation audible. 1. Build the Flow: Start by creating a new flow in Power Automate. This flow is responsible for converting text into speech. Set up the nodes as shown in the screenshot ...

  26. LocalVocal: Local Live Captions & Translation On-the-Go

    LocalVocal: Local Live Captions & Translation On-the-Go. v0.2.1. LocalVocal live-streaming AI assistant plugin allows you to transcribe & translate, locally on your machine, audio speech into text and perform various language processing functions on the text using AI / LLMs (Large Language Models). No GPU* required, no cloud costs, no network ...

  27. AI Speech Translation

    Boostlingo AI Pro: How it works. Boostlingo AI Pro offers high-quality multilingual captions via AI speech recognition and AI translations. Effortlessly convert spoken words into written text with live captioning during meetings or conferences, enabling seamless communication across languages. Check out how it works!

  28. 5 Best AI Voice Generators: AI Text-To-Speech in 2024

    Speechify Premium: $139 per year for advanced text-to-speech features and capabilities. Speechify Studio Free: $0 for access to basic AI voice and video features with no downloads. Speechify ...

  29. The Future Of AI-Powered Text-To-Speech Technology

    A waiting crowd in front of a microphone and podium. getty. How do you envision the future of AI-powered text-to-speech technology, and what potential applications and impacts can we expect to see ...

  30. AssemblyAI

    Build new AI products with voice data leveraging AssemblyAI's industry-leading Speech AI models for accurate speech-to-text, speaker detection, sentiment analysis, chapter detection, PII redaction, and more. Product. Overview Speech-to-Text Streaming Speech-to-Text Speech Understanding Pricing.