16 Best Text to Speech API On The Market (Free & Paid, 2024)

Need a reliable text to speech API for your projects? Explore our list of options, including both free and paid services to find the best for you.

Unreal Speech

Unreal Speech

In this blog, you will discover the power of text to speech technology for a variety of applications - from enhancing accessibility and customer engagement on websites to improving the user experience on mobile apps. Harnessing the potential of text to speech API, developers can create more interactive, engaging, and personalized content on their platforms. Text to speech technology offers a range of functionalities that can enhance the overall user experience and drive user interaction on digital platforms. By the end of this blog, you will have a clearer understanding of how to leverage text to speech API to create more engaging user experiences.

Table of Contents

What is a text to speech api, text to speech api uses cases, performance variations of text to speech api, how to choose the ideal tts engine for your business, try unreal speech for free today — affordably and scalably convert text into natural-sounding speech with our text-to-speech api.

laptop with code - Text To Speech API

A Text-to-Speech (TTS) API is a cloud-based tool that leverages artificial intelligence and deep learning to transform written text into lifelike speech. This process produces high-quality audio files, such as MP3 or WAV, which can mimic human speech. A TTS API can be customized to replicate specific speaking styles and offer natural-sounding voices in various languages. This technology finds application across multiple fields, including personal assistants, navigation systems, e-learning platforms, and accessibility tools for the visually impaired or those with reading difficulties.

Importance of TTS Technology in Modern Digital Experiences

Text to Speech (TTS) technology plays a crucial role in enhancing modern applications and digital experiences.

  • By converting text into speech, TTS APIs make content more accessible to users with visual impairments or reading difficulties.
  • TTS technology improves user engagement , accessibility, and inclusivity for a diverse user base.
  • TTS technology offers assistance for users requiring hands-free interaction, such as when driving or multitasking.
  • For businesses, TTS technology provides a more interactive and engaging user experience, ultimately boosting customer satisfaction and retention.

Accessible Text-to-Speech Solutions

If you are looking for cheap, scalable, realistic TTS to incorporate into your products, try our text-to-speech API for free today. Convert text into natural-sounding speech at an affordable and scalable price.

html tag - Text To Speech API

Entertainment: Enhancing Digital Experiences with Text-to-Speech API

One of the most exciting use cases of Text-to-Speech (TTS) API is in the entertainment industry. TTS API can provide voice-overs for video games or movies, allowing characters to speak in different languages or accents. This can create a more immersive experience for players and viewers, enhancing the overall entertainment value of the product and catering to a wider audience.

Navigation: Getting Around with Text-to-Speech API

Another essential use case of TTS API is in navigation systems. TTS API can provide turn-by-turn directions to drivers, cyclists, or pedestrians in GPS systems or navigation apps. This functionality allows users to get around more easily and safely, reducing the risk of distractions associated with traditional map navigation, especially in a situation where hands-free navigation is essential.

Accessibility: Making Digital Platforms More Inclusive with Text-to-Speech API

Text-to-Speech API plays a crucial role in improving the accessibility of websites, mobile apps, and other digital platforms for people with disabilities. By providing audible content, TTS API allows visually impaired users to access and interact with digital content more effectively, promoting inclusivity and equal access to information and services.

Customer Service: Enhancing Customer Interactions with Text-to-Speech API

TTS API is also invaluable in customer service applications, providing automated customer service over the phone or in chatbots. Companies can efficiently handle a large volume of customer inquiries quickly and efficiently, improving customer satisfaction and operational efficiency. This enhanced automation helps businesses save time, reduce costs, and provide a seamless customer experience.

Healthcare: Supporting Patients with Text-to-Speech API

TTS API can also support healthcare professionals by providing audible instructions or medication reminders for patients with visual or cognitive impairments. This use case ensures that patients can access critical healthcare information autonomously, leading to better medication adherence and overall patient care.

Language Learning: Improving Language Skills with Text-to-Speech API

In the realm of education, Text-to-Speech (TTS) API helps students improve their pronunciation and listening comprehension. By providing audio content in different languages, TTS API supports language learning initiatives, helping students develop their language skills more effectively.

Personal Assistants: Enabling Conversational AI with Text-to-Speech API

Text-to-Speech (TTS) API is an essential component of personal assistant tools like Siri and Alexa, providing spoken responses to user requests. This functionality enables users to interact with AI assistants more naturally and efficiently, enhancing the overall user experience and utility of personal assistant applications.

Education: Overcoming Learning Barriers with Text-to-Speech API

In educational settings, TTS API can help students with reading difficulties, dyslexia, or visual impairments access educational materials more easily on e-learning platforms . By providing audio content, TTS API supports inclusive education practices and accommodates diverse learning needs, ensuring that all students can access educational resources effectively.

Audio Books: Engaging with Text-to-Speech API

One of the most common uses of Text-to-Speech (TTS) API is creating audiobooks. Audiobooks allow people to listen to books while on the go or while engaging in other activities. TTS API enables authors and publishers to produce audiobooks cost-effectively and reach a wider audience, catering to the needs and preferences of modern readers.

Unreal Speech: Low-cost, Scalable Text-to-Speech API

Unreal Speech offers a low-cost, highly scalable text-to-speech API with natural-sounding AI voices which is the cheapest and most high-quality solution in the market. We cut your text-to-speech costs by up to 90%. Get human-like AI voices with our super-fast/low latency API, with the option for per-word timestamps. With our simple easy-to-use API, you can give your LLM a voice with ease and offer this functionality at scale. If you are looking for cheap, scalable, realistic TTS to incorporate into your products, try our text-to-speech API for free today. Convert text into natural-sounding speech at an affordable and scalable price.

16 Best Text to Speech API On The Market

keyboard with notes - Text To Speech API

1. Unreal Speech

Unreal Speech offers a low cost, highly scalable text-to-speech API with natural sounding AI voices which is the cheapest and most high quality solution in the market. We cut your text-to-speech costs by up to 90%. Get human-like AI voices with our super fast / low latency API, with the option for per-word timestamps. With our simple easy-to-use API, you can give your LLM a voice with ease and offer this functionality at scale.

2. Amazon Polly

Amazon Polly’s cloud-based TTS API uses speech synthesis markup language (SSML) to generate realistic speech from text. It enables users to seamlessly integrate speech synthesis into an application to enhance accessibility and engagement.

3. Microsoft Azure

Microsoft Azure’s text to speech API follows a RESTful architecture for its text to speech interface. The cloud-based service allows flexible deployment, allowing users to run TTS at data sources.

Murf is popular for its high-quality voiceovers and its ability to customize speech to a remarkable extent. It offers a unique voice model that delivers a lifelike user experience.

5. Speechify

Speechify is a powerful text-to-speech app written in Python using artificial intelligence, that can help you convert any written text into natural-sounding speech.

6. IBM Watson Text to Speech

Known for its high-quality, natural-sounding voices, IBM Watson provides a unique API that can be used in several programming languages, including Python.

7. Google Cloud Text to Speech

This service utilizes Google’s powerful AI and machine learning capabilities to provide highly realistic voices. It supports numerous languages and dialects, making it suitable for global enterprises.

8. Voice Dream Reader

Known for its readability, Voice Dream Reader offers adjustable reading speed and text highlighting. It’s favoured by those with reading disabilities and language learners.

9. Resemble AI

Resemble AI provides a cutting-edge API that enables users to create human-like voice-overs in just a matter of seconds. Their extensive library of AI voices set them appart from other APIs on the market, with over 200,000 unique voices.

10. Play.ht

Play.ht offers an online Text-to-Speech API that converts text into natural-sounding speech with support for 142 languages and accents worldwide. With this technology, users can easily download files in MP3 or WAV format.

11. Balabolka

Balabolka is a versatile TTS API that supports multiple file formats and speech parameters. Its offline working capability and compatibility with a wide range of text types make it stand out.

12. Lovo AI

Lovo offers a high-quality AI voice generator called Genny. One of its most impressive features is Emotional Voices, which can express up to 25 emotions, adding depth and realism to any content, which in turn makes it more engaging and memorable.

13. ElevenLabs

ElevenLabs offers a state-of-the-art Text-to-Speech API that leverages advanced neural network models to convert text into natural-sounding speech. The API provides high-quality voice synthesis with customizable parameters, allowing developers to tailor the speech output to specific applications and use cases.

14. Descript's TTS API (Overdub)

Descript's TTS API provides ultra-realistic voices by utilizing the Lyrebird AI, which achieves a state-of-the-art level in voice synthesis . Overdub stands out for its ability to mimic the nuances and intonations of human speech, allowing it to blend in seamlessly with natural audio recordings while matching the tonal characteristics on both sides.

15. Colossyan API

Colossyan's API provides a Text-to-Speech converter that allows users to create natural-sounding voice-overs in more than 70 languages and accents. With Colossyan, users can choose from a variety of voice-over actors or even clone their own voice for an added personal touch.

16. ReadSpeaker

ReadSpeaker is known as a leading provider in TTS. With over 20 years of experience in voice technology, ReadSpeaker offers a wide selection of languages and voices to generate speech in various accents.

Affordable and Scalable Text-to-Speech Solutions

person testing new Text To Speech API

In the Text to Speech API market, users often have concerns about performance. The Text to Speech (TTS) market is quite dense, with various providers offering their services. Each provider has its strengths and sometimes weaknesses, which can be a deciding factor for users depending on their specific needs. Different TTS APIs can vary in performance, with some being better suited for specific applications or languages. When choosing a TTS API provider, users must consider their requirements and expectations carefully.

Languages in Text To Speech API

Text-to-Speech APIs can perform differently depending on the language being used. Some providers specialize in specific languages and dialects, while others have a broader range of language options. These differences can impact the accuracy and quality of the TTS output. Factors such as regional specializations and rare or uncommon language specializations can significantly influence the performance of TTS APIs across different languages.

Data Quality in Text To Speech API

The accuracy of TTS APIs can vary based on the quality of the input data. Factors such as punctuation, capitalization, and formatting can impact the performance of TTS APIs. Data quality is crucial for achieving high-quality TTS output, and users should ensure that their input data meets the necessary standards for optimal performance.

Fields in Text To Speech API

Some TTS APIs are trained with domain-specific data, such as medical or automotive fields. This specialized training enables these APIs to perform better for specific applications within those fields. Users with diverse needs across different fields must consider these specificities and optimize their choice of TTS API accordingly. By selecting a TTS API that aligns with their industry requirements, users can ensure the best possible performance for their applications.

deciding the best Text To Speech API

Performance and Scalability

When selecting a Text-to-Speech (TTS) API for your business, consider the performance and scalability of the solution. It is vital to check the API's response time and its ability to handle high volumes of requests. A reliable and scalable TTS API ensures consistent performance even during peak times or when processing large batches of text, improving user experience and operational efficiency.

Language and Accent Support

The language and accent support provided by a TTS engine play a crucial role in catering to diverse user populations and global audiences. Evaluate the TTS engine's support for multiple languages and dialects to ensure that your business can reach a broader audience and deliver content in different languages accurately and naturally.

Naturalness of Speech

One of the most critical factors to consider when choosing a TTS API for your business is the naturalness of the synthesized speech. Evaluate the TTS engine's ability to produce lifelike speech with proper intonation, rhythm, and emotional nuances. Natural and expressive speech enhances user engagement and creates a more immersive experience for your customers , increasing the effectiveness of your text-to-speech applications.

Integration Options

Explore the compatibility and integration capabilities of the TTS engine with your existing platforms, applications, and development frameworks. Choosing a TTS API that seamlessly integrates with your current technologies streamlines implementation and deployment processes, saving time and resources. Consider how well the TTS engine aligns with your business's technical requirements and infrastructure to ensure a smooth integration process.

Cost and Licensing

Analysis of the pricing structure, licensing agreements, and associated costs is essential when choosing a Text-to-Speech API. Understand the subscription fees, usage-based charges, and any additional features that may incur costs. Align the pricing and licensing model with your budget and scalability requirements to ensure that your business can leverage the TTS API effectively without unexpected expenses.

Unreal Speech provides an innovative and cost-effective Text-to-Speech (TTS) API that delivers highly natural-sounding AI voices. Our solution is designed to significantly reduce your TTS costs by up to 90%, making it the most affordable and high-quality option on the market. With Unreal Speech, you get access to human-like AI voices that can give your Long Language Model (LLM) a voice with ease and offer this functionality at scale.

Efficient and Customizable Text-to-Speech API Features

Our super-fast and low-latency API ensures that you can convert text into natural-sounding speech quickly and efficiently. We offer the option for per-word timestamps, allowing you to enhance the user experience of your applications further. Our simple and easy-to-use API makes it effortless to incorporate our text-to-speech solution into your products, regardless of your technical expertise.

Affordable and Realistic Audio Solutions with Unreal Speech

If you are looking for a cheap, scalable, and highly realistic TTS solution for your projects, try Unreal Speech's text-to-speech API today. Experience the power of natural-sounding AI voices that can transform your text content into immersive audio experiences, all at an affordable and scalable price.

  • Skip to main content
  • Skip to search
  • Skip to select language
  • Sign up for free
  • Remember language

Using the Web Speech API

Speech recognition.

Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

The Web Speech API has a main controller interface for this — SpeechRecognition — plus a number of closely-related interfaces for representing grammar, results, etc. Generally, the default speech recognition system available on the device will be used for the speech recognition — most modern OSes have a speech recognition system for issuing voice commands. Think about Dictation on macOS, Siri on iOS, Cortana on Windows 10, Android Speech, etc.

Note: On some browsers, such as Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.

To show simple usage of Web speech recognition, we've written a demo called Speech color changer . When the screen is tapped/clicked, you can say an HTML color keyword, and the app's background color will change to that color.

The UI of an app titled Speech Color changer. It invites the user to tap the screen and say a color, and then it turns the background of the app that color. In this case it has turned the background red.

To run the demo, navigate to the live demo URL in a supporting mobile browser (such as Chrome).

HTML and CSS

The HTML and CSS for the app is really trivial. We have a title, instructions paragraph, and a div into which we output diagnostic messages.

The CSS provides a very simple responsive styling so that it looks OK across devices.

Let's look at the JavaScript in a bit more detail.

Prefixed properties

Browsers currently support speech recognition with prefixed properties. Therefore at the start of our code we include these lines to allow for both prefixed properties and unprefixed versions that may be supported in future:

The grammar

The next part of our code defines the grammar we want our app to recognize. The following variable is defined to hold our grammar:

The grammar format used is JSpeech Grammar Format ( JSGF ) — you can find a lot more about it at the previous link to its spec. However, for now let's just run through it quickly:

  • The lines are separated by semicolons, just like in JavaScript.
  • The first line — #JSGF V1.0; — states the format and version used. This always needs to be included first.
  • The second line indicates a type of term that we want to recognize. public declares that it is a public rule, the string in angle brackets defines the recognized name for this term ( color ), and the list of items that follow the equals sign are the alternative values that will be recognized and accepted as appropriate values for the term. Note how each is separated by a pipe character.
  • You can have as many terms defined as you want on separate lines following the above structure, and include fairly complex grammar definitions. For this basic demo, we are just keeping things simple.

Plugging the grammar into our speech recognition

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

We add our grammar to the list using the SpeechGrammarList.addFromString() method. This accepts as parameters the string we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition.grammars property. We also set a few other properties of the recognition instance before we move on:

  • SpeechRecognition.continuous : Controls whether continuous results are captured ( true ), or just a single result each time recognition is started ( false ).
  • SpeechRecognition.lang : Sets the language of the recognition. Setting this is good practice, and therefore recommended.
  • SpeechRecognition.interimResults : Defines whether the speech recognition system should return interim results, or just final results. Final results are good enough for this simple demo.
  • SpeechRecognition.maxAlternatives : Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But it is not needed for this simple demo, so we are just specifying one (which is actually the default anyway.)

Starting the speech recognition

After grabbing references to the output <div> and the HTML element (so we can output diagnostic messages and update the app background color later on), we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start() . The forEach() method is used to output colored indicators showing what colors to try saying.

Receiving and handling results

Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information (see the SpeechRecognition events .) The most common one you'll probably use is the result event, which is fired once a successful result is received:

The second line here is a bit complex-looking, so let's explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0. Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

We also use the speechend event to stop the speech recognition service from running (using SpeechRecognition.stop() ) once a single word has been recognized and it has finished being spoken:

Handling errors and unrecognized speech

The last two handlers are there to handle cases where speech was recognized that wasn't in the defined grammar, or an error occurred. The nomatch event seems to be supposed to handle the first case mentioned, although note that at the moment it doesn't seem to fire correctly; it just returns whatever was recognized anyway:

The error event handles cases where there is an actual error with the recognition successfully — the SpeechRecognitionErrorEvent.error property contains the actual error returned:

Speech synthesis

Speech synthesis (aka text-to-speech, or TTS) involves receiving synthesizing text contained within an app to speech, and playing it out of a device's speaker or audio output connection.

The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-related interfaces for representing text to be synthesized (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.

To show simple usage of Web speech synthesis, we've provided a demo called Speak easy synthesis . This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter / Return to hear it spoken.

UI of an app called speak easy synthesis. It has an input field in which to input text to be synthesized, slider controls to change the rate and pitch of the speech, and a drop down menu to choose between different voices.

To run the demo, navigate to the live demo URL in a supporting mobile browser.

The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some simple controls. The <select> element is initially empty, but is populated with <option> s via JavaScript (see later on.)

Let's investigate the JavaScript that powers this app.

Setting variables

First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis . This is API's entry point — it returns an instance of SpeechSynthesis , the controller interface for web speech synthesis.

Populating the select element

To populate the <select> element with the different voice options the device has available, we've written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices() , which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name ), the language of the voice (grabbed from SpeechSynthesisVoice.lang ), and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true .)

We also create data- attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.

Older browser don't support the voiceschanged event, and just return a list of voices when SpeechSynthesis.getVoices() is fired. While on others, such as Chrome, you have to wait for the event to fire before populating the list. To allow for both cases, we run the function as shown below:

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter / Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak() , passing it the SpeechSynthesisUtterance instance as a parameter.

In the final part of the handler, we include a pause event to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

Updating the displayed pitch and rate values

The last part of the code updates the pitch / rate values displayed in the UI, each time the slider positions are moved.

  • Español – América Latina
  • Português – Brasil
  • Documentation
  • Cloud Text-to-Speech API

Text-to-Speech documentation

Text-to-Speech converts text or Speech Synthesis Markup Language (SSML) input into audio data of natural human speech. Learn more

Documentation resources

Quickstart: Use the command line

Quickstart: Use the client libraries

Create voice audio files

Decode base64-encoded audio content

List all supported voices

Specify regional endpoints

Use device profiles for generated audio

Supported voices and languages

Speech Synthesis Markup Language (SSML)

Text-to-Speech Client Libraries

Quotas & limits

Release notes

Getting support

Billing questions

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-10-10 UTC.

OpenAI Help Center

The basics of our text-to-speech API

What is it?

With the text-to-speech API, developers can generate high quality spoken audio from text. We’re initially offering six preset voices to choose from and two model variants, tts-1 and tts-1-hd . tts-1 is optimized for real-time use cases and tts-1-hd is optimized for quality. Pricing starts at $0.015 per 1,000 input characters (not tokens).

How can I use it?

Anyone with an OpenAI API account can access the new audio/speech endpoint .

What rate limits can I expect?

Rate limits begin at 50 RPM for paid accounts. You can see your limits in your developer console .

What’s the maximum input size I can submit per request?

4096 characters (equivalent to ~5 minutes of audio at default speed).

Is it possible to stream audio?

Yes! By setting stream=True , you can chunk the returned audio file.

IMAGES

  1. Blog

    text to speech api

  2. Google Speech-To-Text API Tutorial with Python

    text to speech api

  3. Optimizing Azure Text to Speech REST API

    text to speech api

  4. Text to Speech using Web Speech API

    text to speech api

  5. Google Cloud Text to Speech API: The Future of AI Voice Synthesis

    text to speech api

  6. What is Speech to Text API?

    text to speech api

VIDEO

  1. How to Create a Text-to-Speech Converter Using JavaScript

  2. How to use Open AI Text to Speech API in 3 mins

  3. Java Google Text To Speech : Tutorial [ 1 ]

  4. GCP Cloud Speech API 3 Ways: Challenge Lab ARC132

  5. FREE UNLIMITED Create Openai Text to Speech (2024)

  6. 5/50 Tricks: Text To Speech