How-To Geek

How to use aws transcribe to convert speech to text.

Speech transcription is a problem that's commonly solved with expensive human workers.

Quick Links

Aws transcribe converts audio files in s3.

Speech transcription is a problem that's commonly solved with expensive human workers. With machine learning though, computers have caught up, and AWS's AI-powered Speech Recognition Toolkit is now available as a service for your application to use.

Transcribe is simple---give it an audio file (stored in S3), and it can churn through it and give you an output. You are charged based on the length of audio, at a rate of $0.0004 per second . A two-hour boardroom meeting would cost $2.88 to transcribe, but a quick two-minute video only costs $0.06.

Transcribe is pretty fast, but it's not latency optimized. It's well suited for after-the-fact transcription, such as transcribing customer calls and subtitling uploaded video. If you need real-time speech-to-text transcription, you can use AWS Lex, a service for building interactive chat bots like Alexa.

To get started, head over to the AWS Transcribe Console . You can press "Start Streaming" to record from your device's microphone and to test the service. It's pretty neat, but you're likely after more than this.

From the sidebar, select "Transcription Jobs" and click "Create Job." The job serves as a method of automating transcription. Each job works on one file at a time; to automate the transcription of multiple files, you need to create a seperate job for each one from the command line.

Give Transcribe a path to the audio file you'd like to convert. You can optionally manually select the format and sample rate, though it should automatically recognize most common ones.

Once you click create, the transcription begins. The newly created job appears in the list, and once it's done, you can download the transcribed text.

You probably also want to know how to work with Transcribe from the console, as creating jobs by hand is tedious and only suitable if you're processing one large audio file at a time.

aws transcribe start-transcription-job

--transcription-job-name NewJob

--language-code en-US

--media MediaFileUri="s3://bucket/file.mp3"

This starts the job and outputs some JSON telling you if it created successfully. You can check the status of a job programmatically with get-transcription-job :

aws transcribe get-transcription-job --transcription-job-name NewJob

If it's finished,  TranscriptionJob.TranscriptionJobStatus  sets to "COMPLETED," and you can download the file directly with curl  and a little jq  processing:

curl $(aws transcribe get-transcription-job --transcription-job-name NewJob

| jq -r ".TranscriptionJob.Transcript.TranscriptFileUri")

| jq ".results.transcripts"

Note that the transcript file is JSON, and it contains the full transcript plus a confidence assessment of each word and the alternatives. Unless you want all the confidence values, you can filter them off with the final  | jq ".results.transcripts"  statement.

You can also automatically transcribe audio files using Lambda functions. Lambda is a service that can run code in response to AWS events, such as new items being uploaded to S3. It's serverless, and you only pay for execution time; because Lambda isn't doing the actual processing, just creating a new job on upload, the cost should be trivial.

You can code it yourself if you've used Lambda before, but luckily there's a prebuilt application on the Lambda serverless app repository that can handle this exact job for you. It's called s3-lambda-transcribe-audio-to-text-s3 , and you may have to click "Show apps that create custom IAM roles" to find it.

Create a new app from this template, and specify the input bucket and output bucket. Make sure the output bucket exists and that the input bucket doesn't, as the app will create the input bucket for you.

You'll also want to enter the language of the audio file. en-US  is generic English; for anything else, you can find the code  on AWS's docs .

Deploy the application, and you should see a newly created bucket. If you drop an audio file in this bucket, Lambda can create a new Transcribe job for you.

If the app doesn't work, make sure you enabled it to create its IAM role, and make sure it has permission to work with Transcribe and the S3 buckets it needs to.

Build a Serverless Application for Audio to Text conversion

Learn how to use amazon transcribe and aws lambda to build an audio to text conversion application written in go..

Application overview

Prerequisites

Use AWS CDK to deploy the solution

Convert speech to text

Lambda function code walk through

  • Deploy the solution using AWS CloudFormation .
  • Verify the solution.
  • AWS Lambda for Go .
  • AWS Go SDK , specifically for Amazon Rekognition.
  • Go bindings for AWS CDK to implement "Infrastructure-as-code" (IaC) for the entire solution and deploy it with the AWS Cloud Development Kit (CDK) CLI .

Application overview

Ending Support for Internet Explorer

  • Artificial Intelligence /

AWS’s transcription platform is now powered by generative AI

It recognizes 100 spoken languages thanks to ai..

By Emilia David , a reporter who covers AI. Prior to joining The Verge, she covered the intersection between technology, finance, and the economy.

Share this story

Technology Brand Illustration Images

AWS added new languages to its Amazon Transcribe product, offering speech foundation model-based transcription for 100 languages and a slew of new AI capabilities for customers. 

Announced during the AWS re: Invent event , Amazon Transcribe can now recognize more spoken languages and spin up a call transcription. AWS customers use Transcribe to add speech-to-text capabilities to their apps on the AWS Cloud. 

The company said in a blog post that Transcribe trained on “millions of hours of unlabeled audio data from over 100 languages” and uses self-supervised algorithms to learn patterns of human speech in different languages and accents. AWS said it ensured that some languages were not overrepresented in the training data to ensure that lesser-used languages could be as accurate as more frequently spoken ones. 

In late 2023, Amazon Transcribe supported 39 languages.

Amazon Transcribe improves accuracy by 20 to 50 percent over its previous version across many languages, according to AWS. It also offers automatic punctuation, custom vocabulary, automatic language identification, and custom vocabulary filters. It can recognize speech in audio and video formats and noisy environments. 

The Verge reached out to AWS for information on which foundation models it used for Amazon Transcribe.

With better language recognition, AWS said advances with Amazon Transcribe also bleed into better accuracy with its Call Analytics platform, which its contact center customers often use. Amazon Transcribe Call Analytics, now also powered by generative AI models, summarizes interactions between an agent and a customer. AWS said this cuts down on after-call work creating reports, and managers can quickly read information without needing to go through the entire transcript. 

Of course, AWS is not the only company offering AI-powered transcription services. Otter has been providing AI transcriptions to consumers and enterprises for a while and released a summarization tool in June . While not exactly the same, Meta announced it is working on a generative AI-powered translation model that recognizes nearly 100 spoken languages.

AWS also announced additional capabilities to its Amazon Personalization product, which allows clients to offer products or show recommendations to customers, like how streaming services can suggest new shows based on previous activity. AWS added Content Generation, which will write titles or email subject lines to thematically connect recommendation lists. 

Correction, November 28 3:20 PM PT : Corrected to reflect the number of languages supported by Transcribe in 2023 and clarified accuracy information.

OpenAI releases GPT-4o, a faster model that’s free for all ChatGPT users

Apple ipad pro (2024) review: the best kind of overkill, verizon, at&t, and t-mobile’s ‘unlimited’ plans just got a $10m slap on the wrist, chatgpt is getting a mac app, the dji pocket 3 is almost everything i wanted my iphone camera to be.

Sponsor logo

More from this stream AWS re:Invent 2023: the biggest news and announcements

Amazon will offer human benchmarking teams to test ai models, amazon joins ai image creation fray with new model, aws says its new ai chip will train foundation models faster., amazon’s q ai assistant lets users ask questions about their company’s data.

Unlocking the Power of Speech with Amazon Transcribe

speech to text aws

Amazon Transcribe is a cutting-edge automatic speech recognition (ASR) service that transforms spoken language into text. Powered by deep learning technologies, it offers a robust solution for converting speech to text, enabling developers and businesses to enhance their applications with speech-to-text capabilities. From transcribing customer service calls to automating subtitling and generating searchable archives, Amazon Transcribe is revolutionizing how we interact with audio content.

Table of Contents

How amazon transcribe works, the technology behind amazon transcribe.

At the core of Amazon Transcribe is a sophisticated machine-learning model that processes audio files and delivers accurate, time-stamped text transcripts. This service is designed to handle a variety of audio formats and environments, from high-quality studio recordings to low-fidelity phone calls. By leveraging deep learning processes, Transcribe adapts to different accents, dialects, and languages, ensuring high transcription accuracy across diverse scenarios.

Expanding Language Support

Amazon Transcribe’s capabilities have significantly expanded, now supporting transcription in over 100 languages. This advancement opens up new possibilities for global applications, allowing businesses to cater to a wider audience without language barriers. Whether it’s for customer support, content creation, or medical documentation, Transcribe’s extensive language support makes it a versatile tool for various industries.

Key Features of Amazon Transcribe

Medical transcription with amazon transcribe medical.

Amazon Transcribe Medical stands out as a beacon of innovation in the healthcare sector. This specialized service is meticulously designed to meet medical professionals’ and healthcare providers’ unique needs. By delivering highly accurate transcriptions of medical terminologies and patient conversations it significantly simplifies the creation of clinical documentation. This not only streamlines workflows but also enhances the accuracy of patient records, contributing to better patient outcomes.

One of the pivotal advantages of Amazon Transcribe Medical is its compliance with the Health Insurance Portability and Accountability Act (HIPAA), ensuring the utmost protection of patient data. This aspect is crucial in maintaining trust and confidentiality in patient care. Transcribe Medical presents a cost-effective solution compared to traditional transcription methods, which are often labor-intensive and prone to errors. Its state-of-the-art machine-learning technology can understand complex medical jargon, making it a reliable tool for various medical settings, including telemedicine consultations, clinical note-taking, and more.

Amazon Transcribe Medical

Enhancing Customer Experiences with Call Analytics

In the realm of customer service, Amazon Transcribe Call Analytics emerges as a powerful tool for businesses aiming to elevate their customer experience. Leveraging generative AI , this feature dives deep into call transcripts to unearth valuable insights. It meticulously analyses customer and agent sentiment, pinpoints the underlying reasons for calls, and generates concise reports summarising these interactions.

The intelligence gathered through Call Analytics enables businesses to identify areas for improvement in their customer service strategies, tailor training programs for agents, and ultimately foster a more positive customer experience. Additionally, this feature aids in recognizing patterns and trends in customer inquiries, allowing companies to address common concerns and streamline their operations proactively. The ability to quickly understand and act on customer feedback is a game-changer in today’s competitive market, making Amazon Transcribe Call Analytics an indispensable asset for any customer-focused organization.

Amazon Transcribe Call Analytics

Subtitling and Toxicity Detection

Amazon Transcribe also offers invaluable services for content creators through its subtitling capabilities. This feature allows for the automatic generation of accurate subtitles for audio and video content, significantly enhancing accessibility for a global audience, including those with hearing impairments. By breaking down language barriers, content creators can reach a wider audience, enriching the viewing experience for all.

The toxicity detection feature of Amazon Transcribe is a testament to AWS’s commitment to creating a safer online environment. This innovative tool scans transcriptions for harmful content, enabling organizations to identify and mitigate instances of toxicity in user-generated content, live broadcasts, and other digital platforms. In an age where online safety is paramount, this feature provides an essential layer of protection, ensuring that digital spaces remain respectful and inclusive for everyone.

Applications of Amazon Transcribe

Amazon Transcribe’s versatility is a testament to its transformative power across a multitude of sectors. This advanced speech recognition service extends its capabilities far beyond simple transcription tasks, addressing complex needs in customer service, media production, healthcare, and numerous other fields.

Revolutionizing Customer Service

In customer service, Amazon Transcribe redefines how businesses interact with their customers. By transcribing customer calls and inquiries, it provides a textual database that can be easily searched and analysed. This capability allows for the identification of common concerns, questions, and feedback, enabling businesses to adapt their services and products to better meet customer needs. Furthermore, the integration of Transcribe with customer relationship management (CRM) systems can automate the documentation process, ensuring that every customer interaction is captured and available for future reference.

Transforming Media Production

For media professionals, Amazon Transcribe offers an invaluable tool for creating subtitles and closed captions, making content accessible to a wider audience, including those who are deaf or hard of hearing. This not only expands the reach of media content but also complies with accessibility regulations in various jurisdictions. Additionally, journalists and podcasters can leverage Transcribe to convert interviews and audio recordings into text, streamlining the content creation process and enabling more efficient editing and publication workflows.

Advancing Healthcare Documentation

In healthcare, Amazon Transcribe Medical specifically addresses the need for accurate and timely documentation. By transcribing doctor-patient conversations, medical consultations, and clinical notes it facilitates the creation of comprehensive and precise medical records. This not only aids in ensuring better patient care but also supports medical research by providing a textual database for analysis. The service’s compliance with HIPAA regulations further underscores its suitability for sensitive medical environments.

Enhancing Educational Resources

Education is another sector that benefits significantly from Amazon Transcribe. Educators and institutions can transcribe lectures and educational content, making it accessible to students for review and study. This is particularly beneficial for learners who prefer reading to listening or those for whom English is a second language. Moreover, the ability to search through transcribed material enables students to find specific information quickly, enhancing their learning experience.

Supporting Legal and Compliance Efforts

The legal sector also finds Amazon Transcribe to be a powerful ally. Transcribing legal proceedings, depositions, and meetings can save time and resources while ensuring accurate records are kept for future reference. Additionally, the ability to quickly search through transcribed content can aid in case preparation and evidence review.

Integration with AWS Ecosystem

Amazon Transcribe’s potential is magnified when integrated within the AWS ecosystem, a network of services that provide comprehensive solutions for modern technological challenges. This integration facilitates the creation of advanced, highly functional applications that leverage the strengths of multiple AWS services, offering unparalleled efficiency and innovation in processing and understanding human speech.

Enhancing Applications with Amazon Comprehend

When Amazon Transcribe is used in conjunction with Amazon Comprehend , businesses can unlock powerful natural language processing (NLP) capabilities. Amazon Comprehend analyses the text generated by Transcribe to extract meaningful insights, such as sentiment analysis, entity recognition, and key phrase extraction. This combination can be particularly useful in customer service scenarios, where understanding the sentiment and key topics of customer calls can lead to more personalized and effective responses.

Leveraging Machine Learning with Amazon SageMaker

Integration with Amazon SageMaker takes Amazon Transcribe’s capabilities to the next level by incorporating advanced machine-learning models into the transcription process. This allows for the development of custom speech recognition models that are tailored to specific business needs, such as recognizing industry-specific terminology or improving accuracy for non-standard dialects. Amazon SageMaker’s machine learning tools enable businesses to continuously improve and refine their speech recognition models, ensuring that their applications remain at the cutting edge of technology.

Building Interactive Voice Applications

The synergy between Amazon Transcribe and other AWS services like Amazon Lex and AWS Lambda opens up exciting opportunities for creating interactive voice-driven applications. Amazon Lex provides the functionality to build conversational interfaces, while AWS Lambda allows for the execution of backend processes in response to voice commands. This integration enables the development of sophisticated voice assistants, automated customer service bots, and other interactive applications that can understand and respond to human speech in real time.

Streamlining Workflows and Enhancing Customer Engagement

By harnessing the combined power of Amazon Transcribe and the AWS ecosystem, businesses can automate complex workflows, enhance customer engagement, and create more immersive user experiences. For example, media companies can automate the generation of subtitles and closed captions for their content, making it more accessible to a wider audience. Similarly, healthcare providers can streamline the documentation process by transcribing medical consultations and integrating them into electronic health records (EHR) systems.

Getting Started with Amazon Transcribe

Setting up Amazon Transcribe is straightforward. Users need an AWS account and access to the AWS Management Console , AWS Command Line Interface (CLI) , or the Transcribe API. The service accepts audio files stored in Amazon S3 in multiple formats and offers detailed documentation to guide users through the transcription process. 

Here’s a step-by-step guide based on the content from the official AWS documentation:

Step 1: Sign Up for AWS

The first step is to create an AWS account if you don’t already have one. Visit the AWS homepage and follow the sign-up process, which will guide you through creating your account and setting up the necessary billing and contact information.

Step 2: Access the AWS Management Console

Once your AWS account is set up, log in to the AWS Management Console. This web-based interface allows you to manage your AWS services and resources. Amazon Transcribe can be accessed directly from the console, providing a user-friendly environment to start your transcription projects.

Step 3: Choose Your Access Method

Amazon Transcribe can be accessed in several ways, depending on your preferences and requirements:

  • AWS Management Console : Ideal for those who prefer a graphical interface, the console offers an intuitive way to use Amazon Transcribe.
  • AWS Command Line Interface (CLI) : For users comfortable with command-line tools, the AWS CLI offers a powerful way to interact with Amazon Transcribe and other AWS services.
  • Transcribe API : Developers looking to integrate Amazon Transcribe into their applications can use the API to programmatically access the service’s features.

Step 4: Prepare Your Audio Files

Amazon Transcribe supports various audio formats, including MP3, MP4, WAV, and FLAC. Ensure your audio files are stored in Amazon S3 , as the service will access your files from this cloud storage. This step involves uploading your audio files to an S3 bucket, which can be done through the AWS Management Console, AWS CLI, or SDKs.

Step 5: Create a Transcription Job

With your audio files ready in S3, you can now create a transcription job. This can be done through the AWS Management Console, where you’ll specify the details of your transcription request, such as the file location, output format, and language. If you’re using the AWS CLI or Transcribe API, you’ll provide these details in your command or API request.

Step 6: Review and Analyze Your Transcriptions

Once your transcription job is complete, Amazon Transcribe will provide you with a text file containing the transcribed text. This file can be reviewed and analyzed to ensure accuracy and completeness. The service also offers features like timestamp generation for each word, making it easier to align the text with the audio.

Step 7: Integrate and Expand

After becoming familiar with the basic transcription process, explore the advanced features of Amazon Transcribe and consider integrating it with other AWS services to enhance your applications. Whether it’s leveraging Amazon Comprehend for natural language processing or automating workflows with AWS Lambda, the possibilities are vast.

Pricing and Availability

Amazon Transcribe’s pricing structure is designed to be both flexible and accessible, adhering to a pay-as-you-go model that aligns with the diverse needs of its users. This approach allows businesses and individuals to scale their usage according to their specific requirements without incurring unnecessary costs. For those new to the service, Transcribe offers a generous free tier, enabling users to explore and evaluate the service’s capabilities without any financial commitment. This free tier is particularly beneficial for small businesses and startups looking to integrate speech-to-text functionalities into their operations without upfront investment.

Once the free tier’s limits are reached, the pricing transitions to a usage-based model, where costs are determined by the amount of audio transcribed. This model is calculated on a per-second basis, ensuring that users only pay for the actual amount of transcription processed. This granular pricing strategy makes Amazon Transcribe a cost-effective solution for projects of all sizes, from small, one-time tasks to large-scale, ongoing operations.

Amazon Transcribe is transforming the way we interact with audio content, offering a powerful tool for speech-to-text conversion. With its advanced features, extensive language support, and integration with AWS services, it provides a comprehensive solution for businesses and developers looking to leverage speech recognition technology. Whether for transcribing medical conversations, analyzing customer service calls, or creating accessible content, Transcribe is paving the way for innovative applications of speech recognition technology.

Additional Resources

  • Amazon Transcribe Pricing (Understand the cost-effective pricing model of Amazon Transcribe and how it fits into your budget.)
  • Amazon Transcribe Customers (Explore case studies and success stories from businesses that have benefited from using Amazon Transcribe.)
  • Amazon Transcribe Resources (Access a wealth of resources, including documentation and tutorials, to get the most out of Amazon Transcribe.)

Transform Your Business with Amazon Transcribe

Other aws guides.

AWS Outposts: Enhancing On-Premises and Cloud Integration

AWS Outposts: Enhancing On-Premises and Cloud Integration

AWS IoT Core: Key Features and Pricing Explained

AWS IoT Core: Key Features and Pricing Explained

Optimizing Software Development: The Power of Amazon CodeGuru

Optimizing Software Development: The Power of Amazon CodeGuru

AWS Transit Gateway: Streamlining Complex Network Architectures

AWS Transit Gateway: Streamlining Complex Network Architectures

AWS X-Ray for Application Insight and Debugging

AWS X-Ray for Application Insight and Debugging

Text Insights with AWS Comprehend: A Comprehensive Guide

Text Insights with AWS Comprehend: A Comprehensive Guide

Get the latest articles and news about AWS

I have read and agree with Cloudvisor's Privacy Policy .

Take advantage of instant discounts on your AWS and Cloudfront services

AWS Cost Optimization

Squeeze the best performance out of your AWS infrastructure for less money

Well-Architected Framework Review

Ensure you're following AWS best practices with a free annual WAFR review

Monitoring Service

24/7 monitoring catches any potential issues before they turn into a problem

Data Engineering Services

Make the most of your data with optimization, analysis, and automation

Migration to AWS

Seamlessly transfer your cloud infrastructure to AWS with minimal downtime

AWS Security

Protect your AWS infrastructure with sophisticated security tools and consultation

AWS Marketplace

Access the best tools for your use case via the AWS Marketplace

  • For Startups
  • Case Studies

Dive into our latest insights, trends, and tips on cloud technology.

Your comprehensive resource for mastering AWS services.

Join our interactive webinars to learn from cloud experts.

Whitepapers

Explore in-depth analyses and research on cloud strategies.

  • Free consultation

speech to text aws

Please wait while we enable your Account

Aws transcribe (speech-to-text) and automation: everything you need to know.

speech to text aws

Table of contents

What is aws transcribe, how does speech-to-text work, top features, on-demand & live streaming content support, custom language models (clm), live streaming subtitle stabilization, webvtt and subrip support, vocabulary filtering, multilingual subtitle options, how to use aws console for the transcribe process, how to use aws sdk for the transcribe process, how to use aws cli for the transcribe process, how to automate the transcribe process, wrapping up.

In today’s times, viewers demand an immersive and inclusive streaming experience. Accessibility and engagement have become the key to streaming success. There was a time when captions were optional in OTT videos. But today, captions and subtitles are no longer just nice-to-haves, they’re essential for accessibility and a truly inclusive viewing experience. 

At Muvi, we are committed to making your videos accessible and enjoyable for everyone. That’s why we leverage the power of AWS Transcribe to deliver real-time speech-to-text capabilities. 

In this blog, we’ll explore how AWS Transcribe works behind the scenes, transforming spoken words into on-screen text. We’ll discuss how you can automate the process and how it enhances the viewing experience for all audiences. So, let’s get started!

AWS Transcribe is a service that converts spoken words into text. This tool offers a convenient and effective method for transcribing audio content into written form. It supports a variety of audio formats, making it suitable for various applications like transcribing customer service calls, interviews, and meetings.

To utilize Transcribe, you must have an AWS account. You can access Transcribe through the AWS Management Console, AWS Command Line Interface (CLI), or AWS SDKs. Before you begin transcribing, ensure that you have the appropriate permissions to utilize the Transcribe service.

Speech-to-text software operates by listening to audio and providing an editable, verbatim transcript on a designated device. This software achieves this through voice recognition. It employs linguistic algorithms within a computer program to differentiate auditory signals from spoken words and convert those signals into text using Unicode characters. 

The conversion process from speech to text involves a sophisticated machine-learning model that comprises several stages. Let’s delve deeper into how this process unfolds:

  • When a person speaks, their vocal cords produce vibrations that create sounds. Speech-to-text technology functions by detecting these vibrations and converting them into digital language using an analog-to-digital converter.
  • The analog-to-digital converter captures sounds from an audio file, meticulously measures the waveforms, and filters them to isolate the relevant sounds.
  • These sounds are then segmented into hundredths or thousandths of seconds and matched to phonemes. A phoneme is the smallest unit of sound that distinguishes one word from another in a language. For example, English has approximately 40 phonemes.
  • The phonemes are processed through a network using a mathematical model that compares them to well-known sentences, words, and phrases.
  • Finally, the output is presented as text or a computer-generated voice based on the most probable interpretation of the audio.

Let’s take you through some of the top features of AWS Transcribe service.

You need just a single service API for managing both on-demand and live-streaming content. Supported formats for on-demand videos include FLAC, MP3, MP4, Ogg, WebM, AMR, or WAV. For live streaming, the API supports formats such as HTTP2 and WebSocket.

The list of audio codecs supported has been sorted from best quality to worst quality below:

  • FLAC (lossless)
  • WAV (lossless)
  • MP3 (lossy)
  • MP4 (lossy)
  • Ogg (lossy)
  • WebM (lossy)
  • AMR (lossy)

Enhance the accuracy of AWS Transcribe by incorporating domain-specific terminology, such as names, acronyms, and slang, using custom vocabulary and CLM. For batch processing (VoD), custom vocabulary and CLM can be utilized to achieve the highest accuracy levels.

Improve the live subtitling experience in video broadcasts and in-game chat by controlling the stabilization level of partial transcription results. This provides the flexibility to display partial sentence results instead of waiting for the entire sentence to be subtitled.

Output batch transcription works in WebVTT (.vtt) and SubRip (.srt) formats. They can be used as video subtitles in existing workflows. Output files include any content redaction, vocabulary filters, and distinguishing multiple speakers in both formats.

A high-quality user experience is maintained by filtering specific slang, profanity, or inappropriate terminology. You can create and utilize multiple vocabulary filter lists to create subtitles suitable for adult or child audiences based on tags.

You can provide subtitles in multiple languages or translate generated subtitles for content localization with Amazon Transcribe. It helps you extend the reach of your content. Amazon Transcribe supports multiple languages, reducing the need for diverse language expertise.

To use AWS Console for the Transcribe process, you need to follow the steps given below:

1. First Extract an Audio file in high quality from the Target Video File. On-demand/batch supported formats include FLAC, MP3, MP4, Ogg, WebM, AMR, or WAV. You can use Handbrake, ffmpeg, or any other tools to extract the Audio track. Then upload the audio file to the S3 bucket.

2. Choose “Amazon Transcribe” from the “Services” Menu in the AWS Console.

AWS Console

3. Provide the s3 URL of the source from which we need to generate a Caption file.

Provide the s3 URL of the source from which we need to generate a Caption file.

4. After clicking on “Next,” AWS Transcribe will process your audio file and take some time to generate a subtitle file based on the data provided during the previous steps. Once the process is complete, you will see a screen like the one below, which will display the output directory of the subtitle file.

After clicking on "Next," AWS Transcribe will process your audio file and take some time to generate a subtitle file based on the data provided during the previous steps.

Here is the sample PHP Code for using AWS SDK for the Transcribe Process:

speech to text aws

You can refer to the official AWS documentation to know more about the process in detail.  

To start a new transcription, use the start-transcription-job command.

speech to text aws

Amazon Transcribe responds with:

speech to text aws

To understand it more clearly, you can refer to the AWS CLI page.  

Here are the steps to automate AWS Transcribe with PHP:

  • Setup S3 Bucket: Create an S3 bucket where you will upload your video/audio files for transcription.
  • Configure AWS Event Notification: Set up AWS Event Notification on the S3 bucket to trigger an event whenever a new file is uploaded. This event will be used to store details in your database using PHP logic.
  • Create a Lambda Function: Develop a Lambda function that will be triggered by the AWS Event Notification when AWS Transcribe completes its job. This function will store the output of the transcription in the specified S3 bucket.
  • Create an API: Build an API that accepts a video file URL. This API will call the Lambda function to launch an EC2 server. Inside the EC2 server, the audio track will be extracted and uploaded to the S3 bucket. Once the audio file is stored in the bucket, another API will be called to initiate the Transcribe job using the AWS SDK.
  • Process Transcription Job Completion: After the Transcribe job finishes, AWS will store the output VTT/SRT file in S3. The AWS Event Notification will notify your microservice, where you will write logic to store all details in a database. You can then use these details to show the status of the process.

These steps will help you automate the transcription process using AWS Transcribe and PHP. The entire process has been depicted in the flow diagram given below. 

AWS Transcribe Process Workflow

In this blog, we saw how AWS Transcribe can be used to convert speech to text, and how it can be used to automate the captions and subtitles in streaming. By doing so, you can deliver a better streaming experience to your viewers, make your videos more accessible and localized, and hence increase the reach and revenue of your streaming business. 

If you are looking for a streaming SaaS solution that helps you build your own-branded streaming platform with over 500 industry-leading features, including automated captions and subtitles, out-of-the-box, without any coding, then Muvi One is the right choice for you. And you can try it for FREE for 14 days, without giving away your credit card details! So, why wait? Sign up to start your 14-day FREE trial today. 

AWS Transcribe is an Amazon Web Service that uses machine learning to convert speech to text. It’s essentially an automatic speech recognition (ASR) tool. 

What are the key features of AWS Transcribe?

Here are some of the key features of AWS Transcribe:

  • Batch transcriptions: You can upload audio files to Amazon S3 and then use AWS Transcribe to transcribe them.
  • Streaming transcriptions: You can transcribe live audio streams in real-time.
  • Customizable models: You can create custom models that are trained on your specific vocabulary, which can improve the accuracy of your transcripts.
  • Support for multiple languages: AWS Transcribe supports a variety of languages, including English, Spanish, French, German, and Hindi.

What are the benefits of using AWS Transcribe for video streaming service?

AWS Transcribe offers several benefits for video streaming, making your content more accessible and engaging for a wider audience:

  • Increased Accessibility: AWS Transcribe can generate subtitles and closed captions for your video streams. This benefits viewers who are deaf or hard of hearing, those who speak a different language, or those watching in noisy environments where they can’t hear the audio clearly.
  • Improved Engagement: Subtitles can keep viewers engaged by allowing them to follow along with the dialogue even if they’re not actively listening. This can be especially helpful for viewers who are multitasking or watching in a second language.
  • Real-time Captioning: For live streams, AWS Transcribe offers real-time streaming transcription. This allows viewers to follow the conversation as it happens, even if they can’t hear the audio.

Written by: Tanmaya Patra

Tanmaya Patra is a skilled Full stack Software Engineer at Muvi, specializing in PHP, React, Hybrid app development, Docker, and Linux. He plays a pivotal role in the company's Video/Audio Encoding Department, conducting extensive research and engineering cutting-edge solutions. With a passion for technology and a meticulous approach to problem-solving, he consistently delivers innovative solutions that meet and exceed client expectations.

Add your comment

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Try Muvi One free for 14 days

No Credit Card Required

Latest Blog

Right Arrow

A Guide to Live Streaming of Horse Race

speech to text aws

How to Implement Digital OOH Advertising and Digital Signage Using Muvi Flex?

speech to text aws

Getting Started With Muvi Playout

Latest Case Studies

Post 1

  Crypto KTV is the world’s first multi-lingual Web 3.0 TV channel and integrated multimedia...

Post 1

  EnterInfi wanted to develop a unique and one-of-a-kind video streaming platform that would n...

Post 1

  Fresenius wanted to conduct regular training and development, conferences, seminars, webinar...

Latest Whitepapers

Post 1

  This document underscores the crucial need for implementing a live-streaming and recording s...

Post 1

Sports consumption has a very wider audience base. End-users can view a wide range of sporting event...

Post 1

The collaboration between the streaming platform and telecom companies is at a turning point. The r...

Post 1

The SaaS industry is ever-growing and requires a dynamic billing system that can give it competitive...

speech to text aws

Explore Muvi

One Platform, Infinite Streaming Possibilities

Live & On-Demand, Audio & Video, Mobile & TV Apps, Player, and Monetization

speech to text aws

Please, don't leave us!

This is a simple Exit Intent Popup example ;) X

Reach out to Muvi at:

Mail Icon

  • Work with Muvi
  • Partner program
  • Affiliate Program
  • Muvi Legal Policies
  • Muvi Compliances
  • Muvi Foundation
  • Muvi Professional Services
  • CMS Management Services
  • End-User Support Services
  • Design Services
  • Events & Webinars
  • Whitepapers
  • Case Studies

Help and Support

  • Help Center
  • Support Service
  • Find a Partner

Head Office

  • 42241 Violet Mist Terrace,
  • Ashburn, VA 20148
  • Muvi Player SDK
  • Muvi Playout

Copyrights ©2024 Muvi

Blog – Speechactors

Whisper vs. AWS Transcribe: Decoding Audio Transcription

Transcribe: Decoding Audio Transcription

Speech-to-text technology isn’t new in 2023, but a sudden leap in quality means something exciting is happening! With the launch of OpenAI’s Whisper, an open-source speech recognition model in September 2022, the bar has been set high.

So, how does it compare to the seasoned player AWS Transcribe? Let’s dive in and find out which is the ultimate choice for your project.

Table of Contents

The Tech Behind Whisper

OpenAI generally plays it close to the chest and rarely open-sources its models. But with Whisper, they broke the mold! This gem has been trained on a whopping 680,000 hours of data, including low-quality recordings, to maximise accuracy.

Whisper’s Many Models

Whisper comes in multiple models:

  • Large 

Whisper is open-source; which means you can install it on your server. However, if you are going to a Large model which is the most accurate then you need at least 10 GB VRAM. Its speed is also slower than the tiny model. 

But that’s not all, OpenAI recently announced an API for Whisper. So if you don’t have heavy usage then API will be more affordable for you.

Whisper understands an incredible 97 languages and even offers translation services. Choose your desired language, and Whisper will handle the rest.

If you want to check the demo of Whisper you can visit listenmonster , Currently, they are using large v2 mode. It is a Transcription & subtitle tool for internet creators.

Listenmonster functions as a complementary tool to Speechactors , providing an integrated experience that significantly enhances creative possibilities when utilized jointly.

What’s Cooking with AWS Transcribe?

AWS transcribe is an AI speech-to-text model that is available on API. It is a closed model so nobody knows what is behind the scene.

It’s API-only, meaning there’s less flexibility to reduce costs as you scale up. AWS Transcribe covers 40 languages, some with multiple accents like English and Spanish. 

AWS Transcribe was launched in 2017. Unlike Whisper it does not offer a translation service. If you want to do it then you have to use their translation API. It simply means more cost. 

If you want to test it without developing then you can sign up for AWS and search for AWS transcribe to test their service.

The Head-to-Head Comparison

When it comes to turning spoken words into text, you want the most accurate tool. In tests, Whisper comes out on top compared to AWS Transcribe. This isn’t just for English; it’s the same for other languages too. 

Whisper was trained using not-so-great audio on purpose, just to make sure it works well even when conditions aren’t ideal. Want to see for yourself? Try out AWS Transcribe by making an account and then compare it with Whisper by going to listenmonster .

 Languages Supported

Whisper speaks 97 languages with no accents available. You can check all the languages listed here .

AWS Transcribe only supports  40 languages including some languages such as English, French, etc multiple accents. You can check the full list here. 

In simple words, Whisper is the clear winner here once again.

AWS Transcribe’s API follows a pay-as-you-go model:

– First 250,000 minutes: $0.02400/min

– Next 750,000 minutes: $0.01500/min

– Over 5,000,000 minutes: $0.00780/min

Whisper, on the other hand, offers both API and open-source options. API costs start at $0.006/min, and some third-party services even offer rates as low as $0.001445/min.

If you are doing transcription on a large scale then it means you can reduce the cost per minute by installing it on your own server.

Whisper is like the Swiss Army knife of audio transcription. It’s affordable, accurate, and chock-full of features. Whether you opt for its API or install the model on your server, you’re in for some substantial savings.

AWS is a closed model and they have more customers. They might make their transcriptions better in future if they see potential threats from OpenAI whispers. 

However, while writing this blog post in 2023 Whisper clearly crushes the AWS Transcribe in every aspect. 

Gen AI innovation race is leading to security gaps, according to IBM and AWS

  • Share on Facebook
  • Share on LinkedIn

Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here .

What will it take to secure generative AI?

According to a new study released today by IBM and Amazon Web Services (AWS ) there is no simple ‘silver bullet’ solution to secure gen AI, especially now. The report is based on a survey conducted by the IBM Institute for Business Value that surveyed leading executives at U.S. organizations. While gen AI is a top initiative for many, the survey found a high level of enthusiasm for security. 82% of C-suite leaders stated that secure and trustworthy AI is essential for business success.

That said, there is a dichotomy in the results and a difference with what’s actually happening in the real world. The report found that organizations are securing only 24% of their current generative AI projects. IBM isn’t the only firm with a report that raises concerns about security. PwC recently reported that 77% of CEOs are concerned about AI cybersecurity risks. 

Not coincidentally, IBM is working on different approaches with AWS to help improve that situation in the future. Today, IBM is also announcing the IBM X-Force Red Testing Service for AI to further advance generative AI security.

The AI Impact Tour: The AI Audit

Join us as we return to NYC on June 5th to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

“In all the client conversations I’ve been having  I see that leaders are being pulled into different directions,” Dimple Ahluwalia, Global Senior Partner for Cybersecurity Services at IBM Consulting told VentureBeat. “They feel the pressure certainly from both their internal and external stakeholders to innovate with use of gen AI, but that means for some of them that security becomes an afterthought.”

Innovation or security? Gen AI implementations tend to only pick one

While it might seem that having security is common sense for any type of technology deployment, the reality is that’s not always the case.

The report found that for 69% of organizations, innovation takes precedence over security. Ahluwalia noted that organizations have not fully ingrained security across all lines of business. The report also makes it clear that business leaders understand the importance of security and address that issue to help make production deployments of gen AI more successful.

“People are so excited that they’re rushing to see if they can get productivity gains, if they can look at how to be more competitive,” she said. 

Ahluwalia said that the same thing happened in the early years of cloud when every conversation had to involve a discussion of moving workloads to the cloud, often without proper security oversight.

“That’s what is happening now with gen AI, everybody feels compelled and are rushing to get to it,” Ahluwalia said. “The plans haven’t been thought through and as a result, I think security kind of suffers as well.”

Guardrails and policy are the keys to gen AI security

So how can and should organizations improve?

The report recommends that in order to build trust in gen AI, organizations must start with governance. That includes establishing policies, processes and controls aligned with business objectives. 81% said generative AI requires new security governance models.

Once governance is set, strategies can address securing the full AI pipeline using available tools and controls. Collaboration across security, technology and business teams is needed. There is also benefit potentially in leveraging technology partners’ expertise for strategy, training, cost justification and navigating compliance.

How IBM X-Force Red Testing Service for AI fits in

Beyond guardrails and governance there is also a need to validate and test. 

IBM X-Force Red’s new Testing Service for AI is IBM’s first testing service tailored specifically for AI. The new service is bringing together a cross-discipline team of experts across penetration testing, AI systems, and data science. The service will also make use of expertise from IBM Research, which developed the Adversarial Robustness Toolbox (ART).

The concept of a ‘red team’ in security generally means there is a group that is taking an adversarial approach in proactively attacking resources in order to help learn where gaps exist. 

Chris Thompson, Global Head of X-Force Red at IBM explained to VentureBeat that the industry adopted the term “AI red teaming” for better or worse recently, primarily with a focus on the safety and security testing of models themselves. In his view, to date there hasn’t been a traditional red team focus on stealth and evasion. Rather the focus has been more on getting models to do something they shouldn’t, such as produce harmful content or gain access to sensitive RAG datasets. 

“Attacks against gen AI apps themselves are very similar to traditional application security attacks but with a new twist and expanded attack surface,” Thompson said.

At this point in 2024, he noted that IBM is seeing more of a convergence with what is considered to be true red teaming. The approach IBM is taking is to look at the wider attack paths into gen AI. The four areas of AI red teaming IBM has developed services around include: AI platforms, the pipeline used to tune and train the models (MLSecOps), the production environment running the gen AI applications, and the gen AI applications themselves. 

“Aligned with traditional red teaming, we’re also focused on missed detection opportunities and reducing the time it takes to detect any potential advanced threat actors successfully targeting these new AI solutions,” Thompson said.

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat's Terms of Service.

Thanks for subscribing. Check out more VB newsletters here .

An error occured.

There are more AWS SDK examples available in the AWS Doc SDK Examples GitHub repo.

Convert text to speech and back to text using an AWS SDK

The following code example shows how to:

Use Amazon Polly to synthesize a plain text (UTF-8) input file to an audio file.

Upload the audio file to an Amazon S3 bucket.

Use Amazon Transcribe to convert the audio file to text.

Display the text.

Use Amazon Polly to synthesize a plain text (UTF-8) input file to an audio file, upload the audio file to an Amazon S3 bucket, use Amazon Transcribe to convert that audio file to text, and display the text.

For complete source code and instructions on how to set up and run, see the full example on GitHub .

Services used in this example

Amazon Polly

Amazon Transcribe

Warning

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Thanks for letting us know we're doing a good job!

If you've got a moment, please tell us what we did right so we can do more of it.

Thanks for letting us know this page needs work. We're sorry we let you down.

If you've got a moment, please tell us how we can make the documentation better.

IMAGES

  1. What is Amazon Transcribe

    speech to text aws

  2. Building a speech-to-text notification system in different languages

    speech to text aws

  3. AWS Tutorial

    speech to text aws

  4. Speech to text using AWS transcribe and python [Step by step]

    speech to text aws

  5. Build your own text-to-speech applications with Amazon Polly

    speech to text aws

  6. Aws speech to text

    speech to text aws

VIDEO

  1. Text to Speech Voice, BRIAN VOICE from Ivona ☺ DOWNLOAD LINK BELOW

  2. Text To Speech AI

  3. Generate AI Images from Text AWS Bedrock #genai #aws #texttoimage

  4. Using Chalice to Build a Text-to-Speech Service Using Polly

  5. AWS Summit Online ASEAN 2020

  6. Vonage Audio Connector Demo

COMMENTS

  1. Speech To Text

    Automatically convert speech to text. Get Started with Amazon Transcribe. Try Free Demo. 60 minutes of speech-to-text for 12 months. ... AWS support for Internet Explorer ends on 07/31/2022. Supported browsers are Chrome, Firefox, Edge, and Safari. Learn more » ...

  2. What is Speech to Text?

    Speech to text is a speech recognition software that enables the recognition and translation of spoken language into text through computational linguistics. It is also known as speech recognition or computer speech recognition. Specific applications, tools, and devices can transcribe audio streams in real-time to display text and act on it.

  3. How Amazon Transcribe works

    How Amazon Transcribe works. Amazon Transcribe uses machine learning models to convert speech to text. In addition to the transcribed text, transcripts contains data about the transcribed content, including confidence scores and timestamps for each word or punctuation mark. To see an output example, refer to the Data input and output section.

  4. What is Amazon Transcribe?

    PDF RSS. Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application. With Amazon Transcribe, you can improve accuracy for your specific use case with language ...

  5. Amazon Transcribe Features

    Amazon Transcribe Features. Amazon Transcribe is a speech foundation model-powered automatic speech recognition (ASR) service that supports over 100 languages. Transcribe's features enable you to ingest audio input, produce easy to read and review transcripts, improve accuracy with customization, and filter content to ensure customer privacy.

  6. Data input and output

    Data input and output. PDF RSS. Amazon Transcribe takes audio data, as a media file in an Amazon S3 bucket or a media stream, and converts it to text data. If you're transcribing media files stored in an Amazon S3 bucket, you're performing batch transcriptions. If you're transcribing media streams, you're performing streaming transcriptions.

  7. How to Use AWS Transcribe to Convert Speech to Text

    From the sidebar, select "Transcription Jobs" and click "Create Job." The job serves as a method of automating transcription. Each job works on one file at a time; to automate the transcription of multiple files, you need to create a seperate job for each one from the command line. Give Transcribe a path to the audio file you'd like to convert.

  8. What is Amazon Transcribe

    What is Amazon Transcribe?* Amazon Transcribe makes it easy for developers to add speech to text capabilities to their applications. Audio data is virtually ...

  9. Convert Speech to Text with Amazon Transcribe

    If you're trying to convert speech from an audio or video file into text using AWS, this is the video for you. In this video, I show you how to use Amazon Tr...

  10. Build a Serverless Application for Audio to Text conversion

    Last Modified Mar 14, 2024. Amazon Transcribe is a service that utilizes machine learning models to convert speech to text automatically. It offers various features that can enhance the accuracy of the transcribed text, such as language customization, content filtering, multi-channel audio analysis, and individual speaker speech partitioning.

  11. Amazon Transcribe

    Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications. Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech. You can also send a live audio stream to ...

  12. Transcribing streaming audio

    Using Amazon Transcribe streaming, you can produce real-time transcriptions for your media content. Unlike batch transcriptions, which involve uploading media files, streaming media is delivered to Amazon Transcribe in real time. Amazon Transcribe then returns a transcript, also in real time. Streaming can include pre-recorded media (movies ...

  13. AWS's transcription platform is now powered by generative AI

    Announced during the AWS re: Invent event, Amazon Transcribe can now recognize more spoken languages and spin up a call transcription.AWS customers use Transcribe to add speech-to-text ...

  14. Amazon Transcribe now supports speech-to-text in 31 languages

    Amazon Transcribe is an easy-to-use automatic speech recognition (ASR) service that makes it easy to analyze audio files and convert those into text that includes enrichment such as speaker identification, timestamp generation, punctuation, and formatting. With the recent announcement, customers can now transcribe audio from even more languages.

  15. Unlocking the Power of Speech with Amazon Transcribe

    Amazon Transcribe is transforming the way we interact with audio content, offering a powerful tool for speech-to-text conversion. With its advanced features, extensive language support, and integration with AWS services, it provides a comprehensive solution for businesses and developers looking to leverage speech recognition technology.

  16. AWS Transcribe (Speech-To-Text) and Automation: Everything You Need to Know

    Choose "Amazon Transcribe" from the "Services" Menu in the AWS Console. 3. Provide the s3 URL of the source from which we need to generate a Caption file. 4. After clicking on "Next," AWS Transcribe will process your audio file and take some time to generate a subtitle file based on the data provided during the previous steps.

  17. Speech to Text using AWS Transcribe, S3 and Lambda

    AWS Transcribe is the speech-to-text solution provided by Amazon Web Services which has renowned to be very quick and have high accuracy.AWS Transcribe under the hood uses a deep learning process names ASR (automatic speech recognition) to convert the audio to text quickly and more accurately. It also has a separate service inside named Amazon ...

  18. Building a Speech-to-Text Transcription System using AWS ...

    In this article, we'll walk through the steps required to build a speech-to-text transcription system using an event-driven architecture and automatic speech recognition. Overview. Our system will consist of several AWS services that work together to transcribe audio files into text. Here's a high-level overview of the architecture:

  19. AWS Transcribe: A Powerful Speech-to-Text Service

    Welcome to our blog post on AWS Transcribe, a cutting-edge speech-to-text service provided by Amazon Web Services (AWS). As developers and tech enthusiasts, we understand the importance of accurate and efficient transcription capabilities for various applications. In this article, we will explore the key features, benefits, and how to get ...

  20. Transcribe speech to text in real time using Amazon Transcribe with

    Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to applications. In November 2018, we added streaming transcriptions over HTTP/2 to Amazon Transcribe. This enabled users to pass a live audio stream to our service and, in return, receive text transcripts in real time.

  21. Speech to Text Magic with React & AWS Transcribe

    This blog will guide you through building a React app that leverages AWS Transcribe to deliver speech-to-text functionality. Pick up audio files from S3: We'll utilise the AWS SDK for JavaScript within the React app to access audio files stored in your S3 bucket. Interact with AWS Transcribe: The code will interact with the Transcribe service ...

  22. Building a Real-Time Speech to Text React Application

    AWS Transcribe: Auto-AI service that allows developers with no ML experience to integrate Speech to Text capabilities in their backend. Identity Access and Management (IAM): Lets you manage access to AWS services through permissions and roles. We will be creating a role for our Amplify User to work with the various AWS services. 2.

  23. Whisper vs. AWS Transcribe: Decoding Audio Transcription

    AWS transcribe is an AI speech-to-text model that is available on API. It is a closed model so nobody knows what is behind the scene. It's API-only, meaning there's less flexibility to reduce costs as you scale up. AWS Transcribe covers 40 languages, some with multiple accents like English and Spanish. AWS Transcribe was launched in 2017.

  24. Amazon Transcribe Pricing

    Let's assume you want to subtitle 5,000 hours of live streaming content with Amazon Transcribe. In US East (N. Virginia), Tier 1 (T1) pricing of $0.024/minute applies to the first 250,000 minutes of transcriptions. Tier 2 pricing of $0.015/minute (38% discount to T1 pricing) applies to the next 750,000 minutes.

  25. Gen AI innovation race is leading to security gaps ...

    According to a new study released today by IBM and Amazon Web Services (AWS) there is no simple 'silver bullet' solution to secure gen AI, especially now. The report is based on a survey ...

  26. Convert text to speech and back to text using an AWS SDK

    Convert text to speech and back to text using an AWS SDK. PDF. The following code example shows how to: Use Amazon Polly to synthesize a plain text (UTF-8) input file to an audio file. Upload the audio file to an Amazon S3 bucket. Use Amazon Transcribe to convert the audio file to text. Display the text.