speech to text unity github

Speech Recognition in Unity3D – The Ultimate Guide

There are three main strategies in converting user speech input to text:

Voice Commands
Free Dictation

These strategies exist in any voice detection engine (Google, Microsoft, Amazon, Apple, Nuance, Intel, or others), therefore the concepts described here will give you a good reference point to understand how to work with any of them. In today’s article, we’ll explore the differences of each method, understand their use-cases, and see a quick implementation of the main ones.

Prerequisites

To write and execute code, you need to install the following software:

Visual Studio 2019 Community

Unity3D is using a Microsoft API that works on any Windows 10 device (Desktop, UWP, HoloLens, XBOX). Similar APIs also exist for Android and iOS.

Did you know?…

LightBuzz has been helping Fortune-500 companies and innovative startups create amazing Unity3D applications and games. If you are looking to hire developers for your project, get in touch with us.

Source code

The source code of the project is available in our LightBuzz GitHub account. Feel free to download, fork, and even extend it!

1) Voice commands

We are first going to examine the simplest form of speech recognition: plain voice commands.

Description

Voice commands are predictable single words or expressions, such as:

“Forward”
“Left”
“Fire”
“Answer call”

The detection engine is listening to the user and compares the result with various possible interpretations. If one of them is near the spoken phrase within a certain confidence threshold, it’s marked as a proposed answer.

Since that is a “one or anything” approach, the engine will either recognize the phrase or nothing at all.

This method fails when you have several ways to say one thing. For example, the words “hello”, “hi”, “hey there” are all forms of greeting. Using this approach, you have to define all of them explicitly.

This method is useful for short, expected phrases, such as in-game controls.

Our original article includes detailed examples of using simple voice commands. You may also check out the Voice Commands Scene on the sample project .

Below, you can see the simplest C# code example for recognizing a few words:

2) Free Dictation

To solve the challenges of simple voice commands, we shall use the dictation mode.

While the user speaks in this mode, the engine listens for every possible word. While listening, it tries to find the best possible match of what the user meant to say.

This is the mode activated by your mobile device when you speak to it when writing a new email using voice. The engine manages to write the text in less than a second after you finish to say a word.

Technically, this is really impressive, especially considering that it compares your voice across multi-lingual dictionaries, while also checking grammar rules.

Use this mode for free-form text. If your application has no idea what to expect, the Dictation mode is your best bet.

You can see an example of the Dictation mode in the sample project Dictation Mode Scene . Here is the simplest way to use the Dictation mode:

As you can see, we first create a new dictation engine and register for the possible events.

It starts with DictationHypothesis events, which are thrown really fast as the user speaks. However, hypothesized phrases may contain lots of errors.
DictationResult is an event thrown after the user stops speaking for 1–2 seconds. It’s only then that the engine provides a single sentence with the highest probability.
DictationComplete is thrown on several occasions when the engine shuts down. Some occasions are irreversible technical issues, while others just require a restart of the engine to get back to work.
DictationError is thrown for other unpredictable errors.

Here are two general rules-of-thumb:

For the highest quality, use DictationResult .
For the fastest response, use DictationHypothesis .

Having both quality and speed is impossible with this technic.

Is it even possible to combine high-quality recognition with high speed?

Well, there is a reason we are not yet using voice commands as Iron Man does: In real-world applications, users are frequently complaining about typing errors, which probably happens only less than 10% of the cases… Dictation has many more mistakes than that.

To increase accuracy and keep the speed fast at the same time, we need the best of both worlds — the freedom of the Dictation and the response time of the Voice Commands.

The solution is Grammar Mode . This mode requires us to write a dictionary. A dictionary is an XML file that defines various rules for the things that the user will potentially say. This way, we can ignore languages we don’t need, and phrases the user will probably not use.

The grammar file also explains to the engine what are the possible words it can expect to receive next, therefore shrinking the amount from ANYTHING to X. This significantly increases performance and quality.

For example, using a Grammar, we could greet with either of these phrases:

“Hello, how are you?”
“Hi there”
“Hey, what’s up?”
“How’s it going?”

All of those could be listed in a rule that says:

If the user started saying something that sounds like” Hello”, it would be easily differentiated from e.g “Ciao”, compared to being differentiated also from e.g. “Yellow” or “Halo”.

We are going to see how to create our own Grammar file in a future article.

For your reference, this is the official specification for structuring a Grammar file .

In this tutorial, we described two methods of recognizing voice in Unity3D: Voice Commands and Dictation. Voice Commands are the easiest way to recognize pre-defined words. Dictation is a way to recognize free-form phrases. In a future article, we are going to see how to develop our own Grammar and feed it to Unity3D.

Until then, why don’t you start writing your code by speaking to your PC?

You made it to this point? Awesome! Here is the source code for your convenience.

Before you go…

Sharing is caring.

If you liked this article, remember to share it on social media, so you can help other developers, too! Also, let me know your thoughts in the comments below. ‘Til the next time… keep coding!

Shachar Oz is a product manager and UX specialist with extensive experience with emergent technologies, like AR, VR and computer vision. He designed Human Machine Interfaces for the last 10 years for video games, apps, robots and cars, using interfaces like face tracking, hand gestures and voice recognition. Website

Kinect is dead. Here is the best alternative

Product Update: LightBuzz SDK version 5.5

11 comments.

Hello, I have a question, while in unity everything works perfectly, but when I build the project for PC, and open the application, it doesn’t work. Please help.

hi Omar, well, i have built it will Unity 2019.1 as well as with 2019.3 and it works perfectly.

i apologize if it doesn’t. please try to make a build from the github source code, and feel free to send us some error messages that occur.

Hello, I’m trying Dictation Recognizer and I want to change the language to Spanish but I still don’t quite get it. Can you help me with this?

hi Alexis, perhaps check if the code here could help you: https://docs.microsoft.com/en-us/windows/apps/design/input/specify-the-speech-recognizer-language

You need an object – protected PhraseRecognizer recognizer;

in the example nr 1. Take care and thanks for this article!

Thank you Carl. Happy you liked it.

does this support android builds

Hi there. Sadly not. Android and ios have different speech api. this api supports microsoft devices.

Any working example for the grammar case?

Well, you can find this example from Microsoft. It should work anyway on PC. A combination between Grammar and machine learning is how most of these mechanisms work today.

https://learn.microsoft.com/en-us/dotnet/api/system.speech.recognition.grammar?view=netframework-4.8.1#examples

Privacy Overview

How to use Text-to-Speech in Unity

Enhance your Unity game by integrating artificial intelligence capabilities. This Unity AI tutorial will walk you through the process of using the Eden AI Unity Plugin, covering key steps from installation to implementing various AI models.

What is Unity ?

Established in 2004, Unity is a gaming company offering a powerful game development engine that empowers developers to create immersive games across various platforms, including mobile devices, consoles, and PCs.

If you're aiming to elevate your gameplay, Unity allows you to integrate artificial intelligence (AI), enabling intelligent behaviors, decision-making, and advanced functionalities in your games or applications.

Unity offers multiple paths for AI integration. Notably, the Unity Eden AI Plugin effortlessly syncs with the Eden AI API, enabling easy integration of AI tasks like text-to-speech conversion within your Unity applications.

Benefits of integrating Text to Speech into video game development

Integrating Text-to-Speech (TTS) into video game development offers a range of benefits, enhancing both the gaming experience and the overall development process:

1. Immersive Player Interaction

TTS enables characters in the game to speak, providing a more immersive and realistic interaction between players and non-player characters (NPCs).

2. Accessibility for Diverse Audiences

TTS can be utilized to cater to a diverse global audience by translating in-game text into spoken words, making the gaming experience more accessible for players with varying linguistic backgrounds.

3. Customizable Player Experience

Developers can use TTS to create personalized and adaptive gaming experiences, allowing characters to respond dynamically to player actions and choices.

4. Innovative Gameplay Mechanics

Game developers can introduce innovative gameplay mechanics by incorporating voice commands, allowing players to control in-game actions using spoken words, leading to a more interactive gaming experience.

5. Adaptive NPC Behavior

NPCs with TTS capabilities can exhibit more sophisticated and human-like behaviors, responding intelligently to player actions and creating a more challenging and exciting gaming environment.

6. Multi-Modal Gaming Experiences

TTS opens the door to multi-modal gaming experiences, combining visual elements with spoken dialogues, which can be especially beneficial for players who prefer or require alternative communication methods.

Integrating TTS into video games enhances the overall gameplay, contributing to a more inclusive, dynamic, and enjoyable gaming experience for players.

Use cases of Video Game Text-to-Speech Integration

Text-to-Speech (TTS) integration in video games introduces various use cases, enhancing player engagement, accessibility, and overall gaming experiences. Here are several applications of TTS in the context of video games:

Quest Guidance

TTS can guide players through quests by providing spoken instructions, hints, or clues, offering an additional layer of assistance in navigating game objectives.

Interactive Conversations

Enable players to engage in interactive conversations with NPCs through TTS, allowing for more realistic and dynamic exchanges within the game world.

Accessibility for Visually Impaired Players

TTS aids visually impaired players by converting in-game text into spoken words, providing crucial information about game elements, menus, and story developments.

Character AI Interaction

TTS can enhance interactions with AI-driven characters by allowing them to vocally respond to player queries, creating a more realistic and immersive gaming environment.

Interactive Learning Games

In educational or serious games, TTS can assist in delivering instructional content, quizzes, or interactive learning experiences, making the gameplay educational and engaging.

Procedural Content Generation

TTS can contribute to procedural content generation by dynamically narrating events, backstory, or lore within the game, adding depth and context to the gaming world.

Integrating TTS into video games offers a versatile set of applications that go beyond traditional text presentation, providing new dimensions of interactivity, accessibility, and storytelling.

How to integrate TTS into your video game with Unity

Step 1. install the eden ai unity plugin.

Ensure that you have a Unity project open and ready for integration. If you haven't installed the Eden AI plugin, follow these steps:

Open your Unity Package Manager
Add package from GitHub

Step 2. Obtain your Eden AI API Key

To get started with the Eden AI API, you need to sign up for an account on the Eden AI platform .

Once registered, you will get an API key which you will need to use the Eden AI Unity Plugin. You can set it in your script or add a file auth.json to your user folder (path: ~/.edenai (Linux/Mac) or %USERPROFILE%/.edenai/ (Windows)) as follows:

Alternatively, you can pass the API key as a parameter when creating an instance of the EdenAIApi class. If the API key is not provided, it will attempt to read it from the auth.json file in your user folder.

Step 3. Integrate Text-to-Speech on Unity

Bring vitality to your non-player characters (NPCs) by empowering them to vocalize through the implementation of text-to-speech functionality.

Leveraging the Eden AI plugin, you can seamlessly integrate a variety of services, including Google Cloud, OpenAI, AWS, IBM Watson, LovoAI, Microsoft Azure, and ElevenLabs text-to-speech providers, into your Unity project (refer to the complete list here ).

This capability allows you to tailor the voice model, language, and audio format to align with the desired atmosphere of your game.

1. Open your script file where you want to implement the text-to-speech functionality.

2. Import the required namespaces at the beginning of your script:

3. Create an instance of the Eden AI API class:

4. Implement the SendTextToSpeechRequest function with the necessary parameters:

Step 4: Handle the Text-to-Speech Response

The SendTextToSpeechRequest function returns a TextToSpeechResponse object.

Access the response attributes as needed. For example:

Step 5: Customize Parameters (Optional)

The SendTextToSpeechRequest function allows you to customize various parameters:

Rate: Adjust speaking rate.
Pitch: Modify speaking pitch.
Volume: Control audio volume.
VoiceModel: Specify a specific voice model.
Include these optional parameters based on your preferences.

Step 6: Test and Debug

Run your Unity project and test the text-to-speech functionality. Monitor the console for any potential errors or exceptions, and make adjustments as necessary.

Now, your Unity project is equipped with text-to-speech functionality using the Eden AI plugin. Customize the parameters to suit your game's atmosphere, and enhance the immersive experience for your players.

TTS integration enhances immersion and opens doors for diverse gameplay experiences. Feel free to experiment with optional parameters for further fine-tuning. Explore additional AI functionalities offered by Eden AI to elevate your game development here .

About Eden AI

Eden AI is the future of AI usage in companies: our app allows you to call multiple AI APIs.

Centralized and fully monitored billing
Unified API: quick switch between AI models and providers
Standardized response format: the JSON output format is the same for all suppliers.
The best Artificial Intelligence APIs in the market are available
Data protection: Eden AI will not store or use any data.

How to implement Image Similarity Search with Python

Top Free Computer Vision APIs, and Open Source models

Our Custom Chatbot Gets Supercharged with New Features

Try eden ai for free..

You can directly start building now. If you have any questions, feel free to schedule a call with us!

Drop your email and we'll get back to you ASAP to answer any questions you have or just to say hi —we promise not to spam you!

Technologies

Unity Speech Recognition

This article serves as a comprehensive guide for adding on-device Speech Recognition to an Unity project.

When used casually, Speech Recognition usually refers solely to Speech-to-Text . However, Speech-to-Text represents only a single facet of Speech Recognition technologies. It also refers to features such as Wake Word Detection , Voice Command Recognition , and Voice Activity Detection ( VAD ). In the context of Unity projects, Speech Recognition can be used to implement a Voice Interface .

Fortunately Picovoice offers a few tools to help implement Voice Interfaces . If all that is needed is to recognize when specific phrases or words are said, use Porcupine Wake Word . If Voice Commands need to be understood and intent extracted with details (i.e. slot values), Rhino Speech-to-Intent is more suitable. Keep reading to see how to quickly start with both of them.

Picovoice Unity SDKs have cross-platform support for Linux , macOS , Windows , Android and iOS !

Porcupine Wake Word

To integrate the Porcupine Wake Word SDK into your Unity project, download and import the latest Porcupine Unity package .

Sign up for a free Picovoice Console account and obtain your AccessKey . The AccessKey is only required for authentication and authorization.

Create a custom wake word model using Picovoice Console.

Download the .ppn model file and copy it into your project's StreamingAssets folder.

Write a callback that takes action when a keyword is detected:

Initialize the Porcupine Wake Word engine with the callback and the .ppn file name (or path relative to the StreamingAssets folder):
Start detecting:

For further details, visit the Porcupine Wake Word product page or refer to Porcupine's Unity SDK quick start guide .

Rhino Speech-to-Intent

To integrate the Rhino Speech-to-Intent SDK into your Unity project, download and import the latest Rhino Unity package .

Create a custom context model using Picovoice Console.

Download the .rhn model file and copy it into your project's StreamingAssets folder.

Write a callback that takes action when a user's intent is inferred:

Initialize the Rhino Speech-to-Intent engine with the callback and the .rhn file name (or path relative to the StreamingAssets folder):
Start inferring:

For further details, visit the Rhino Speech-to-Intent product page or refer to Rhino's Android SDK quick start guide .

Subscribe to our newsletter

More from Picovoice

Learn how to perform Speech Recognition in JavaScript, including Speech-to-Text, Voice Commands, Wake Word Detection, and Voice Activity Det...

Learn how to perform Speech Recognition in iOS, including Speech-to-Text, Voice Commands, Wake Word Detection, and Voice Activity Detection.

Learn how to perform Speech Recognition in the web using React, including Voice Commands and Wake Word Detection.

Learn how to perform Speech Recognition in Android, including Speech-to-Text, Voice Commands, Wake Word Detection, and Voice Activity Detect...

Learn how to perform Speech Recognition in Python, including Speech-to-Text, Voice Commands, Wake Word Detection, and Voice Activity Detecti...

New releases of Porcupine Wake Word and Rhino Speech-to-Intent engines add support for Arabic, Dutch, Farsi, Hindi, Mandarin, Polish, Russia...

Learn how to create offline voice assistants like Alexa or Siri that run fully on-device using an STM32 microcontroller.

Learn how to transcribe speech to text on an Android device. Picovoice Leopard and Cheetah Speech-to-Text SDKs run on mobile, desktop, and e...

Introducing the Unity Text-to-Speech Plugin from ReadSpeaker

Having trouble adding synthetic speech to your next video game release? Try the Unity text-to-speech plugin from ReadSpeaker AI. Learn more here.

Accessibility
Assistive Technology
ReadSpeaker News
Text To Speech
Voice Branding

Introducing the Unity Text-to-Speech Plugin from ReadSpeaker

As a game developer, how will you use text to speech (TTS)?

We’ve only begun to discover what this tool can do in the hands of creators. What we do know is that TTS can solve tough development problems , that it’s a cornerstone of accessibility , and that it’s a key component of dynamic AI-enhanced characters: NPCs that carry on original conversations with players.

There have traditionally been a few technical roadblocks between TTS and the game studio: Devs find it cumbersome to create and import TTS sound files through an external TTS engine. Some TTS speech labors under perceptible latency, making it unsuitable for in-game audio. And an unintegrated TTS engine creates a whole new layer of project management, threatening already drum-tight production schedules.

What devs need is a latency-free TTS tool they can use independently, without leaving the game engine—and that’s exactly what you get with ReadSpeaker AI’s Unity text-to-speech plugin.

ReadSpeaker AI’s Unity Text-to-Speech Plugin

ReadSpeaker AI offers a market-ready TTS plugin for Unity and Unreal Engine, and will work with studios to provide APIs for other game engines. For now, though, we’ll confine our discussion to Unity, which claims nearly 65% of the game development engine market. ReadSpeaker AI’s TTS plugin is an easy-to-install tool that allows devs to create and manipulate synthetic speech directly in Unity: no file management, no swapping between interfaces, and a deep library of rich, lifelike TTS voices. ReadSpeaker AI uses deep neural networks (DNN) to create AI-powered TTS voices of the highest quality, complete with industry-leading pronunciation thanks to custom pronunciation dictionaries and linguist support.

With this neural TTS at their fingertips, developers can improve the game development process—and the player’s experience—limited only by their creativity. So far, we’ve identified four powerful uses for a TTS game engine plugin. These include:

User interface (UI) narration for accessibility. User interface narration is an accessibility feature that remediates barriers for players with vision impairments and other disabilities; TTS makes it easy to implement. Even before ReadSpeaker AI released the Unity plugin, The Last of Us Part 2 (released in 2018) used ReadSpeaker TTS for its UI narration feature. A triple-A studio like Naughty Dog can take the time to generate TTS files outside the game engine; those files were ultimately shipped on the game disc. That solution might not work ideally for digital games or independent studios, but a TTS game engine plugin will.
Prototyping dialogue at early stages of development. Don’t wait until you’ve got a voice actor in the studio to find out your script doesn’t flow perfectly. The Unity TTS plugin allows developers to draft scenes within the engine, tweaking lines and pacing to get the plan perfect before the recording studio’s clock starts running.
Instant audio narration for in-game text chat. Unity speech synthesis from ReadSpeaker AI renders audio instantly at runtime, through a speech engine embedded in the game files, so it’s ideal for narrating chat messages instantly. This is another powerful accessibility tool—one that’s now required for online multiplayer games in the U.S., according to the 21st Century Communications and Video Accessibility Act (CVAA). But it’s also great for players who simply prefer to listen rather than read in the heat of action.
Lifelike speech for AI NPCs and procedurally generated text. Natural language processing allows software to understand human speech and create original, relevant responses. Only TTS can make these conversational voicebots—which is essentially what AI NPCs are—speak out loud. Besides, AI NPCs are just one use of procedurally generated speech in video games. What are the others? You decide. Game designers are artists, and dynamic, runtime TTS from ReadSpeaker AI is a whole new palette.

Text to Speech vs. Human Voice Actors for Video Game Characters

Note that our list of use cases for TTS in game development doesn’t include replacing voice talent for in-game character voices, other than AI NPCs that generate dialogue in real time. Voice actors remain the gold standard for character speech, and that’s not likely to change any time soon. In fact, every great neural TTS voice starts with a great voice actor; they provide the training data that allows the DNN technology to produce lifelike speech, with contracts that ensure fair, ethical treatment for all parties. So while there’s certainly a place for TTS in character voices, they are not a replacement for human talent. Instead, think of TTS as a tool for development, accessibility, and the growing role of AI in gaming.

ReadSpeaker AI brings more than 20 years of experience in TTS, with a focus on performance. That expertise helped us develop an embedded TTS engine that renders audio on the player’s machine, eliminating latency. We also offer more than 90 top-quality voices in over 30 languages, plus SSML support so you can control expression precisely. These capabilities set ReadSpeaker AI apart from the crowd. Curious? Keep reading for a real-world example.

ReadSpeaker AI Speech Synthesis in Action

Soft Leaf Studios used ReadSpeaker AI’s Unity text-to-speech plugin for scene prototyping and UI and story narration for its highly accessible game, in development at publication time, Stories of Blossom . Check out this video to see how it works:

“Without a TTS plugin like this, we would be left guessing what audio samples we would need to generate, and how they would play back,” Conor Bradley, Stories of Blossom lead developer, told ReadSpeaker AI. “The plugin allows us to experiment without the need to lock our decisions, which is a very powerful tool to have the privilege to use.”

This example begs the question every game developer will soon be asking themselves, a variation on the question we started with: What could a Unity text-to-speech plugin do for your next release? Reach out to start the conversation .

ReadSpeaker’s industry-leading voice expertise leveraged by leading Italian newspaper to enhance the reader experience Milan, Italy. – 19 October, 2023 – ReadSpeaker, the most trusted,…

Accessibility Overlays: What Site Owners Need to Know

Accessibility overlays have gotten a lot of bad press, much of it deserved. So what can you do to improve web accessibility? Find out here.

Making STEM accessible: person in red long sleeve shirt holding white pen

As STEM classrooms move online, we need new ways to make content accessible—and even fun! Learn nine approaches to digital STEM accessibility here.

ReadSpeaker webReader
ReadSpeaker docReader
ReadSpeaker TextAid
Assessments
Text to Speech for K12
Higher Education
Corporate Learning
Learning Management Systems
Custom Text-To-Speech (TTS) Voices
Voice Cloning Software
Text-To-Speech (TTS) Voices
ReadSpeaker speechMaker Desktop
ReadSpeaker speechMaker
ReadSpeaker speechCloud API
ReadSpeaker speechEngine SAPI
ReadSpeaker speechServer
ReadSpeaker speechServer MRCP
ReadSpeaker speechEngine SDK
ReadSpeaker speechEngine SDK Embedded
Automotive Applications
Conversational AI
Entertainment
Experiential Marketing
Guidance & Navigation
Smart Home Devices
Transportation
Virtual Assistant Persona
Voice Commerce
Customer Stories & e-Books
About ReadSpeaker
TTS Languages and Voices
The Top 10 Benefits of Text to Speech for Businesses
Learning Library
e-Learning Voices: Text to Speech or Voice Actors?
TTS Talks & Webinars

Make your products more engaging with our voice solutions.

Solutions ReadSpeaker Online ReadSpeaker webReader ReadSpeaker docReader ReadSpeaker TextAid ReadSpeaker Learning Education Assessments Text to Speech for K12 Higher Education Corporate Learning Learning Management Systems ReadSpeaker Enterprise AI Voice Generator Custom Text-To-Speech (TTS) Voices Voice Cloning Software Text-To-Speech (TTS) Voices ReadSpeaker speechCloud API ReadSpeaker speechEngine SAPI ReadSpeaker speechServer ReadSpeaker speechServer MRCP ReadSpeaker speechEngine SDK ReadSpeaker speechEngine SDK Embedded
Applications Accessibility Automotive Applications Conversational AI Education Entertainment Experiential Marketing Fintech Gaming Government Guidance & Navigation Healthcare Media Publishing Smart Home Devices Transportation Virtual Assistant Persona Voice Commerce
Resources Resources TTS Languages and Voices Learning Library TTS Talks and Webinars About ReadSpeaker Careers Support Blog The Top 10 Benefits of Text to Speech for Businesses e-Learning Voices: Text to Speech or Voice Actors?
Get started

Search on ReadSpeaker.com ...

All languages.

Norsk Bokmål
Latviešu valoda

Português – Brasil

Using the Speech-to-Text API with C#

1. overview.

Google Cloud Speech-to-Text API enables developers to convert audio to text in 120 languages and variants, by applying powerful neural network models in an easy to use API.

In this codelab, you will focus on using the Speech-to-Text API with C#. You will learn how to send an audio file in English and other languages to the Cloud Speech-to-Text API for transcription.

What you'll learn

How to use the Cloud Shell
How to enable the Speech-to-Text API
How to Authenticate API requests
How to install the Google Cloud client library for C#
How to transcribe audio files in English
How to transcribe audio files with word timestamps
How to transcribe audio files in different languages

What you'll need

A Google Cloud Platform Project
A Browser, such Chrome or Firefox
Familiarity using C#

How will you use this tutorial?

How would you rate your experience with c#, how would you rate your experience with using google cloud platform services, 2. setup and requirements, self-paced environment setup.

Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one .

The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID ). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
For your information, there is a third value, a Project Number , which some APIs use. Learn more about all three of these values in the documentation .
Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell , a command line environment running in the Cloud.

Activate Cloud Shell

If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue .

It should only take a few moments to provision and connect to Cloud Shell.

This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.

Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

Command output

Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:

If it is not, you can set it with this command:

3. Enable the Speech-to-Text API

Before you can begin using the Speech-to-Text API, you must enable the API. You can enable the API by using the following command in the Cloud Shell:

4. Install the Google Cloud Speech-to-Text API client library for C#

First, create a simple C# console application that you will use to run Speech-to-Text API samples:

You should see the application created and dependencies resolved:

Next, navigate to SpeechToTextApiDemo folder:

And add Google.Cloud.Speech.V1 NuGet package to the project:

Now, you're ready to use Speech-to-Text API!

5. Transcribe Audio Files

In this section, you will transcribe a pre-recorded audio file in English. The audio file is available on Google Cloud Storage.

To transcribe an audio file, open the code editor from the top right side of the Cloud Shell:

Navigate to the Program.cs file inside the SpeechToTextApiDemo folder and replace the code with the following:

Take a minute or two to study the code and see it is used to transcribe an audio file*.*

The Encoding parameter tells the API which type of audio encoding you're using for the audio file. Flac is the encoding type for .raw files (see the doc for encoding type for more details).

In the RecognitionAudio object, you can pass the API either the uri of our audio file in Cloud Storage or the local file path for the audio file. Here, we're using a Cloud Storage uri.

Back in Cloud Shell, run the app:

You should see the following output:

In this step, you were able to transcribe an audio file in English and print out the result. Read more about Transcribing .

6. Transcribe with word timestamps

Speech-to-Text can detect time offset (timestamp) for the transcribed audio. Time offsets show the beginning and end of each spoken word in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.

To transcribe an audio file with time offsets, navigate to the Program.cs file inside the SpeechToTextApiDemo folder and replace the code with the following:

Take a minute or two to study the code and see it is used to transcribe an audio file with word timestamps*.* The EnableWordTimeOffsets parameter tells the API to enable time offsets (see the doc for more details).

In this step, you were able to transcribe an audio file in English with word timestamps and print out the result. Read more about Transcribing with word offsets .

7. Transcribe different languages

Speech-to-Text API supports transcription in over 100 languages! You can find a list of supported languages here .

In this section, you will transcribe a pre-recorded audio file in French. The audio file is available on Google Cloud Storage.

To transcribe the French audio file, navigate to the Program.cs file inside the SpeechToTextApiDemo folder and replace the code with the following:

Take a minute or two to study the code and see how it is used to transcribe an audio file*.* The LanguageCode parameter tells the API what language the audio recording is in.

This is a sentence from a popular French children's tale .

In this step, you were able to transcribe an audio file in French and print out the result. Read more about supported languages .

8. Congratulations!

You learned how to use the Speech-to-Text API using C# to perform different kinds of transcription on audio files!

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

Go to the Cloud Platform Console .
Select the project you want to shut down, then click ‘Delete' at the top: this schedules the project for deletion.
Google Cloud Speech-to-Text API: https://cloud.google.com/speech-to-text/docs
C#/.NET on Google Cloud Platform: https://cloud.google.com/dotnet/
Google Cloud .NET client: https://googlecloudplatform.github.io/google-cloud-dotnet/

This work is licensed under a Creative Commons Attribution 2.0 Generic License.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Open main menu

Overtone - Realistic AI Offline Text to Speech (TTS)

Overtone is an offline Text-to-Speech asset for Unity. Enrich your game with 15+ languages, 900+ English voices, rapid performance, and cross-platform support.

Getting Started

Welcome to the Overtone documentation! In this section, we’ll walk you through the initial steps to start using the tools. We will explain the various features of Overtone, how to set it up, and provide guidance on using the different models for text to speech

Overtone provides a versatile text-to-speech solution, supporting over 15 languages to cater to a diverse user base. It is important to note that the quality of each model varies, which in turn affects the voice output. Overtone offers four quality variations: X-LOW, LOW, MEDIUM, and HIGH, allowing users to choose the one that best fits their needs.

The plugin includes a default English-only model, called LibriTTS, which boasts a selection of more than 900 distinct voices, readily available for use. As lower quality models are faster to process, they are particularly well-suited for mobile devices, where speed and efficiency are crucial.

How to download models

The TTSVoice component provides a convenient interface for downloading the models with just a click. Alternatively you can open the window from Window > Overtone > Download Manager

The plugin contains a demos to demonstrate the functionality: Text to speech. You can input text, select a downloaded voice in the TTSVoice component an listen to it

This class loads and setups the model into memory. It should be added into scenes that Overtone is planned to be used. It exposes 1 method, Speak which receives a string and a TTSVoice and returns an audioclip.

Example programatic usage:

This script loads a voice model and frees it when necessary. It also allows the user to select the speaker id to use in the voice model.

Script Reference for TTSVoice.cs

TTSPlayer.cs is a script that combines a TTSVoice and a TTSEngine into synthesized text.

Script Reference for TTSPlayer.cs

Ssmlpreprocessor.

SSMLPreprocessor.cs is a static class that offers limited SSML (Speech Synthesis Markup Language) support for Overtone. Currently, this class supports preprocessing for the <break> tag.

Speech Synthesis Markup Language (SSML) is an XML-based markup language that provides a standard way to control various aspects of synthesized speech output, including pronunciation, volume, pitch, and speed.

While we plan to add partial SSML support in future updates, for now, the SSMLPreprocessor class only recognizes the <break> tag.

The <break> tag allows you to add a pause in the synthesized speech output.

Supported Platforms

Overtone supports the following platforms:

If interested in any other platforms, please reach out.

Supported Languages

Troubleshooting.

For any questions, issues or feature requests don’t hesitate to email us at [email protected] or join the discord . Very are happy to help and aim to have very fast response times :)

We are a small company focused on building tools for game developers. Send us an email to [email protected] if interested in working with us. For any other inquiries, feel free to contact us at [email protected] or contact us on the discord

Sign up to our newsletter.

Want to receive news about discounts, new products and updates?

Case Studies
Support & Services
Asset Store

Search Unity

A Unity ID allows you to buy and/or subscribe to Unity products and services, shop in the Asset Store and participate in the Unity community.

Discussions
Evangelists
User Groups
Beta Program
Advisory Panel

You are using an out of date browser. It may not display this or other websites correctly. You should upgrade or use an alternative browser .

Unity 6 Preview is now available. To find out what's new, have a look at our Unity 6 Preview blog post . Dismiss Notice
Unity is excited to announce that we will be collaborating with TheXPlace for a summer game jam from June 13 - June 19. Learn more. Dismiss Notice
Search titles only

Separate names with a comma.

Search this thread only
Display results as threads

Useful Searches

Recent Posts

[Open Source] whisper.unity - free speech to text running on your machine

Discussion in ' Assets and Asset Store ' started by Macoron , Apr 12, 2023 .

whisper.unity Several month ago OpenAI released powerful audio speech recognition (asr) model called Whisper . Code and weights are under MIT license. I used another open source implementation called whisper.cpp and moved it to Unity. Main features: Multilanguage, supports around 60 languages Can do transcription from one language to another. For example transcribe German audio to English text. Works faster than realtime. On my Mac it transcribes 11 seconds audio in 220 ms Runs on local user machine without Internet connection Free and open source, can be used in commercial projects Feel free to use it in your projects: https://github.com/Macoron/whisper.unity

I implemented this into my new project. It works great, thanks for this! I couldn't get this work for IL2CPP, though, I had to use Mono. (Unity 2022.3.0)

Gord10 said: ↑ I implemented this into my new project. It works great, thanks for this! I couldn't get this work for IL2CPP, though, I had to use Mono. (Unity 2022.3.0) Click to expand...

Macoron said: ↑ Great, nice to hear that you used it for your project. For what platform did you have problem with IL2CPP? It should be supported. Click to expand...

Gord10 said: ↑ I get following errors in player (64-bits Windows build) Click to expand...

Great! Windows, Mac (tested only for Silicon) and Linux IL2CPP builds work perfectly, now, thanks for the fix.

Warfighter789

Hi there, is this asset compatible with VR devices such as Meta Quest 2? I attempted to integrate it into my project but encountered an error. Thanks in advance. Error Unity NotSupportedException: IL2CPP doesn't allow marshaling delegates that reference instance methods to native code. The method we're trying to marshal is: Whisper.Native.whisper_progress_callback::Invoke.

Warfighter789 said: ↑ Hi there, is this asset compatible with VR devices such as Meta Quest 2? I attempted to integrate it into my project but encountered an error. Thanks in advance. Error Unity NotSupportedException: IL2CPP doesn't allow marshaling delegates that reference instance methods to native code. The method we're trying to marshal is: Whisper.Native.whisper_progress_callback::Invoke. Click to expand...

Hi, Unfortunately, I have an initialization error with Unity 2022.3 LTS concerning libwhisper.dll (DllNotFoundException: libwhisper assembly) Any advice will be welcome to allow a compilation with this version.

jlmarc33 said: ↑ Hi, Unfortunately, I have an initialization error with Unity 2022.3 LTS concerning libwhisper.dll (DllNotFoundException: libwhisper assembly) View attachment 1267375 Any advice will be welcome to allow a compilation with this version. Click to expand...

Macoron said: ↑ Check messages above. This error should be fixed by recent update. Btw, I didn't test in Oculus Quest 2, but really interested to see how fast it works. Please write back. Edit: Make sure you use lastest-latest with this update https://github.com/Macoron/whisper.unity/pull/41 Click to expand...

Warfighter789 said: ↑ The latest update fixed the issue, thank you! The speed is quite fast. I noticed that the Speech to Voice isn't as accurate anymore, is that normal? Click to expand...

Macoron said: ↑ What do you mean Speech to Voice isn't accurate anymore? Do you have bad transcription results? If that the the case, what language do you use? Click to expand...

Warfighter789 said: ↑ Yeah, my transcription results are having problems. I've noticed that it's not picking up my voice as accurately as before. Sometimes when I say something, it comes out differently in the transcript. I'm using English. Click to expand...

Macoron said: ↑ Well, you can try to use older release. The latest master uses whisper.cpp 1.4.2 which may works different from 1.2.2. I also noticed some changes, but not sure if it's better or worse. https://github.com/Macoron/whisper.unity/releases/tag/1.1.1 If you are using English, I highly recommend you to switch to `whisper.tiny.en` or `whisper.base.en` models. They are much better in English transcription. I personally use `whisper.small.en`, but they might be too heavy for quest. Click to expand...

I tested Whisper.unity successfully on my Windows 11 laptop PC without any issues. I used Unity 2021.3.9 and the latest 2022.3.4 LTS. So, my initialization problem with Unity 2022.3.0 seems to be related only to my specific desktop PC configuration... (Windows 10 with security restrictions).

jlmarc33 said: ↑ I tested Whisper.unity successfully on my Windows 11 laptop PC without any issues. I used Unity 2021.3.9 and the latest 2022.3.4 LTS. So, my initialization problem with Unity 2022.3.0 seems to be related only to my specific desktop PC configuration... (Windows 10 with security restrictions). Click to expand...

Bullybolton

@Warfighter789 what did you do to test on Quest 2? I've just built the microphone sample scene onto quest and it was very slow.

Macoron said: ↑ https://github.com/Macoron/whisper.unity Click to expand...

Spellbook said: ↑ This is something I've worked towards for years and it has effectively been impossible unless you're Google, Apple or Amazon... I don't think people quite realize how revolutionary this stuff is yet. Click to expand...

One issue I've run into is sampling a short audio clip returns 0 segments. Using push-to-talk, someone might quickly say "Yes" and the clip is 1 or 2 seconds long. The WhisperWrapper line "var n = WhisperNative.whisper_full_n_segments(_whisperCtx);" returns 0, finding no segments. I assume this is probably a limitation of the Whisper internals? I wanted to ask before I artificially append a few seconds to the end of audio clips as a hack solution.

Spellbook said: ↑ One issue I've run into is sampling a short audio clip returns 0 segments. Using push-to-talk, someone might quickly say "Yes" and the clip is 1 or 2 seconds long. Click to expand...

yeah , escuse my english i french sorry if question is a noob one i cant find how to translate a text from language to another exept of couse for the bool translateToEnglish, i want to translate all speech what ever the language in french any help would be highly appreciated thanks for this great paquage !

Utopien said: ↑ yeah , escuse my english i french sorry if question is a noob one i cant find how to translate a text from language to another exept of couse for the bool translateToEnglish, i want to translate all speech what ever the language in french any help would be highly appreciated Click to expand...

Macoron said: ↑ Find Whisper Manager on your scene and there find "Language" field. Write "fr" language code and make sure that "Translate To English" is disabled. Now any speech on any language will be translated to French text. Keep in mind, that it doesn't work as well as English translation and you will probably need bigger model than "tiny". With smaller models it will probably be just gibberish. View attachment 1272260 Click to expand...

Quick update: whisper.unity updated to 1.2.0 version! Biggest changes are prompting and streaming support. For more information, check release notes in Github repository .

This is amazing I've always wanted to see something like this

Hey this is working great for me in editor but when i do an android build It dies thusly 09-11 23:43:30.505 1781 2132 I Unity : Trying to load Whisper model from buffer... 09-11 23:43:30.553 1781 1817 E Unity : DllNotFoundException: __Internal assembly:<unknown assembly> type:<unknown type> member null) 09-11 23:43:30.553 1781 1817 E Unity : at (wrapper managed-to-native) Whisper.Native.WhisperNative.whisper_init_from_buffer(intptr,uintptr) 09-11 23:43:30.553 1781 1817 E Unity : at Whisper.WhisperWrapper.InitFromBuffer (System.Byte[] buffer) [0x00054] in <82e321693d1448d4ae1fba9fa7e11c76>:0 09-11 23:43:30.553 1781 1817 E Unity : at Whisper.WhisperWrapper+<>c__DisplayClass27_0.<InitFromBufferAsync>b__0 () [0x00000] in <82e321693d1448d4ae1fba9fa7e11c76>:0 09-11 23:43:30.553 1781 1817 E Unity : at System.Threading.Tasks.Task`1[TResult].InnerInvoke () [0x0000f] in <0bfb382d99114c52bcae2561abca6423>:0 09-11 23:43:30.553 1781 1817 E Unity : at System.Threading.Tasks.Task.Execute () [0x00000] in <0bfb382d99114c52bcae2561abca6423>:0 09-11 23:43:30.553 1781 1817 E Unity : --- End of stack trace from previous location where exception was thrown --- 09-11 23:43:30.553 1781 1817 E Unity : 09-11 23:43:30.553 1781 1817 E Unity : at Whisper.WhisperWrapper.InitFromBufferAsync (System.Byte[] buffer) [0x0007d] in <82e321693d1448d4ae1fba9fa7e11c76>:0 09-11 23:43:30.553 1781 1817 E Unity : at Whisper.WhisperWrapper.InitFromFileAsync (System.String modelPath) [0x000c1] in <82e321693d1448d4ae1fba9fa7e11c76>:0 09-11 23:43:30.553 1781 1817 E Unity : at Whisper.WhisperManager.InitModel () [0x000bd] 09-11 23:43:35.290 1781 1817 E Unity : Whisper model isn't loaded! Init Whisper model first!

Strategos said: ↑ Hey this is working great for me in editor but when i do an android build It dies thusly Click to expand...

Macoron said: ↑ For what device are you building? Which version of Unity? Please also check that your Player Settings has IL2CPP Scripting Backend and you are building for ARM64 architecture. Click to expand...

Is there a way to add custom words? There are some words that I need to use but it doesn't pick them up as that word ever.

Also, @Macoron I tried installing the package from the package manager and kept getting an error about OnRecordStop delegate not being found. Removed it and copied the package com.whisper.unity package folder in the downloaded git zip download and no error.

epl-matt said: ↑ Is there a way to add custom words? There are some words that I need to use but it doesn't pick them up as that word ever. Click to expand...

Strategos said: ↑ Thanks I will check these things and report back. Click to expand...

New major update: now whisper supports GPU inference on CUDA and Metal. It also should improve quality and fix some minor bugs. Check more details here .

Macoron said: ↑ New major update: now whisper supports GPU inference on CUDA and Metal. It also should improve quality and fix some minor bugs. Check more details here . Click to expand...

Tyke18 said: ↑ Hi does this have a voice activity detection feature? I want to be able to capture mic input on a Meta Quest 3 without the user having to press anything to indicate they are speaking (audio permissions must be granted first of course). Click to expand...

How does the Initial prompt work? My application needs to work in the following way. It needs to listen to a users natural speech and pick out any keywords (held in a database) on the fly during their speech. If it detects a keyword it will then trigger an event. Is filling the Initial Prompt field with my keywords the best solution to this? Should each word be comma separated in Initial Prompt field?

DanielSCG said: ↑ How does the Initial prompt work? My application needs to work in the following way. It needs to listen to a users natural speech and pick out any keywords (held in a database) on the fly during their speech. If it detects a keyword it will then trigger an event. Is filling the Initial Prompt field with my keywords the best solution to this? Should each word be comma separated in Initial Prompt field? View attachment 1414155 Click to expand...

Macoron said: ↑ In my experience, the initial prompt works best for: - Make whisper better understand rare or new words. For example person or company name, location name, etc - Guide transcription in certain writing styles. For example "WRITE ALL TEXT IN CAPS LOCK" - Setup context of previous transcription You can try to write the initial prompt like "house, car, house, house, train, car..." to set up some context for the model. This might help whisper to work better or might cause the model to hallucinate. Hard to tell without experiments. If you have limited set of words, you might be interested in grammar rules . They basically constrain model and force it to recognize words based on the set of rules. Unfortunately, they are not supported by whisper.unity yet . Click to expand...

DanielSCG said: ↑ Hi, Thanks for the fast reply. I have tried using the initial prompt but it doesn't seem to help much. To your knowledge is Whisper the best solution to use for my requirements? Maybe there would be some other transcriber that would work similarly? All I need it to do is recognise if a keyword is spoken with a good degree of accuracy. Click to expand...

Macoron said: ↑ Whisper should be good for such task. The accuracy grows greatly with model size. If you want to support only English it also make sense to use English-only models. They are much better than general models. I would also recommend not to use streaming, because it's less accurate. Whisper.cpp fails to transcript any audio that is less than 1 second in duration, so keep some audio margins for your words. Click to expand...

DanielSCG said: ↑ I need my application to recognise and list keywords the user is speaking in Realtime. For example is a user was saying a sentence like "I am in an house and I am about to inspect the door frame". The application should in Realtime show the keywords to the user on the screen. In this example the keywords would be in a database [House, Door]. So I need the app and speech to text to run for about 5 minuets in real time and just indicate to the user when they have spoken a keyword. I am developing this for android mobile devices so it needs to be lightweight enough to run on them. My surface book is what I'm using for development and ggml-small.eng.bin is the largest model it can handle. Is there some parameters in whisper I could change to better achieve these requirements? Click to expand...

Instantly share code, notes, and snippets.

akeller / WatsonSpeechToText.cs

Download ZIP
Star ( 0 ) 0 You must be signed in to star a gist
Fork ( 0 ) 0 You must be signed in to fork a gist
Embed Embed this gist in your website.
Share Copy sharable link for this gist.
Clone via HTTPS Clone using the web URL.
Learn more about clone URLs
Save akeller/281864b1ab3d0a592642158a1f809b2a to your computer and use it in GitHub Desktop.

akeller commented May 23, 2018 • edited

BREAKING CHANGES MADE TO SDK This is on my todo list to fix, but it is currently not working with the most recent version in the asset store.

Sorry, something went wrong.

(Free) Runtime Text To Speech Plugin

Needed simple TTS plugin for a small unity windows game that i’m working on.. I know asset store has few, but they seem to rely on Windows Speech platform voices, and one plugin that was completely standalone didnt really have good enough speech quality.. First i tried to use Mozilla-TTS (to generate voice files in advance), but it was impossible to get it to compile due to some weird tricks needed. Then found espeak-ng , which seems good enough quality and already had dotnet wrapper available! I Compiled speak dll with VisualStudio, had few issues with the wrapper, but got them fixed – compare Client.cs with the original. Next issue was DLL kept crashing, fixed it by using DLLManipulator .

Unity project: https://github.com/unitycoder/UnityRuntimeTextToSpeech

Note: See github issues for known issues and ideas to improve this.

Updates: Now outputs to AudioSource component, thanks to updates from @autious fork.

About the Author:

7 comments + add comment.

So cool. What platforms does this work on? IOS? Android?

not tested, but pretty sure that the DLL doesnt work on those platforms. (and especially because of using the DLLManipulator, which needs to read dll file from specific folder or so).

Better use those paid asset store versions, they work on many platforms.

Cool. ¿It is possible to change the language?

try this https://github.com/unitycoder/UnityRuntimeTextToSpeech/issues/3

Will this work offline? If not recommend other. Actually, need offline TTS.

yes, its offline.

list of alternatives https://github.com/unitycoder/UnityRuntimeTextToSpeech/wiki/Alternative-TTS-plugins

Name (required)

Mail (will not be published) (required)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

UnityLauncherPro

@unitycoder_com

Subscribe to blog via email.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Email Address

These materials are not sponsored by or affiliated with Unity Technologies or its affiliates. “Unity” is a trademark or registered trademark of Unity Technologies or its affiliates in the U.S. and elsewhere.

Disclosure: These posts may contain affiliate links, which means we may receive a commission if you click a link and purchase something that we have recommended. While clicking these links won't cost you any money, they will help me fund my development projects while recommending great assets!

Help | Advanced Search

Computer Science > Computation and Language

Title: unity: two-pass direct speech-to-speech translation with discrete units.

Abstract: Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.

Submission history

Access paper:.

Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

AI + Machine Learning , Announcements , Azure AI Content Safety , Azure AI Studio , Azure OpenAI Service , Partners

Introducing GPT-4o: OpenAI’s new flagship multimodal model now in preview on Azure

By Eric Boyd Corporate Vice President, Azure AI Platform, Microsoft

Posted on May 13, 2024 2 min read

Tag: Copilot
Tag: Generative AI

Microsoft is thrilled to announce the launch of GPT-4o, OpenAI’s new flagship model on Azure AI. This groundbreaking multimodal model integrates text, vision, and audio capabilities, setting a new standard for generative and conversational AI experiences. GPT-4o is available now in Azure OpenAI Service, to try in preview , with support for text and image.

Azure OpenAI Service

A person sitting at a table looking at a laptop.

A step forward in generative AI for Azure OpenAI Service

GPT-4o offers a shift in how AI models interact with multimodal inputs. By seamlessly combining text, images, and audio, GPT-4o provides a richer, more engaging user experience.

Launch highlights: Immediate access and what you can expect

Azure OpenAI Service customers can explore GPT-4o’s extensive capabilities through a preview playground in Azure OpenAI Studio starting today in two regions in the US. This initial release focuses on text and vision inputs to provide a glimpse into the model’s potential, paving the way for further capabilities like audio and video.

Efficiency and cost-effectiveness

GPT-4o is engineered for speed and efficiency. Its advanced ability to handle complex queries with minimal resources can translate into cost savings and performance.

Potential use cases to explore with GPT-4o

The introduction of GPT-4o opens numerous possibilities for businesses in various sectors:

Enhanced customer service : By integrating diverse data inputs, GPT-4o enables more dynamic and comprehensive customer support interactions.
Advanced analytics : Leverage GPT-4o’s capability to process and analyze different types of data to enhance decision-making and uncover deeper insights.
Content innovation : Use GPT-4o’s generative capabilities to create engaging and diverse content formats, catering to a broad range of consumer preferences.

Exciting future developments: GPT-4o at Microsoft Build 2024

We are eager to share more about GPT-4o and other Azure AI updates at Microsoft Build 2024 , to help developers further unlock the power of generative AI.

Get started with Azure OpenAI Service

Begin your journey with GPT-4o and Azure OpenAI Service by taking the following steps:

Try out GPT-4o in Azure OpenAI Service Chat Playground (in preview).
If you are not a current Azure OpenAI Service customer, apply for access by completing this form .
Learn more about  Azure OpenAI Service  and the  latest enhancements.
Understand responsible AI tooling available in Azure with Azure AI Content Safety .
Review the OpenAI blog on GPT-4o.

Let us know what you think of Azure and what you would like to see in the future.

Provide feedback

Build your cloud computing and Azure skills with free courses by Microsoft Learn.

Explore Azure learning

AI + Machine Learning , Announcements , Azure AI , Azure AI Studio , Azure OpenAI Service , Events

New models added to the Phi-3 family, available on Microsoft Azure chevron_right

AI + Machine Learning , Announcements , Azure AI , Azure AI Content Safety , Azure AI Services , Azure AI Studio , Azure Cosmos DB , Azure Database for PostgreSQL , Azure Kubernetes Service (AKS) , Azure OpenAI Service , Azure SQL Database , Events

From code to production: New ways Azure helps you build transformational AI experiences chevron_right

AI + Machine Learning , Azure AI Studio , Customer stories

3 ways Microsoft Azure AI Studio helps accelerate the AI development journey chevron_right

AI + Machine Learning , Analyst Reports , Azure AI , Azure AI Content Safety , Azure AI Search , Azure AI Services , Azure AI Studio , Azure OpenAI Service , Partners

Microsoft is a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud AI Developer Services chevron_right

Join the conversation, leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

I understand by submitting this form Microsoft is collecting my name, email and comment as a means to track comments on this website. This information will also be processed by an outside service for Spam protection. For more information, please review our Privacy Policy and Terms of Use .

I agree to the above

Skip to main content
Keyboard shortcuts for audio player

Benedictine College nuns denounce Harrison Butker's speech at their school

John Helton

Kansas City Chiefs kicker Harrison Butker speaks to the media during NFL football Super Bowl 58 opening night on Feb. 5, 2024, in Las Vegas. Butker railed against Pride month along with President Biden's leadership during the COVID-19 pandemic and his stance on abortion during a commencement address at Benedictine College last weekend. Charlie Riedel/AP hide caption

An order of nuns affiliated with Benedictine College rejected Kansas City Chiefs kicker Harrison's Butker's comments in a commencement speech there last weekend that stirred up a culture war skirmish.

"The sisters of Mount St. Scholastica do not believe that Harrison Butker's comments in his 2024 Benedictine College commencement address represent the Catholic, Benedictine, liberal arts college that our founders envisioned and in which we have been so invested," the nuns wrote in a statement posted on Facebook .

In his 20-minute address , Butker denounced abortion rights, Pride Month, COVID-19 lockdowns and "the tyranny of diversity, equity and inclusion" at the Catholic liberal arts college in Atchison, Kan.

He also told women in the audience to embrace the "vocation" of homemaker.

"I want to speak directly to you briefly because I think it is you, the women, who have had the most diabolical lies told to you. How many of you are sitting here now about to cross the stage, and are thinking about all the promotions and titles you're going to get in your career?" he asked. "Some of you may go on to lead successful careers in the world. But I would venture to guess that the majority of you are most excited about your marriage and the children you will bring into this world."

For many Missouri Catholics, abortion rights means choosing between faith, politics

That was one of the themes that the sisters of Mount St. Scholastica took issue with.

"Instead of promoting unity in our church, our nation, and the world, his comments seem to have fostered division," they wrote. "One of our concerns was the assertion that being a homemaker is the highest calling for a woman. We sisters have dedicated our lives to God and God's people, including the many women whom we have taught and influenced during the past 160 years. These women have made a tremendous difference in the world in their roles as wives and mothers and through their God-given gifts in leadership, scholarship, and their careers."

The Benedictine sisters of Mount St. Scholastica founded a school for girls in Atchinson in the 1860s. It merged with St. Benedict's College in 1971 to form Benedictine College.

Neither Butker nor the Chiefs have commented on the controversy. An online petition calling for the Chiefs to release the kicker had nearly 215,000 signatures as of Sunday morning.

6 in 10 U.S. Catholics are in favor of abortion rights, Pew Research report finds

The NFL, for its part, has distanced itself from Butker's remarks.

"Harrison Butker gave a speech in his personal capacity," Jonathan Beane, the NFL's senior VP and chief diversity and inclusion officer told NPR on Thursday. "His views are not those of the NFL as an organization."

Meanwhile, Butker's No. 7 jersey is one of the league's top-sellers , rivaling those of better-known teammates Patrick Mahomes and Travis Kelce.

Butker has been open about his faith. The 28-year-old father of two told the Eternal Word Television Network in 2019 that he grew up Catholic but practiced less in high school and college before rediscovering his belief later in life.

His comments have gotten some support from football fan social media accounts and Christian and conservative media personalities .

A video of his speech posted on Benedictine College's YouTube channel has 1.5 million views.

Rachel Treisman contributed to this story.

Harrison Butker
benedictine college

Unlock a new era of innovation with Windows Copilot Runtime and Copilot+ PCs

Pavan Davuluri – Corporate Vice President, Windows + Devices

I am excited to be back at Build with the developer community this year.

Over the last year, we have worked on reimagining Windows PCs and yesterday, we introduced the world to a new category of Windows PCs called Copilot+ PCs.

Copilot+ PCs are the fastest, most intelligent Windows PCs ever with AI infused at every layer, starting with the world’s most powerful PC Neural Processing Units (NPUs) capable of delivering 40+ TOPS of compute. The new class of PCs is up to 20 times more powerful 1 and up to 100 times as efficient 2 for running AI workloads compared to traditional PCs. This is a quantum leap in performance, made possible by a quantum leap in efficiency. The NPU is part of a new System on Chip (SoC) that enables the most powerful and efficient Windows PCs ever built, with outstanding performance, incredible all day battery life, and great app experiences. Copilot+ PCs will be available in June, starting with Qualcomm’s Snapdragon X Series processors. Later this year we will have more devices in this category from Intel and AMD.

I am also excited that Qualcomm announced this morning its Snapdragon Dev Kit for Windows which has a special developer edition Snapdragon X Elite SoC. Featuring the NPU that powers the Copilot+ PCs, the Snapdragon Dev Kit for Windows has a form factor that is easily stackable and is designed specifically to be a developer’s everyday dev box, providing the maximum power and flexibility developers need. It is powered by a 3.8 GHz 12 Core Oryon CPU with dual core boost up to 4.3GHz, comes with 32 GB LPDDR5x memory, 512GB M2 storage, 80 Watt system architecture, support for up to 3 concurrent external displays and uses 20% ocean-bound-plastic. Learn more .

This new class of powerful next generation AI devices is an invitation to app developers to deliver differentiated AI experiences that run on the edge, taking advantage of NPUs that offer the benefits of minimal latency, cost efficiency, data privacy, and more.

As we continue our journey into the AI era of computing, we want to give Developers who are at the forefront of this AI transformation the right software tools in addition to these powerful NPU powered devices to accelerate the creation of differentiated AI experiences to over 1 billion users. Today, I’m thrilled to share some of the great capabilities coming to Windows, making Windows the best place for your development needs.

We are excited to extend the Microsoft Copilot stack to Windows with Windows Copilot Runtime. We have infused AI into every layer of Windows, including a fundamental transformation of the OS itself to enable developers to accelerate AI development on Windows.
Windows Copilot Runtime has everything you need to build great AI experiences regardless of where you are on your AI journey – whether you are just getting started or already have your own models. Windows Copilot Runtime includes Windows Copilot Library which is a set of APIs that are powered by the 40+ on-device AI models that ship with Windows. It also includes AI frameworks and toolchains to help developers bring their own on-device models to Windows. This is built on the foundation of powerful client silicon, including GPUs and NPUs.
We are introducing Windows Semantic Index, a new OS capability which redefines search on Windows and powers new experiences like Recall. Later, we will make this capability available for developers with Vector Embeddings API to build their own vector store and RAG within their applications and with their app data.
We are introducing Phi Silica which is built from the Phi series of models and is designed specifically for the NPUs in Copilot+ PCs. Windows is the first platform to have a state-of-the-art small language model (SLM) custom built for the NPU and shipping inbox.
Phi Silica API along with OCR, Studio Effects, Live Captions, Recall User Activity APIs will be available in Windows Copilot Library in June. More APIs like Vector Embedding, RAG API, Text Summarization will be coming later.
We are introducing native support for PyTorch on Windows with DirectML which allows for thousands of Hugging Face models to just work on Windows.
We are introducing Web Neural Network (WebNN) Developer Preview to Windows through DirectML. This allows web developers to take advantage of the silicon to deliver performant AI features in their web apps and can scale their AI investments across the breadth of the Windows ecosystem.
We are introducing new productivity features in Dev Home like Environments, improvements to WSL, DevDrive and new updates to WinUI3 and WPF to help every developer become more productive on Windows.

I can’t wait to share more with you during our keynote today, be sure to register for Build and tune in !

Introducing Windows Copilot Runtime to provide a powerful AI platform for developers

We want to democratize the ability to experiment, to build, and to reach people with breakthrough AI experiences. That’s why we’re committed to making Windows the most open platform for AI development. Building a powerful AI platform takes more than a new chip or model, it takes reimagining the entire system, from top to bottom. The new Windows Copilot Runtime is that system. Developers can take advantage of Windows Copilot Runtime in a variety of ways, from higher level APIs that can be accessed via simple settings toggle, all the way to bringing your own machine learning models. It represents the end-to-end Windows ecosystem:

Applications and Experiences created by Microsoft and developers like you across Windows shell, Win32 Apps and Web apps.
Windows Copilot Library is the set of APIs powered by the 40+ on-device models that ship with Windows. This includes APIs and algorithms that power Windows experiences and are available for developers to tap into.
AI frameworks like DirectML, ONNX Runtime, PyTorch, WebNN and toolchains like Olive, AI Toolkit for Visual Studio Code and more to help developers bring their own models and scale their AI apps across the breadth of the Windows hardware ecosystem.
Windows Copilot Runtime is built on the foundation of powerful client silicon , including GPUs and NPUs.

New experiences built using the Windows Copilot Runtime

Windows Copilot Runtime powers the creation of all experiences you build, and what we – Windows – build for our end-users. Using a suite of APIs and on-device models in Windows Copilot Library, we have built incredible first-party experiences like

Recall that helps users instantly find almost anything 3 they’ve seen on their PC
Cocreator 4 a collaborative AI image generator that helps users bring their ideas to life using natural language and ink strokes locally on the device
Restyle Image, helps users reimagine their personal photos with a new style combining image generation and photo editing in Photos
Others like Windows Studio Effects, and Live captions, with real-time translation from video and audio in 40+ languages into English subtitles

We are also partnering with several third-party developers on apps like Davinci Resolve, CapCut, WhatsApp, Camo Studio, djay Pro, Cephable, LiquidText, Luminar Neo and many more that are leveraging the NPU to deliver innovative AI experiences with reduced latency, faster task completion, enhanced privacy and lower cloud compute costs. We’re excited for developers to take advantage of the NPU and Windows Copilot Runtime and invent new experiences.

Windows Copilot Library offers a set of APIs helping developers to accelerate local AI development

Windows Copilot Library has a set of APIs that are powered by the 40+ on-device AI models and state-of-the-art algorithms like DiskANN , built into Windows. Windows Copilot Library consists of ready-to-use AI APIs like Studio Effects, Live captions translations, OCR, Recall with User Activity, and Phi Silica, which will be available to developers in June. Vector Embeddings, Retrieval Augmented Generation (RAG), Text Summarization along with other APIs will be coming later to Windows Copilot Library . Developers will be able to access these APIs as part of the Windows App SDK release.

Developers can take advantage of the Windows Copilot Library with no-code effort to integrate Studio Effects into their apps like Creative filters, Portrait light, Eye contact teleprompter, Portrait blur, and Voice focus. WhatsApp among others has already upgraded their user experience adding Windows Studio Effects controls directly into the UI. Learn more.

With a similar no-code effort, developers can take advantage of Live captions, the translation feature in Windows to caption audio and video in real time and translate into preferred language in apps.

Developers can tap into the newly announced Recall feature on Copilot+ PCs. Enhance the user’s Recall experience with your app by adding contextual information to the underlying vector database via the User Activity API. This integration helps users pick up where they left off in your app, improving app engagement and user’s seamless flow between Windows and your app. E dge and M365 apps like Outlook, PowerPoint and Teams have already extended their apps with Recall. Concepts , a 3rd party sketching app is an early example – if launched from Recall, it brings users immediately to the exact canvas location in the right document, and even the same zoom level seen in the Recall timeline.

Introducing Windows Semantic Index that redefines search on Windows. Vector Embeddings API offers the capability for developers to build their own vector store with their app data

Recall database is powered by Windows Semantic Index, a new OS capability that redefines search on Windows. Recall is grounded in several state-of-the-art AI models, including multi-modal SLMs, running concurrently and integrated into the OS itself. These models understand different kinds of content and work across several languages, to organize a vast sea of information from text to image to videos, in Windows. This data is transformed and stored in a vector store called Windows Semantic Index. The semantic index is stored entirely on the user’s local device and accessible through natural language search. This deep integration allows a uniquely robust approach to privacy as the data does not leave the local device.

To help developers bring the same natural language search capability in their apps, we are making Vector Embeddings and RAG API available in Windows Copilot Library later. This will enable developers to build their own semantic index store with their own app data and this combined with Retrieval Augmented Generation (RAG) API, developers can bring natural language search capability in their apps. This is a great example of how we are building new features using the models and APIs in Windows Copilot Runtime and offering the same capability for developers to do so in their apps.

The APIs in the Windows Copilot Library cover the full spectrum from low-code APIs to sophisticated pipelines to fully multi-modal models.

Windows is the first platform to have a state-of-the-art SLM shipping inbox and Phi Silica is custom built for the NPUs in Copilot+ PCs

We recently introduced Phi-3 the most capable and cost-effective SLM. Phi-3-mini does better than models twice its size on key benchmarks. Today we are introducing Phi Silica, built from the Phi series of models. Phi Silica is the SOTA (state of the art) SLM included out of the box and is custom built for the NPUs in Copilot+ PCs. With full NPU offload of prompt processing, the first token latency is at 650 tokens/second – and only costs about 1.5 Watts of power while leaving your CPU and GPU free for other computations. Token generation reuses the KV cache from the NPU and runs on the CPU producing about 27 tokens/second.

These are just a few examples of the APIs available to developers in the Windows Copilot Library. As new models and new libraries come to Windows, the possibilities will only grow. We want to make it easy for developers to bring powerful AI features into their apps, and Windows Copilot Library is the perfect place to start.

We consistently ensure Windows AI experiences are safe, fair, and trustworthy, following our Microsoft Responsible AI principles. When developers extend their apps with Windows Copilot Library, they automatically inherit those Responsible AI guardrails.

Developers can bring their own models and scale across breadth of Windows hardware powered by DirectML

While the models that ship with Windows 11 power a wide range of AI experiences, many developers will want to bring their own models to Windows to power their applications. As an open platform, Windows supports a diverse silicon ecosystem, and Windows has simplified optimizing models across silicon with DirectML. Just like DirectX is for Graphics, DirectML is the high-performance low-level API for machine learning in Windows.

DirectML abstracts across the different hardware options our Independent Hardware Vendor (IHV) partners bring to the Windows ecosystem, and supports across GPUs and NPUs, with CPU integration coming soon. It integrates with relevant frameworks, such as the ONNX Runtime, PyTorch and WebNN.

PyTorch is now natively supported on Windows with DirectML

We know that a lot of developers do their PyTorch development on Windows. So, we’re thrilled to announce that Windows now natively supports PyTorch through DirectML. Native PyTorch support means that thousands of Hugging Face models will just work on Windows. Not just that – we’re collaborating with Nvidia to scale these development workflows to over 100M RTX AI GPUs.

PyTorch support on GPUs is available starting today, with NPU support coming soon. Learn more

We recognize that many developers start with web apps today. Web apps should also be able to take advantage of silicon on local devices to deliver AI experiences to users.

DirectML now supports web apps that can take advantage of silicon to deliver AI experiences powered by WebNN

From native to web applications, DirectML now brings local AI scale across Windows for the web through the new WebNN Developer Preview. WebNN, an emerging web standard for machine learning, powered by DirectML and ONNX Runtime Web, simplifies how developers can leverage the underlying hardware on their user’s device for their web apps to deliver AI experiences at near native performance for tasks such as generative AI, image processing, natural language processing, computer vision and more. This WebNN Developer Preview supports GPUs with broader accelerator coverage to include NPU coming soon. Learn more about how to get started with WebNN.

High-performance inferencing on Windows with ONNX Runtime and DirectML

Microsoft’s ONNX Runtime builds on the power of the open-source community to enable developers to ship their AI models to production with the performance and cross-platform support they need. ONNX Runtime with DirectML applies state-of-the-art optimizations to get the best performance for all generative AI models like Phi, Llama, Mistral, and Stable Diffusion. With ONNX Runtime, developers can extend their Windows applications to other platforms like web, cloud or mobile, wherever they need to ship their application on. ONNX Runtime is how Microsoft apps like Office, Visual Studio Code, and even Windows itself ship their AI to run on-device. Learn More.

DirectML helps scale your efforts across the Windows ecosystem – whether you are building your own models or you want to bring an open-source model from Hugging Face, and whether you are building a native Windows app or a web app.

DirectML is generally available across all Windows GPUs. DirectML support on Intel® Core™ Ultra processors with Intel® AI Boost is available as a Developer Preview with GA coming soon, and Qualcomm® Hexagon™ NPU in the Snapdragon X Elite SoC is coming soon. Stay tuned for more DirectML features that will simplify how developers can differentiate with AI and scale their innovations across Windows. Grab your favorite model and get started with DirectML today at DirectML Overview or Windows AI Dev Center | Microsoft Developer

Windows Subsystem for Linux (WSL) offers a robust platform for AI development on Windows by making it easy to run Windows and Linux workloads simultaneously. Developers can easily share files, GUI apps, GPU and more between environments with no additional setup. WSL is now enhanced to meet the enterprise grade security requirements so enterprise customers can confidently deploy WSL for their developers to take advantage of both Windows and Linux operating systems on the same Windows device and accelerate AI development efficiently.

WSL now incorporates two new Zero Trust features, Linux Intune Agent and Integration with Microsoft Entra ID, to enable system administrators to enhance enterprise security. With Linux Intune agent integration, IT admins and administrators can determine compliance based on WSL distro versions and more, using custom scripts. Microsoft Entra ID integration provides a zero trust experience to access protected enterprise resources from within a WSL distro by providing a secure channel to acquire and utilize tokens bound to the host device. The Linux Intune agent integration is currently in public preview, and Microsoft Entra ID integration will be in public preview this summer.

New experiences designed to help every developer become more productive on Windows 11

We know building great AI experiences starts with developer productivity. That’s why we are excited to announce new features in Dev Home, performance improvements to DevDrive and improvements to your favorite tool PowerToys.

At Build last year, we announced Dev Home and since then we have been evolving Dev Home to be the one-stop-shop for setting up your Windows machine for development. We have made some key improvements to Dev Home to further boost developer productivity. Dev Home is now installed on every Windows machine making it easy to get started. We are introducing Environments, Windows Customization and welcoming WSL and a subset of PowerToys utilities to Dev Home.

Environments in Dev Home help centralize your interactions with all remote environments. Create, manage, launch and configure dev environments in a snap from Dev Home

For developers who often use virtual machines and remote environments, Environments in Dev Home is for you. With support for Hyper-V VMs and cloud Microsoft Dev Boxes, you can create new environments, set up environments with repositories, apps, and packages. You can perform quick actions such as taking snapshots, starting, and stopping, and even pin environments to the Start Menu and taskbar, all from Environments in Dev Home. To make this experience even more powerful, it’s all extensible and open source so you can add your own environments. Environments in Dev Home is available now in preview.

We know developers want zero distractions when coding, and customizing your dev machine to the ideal state is critical for productivity. We also know developers want more control and agency on their device. That’s why we are releasing Windows Customization feature in Dev Home .

Windows Customization in Dev Home allows developers to customize their device to an ideal state with fewest clicks

Windows Customization gives developers access to Dev Drive insights, advanced File Explorer settings, virtual machine management, and the ability to quiet background processes, giving developers more control over their Windows machine. Submit feature requests for what you want to see in Windows Customization on GitHub .

Windows customization folders in Dev Home

New Export feature in Dev Home Machine Configuration allows you to quickly create configuration files to share with your teammates, boosting productivity

WinGet configuration files are an easy way to get your machine set up for development exactly how you like it. For a streamlined experience, try the new export feature in Dev Home which allows you to generate a configuration file based on the choices you made in Dev Home’s Machine Configuration setup flow, allowing you to quickly create configuration files to share with your teammates for a consistent machine setup.

Lastly, when cloning a repository in Dev Home that contains a configuration file, Dev Home can now detect that file and let you run it right away, allowing you to get set up for coding even faster than before.

In addition to these new features, we are bringing WSL and a subset of PowerToys utilities to Dev Home, truly making Dev Home your one-stop shop for all your development needs. You can now access WSL right from Dev Home in the Environments tab. Also, a subset of PowerToys utilities such as Hosts File Editor , Environment Variables , and Registry Preview can be accessed in the new Utilities tab on Dev Home. These features are currently available in preview.   

Dev Drive introduces block cloning that will allow developers to perform large file copy operations, instantaneously

At the heart of developer productivity lies improving performance for developer workloads on Windows. Last year at Build , we announced Dev Drive a new storage volume tailor-made for developers and supercharged for performance and security. Since then, we have continued to invest further in Windows performance improvements for developer workloads.

With the release of Windows 11 24H2, workflows will get even faster when developing on a Dev Drive. Windows copy engine now has Filesystem Block Cloning, resulting in nearly instantaneous copy actions and drastically improving performance, especially in developer scenarios that copy large files. Our benchmarks include the following:

Dev Drive is a must for any developer, especially if you are dealing with repositories with many files, or large files. You can set up Dev Drive through the Settings app under System->Storage->Disks and Volumes page.

Reducing toil and unlocking the fun and joy of development on Windows with new features and improvements

Sudo for Windows allows developers to run elevated commands right in Terminal

For command line users, we’re providing a simple and familiar way for elevating your command prompt with Sudo for Windows. Simply enable Sudo within Windows developer settings and you can get started running elevated commands with Sudo right in your terminal. You can learn more about Sudo on GitHub.

New Source code integration in File Explorer allows tracking commit messages and file status directly in File Explorer

File Explorer will provide even more power to developers with version control protocol integration (including Git). This allows developers to monitor data including file status, commit messages, and current branch directly from File Explorer. File Explorer has also gained the ability to compress to 7zip and TAR.

Continuing to innovate and accelerating development for Windows on Arm

The Arm developer ecosystem momentum continues to grow with updates to Visual Studio, .NET, and many key tools delivering Arm native versions. Windows is continuing to welcome more third-party Windows apps, middleware partners and Open-Source Software natively to Arm. Learn how to add Arm support for your apps.

Visual Studio now includes Arm native SQL Server Developer Tools (SSDT), the #1 requested Arm native workload for VS. Learn more
.NET 8 includes tons of performance improvements for Arm: Performance Improvements in .NET 8 – .NET Blog (microsoft.com)
Unity games editor is now available in preview and will release to market with the next Unity update, allowing game developers to build, test and run Unity titles for Arm powered Windows devices.   
Blender Arm native builds is available in preview with the official builds with long term support expected to ship in June. Blender is the free and open source 3D creation suite. It supports the entirety of the 3D pipeline—modelling, rigging, animation, simulation, rendering, compositing and motion tracking, even video editing and game creation.
Arm Native Docker tools for Windows are now available.  
Github Actions now has Arm64 runners on Windows. This is now available in private preview with a public preview expected in the coming months. You can apply to join here .
GIMP adds Arm native with long term support from v3.0 will be available in May.  
Qt 6.8 release due in September will move Arm native to LTS for Windows.

Continuing investments in WinUI3 and WPF to help developers build rich, modern Windows applications

Windows is an open and versatile platform that supports a wide range of UI technologies. If you are looking to develop native Windows applications using our preferred UI development language, XAML, we recommend using either WinUI 3 or WPF.

WinUI 3 includes a modern native compositor and excels at media and graphics-focused consumer and commercial applications. WPF has a longer history and can take advantage of a deep ecosystem of commercial products as well as free and open source projects, many of which are focused on enterprise and data-intensive scenarios. We recommend you first consider WinUI 3, and if that meets your app’s needs, proceed with it for the most modern experience. Otherwise, WPF is an excellent choice. Both WinUI 3 and WPF can take advantage of all Windows has to offer, including the new features and APIs in the Windows App SDK, so you can feel confident in creating a modern application in either technology.

WinUI 3 and Windows App SDK now support native Maps control and .NET 8

With the latest updates to Windows App SDK 1.5+ we shipped several developer-requested features including support for .NET 8, with its faster startup, smaller footprint, and new runtime features. We’ve also brought to WinUI 3 one of the most requested features, the Maps control, powered by WebView 2 and Azure Maps. You can learn more about the controls and features in WinUI 3 in the interactive WinUI 3 Gallery App

Microsoft apps like Photos and File Explorer have migrated to WinUI3 along with developers like Apple (Apple TV, Apple Music, iCloud, Apple Devices) and Yair A (Files App), who are also adopting WinUI 3.

Windows 11 theme support makes it easy to modernize the look and feel of your WPF applications

WPF is popular, especially for data-heavy and enterprise apps. We listened to your feedback and are committed to continuing investments in WPF. With the latest updates to WPF, we have made it easier than ever to modernize the look and feel of your app through support for Windows 11 theming. We also improved integration with Windows by including a native FolderBrowserDialog and managed DWrite.

Developers, including Morgan Stanley and Reincubate , have created great apps that showcase what can be built using WPF.

Our updated Windows Dev Center includes information on both WinUI3 and WPF to help you make the best decision for your application.

Extend Windows apps into 3D space

As Windows transforms for the era of AI we are continuing to expand the reach of the platform including all the AI experiences developers create with the Windows Copilot Runtime. We are delivering Windows from the cloud with Windows 365 so apps can reach any device, anywhere. And we are introducing Windows experiences to new form factors beyond the PC.

For example, we are deepening our partnership with Meta to make Windows a first-class experience on Quest devices. And Windows can take advantage of Quest’s unique capabilities to extend Windows apps into 3D space. We call these Volumetric apps. Developers will have access to a volumetric API. This is just one of many ways to broaden your reach through the Windows ecosystem.

Building for the future of AI on Windows

This past year has been incredibly exciting as we reimagined the Windows PC in this new era of AI. But this is just the start of our journey. With the most efficient and performant Windows PCs ever built, powered by the game-changing NPU technology, and an OS with AI at its core, we have listened to your feedback and worked to make Windows the very best platform for developers.

We look forward to continuing to partner with you, our developer and MVP community, to bring innovation to our platform and tools, and enabling each of you to create future AI experiences that will empower every person on the planet to achieve more. We can’t wait to see what you will build next.

Editor’s note, May 21, 2024: This post was updated to reflect the latest product information on Snapdragon Dev Kit for Windows.

Disclaimers

1 Tested April 2024 using debug application for Windows Studio Effects workload comparing pre-release Copilot+ PC builds with Snapdragon Elite X 12 Core to Windows 11 PC with Intel 12th gen i7 configuration

2 Tested April 2024 using Phi SLM workload running 512-token prompt processing in a loop with default settings comparing pre-release Copilot+ PC builds with Snapdragon Elite X 12 Core and Snapdragon X Plus 10 core configurations (QNN build) to Windows 11 PC with NVIDIA 4080 GPU configuration (CUDA build).

3 Optimized for select languages (English, Chinese (simplified), French, German, Japanese, and Spanish.) Content-based and storage limitations apply. See [ aka.ms/copilotpluspcs ].

4 Optimized for English text prompts. See aka.ms/copilotpluspcs.

COMMENTS

corycorvus/Unity-Speech-to-Text
Watson streaming and non-streaming speech-to-text both rely on IBM's Watson SDK for Unity, which must be manually added to the project. The Unity Watson SDK can be found here. Google non-streaming and Wit.ai non-streaming speech-to-text both rely on UniWeb, which must be manually added to the project. UniWeb can be found on the Unity Asset ...
Unity Speech to Text Plugin for Android & iOS
GitHub Sponsors ☕. This plugin helps you convert speech to text on Android (all versions) and iOS 10+. Offline speech recognition is supported on Android 23+ and iOS 13+ if the target language's speech recognition model is present on the device.
GitHub
Text To Speech. The model that we use for TTS is FastSpeech. The TFLite model that we used is converted from a pre-trained model found in the TensorflowTTS repository. To prevent Unity from freezing when inferencing the TFLite model, we run the inference process in a new thread and play the audio in the main thread once it is ready.
Speech Recognition in Unity3D
In this tutorial, we described two methods of recognizing voice in Unity3D: Voice Commands and Dictation. Voice Commands are the easiest way to recognize pre-defined words. Dictation is a way to recognize free-form phrases. In a future article, we are going to see how to develop our own Grammar and feed it to Unity3D.
Integrate Text-to-Speech into your app with Unity
How to integrate TTS into your video game with Unity. ‍. Step 1. Install the Eden AI Unity Plugin. ‍. Ensure that you have a Unity project open and ready for integration. If you haven't installed the Eden AI plugin, follow these steps: Open your Unity Package Manager. Add package from GitHub.
Speech to Text with OpenAI Whisper in Unity!
Learn how to use OpenAI Whisper, a powerful speech recognition tool, in your Unity projects with this tutorial video.
Speech Recognition in Unity Tutorial
This article serves as a comprehensive guide for adding on-device Speech Recognition to an Unity project.. When used casually, Speech Recognition usually refers solely to Speech-to-Text.However, Speech-to-Text represents only a single facet of Speech Recognition technologies. It also refers to features such as Wake Word Detection, Voice Command Recognition, and Voice Activity Detection (VAD).
Introducing the Unity Text-to-Speech Plugin from ReadSpeaker
The Unity TTS plugin allows developers to draft scenes within the engine, tweaking lines and pacing to get the plan perfect before the recording studio's clock starts running. Instant audio narration for in-game text chat. Unity speech synthesis from ReadSpeaker AI renders audio instantly at runtime, through a speech engine embedded in the ...
Using the Speech-to-Text API with C#
4. Install the Google Cloud Speech-to-Text API client library for C#. First, create a simple C# console application that you will use to run Speech-to-Text API samples: You should see the application created and dependencies resolved: Next, navigate to folder: And add NuGet package to the project:
Overtone
Overtone is an offline Text-to-Speech asset for Unity. Enrich your game with 15+ languages, 900+ English voices, rapid performance, and cross-platform support. Welcome to the Overtone documentation! In this section, we'll walk you through the initial steps to start using the tools.
free speech to text running on your machine
Several month ago OpenAI released powerful audio speech recognition (asr) model called Whisper.Code and weights are under MIT license. I used another open source implementation called whisper.cpp and moved it to Unity. Main features:
Watson Unity SDK Getting Started
BREAKING CHANGES MADE TO SDK This is on my todo list to fix, but it is currently not working with the most recent version in the asset store. Watson Unity SDK Getting Started - Speech to Text. GitHub Gist: instantly share code, notes, and snippets.
(Free) Runtime Text To Speech Plugin « Unity Coding
(Free) Runtime Text To Speech Plugin. An article by mgear 7 Comments. some waveform image - just to fill main image slot.. Needed simple TTS plugin for a small unity windows game that i'm working on.. I know asset store has few, but they seem to rely on Windows Speech platform voices, and one plugin that was completely standalone didnt ...
[2212.08055] UnitY: Two-pass Direct Speech-to-speech Translation with
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword ...
Unity-Speech-to-Text/GoogleCloudSpeech/GoogleCloudSpeech ...
his plugin interfaces Windows streaming, Wit.ai non-streaming, Google streaming/non-streaming, and IBM Watson streaming/non-streaming speech-to-text. - corycorvus/Unity-Speech-to-Text
Speech to text
The Audio API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model.They can be used to: Transcribe audio into whatever language the audio is in. Translate and transcribe the audio into english.
Introducing GPT-4o: OpenAI's new flagship multimodal model now in
Unified speech services for speech-to-text, text-to-speech and speech translation. Azure AI Language ... Get unlimited, cloud-hosted private Git repos for your project. Azure Artifacts Create, host, and share packages with your team. Azure Test Plans Test and ship confidently with an exploratory test toolkit ...
Benedictine College nuns denounce Harrison Butker's speech at ...
Harrison Butker's commencement address denounced by Benedictine College nuns "Instead of promoting unity in our church, our nation, and the world, his comments seem to have fostered division," the ...
Unity-Speech-to-Text/Assets/SpeechToText/Scripts ...
his plugin interfaces Windows streaming, Wit.ai non-streaming, Google streaming/non-streaming, and IBM Watson streaming/non-streaming speech-to-text. - corycorvus/Unity-Speech-to-Text
Microsoft Build 2024: Essential Guide for AI Developers at Startups and
Microsoft Build 2024 starts May 21 st . Register now to attend virtually. Microsoft Build, the annual conference where developers can dive deep into the latest Microsoft technologies, kicks off May 21 st. Generative AI is rapidly evolving. Take for example, the recent launch of OpenAI's GPT-4o.
Speech And Text in Unity iOS and Unity Android
Speed to text and text to speed in Unity iOS and Unity Android I have provide all java and object c source. you can know how it work, optimization, or add any features
Unlock a new era of innovation with Windows Copilot Runtime and
I am excited to be back at Build with the developer community this year.. Over the last year, we have worked on reimagining Windows PCs and yesterday, we introduced the world to a new category of Windows PCs called Copilot+ PCs. Copilot+ PCs are the fastest, most intelligent Windows PCs ever with AI infused at every layer, starting with the world's most powerful PC Neural Processing Units ...
Spech To Text in Unity IOS, Use cloud speech api
Step 2: Replace your credentials key on GOOGLE_SPEECH_TO_TEXT_KEY and your language you want. (Need switch Unity iOS platform first) Step 3: build iOS, You will have Xcode project. Step 4: You need to add speex SDK to your project. (In folder "Package") Copy SpeexSDK to Xcode project -> Add other framwork -> Select SpeexSDK
Python OpenAI Whisper Speech to Text Transcription
Follow the prompts to enter the file path of the audio file and choose the desired response format (text or vtt). Note: To access the OpenAI API, you will need an API key. Please refer to the OpenAI API documentation for instructions on how to obtain and use the API key.