Speech-to-text (STT) systems, or automatic speech recognition (ASR) systems, transform the spoken words into textual data that can be used in a variety of ways.
There are many applications for this technology, including voice-activated devices, transcription services, and accessibility for people with speech impairments.
What is Speech-to-Text?
Speech-to-Text (STT) technology allows you to turn any audio content into written text. It is also called Automatic Speech Recognition (ASR), or computer speech recognition. Speech-to-Text is based on acoustic modeling and language modeling.
There are several free and open-source APIs and libraries available for speech-to-text (STT) conversion. Here are some popular options:
As Google is essentially the backbone of the Internet at this point, it`s no surprise their Speech-To-Text API is one of the most popular - and most powerful - APIs available.
Google gives users 60 minutes free transcription, with $300 in free credits for Google Cloud hosting.
- Multiple machine learning models for increased accuracy
- Automatic language recognition
- Proper noun recognition
- Noise cancellation for audio from phone calls and video
- It`s expensive
- Limited custom vocabulary builder
- Business audio with lots of terminology has poor accuracy
The Amazon Transcribe product was developed from the Alexa voice assistant. For short audio, Transcribe`s command-and-response transcription is excellent. In terms of accuracy, they are on the higher end of ASR providers for consumer audio data, but not as good with business audio.
AWS Transcribe offers one hour free per month for the first 12 months of use.
- Brand name
- Easy to integrate if you are already in the AWS ecosystem
- Consumer audio accuracy is fairly good
- Good scalability, except for costs
- A limited number of support options
- Cloud deployment only
- High cost
The Speech-to-Text APIs from AssemblyAI help convert audio files and video streams into text automatically and help them understand. Speech-to-text in AssemblyAI is powered by the latest AI models, and its Audio Intelligence detects topics, moderates, and summarizes content.
The company offers several free transcription hours for audio files or video streams per month before transitioning to an affordable paid tier.
- High accuracy for non-technical US English
- Low cost
- Limited customization
- It is difficult to understand a lot of terminology, jargon, and accents
Speechmatics provides automatic transcription services using a cloud-based API. A major feature of this application is its ability to process files offline, since it supports a wide range of file formats.
Speechmatics has been found to be one of the fastest and most reliable APIs for automatic transcription. As well as supporting nine languages, it also supports different variants of English, including British and Australian English.
- Easily integrated via REST API
- There are multiple file formats supported
- Multi-speaker support
- Works well with noisy audio
- No app interface
- For each query, there is a charge
Microsoft Azure Speech Services is provided by Microsoft and uses deep learning models to recognize speech. In addition to its multilingual support, it also offers a free tier that allows 5 hours of use per month. Microsoft`s clients include LG, KPMG, and General Electric.
- Good choice for short audio for command and response
- No real-time streaming
- The scalability is good, except for the costs
- There is limited customization available
- Poor accuracy with business audio or audio with lots of terminology
Kaldi is an open-source speech recognition toolkit. This program is written in C++ and supports various STT tasks. Kaldi provides pre-built models, scripts, and tools for training and evaluating speech recognition systems.
The Kaldi website also offers excellent documentation for deep neural networks. The code is mainly written in C++, but it`s "wrapped" by Bash and Python scripts.
- The architecture will result in very slow speeds
- Requires a lot of self training to be usable
The Wav2Letter toolkit is an Automatic Speech Recognition (ASR) tool written in C++ and based on ArrayFire tensor libraries.
Similarly to DeepSpeech, Wav2Letter is an open source library that is fairly accurate and easy to use.
- Language Independence
- End-to-End System
- Complex Setup
- Lack of Language Model
- Noisy Environments
Performance, accuracy, and specific features vary among these options. Consider your requirements, available resources, and integration preferences before selecting one.
I hope you enjoyed it. Get in touch with Revaalo labs if you need anything related to Speech-To-Text APIs for your platforms.