Meet Gladia, a French AI startup that desires to vary how corporations work together with audio information. The corporate develops an audio transcription software programming interface (API) which you can combine with different merchandise and is meant to work a lot better than what’s out there on the market. And this tech basis unlocks new use circumstances round audio.
In case you’re accustomed to audio transcription APIs, you recognize that large cloud suppliers have already got their very own APIs. There’s Google’s speech-to-text API, Amazon Transcribe, Microsoft’s Speech to Text, and so forth. They work nicely, however they’re costly, sluggish and don’t have a ton of options.
Gladia’s co-founder and CEO Jean-Louis Quéguiner, who was the previous head of AI for OVHcloud and co-founded the corporate with Jonathan Soto, instructed me about among the limitations with present APIs. In keeping with him, there are three ache factors with present merchandise. First, in the case of costs, transcribing an hour of audio typically prices $1.50 to $2 an hour.
Second, the output isn’t all the time very dependable as some languages work nicely whereas others are barely supported. In the case of superior options, if folks communicate in a number of languages, likelihood is the API merely received’t be capable of discover the language change and transcribe the audio in a couple of language.
Third, transcription APIs are sluggish. It may well take greater than quarter-hour to transcribe an hour of audio. That’s superb in the event you don’t want transcriptions immediately, however it signifies that you received’t be capable of use these APIs in some industries.
Whisper’s whisperer
Gladia is predicated on Whisper, OpenAI’s open supply transcription mannequin. “We began from Whisper. We haven’t reinvented the wheel, however we listened to our prospects they usually instructed us: ‘What I would like is one thing that works in addition to Whisper,’” Jean-Louis Quéguiner instructed me.
However Whisper isn’t excellent. The vanilla model continues to be fairly sluggish, so Gladia has spent lots of time turning Whisper into a quick and responsive transcription mannequin. That’s not the one problem.
“Half of Whisper is GPT-2. You’ve seen LLMs and ChatGPT, it tends to hallucinate. We’ve accomplished lots of work to keep away from hallucination issues too,” Quéguiner stated.
Particularly, he instructed me that Whisper has been educated on closed captions that you could find on the web, equivalent to on YouTube. OpenAI’s mannequin tends to listen to frequent phrases which you can hear in on-line movies, equivalent to “in the event you loved this video, please like and subscribe.” There’s a mathematical overrepresentation of some sentences like this one and Gladia tries to repair these shortcomings.
Along with these modifications to Whisper and its implementation, Gladia additionally has some pre-processing and post-processing algorithms that enhance the tip outcomes.
Gladia guarantees that it will possibly transcribe an hour of audio for $0.61. And the transcription course of takes roughly 60 seconds. Its API can detect when there are a number of audio system, add timestamps, detect languages and change from one language to a different if wanted. Gladia additionally mechanically provides punctuation and casing.
Like most APIs, the tip result’s in JSON format. However Gladia additionally helps SRT and VTT recordsdata for corporations that wish to generate subtitles.
I created an account and uploaded an audio recording of an interview to see how Gladia works. It took a bit extra time than anticipated however it was undoubtedly a lot sooner than Google’s or Azure’s speech-to-text APIs.
The consequence wasn’t flawless, however it was extraordinarily good — it understood acronyms and technical phrases. I opened the identical audio file in Aiko, a Mac app developed by Sindre Sorhus that allows you to transcribe audio file regionally utilizing Whisper. As anticipated, the output was near Gladia’s output — however Gladia was a lot sooner than operating Aiko on my MacBook Professional.
General, Gladia was the very best transcription API I’ve ever used.
Changing into an audio intelligence API
The corporate presently works with name middle corporations, digital assembly providers, and video publishers, together with Claap, Livestorm and Selectra.
Gladia raised a $4 million seed spherical in a funding spherical led by New Wave. Different buyers embrace Sequoia, Cocoa and enterprise angels, equivalent to Solomon Hykes, Pierre Betouin, Miroslaw Klaba and Alexandre Berriche.
Having a rock-solid transcription API is simply the first step for Gladia. The corporate hopes that it will possibly then construct options on prime of this robust technical basis.
As an example, after an audio file has been transcribed, Gladia can translate textual content into one other language. Mixed with word-level timestamps, it signifies that an organization can add an audio file and get subtitles in dozens of languages in only a few minutes.
Sooner or later, the corporate hopes that it will possibly summarize the content material of an audio file, categorize content material into a number of subject classes, create chapters mechanically, conduct sentiment evaluation and extra.
“Our longer-term imaginative and prescient is to maneuver from 2D to 3D information. Audio is fairly flat, and the thought is to enhance it with intelligence,” Quéguiner stated. “We expect that transcription will develop into a commodity. However we expect that what’s going to matter extra is the choices we’re going so as to add.”