Be part of prime executives in San Francisco on July 11-12, to listen to how leaders are integrating and optimizing AI investments for achievement. Study Extra
Final week, Meta Platforms’ synthetic intelligence analysis arm launched Voicebox, a machine studying mannequin that may generate speech from textual content. What units Voicebox other than different text-to-speech fashions is its skill to carry out many duties that it has not been skilled for, together with modifying, noise removing, and magnificence switch.
The mannequin was skilled utilizing a particular technique developed by Meta researchers. Whereas Meta has not launched Voicebox attributable to moral issues about misuse, the preliminary outcomes are promising and may energy many purposes sooner or later.
‘Circulate Matching’
Voicebox is a generative mannequin that may synthesize speech throughout six languages, together with English, French, Spanish, German, Polish, and Portuguese. Like massive language fashions, it has been skilled on a really normal job that can be utilized for a lot of purposes. However whereas LLMs attempt to study the statistical regularities of phrases and textual content sequences, Voicebox has been skilled to study the patterns that map voice audio samples to their transcripts.
Such a mannequin can then be utilized to many downstream duties with little or no fine-tuning. “The purpose is to construct a single mannequin that may carry out many text-guided speech technology duties by way of in-context studying,” Meta’s researchers write in their paper (PDF) describing the technical particulars of Voicebox.
The mannequin was skilled Meta’s “Flow Matching” method, which is extra environment friendly and generalizable than diffusion-based studying strategies utilized in different generative fashions. The method allows Voicebox to “study from different speech knowledge with out these variations having to be fastidiously labeled.” With out the necessity for guide labeling, the researchers have been capable of practice Voicebox on 50,000 hours of speech and transcripts from audiobooks.
The mannequin makes use of “text-guided speech infilling” as its coaching purpose, which implies it should predict a section of speech given its surrounding audio and the whole textual content transcript. Principally, it signifies that throughout coaching, the mannequin is supplied with an audio pattern and its corresponding textual content. Elements of the audio are then masked and the mannequin tries to generate the masked half utilizing the encompassing audio and the transcript as context. By doing this again and again, the mannequin learns to generate natural-sounding speech from textual content in a generalizable method.
Replicating voices throughout languages, modifying out errors in speech, and extra
Not like generative fashions which can be skilled for a selected software, Voicebox can carry out many duties that it has not been skilled for. For instance, the mannequin can use a two-second voice pattern to generate speech for brand new textual content. Meta says this functionality can be utilized to convey speech to people who find themselves unable to talk or customise the voices of non-playable sport characters and digital assistants.
Voicebox additionally performs fashion switch in several methods. For instance, you possibly can present the mannequin with two audio and textual content samples. It’ll use the primary audio pattern as fashion reference and modify the second to match the voice and tone of the reference. Curiously, the mannequin can do the identical factor throughout completely different languages, which might be used to “assist individuals talk in a pure, genuine method — even when they don’t converse the identical languages.”
The mannequin may do quite a lot of modifying duties. For instance, if a canine barks within the background when you’re recording your voice, you possibly can present the audio and transcript to Voicebox and masks out the section with the background noise. The mannequin will use the transcript to generate the lacking portion of the audio with out the background noise.
The identical method can be utilized to edit speech. For instance, if in case you have misspoken a phrase, you possibly can masks that portion of the audio pattern and go it to Voicebox together with a transcript of the edited textual content. The mannequin will generate the lacking half with the brand new textual content in a method that matches the encompassing voice and tone.
One of many fascinating purposes of Voicebox is voice sampling. The mannequin can generate varied speech samples from a single textual content sequence. This functionality can be utilized to generate artificial knowledge to coach different speech processing fashions. “Our outcomes present that speech recognition fashions skilled on Voicebox-generated artificial speech carry out nearly in addition to fashions skilled on actual speech, with 1 p.c error price degradation versus 45 to 70 p.c degradation with artificial speech from earlier text-to-speech fashions,” Meta writes.
Voicebox has limits too. Because it has been skilled on audiobook knowledge, it doesn’t switch properly to conversational speech that’s informal and accommodates non-verbal sounds. It additionally doesn’t present full management over completely different attributes of the generated speech, equivalent to voice fashion, tone, emotion, and acoustic situation. The Meta analysis staff is exploring methods to beat these limitations sooner or later.
Mannequin not launched
There’s rising concern in regards to the threats of AI-generated content material. For instance, cybercriminals just lately tried to rip-off a girl by calling her and utilizing AI-generated voice to impersonate her grandson. Superior speech synthesis programs equivalent to Voicebox might be used for comparable functions or different nefarious deeds, equivalent to creating pretend proof or manipulating actual audio.
“As with different highly effective new AI improvements, we acknowledge that this expertise brings the potential for misuse and unintended hurt,” Meta wrote on its AI weblog. As a consequence of these issues, Meta didn’t launch the mannequin however supplied technical particulars on the structure and coaching course of within the technical paper. The paper additionally accommodates particulars a few classifier mannequin that may detect speech and audio generated by Voicebox to mitigate the dangers of utilizing the mannequin.