Twelve Labs is building models that can understand videos at a deep level

Textual content-generating AI is one factor. However AI fashions that perceive photos in addition to textual content can unlock highly effective new purposes.

Take, for instance, Twelve Labs. The San Francisco-based startup trains AI fashions to — as co-founder and CEO Jae Lee places it — “clear up advanced video-language alignment issues.”

“Twelve Labs was based … to create an infrastructure for multimodal video understanding, with the primary endeavor being semantic search — or ‘CTRL+F for movies,’” Lee advised TechCrunch in an electronic mail interview. “The imaginative and prescient of Twelve Labs is to assist builders construct applications that may see, hear and perceive the world as we do.”

Twelve Labs’ fashions try and map pure language to what’s taking place inside a video, together with actions, objects and background sounds, permitting builders to create apps that that may search by movies, classify scenes and extract matters from inside these movies, robotically summarize and cut up video clips into chapters, and extra.

Lee says that Twelve Labs’ know-how can drive issues like advert insertion and content material moderation — for example, determining which movies exhibiting knives are violent versus educational. It will also be used for media analytics, Lee added, and to robotically generate spotlight reels — or weblog submit headlines and tags — from movies.

I requested Lee concerning the potential for bias in these fashions, on condition that it’s well-established science that fashions amplify the biases within the information on which they’re skilled. For instance, coaching a video understanding mannequin on principally clips of native information — which frequently spends a variety of time overlaying crime in a sensationalized, racialized approach — may trigger the mannequin to be taught racist as well as sexist patterns.

Lee says that Twelve Labs strives to satisfy inner bias and “equity” metrics for its fashions earlier than releasing them, and that the corporate plans to launch model-ethics-related benchmarks and information units sooner or later. However he had nothing to share past that.

Mockup of API for advantageous tuning the mannequin to work higher with salad-related content material.

“By way of how our product is totally different from massive language fashions [like ChatGPT], ours is particularly skilled and constructed to course of and perceive video, holistically integrating visible, audio and speech parts inside movies,” Lee mentioned. “Now we have actually pushed the technical limits of what’s doable for video understanding.”

Google is creating the same multimodal mannequin for video understanding referred to as MUM, which the corporate’s utilizing to energy video suggestions throughout Google Search and YouTube. Past MUM, Google — in addition to Microsoft and Amazon — supply API-level, AI-powered providers that acknowledge objects, locations and actions in movies and extract wealthy metadata on the body stage.

However Lee argues that Twelve Labs is differentiated each by the standard of its fashions and the platform’s fine-tuning options, which permit prospects to automet the platform’s fashions with their very own information for “domain-specific” video evaluation.

On the mannequin entrance, Twelve Labs is at present unveiling Pegasus-1, a brand new multimodal mannequin that understands a spread of prompts associated to whole-video evaluation. For instance, Pegasus-1 might be prompted to generate a protracted, descriptive report a couple of video or just some highlights with timestamps.

“Enterprise organizations acknowledge the potential of leveraging their huge video information for brand spanking new enterprise alternatives … Nevertheless, the restricted and simplistic capabilities of typical video AI fashions typically fall wanting catering to the intricate understanding required for many enterprise use instances,” Lee mentioned. “Leveraging highly effective multimodal video understanding basis fashions, enterprise organizations can attain human-level video comprehension with out handbook evaluation.”

Since launching in personal beta in early Could, Twelve Labs’ consumer base has grown to 17,000 builders, Lee claims. And the corporate’s now working with numerous corporations — it’s unclear what number of; Lee wouldn’t say — throughout industries together with sports activities, media and leisure, e-learning and safety, together with the NFL.

Twelve Labs can also be persevering with to lift cash — and vital a part of any startup enterprise. Right this moment, the corporate introduced that it closed a $10 million strategic funding spherical from Nvidia, Intel and Samsung Subsequent, bringing its complete raised to $27 million.

“This new funding is all about strategic companions that may speed up our firm in analysis (compute), product and distribution,” Lee mentioned. “It’s gas for ongoing innovation, based mostly on our lab’s analysis, within the area of video understanding in order that we will proceed to convey probably the most highly effective fashions to prospects, no matter their use instances could also be … We’re transferring the trade ahead in ways in which free corporations as much as do unimaginable issues.”

Source link

Popular Post

AI-Powered Workflow Monitoring 2025: Achieve Automation Excellence

Can I Have Grapefruit with That? How AI Can Transform Pharmacy Patient Engagement

Addressing AI Skepticism in Healthcare: Overcoming Obstacles To Secure Communication

The Dual-Edged Sword of AI in Cybersecurity: Opportunities, Threats, and the Road Ahead

What Is an AI Agent? A Computer Scientist Explains the Next Wave of AI Tools

Subscribe

Twelve Labs is building models that can understand videos at a deep level

You may also like

Popular Post

Subscribe