Analysis

Why video and images present such a challenge to enterprise AI teams

Image by Philip Oroni for Unsplash+

Much of the world's text-based content has been used to train large language models. There is actually a shortage of available content because LLMs are constantly sucking up the supply, but there is a large untapped supply of unstructured data – sound, video and images. Why aren't people tapping into that more?

At a recent HumanX panel, I spoke to three leaders including Cody Coleman, co-founder and CEO at Coactive AI, Soyoung Lee, co-founder and head of GTM at Twelve Labs and Alex Ratner, co-founder and CEO at Snorkel Labs, about the challenges involved in processing the large volume of unstructured data.

Coleman, whose company helps customers search and analyze unstructured data without tagging, says while models thrive on large amounts of data, the volume here is astronomical in comparison with text and paradoxically, it's just hard to process it all. "When we think about actually being able to process that massive amount of unstructured data and the oceans of unstructured data that are out there, you need fundamentally a different set of tools to be able to deal with that messiness and to be able to pull information out of it," he said.

He did a neat comparison to illustrate the magnitude of the volume, first pointing out that 10 million rows of structured data runs about 40 megabytes. Moving to 10 million text documents, you jump to 40 gigabytes, while 10 million images brings you up to 20 terabytes. He compared that shift from going from the surface area of Lake Tahoe to the Caspian Sea to the Pacific Ocean, and that's before you even factor in video.

Lee, whose company builds AI models tuned for unstructured data, says even though there are large supplies of video content in the world, it is so much more complex when it comes to building models to help understand it.

"Video is a uniquely human concept, where it's so similar to the real world. It has sound. It has visuals. It has language. You have to understand how things change over time and connect everything to understand the narrative," Lee explained. And it's easy for a model to get confused. Does crying mean someone's happy or sad? As humans we can usually interpret the subtle differences in human emotions. For models, it's much more difficult to gauge this kind of sentiment.

She says while the challenges are real, companies like hers are working to solve the problems related to processing large amounts of video. “And so the ability for the models to process data with a level of intelligence across vast amounts of information is a capability that is now becoming a lot more meaningful and production‑ready today,” she said.

But before models can work effectively, the data itself needs preparation.

Finding new ways to process this data

Ratner, whose company builds data environments for evaluating and tuning AI, says the challenge is getting all that data ready for a model to understand and process. "Collecting that data, storing that data, and ultimately labeling it, curating it, refining it so that it can actually be something that is fuel for modern AI models is one of the most important parts of the supply chain for AI today," he said.

And it's this difference that these three leaders describe that make it very difficult to rely on a general purpose model. Coleman says they may get you 60% of the way there with a frontier model, but that's not enough for critical enterprise use cases. "You can't be kind of a blunt object and throw a model at everything. It'll be too expensive and too slow for that type of budget and for that type of turnaround, especially as you think of scale," he said.

That requires specialized models that understand the content because volume means more processing time. That could lead to big AI bills that may undermine the very reason you wanted to use AI in the first place. As Lee says, "Not every frame of video is meaningful. And so if you're trying to run compute on every frame of video, you can imagine how that scales across all of the video that exists within an enterprise owned by you users, or in the whole world." That means you need a model that knows how to pick and choose what's important to leave in and what's ok to leave out.

And that means models that understand that could save you valuable processing time and loads of cash. As Ratner says, "The better you can scope what you want to do with your use case, either as a vendor working with your customer or as the end user, the more you can get a small model with the right data, the right tuning, to excel at that target criteria."

At the end of the day, it is still about using the right tools for the job. Standard models can get you part of the way there, but when it comes to unstructured data at enterprise scale, they are too slow, too expensive, and not accurate enough to trust on their own.

Why video and images present such a challenge to enterprise AI teams

Finding new ways to process this data

Read more

Why AI projects get stuck in testing hell

Get ready for headless everything

FastForward #62: Get ready for headless everything

An open source project turns 10 and finds itself tailor-made for the agentic AI era