AI projects are not cheap

One of the promises of the hype around generative AI is that it will help companies activate the mountains of unstructured data that are growing rapidly across the enterprise. But that’s easier said than done, as unstructured data is growing faster than IT budgets due to the rise of various cloud apps and lax governance. Users tend to store everything in files, objects, and virtual disks by default, just in case.

Comprise’s Unstructured Data Management Report 2024 found that optimizing storage costs was a higher priority for IT decision makers than developing artificial intelligence. Komprise COO Krishna Subramanian says she was surprised:

We thought AI would rank higher than cost optimization, but actually cost optimization was ranked highest, and I think that’s because most of the customers we talk to are still trying to understand AI. They probably do a little bit of AI, but they’re more interested in getting their data ready for AI, and 70% of them don’t have budgeted for that. So they have to fund the AI ​​projects by just getting the money from somewhere else. If they want to do an AI project, they’ll probably say, “Let’s increase storage efficiency or increase compute power,” and we’ll take that money and use it for AI.

Komprise was founded in 2015 to develop tools to help automate the management and governance of unstructured data. A key component is a metadata tagging process, which essentially creates a semantic layer to track what data exists, how it was generated, what it means, and how it is used. This adds a bit of structure to guide various automated processes.

Subramanian says one of the key ways organizations can reduce their storage and backup costs is by improving data tiering processes. This allows teams to automatically move less-used data to cold storage, which can result in storage cost savings of up to 70%. At the same time, their tools provide a virtual link within the same file system or structure as the more actively used data across file systems and object storage.

What is unstructured data?

The term “unstructured data” is often used but is somewhat confusing because all data has some structure reflected in how it is formatted and managed, as well as its meaning. It is probably more useful to think of it as a spectrum that can vary depending on the application, format and use case. For example, an invoice or procurement document has a common structure but may differ depending on the vendor or customer. On the other hand, video footage of inspections may be less structured but also relates to specific assets, times and conditions.

Subramanian explains what she thinks about the development and proliferation of unstructured data:

If you think about the evolution of data, you know that the first data application was actually structured data. So your bank accounts and CRM and ERP applications were all very structured. They required a database. Companies invested in database technology and block storage for structured data. Unstructured data really just originated as files in file shares or personal documents on a PC. At first the amount of unstructured data was quite small, then it grew rapidly and today 90% of the data in any organization is actually unstructured.

And when we say unstructured data, we don’t mean that the data doesn’t have a structure. It’s just that there’s no common structure. So it could be audio files, video files, genomic data, data from self-driving cars, documents. A lot of it is actually generated by applications. It’s just not tabular data. But the problem for companies is that most of the technology is built for structured data, not unstructured data.

Explosive growth

On a personal level, managing unstructured data can be like storing all the photos and videos on your phone, even if you never look at them. The problem becomes even more serious when thousands of employees generate new data that they neither actively use nor delete. Subramanian estimates that storage costs now consume 30 to 50 percent of IT budgets, and the amount of data in the cloud and on-premises is growing by 20 to 200 percent annually.

Another challenge is the complexity of it all. The raw data can consist of billions of files scattered across many systems. Subramanian estimates that most large organizations have tens of petabytes of data that can take several months to move or make available to AI. Subramanian explains:

So when you consider how much data is piled up in so many places, it’s really not practical to push all that data into a cloud or somewhere else and then run an AI model on it. What you really want is to have a way to figure out what data is actually useful and feed that into an AI process, because AI is iterative. You can iterate on the same data using different AI processes. The idea that you have to push all your data to every place that an AI process is running and then move it to the next place is untenable for unstructured data. That’s why you need something that’s globally indexed to know where things are.

Automating data management

With this in mind, companies are also turning to various data and AI processing techniques to add structure to their data. The latest large language models convert raw text, images and videos into vectors to improve semantic search in vector databases. Knowledge graphs and graph databases help connect the entities, events and processes mentioned in documents.

Komprise focuses one level above that to bring order to metadata, or the data about the data. This could include things like: who collected it, where is it stored, when was it last used, how should it be managed, and what is the latest version? For example, if you ask an HR AI copilot a question about what your health insurance plan covers, this metadata index could make it easier to track down the latest information from your region.

It also provides a foundation for controlling the growth of shadow AI, where users begin feeding enterprise data into AI services. Many enterprise AI services provide management tools to control how data is reused, but many consumer variants do not, which can lead to data leaks or breaches.

Another challenge is keeping track of the origin of the data. The rapid adoption of generative AI tools is also driving the proliferation of AI-generated content, which may be inaccurate or contain hallucinations. Subramanian explains:

Generative AI may not always give you the same answer. It contains errors and hallucinations. So in our index, when you have results from AI, you can actually mark which ones have been reviewed by humans. This helps you build more confidence and trust the AI ​​more.

My opinion

I was recently surprised to find that short videos on my phone were consuming gigabytes of my cloud storage, but by re-encoding with the free Handbrake app. These days, it takes a lot of neurotic discipline to keep my email, photos, and files under my quota. The problem is exacerbated by the fact that every new phone has an even fancier new camera but no extra storage. It feels like this is part of the cloud data upselling strategy.

Additionally, each new app seems to have its own storage format, location, and workflow, which can drive you mad when trying to move large amounts of content from one app to another. My wife regularly asks for my assistance when she’s tearing her hair out trying to transfer her data from one app to another. And she scowls at me every time I suggest that the file system on her laptop would be simpler.

I can only imagine the problems large companies with a mix of technical skills, AI ambitions, and neurotic discipline face when trying to bring order to this growing chaos. There are many vendors building active metadata management tools for structured data, but surprisingly there is little competition when it comes to unstructured data. Komprise’s relatively simple approach to automating and controlling this process could go a long way toward curbing this chaos, not to mention raising money for AI (mis)adventures.

Another point is that companies will continue to struggle with managing AI-generated content that confuses employees and customers. Earlier this year, Air Canada lost a legal dispute related to an AI chatbot that lied about its bereavement reimbursement policy. Cases like this should serve as a wake-up call that it is time to take the consequences of hallucinating AI seriously. This requires labeling all AI content and applying a chain of trust for human-verified content. In addition, it will help companies avoid some of the problems associated with Collapse of the AI ​​model that can arise when training new models using AI-generated content.

By Olivia

Leave a Reply

Your email address will not be published. Required fields are marked *