Why Data Management is important

The importance of data management for AI

Kai Yang

5/9/20252 min read

brown mountain under white sky during daytime

To truly accelerate the use of AI and Machine Learning models at scale for customer outcomes require strong data management foundations. If the goal is to shift towards truly agentic AI, where LLM models are making unsupervised workflow decisions, the importance of managing the internal and external data at scale systematically is ever important. In all regulated industries with GDPR-like customer privacy laws, this is a license to operate per-requisite and not a nice to have. The fines and remediation costs can be astronomical, let alone the reputation risk.

On average 70%+ of time taken across the data analytics lifecycle is in data discovering, data approvals, data quality cleansing - rather than model building for decisions. This is largely due to the fact, metadata capture, data catalogue, data modelling ontology and reference data are not standardized across the applications/data platforms that support business workflow end to end. The reason for the under-investment is largely due to the fact it is not directly visible to business leaders who tend to invest in visible data products such as BI & Visualisation tools e.g. Looker ML, and front end applications in digital channels as priority. The challenge is overtime data collected and stored, which is the true asset to fine tune the AI/ML models are trapped in individual silos... Every dollar invested in these foundational capabilities especially upstream are worth every cent.

The industry trend towards data mesh and data fabric, is the right one in terms of making data discoverable, accessible via virtualization and standardised policy access & entitlement engine. Netflix is probably the best visible example of how it has benefited from data mesh. However, it should be noted Netflix is a tech native firm with very few legacy systems and applications, which is not the case with firms with over 30y history i.e. born before 1997.

As of mid-2025, the largest LLM today has 1.8 trillion parameters and has exhausted the freely available data on the internet. It is probably fair to say LLM models from hyper-scalars over time will reach parity in terms of their baseline evaluation criteria e.g., in terms of its reasoning, coding capabilities. This means organisations that truly want to leverage agentic AI models to double their productivity will need to maximise the internal data available. The ones that understand this faster and truly unlock their 'pot of gold' via good data management will be the ones that will surge ahead and build on their competitive advantage.