How MDM Platforms Actually Run in Production vs the Pitch

8 min read
Realities From the Integration Trenches
- The Sales Pitch: The promise of instant agentic AI readiness and automated golden records across your entire enterprise stack.
- The Production Reality: Multi-month schema alignment, API rate-limiting bottlenecks, and the heavy operational friction of manual data stewardship.
- The Architectural Split: Traditional centralized MDM suites versus modern zero-copy connected apps running directly on your cloud data warehouse.
- The Next Action: Audit your entity resolution latency and duplicate rates before signing any seven-figure enterprise renewal contracts.
The High Cost of Dirty Data in the Era of Agentic AI
The sales presentation for modern Master Data Management (MDM) platforms always looks like magic. A slick slide deck promises that by deploying their software, your enterprise will instantly unify its fragmented customer, product, and inventory data into a single, pristine source of truth. At industry events like Informatica World 2026, the talk is all about "headless data management" and "agentic MDM," promising that autonomous AI agents can seamlessly navigate your systems to make real-time operational decisions.
But back in the engineering room at 3 a.m., the reality of master data management is far messier. When an AI agent queries a database to check a customer’s lifetime value, it does not see a clean, unified profile. Instead, it encounters three different records for "Jon Smith" at the same physical address, each with slightly different purchase histories across your Shopify storefront and your legacy ERP. If the agent acts on this fragmented data, it sends the wrong promotional offer or, worse, triggers an incorrect billing sequence.
This gap between AI ambition and data reality is widening rapidly. A 2026 CDO survey revealed that 76% of data leaders acknowledge their governance systems have not kept pace with AI adoption, while 61% state that higher-quality data is the single most important factor for moving AI pilots into production. Furthermore, research from Drexel University indicates that 67% of organizations do not fully trust the data they rely on daily. To build systems that actually work, we have to look past the marketing hype and understand how these platforms operate when the queries start hitting production scale.
Figures compiled from the sources cited below.
How Entity Resolution Works Under the Hood
To understand why MDM is so difficult, we have to strip away the vendor jargon and look at the core technical challenge: entity resolution. This is the process of determining whether two different records in separate databases refer to the same real-world entity. It is not a matter of running a simple SQL join on a primary key. It requires comparing messy, human-entered strings across millions of rows.
Think of entity resolution like a bouncer at a club door with a handwritten guest list. If a guest arrives and says their name is "Jon Smith," but the list says "Jonathan Smith," the bouncer has to make a judgment call. He looks at other clues: Is the phone number the same? Is the email address a match? In an MDM platform, this judgment call is handled through a combination of deterministic rules and probabilistic matching algorithms.
The Mechanics of Probabilistic Matching and Headless APIs
In a traditional centralized MDM architecture, data is extracted from source systems, loaded into the MDM hub, cleansed, matched, and then written back to the source systems as a "golden record." This hub-and-spoke model is highly structured but incredibly slow. When SAP agreed to acquire Reltio in early 2026 to strengthen its SAP Business Data Cloud (SAP BDC), it was a direct acknowledgement that enterprises need better ways to harmonize SAP and non-SAP data for AI applications. Yet, synchronizing these massive datasets across external systems introduces significant latency.
The alternative approach gaining traction is the "connected app" model, exemplified by Semarchy launching its connected app for Snowflake. Instead of moving gigabytes of data out of your cloud warehouse to an external MDM engine, connected apps run their matching algorithms directly inside your warehouse using native SQL and push-down execution. This eliminates the egress costs and security risks of moving data, but it puts a heavy compute burden on your warehouse, which can quietly spike your monthly Snowflake bill if your matching rules are unoptimized.
"An AI agent is only as smart as the database it queries, and feeding dirty, un-deduplicated customer records to a large language model is just a faster way to hallucinate bad business decisions."
How to Modernize Your Enterprise Master Data Strategy
If you are tasked with cleaning up your enterprise data layer to support downstream AI or analytics, you cannot rely on automated software to do all the heavy lifting. You need a structured, step-by-step approach to build a reliable master data pipeline.
- Profile your duplicate rates: Run a baseline analysis on your primary customer and product tables to calculate your exact duplicate rate. Use simple blocking keys, such as matching the first three characters of a last name and the first three digits of a postal code, to group similar records before running heavy matching algorithms.
- Establish clear stewardship workflows: Define who owns the data when automated matching rules fail. If the matching algorithm is only 85% confident that two records are the same person, the record must be routed to a human data steward via an automated queue in tools like Profisee or Stibo Systems.
- Expose golden records via headless APIs: Do not force downstream applications to query the raw MDM database directly. Use headless data management patterns, such as those introduced by Informatica, to expose clean, governed golden records through high-performance REST or gRPC endpoints.
- Enforce governance at the ingestion point: Prevent dirty data from entering your systems in the first place by integrating real-time address validation and email verification APIs directly into your customer-facing applications and CRM forms.
The Operational Trade-Off: Heavy Suites vs. Connected Apps
Choosing an MDM strategy is not about finding the "best" platform; it is about choosing which type of operational friction your engineering team is willing to tolerate. There are two valid, competing approaches to this problem, and each comes with its own set of compromises.
- Unified Enterprise Suites (Informatica, SAP + Reltio): These platforms offer incredibly deep entity resolution, robust compliance frameworks for regulations like GDPR and SOX, and mature data stewardship interfaces. However, the total cost of ownership (TCO) is exceptionally high, integration cycles are measured in quarters rather than weeks, and you are locked into the vendor's proprietary ecosystem.
- Connected Warehouse Apps (Semarchy on Snowflake): This approach leverages your existing cloud data warehouse infrastructure, allowing you to run zero-copy data cleansing and matching. It is highly appealing to modern DataOps teams because it keeps data inside your security perimeter and matches the speed of your dbt pipelines. The catch is that you must manage the underlying compute costs, and you lack the out-of-the-box, multi-domain business workflows found in legacy suites.
The Data Gravity Rule: If more than 70% of your transactional data already lives inside a single ecosystem like SAP, buy the suite; if your data is evenly distributed across cloud warehouses, a connected app will save you millions in egress and integration costs.
The Hidden Pitfalls of Automated Data Harmonization
When deploying these systems in production, engineering teams frequently fall into predictable traps that destroy the ROI of their MDM investments.
- The "Set-and-Forget" Fuzzy Matching Trap: Relying too heavily on probabilistic matching without continuous monitoring. If you set your matching threshold too low, the system will incorrectly merge distinct customers, leading to compliance violations; if you set it too high, your duplicate rate will remain unchanged.
- The Egress Cost Blindspot: Extracting massive datasets out of cloud data warehouses to run matching rules in an external SaaS MDM platform. This pattern frequently results in thousands of dollars in monthly cloud egress fees and introduces unnecessary network latency to your data pipelines.
- Ignoring the Human Steward: Assuming that AI agents or machine learning models can completely replace human data stewards. When resolving complex corporate hierarchies or B2B accounts, human context is often required to determine the correct parent-child relationships between business entities.
Frequently Asked Questions
What happens to our compliance audit trail when an external MDM platform's API goes down during a batch sync?
If your MDM platform experiences an API outage during a batch synchronization, your source systems will temporarily run out of sync, creating data drift. To prevent compliance failures under regulations like GDPR or HIPAA, your integration pipeline must implement an idempotent transactional queue. This ensures that when the API comes back online, the updates are applied in the exact order they occurred, preserving a clear, auditable trail of who changed what record and when.
How does the latency of a Snowflake Connected App compare to a traditional hub-and-spoke MDM architecture?
A connected app running directly on Snowflake typically features lower data movement latency because it eliminates the need to export, transform, and load data into an external hub. However, the query latency for real-time lookups can be higher if the warehouse is cold or if the matching views require complex, high-cardinality joins. Traditional hub-and-spoke systems, while slower on batch updates, often deliver faster single-record lookup latencies (under 50 milliseconds) because they serve read requests from highly optimized, memory-cached relational databases.
Why does SAP's acquisition of Reltio matter if we already run our data pipelines on non-SAP systems?
The acquisition of Reltio by SAP is a significant market signal because it shows that ERP giants realize they can no longer operate in a silo. Even if your primary data pipeline runs on non-SAP systems, this acquisition means SAP Business Data Cloud will increasingly seek to ingest and govern your external data sources. For engineering teams, this means you must prepare for tighter integrations and potential licensing changes if you rely on Reltio to master data that eventually feeds into an SAP-dominated financial ledger.
The ultimate success of your data strategy does not depend on whether you buy a headless, agentic, or connected MDM platform. It depends on your team's willingness to do the hard work of defining clear business rules, setting up manual stewardship workflows, and keeping your matching logic close to where your data actually lives.
How many duplicate customer records are currently sitting in your production database, and what is it costing your engineering team to ignore them?
Related from this blog
- Graph Database B2B Pipelines Fail on Relational Thinking
- Data pipeline orchestration: Why YAML won't save your DAGs
- How Graph Database B2B Integrations Break at Scale
- Enterprise RAG Playbooks Abandon Pure Vector Search in 2026
- Enterprise Data Lakehouse Architecture: Why It Breaks at Scale
Sources
- Informatica Unveils Headless Data Management, Agentic MDM at Informatica World 2026 - HPCwire — HPCwire
- Semarchy Launches Snowflake Connected App for Governed Data Products and Enterprise AI - Business Wire — Business Wire
- SAP to Acquire Reltio: Make SAP and Non-SAP Data AI-Ready - SAP News Center — SAP News Center
- Informatica puts the meat on its headless AI effort for partners - IT Europa — IT Europa
- Master Data Management Market Size Share & Forecast - 2035 - Market Research Future — Market Research Future
- Master Data Management: Building a Single Source of Truth for Ecommerce (2026) - Shopify — Shopify