How MDM Platforms Resolve Entity Duplication in 2026

9 min read
The Operational Briefing
- The Core Mechanism: Master Data Management (MDM) platforms reconcile, clean, and synchronize core business entities across highly fragmented database environments.
- The Strategic Shift: SAP's March 2026 acquisition of Reltio highlights a massive transition from slow, batch-oriented data governance to real-time, cloud-native entity resolution.
- The Operational Friction: Most enterprises remain trapped in a half-finished migration, attempting to bridge legacy on-prem transactional systems with modern, API-first cloud data pipelines.
Why is SAP Buying Reltio When Enterprises Already Have Databases?
SAP's March 2026 acquisition of Reltio highlights a persistent enterprise headache: maintaining clean data across silos using modern MDM platforms.
Every large company suffers from a quiet, expensive identity crisis. The sales team uses Salesforce, the billing department runs on SAP S/4HANA, the marketing team tracks leads in HubSpot, and the data science team queries a Snowflake data lakehouse. If a customer changes their address, updates their phone number, or registers under a parent company, each of these systems records a slightly different version of reality. The database does not know that "Acme Corp," "Acme Corporation," and "ACME Inc." are the exact same legal entity.
This is where Master Data Management (MDM) platforms fit into the architecture. They do not replace your transaction databases or your analytical warehouses. Instead, they sit adjacent to them, acting as the definitive arbiter of identity. They ingest messy, conflicting data from every corner of the business, run it through a series of matching engines, and output a single, clean, authoritative record known as the golden record.
The industry is currently caught in a slow, uneven transition. For decades, master data was managed through heavy, on-prem, batch-processed registries like legacy SAP Master Data Governance (MDG) or older Informatica implementations. These systems were built for a world where data moved once a day at midnight. Today, engineering teams are trying to stream live data via Apache Kafka and feed clean context into vector databases for real-time retrieval-augmented generation (RAG). Yet, the legacy transactional systems refuse to move faster. The result is a highly fragmented data architecture where real-time cloud APIs are constantly waiting on batch-oriented ERP backends to update.
The Mechanics of Sorting Out Who is Who in Your Data
To understand how an MDM platform solves this, we have to look past the marketing jargon and examine the actual data pipeline. The process of turning chaotic, duplicate records into a single golden record relies on a highly sequenced, multi-stage mechanism. It begins with ingestion, moves through schema normalization, applies matching algorithms, executes survivorship rules, and finally syndicates the clean data back to the systems that need it.
Think of an MDM platform as an elite hotel concierge who cross-references a messy stack of handwritten guest lists, online bookings, and VIP club registries to ensure the guest checking in gets their preferred room and correct bill, even if their name is spelled three different ways across those lists.
In practice, the platform must first normalize the incoming data. This means converting all phone numbers to the E.164 standard, parsing physical addresses using postal verification APIs, and stripping out junk characters. Once the data is clean, the matching engine takes over. Modern platforms like Reltio use a combination of deterministic matching (exact matches on unique identifiers like Tax IDs or social security numbers) and probabilistic matching (using algorithms like Jaro-Winkler or Levenshtein distance to calculate the likelihood that two strings refer to the same entity).
The Friction Point Between Deterministic and Probabilistic Matching
The hardest part of this process is balancing the trade-off between false positives and false negatives. If your matching rules are too strict (highly deterministic), you end up with duplicate records because a simple typo in a last name prevents a match. If your rules are too loose (highly probabilistic), you risk merging the records of two entirely different customers, which can lead to compliance violations under GDPR or severe billing errors in your ERP.
"An MDM platform does not create new data; it resolves the conflicting stories your existing databases are telling you."
Legacy architectures handled this by routing every borderline match to a human data steward for manual review. In a massive enterprise, this creates an enormous operational bottleneck. Modern cloud-native platforms attempt to automate this by using machine learning models to continuously tune match thresholds based on historical steward decisions. However, this is where the half-finished migration reveals itself: while the matching engine can run in milliseconds in the cloud, pushing those resolved changes back to an on-prem ERP often requires waiting for a nightly batch job, creating a temporary state of data divergence.
To see how these two paradigms contrast, we can compare the operational realities of legacy on-prem systems against modern cloud-native platforms like Reltio:
| Operational Metric | Legacy Batch MDM (e.g., On-Prem SAP MDG) | Modern Cloud MDM (e.g., Reltio) |
|---|---|---|
| Processing Latency | Hours to days (scheduled batch runs) | Sub-second to real-time (API-driven) |
| Data Model Flexibility | Rigid relational schemas (SQL-based) | Flexible graph schemas (NoSQL/Graph-based) |
| Matching Approach | Strictly deterministic, rule-heavy SQL | Hybrid deterministic and probabilistic ML |
| Downstream Sync | File-based exports (CSV/XML over SFTP) | Real-time event streaming (Kafka/Webhooks) |
Rule of Thumb: If your entity resolution strategy relies on downstream data consumers writing their own custom SQL joins to clean up duplicate customer records, you do not have an MDM strategy; you have a data debt factory.
An Operator's Step-by-Step Playbook for Entity Resolution
Implementing an MDM platform is not a matter of turning on a software license and watching the data clean itself. It requires a disciplined, step-by-step operational sequence to ensure that the systems upstream and downstream do not break when identities begin to shift.
- Establish Source System Authority and Lineage: Before writing a single matching rule, you must map your data landscape and assign a trust score to every source system for every specific attribute. For example, you might decide that your CRM (Salesforce) is the absolute authority for customer phone numbers and email addresses, but your ERP (SAP S/4HANA) is the sole authority for billing addresses and credit limits. This hierarchy prevents systems from overwriting each other in an infinite sync loop.
- Configure the Match Rules and Run a Passive Simulation: Next, you load your historical data into the MDM platform and run matching rules in a passive, read-only mode. This allows you to analyze the matches the system *would* make without actually merging any records in your live databases. You then review the results to identify false positives (e.g., merging father and son records because they share an address and a similar name) and adjust your probabilistic thresholds accordingly.
- Activate Survivorship and Begin Real-Time Syndication: Once the match rules are tuned, you define the survivorship rules—the logic that dictates how the golden record is constructed from the winning attributes of the merged source records. With survivorship active, you turn on the outbound syndication pipelines. This means configuring the MDM platform to publish any changes to a golden record directly to an event bus like Kafka, allowing downstream analytical warehouses, search indexes, and operational apps to update their local caches immediately.
This sequence must be executed methodically, as rushing to step three without a simulated step two will inevitably corrupt your downstream transactional systems.
Where Data Architects Tripped Up During the Migration
The path to clean master data is littered with failed deployments. Most of these failures do not stem from software limitations, but rather from fundamental misunderstandings of how data behaves across an enterprise.
- Believing real-time MDM solves the dirty source data problem: An MDM platform is an engine of reconciliation, not a magic wand. If your sales representatives are allowed to type "N/A," "test," or "none" into required fields in your CRM, the MDM platform will faithfully ingest those values. At best, it will flag them as low-confidence matches; at worst, it will resolve them into a single, massive, useless golden record representing a customer named "N/A." Data quality must still be enforced at the point of entry.
- Treating MDM as a pure database consolidation project: Many teams assume they can replace MDM by simply migrating all their data into a single, massive cloud data warehouse like Snowflake or a lakehouse platform like Databricks. While a warehouse is excellent for running analytical queries on historical data, it is not designed to handle the real-time, bidirectional operational transactional workflows that MDM platforms manage. A warehouse shows you what happened; an MDM platform tells your operational systems what is true right now.
- Assuming LLMs and vector search render MDM obsolete: There is a common misconception that because large language models (LLMs) can perform semantic searches and understand context, you no longer need structured entity resolution. The reality is the exact opposite. If you feed an LLM vector database embeddings containing duplicate, conflicting, and outdated customer records, the model will generate inaccurate, hallucinated responses. Clean master data is the absolute prerequisite for reliable generative AI in the enterprise.
Frequently Asked Questions
What happens to our downstream analytics when the MDM survivorship rules are updated mid-quarter?
Updating survivorship rules mid-quarter can introduce immediate data drift in your analytical environments. If you change a rule so that ERP data now overrides CRM data for a specific attribute, your historical reporting in Snowflake or BigQuery will retroactively shift. To mitigate this, you must version your golden records, maintain a strict lineage log of which rules were active when a record was updated, and ensure your analytics team is notified before any core matching or survivorship logic is altered in production.
How do we handle the latency mismatch when a modern CRM writes in real-time but our legacy ERP only processes batch updates at midnight?
This mismatch is the classic "half-finished migration" trap. The best approach is to decouple the systems using an event-driven architecture. When the CRM writes a change, it should stream to the MDM platform via an API. The MDM platform resolves the entity, updates the golden record, and publishes the update to an event queue (like Apache Kafka). The real-time systems consume this update immediately, while a separate connector stages the update, holds it, and formats it for the ERP's nightly batch processing window, ensuring database tables do not lock up during business hours.
Can we just use a vector database and semantic search to handle entity resolution instead of buying an MDM platform?
No, you cannot. While vector databases and semantic search are highly effective at identifying similar strings or concepts, they lack the operational framework required for enterprise master data. A vector database does not have built-in survivorship logic, data stewardship workflows, audit trails, or the deterministic constraints required for regulatory compliance (such as SOX or GDPR). You can use vector embeddings as a feature inside a modern MDM platform's probabilistic matching engine, but a vector database alone cannot govern the lifecycle of a golden record.
The Architect's Verdict: The acquisition of Reltio by SAP proves that real-time, cloud-native entity resolution is no longer an optional luxury for modern enterprise stacks. However, buying the platform is only half the battle; the real work lies in configuring the precise matching and survivorship rules that prevent your transactional systems from corrupting your analytical layers. Success requires treating identity as a continuous, governed pipeline rather than a static database state.
When you look at your own data stack today, can you confidently say how many unique customer records exist across your CRM, ERP, and data warehouse—or are you still relying on downstream analysts to write custom SQL queries to hide the duplicates?
Related from this blog
- Master Data Management Platforms Shift Massive Costs to IT
- How MDM Platforms Actually Run in Production vs the Pitch
- Graph Database B2B Pipelines Fail on Relational Thinking
- Data pipeline orchestration: Why YAML won't save your DAGs
- How Graph Database B2B Integrations Break at Scale