Rules-Based and Probabilistic Entity Resolution
By Robert Barker, Infoglide Senior VP and Chief Marketing Officer
If you’ve followed recent developments in the entity resolution market, including the recent re-positioning of existing vendors like Netrics and Initiate Systems, you may have heard discussion about the relative merits of rules-based entity resolution using attribute-specific analytics versus probabilistic entity resolution that uses mathematical analytics exclusively. Facts can get distorted in the heat of discussion, so let’s examine a few of the arguments and then look at the facts:
- Rules-based systems can only yield binary answers; i.e. they require that attributes either match exactly or not.
- Rules-based systems demand that all data sources be centralized and rationalized.
- Probabilistic-only entity resolution systems enable decision-making based on relative likelihood, while rules-based systems do not.
FACT: The best systems use the best of rules and similarity searching technology, enabling them to compute the distance between attributes and make complex decisions based on those calculations. These hybrid solutions can actually run more efficiently while providing better results.
FACT: Flexible systems do not require data sources to be centralized, warehoused, or have conforming schemas.
FACT: It’s possible to combine the best of rules and probability to create effective identity resolution solutions that make decisions based on relative likelihood.
“One size fits all” doesn’t work well in many domains, and it unnecessarily constrains the development of entity resolution solutions. For example, suppose your solution needs to include automobile license plate numbers as one attribute to help resolve entities. Mathematical probability won’t detect that “13” is similar to “B” while an attribute-specific analytic quickly makes the connection.
So what’s the takeaway? Be skeptical when you hear that rules and probability don’t mix. More importantly, question why you have to choose one or the other when you could have both.
