Architectures for Entity Resolution-Part 3
Thursday, April 29th, 2010By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
In the last two posts we reviewed the basic architectures used to implement entity resolution (ER) systems. We started with the most basic systems, the merge/purge and heterogeneous join processes. In the last post, we discussed identity resolution systems, the first of two types of ER architectures that perform identity management. By retaining identity information, these systems are able to recognize the same identity over time and to assign it a persistent identifier.
The distinguishing characteristic of identity resolution systems is that they start with a given set of identities to which input references are resolved. An example would be a customer recognition system where the starting identities are the customers of the business. However, there are many situations where the identities are not necessarily known in advance. In some cases, it is not because the entities are unknown, but simply that they are not organized in a way that can be easily pre-loaded.
For example, two companies merge. Each company has its own customer database, but the customers are identified in different ways. The same situation can arise in one company through poor systems and practices, resulting in no confidence that the master records are not duplicated across business lines or company locations.
The type of system often used to address these situations is called an “ identity capture” system. Identity capture systems resemble a cross between a “smart” merge/purge system and an identity resolution system. They support identity management and persistent identifiers, but start without a preloaded set of identities.
Here is how they work. As references are resolved, the system saves what it has learned rather than discarding it, so identities are built on the fly as references are processed. For example, suppose an elementary school has 10 years of enrollment records, i.e. for each year, it has records of all the students where were in grades 1 through 6. Each year some students leave grade 6 for middle school or transfer from any of the grades to another school. At the same time some new students enter at first grade or transfer into upper grades from another school. In an identity capture system, the identity master starts out empty. When the first enrollment file is processed, almost all of the enrollment records processed will represent new identities. The identity characteristics in each record are captured and stored to create a new identity master record.
When the next year of enrollment is processed, the system should recognize students re-enrolling from the previous year, so it only captures as new identities those students entering the school that year. However, in many identity capture systems, the process of capture goes beyond simply adding new identities and can also be used to enhance existing identities. For example, suppose that from the first year enrollment, an identity was created for student Edgardo Mendez with a 7/12/2000 date-of-birth (DOB). Then in the next year of enrollment the system is presented with the record of Eddie Mendez with a 7/12/2000 DOB. Based on the resolution rules (including conflict rules), the embedded identity resolution process may decide that these are both references to the same student. If that were the case, it would enhance the identity master record to include the second year first name variant, so that going forward it would recognize the same identity with a first name of either Edgardo or Eddie.
The advantage of collecting identity information on the fly is offset to some extent by the problem of splits and consolidations. The order that references are processed can sometimes affect the system’s identity decisions. Information that connects two references may come after the two references have created separate identity master records (false negative). This requires the two identity master records to be consolidated or merged. Although the master records can be corrected, it defeats the idea of the persistent identifier in that many previously processed references could have been assigned the identifier associated with the retired master record while others were assigned the identifier of the surviving master record.
Splits are the reverse situation where two references to the same entity are mistakenly used to create a single master identity record (false positive). Splits are harder to correct than consolidations, and for this reason, ER systems that manage identity tend to err on the side of false negatives than false positives.
In the next post we will discuss the four most common strategies linking references.


