HOME

Architectures for Entity Resolution

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post we looked at a formal model for describing entity-based integration. Now let’s turn our attention to how entity resolution (ER) systems are actually implemented.  One of the most important design decisions is whether the system will perform entity identity management.  Systems perform identity management when they create and store the attributes values for the identities that they process.  Identity management is necessary for systems that assign persistent entity identifiers, i.e. the system must give all of the references to the same entity the same identifier value from one resolution process to the next.

The most basic form of ER is the merge/purge process.  A merge/purge process reads a large batch of references and systematically makes pair-wise comparisons between them.  During the process, it assigns a group identifier to all of the references it determines to be for the same entity.  However, these identifiers are transient, only existing during the process of a particular batch of references since the end result is to create a single, merged record (called a “survivor” record) in place of each reference group.  The result is that references to the same entity occurring in two different merge/purge processes will likely be given different group identifiers from one process to the next.  For example, the references for John Doe in the first batch of references processed might given the group ID of 213, but references to the same John Doe in a batch of references processed the next day might be given a group ID of 634.  The merge/purge process can still correctly resolve the entity references in each batch, but the values of the group IDs don’t persist or carry over for the same entities from batch to batch.

Another characteristic of the merge/purge ER process is that it is designed to operate in batch mode.  However, there are transactional or “on-demand” versions of merge/purge that are sometimes referred to as heterogeneous database join systems.  Instead of combining all of the reference sources into a single file for batch processing, each reference source is loaded as a database table.  The application is connected to all of the source tables and has metadata that describes the structure of each reference source.  This allows a single query or “join request” to be submitted to the application, which then translates the request into an appropriate query for each source.  The individual query responses are collected and processed into a single view that is provided as the query result for the initial query.  Just as in the merge/purge process, the groups of references brought together for an entity (a query) are transient.  These types of query-based ER systems are common in law enforcement and other hypothesis testing applications.

On the other hand, there are other ER architectures designed to retain and manage entity identity information.  By doing this they are able to “recognize” references to the same entity over time and assigned those references the same entity identifier, i.e. maintain persistent entity identifiers.  In CRM applications these kinds of systems are sometimes called Customer Recognition Systems.

There are two major types of ER system architectures that perform identity management - “identity resolution” systems and “identity capture” systems. In the next post, I will pick up here with a discussion of how these systems manage identity and maintain persistent entity identifiers.

Leave a Reply


Bad Behavior has blocked 1166 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice