Entity Identity Management
Friday, January 14th, 2011By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
First, let me wish everyone a Happy and Prosperous New Year. Also, since my last post, my book Entity Resolution and Information Quality has been published and is now available from Morgan Kaufmann Publishing (http://mkp.com/news/entity-resolution-and-information-quality).
What is entity identity management? It simply means that an ER system can store and maintain a record of identity information that persists over time. Entity identity management is essential for an ER engine to operate in identity resolution or identity capture mode and for it to maintain persistent entity identifiers.
As you may recall from previous discussions, an identity resolution ER system starts with a set of known (asserted) identities and attempts to determine if a given entity reference refers to one of these known entities. On the other hand, an identity capture ER system starts with a blank slate and tries to construct an identity based on the (equivalent) references it processes.
Two important concepts here bear further discussion. One is the structure for representing the identity of an entity, and the second and somewhat more philosophical question is, what constitutes entity identity.
There are two commonly used approaches to representing identity in ER systems – one is an attribute-level structure sometimes called a “merge identity” and the other is a reference-level structure sometimes called a “cluster identity.” The difference between a merge identity and a cluster identity can be illustrated by a simple example.
Suppose we have a system where entity references have three attributes A, B, and C, and that we are given two specific entity references R1=(a1, b1, c1) and R2=(a2, b2, c1), where a1 and a2 are values for attribute A, b1 and b2 values for attribute B, and c1 a value for attribute C. Finally assume that references R1 and R2 are determined to be equivalent references (i.e. references to the same real-world entity). In the merge identity approach, the entity identity EM referenced by R1 and R2 would be represented as
EM=[A:{a1, a2}, B:{b1, b2}, C:{c1}]
Meaning that for identity EM the A attribute can take on either the value a1 or a2, the B attributes can take on the value b1 or b2, and the C attribute the value c1. In a merge identity the binding between the values a1 and b1 that was expressed by their co-occurrence in the reference R1 is lost. Similarly the binding between a2 and b2 expressed by R2 is no longer present in EM.
In a cluster identity structure, the original reference binding between attribute values is preserved. In the cluster identity approach, the entity identity EC referenced by R1 and R2 would be represented as
EC=[(A:a1, B:b2, C:c1), (A:a2, B:b2, C:c1)]
Thus, for identity EC the attributes A, B, and C can only take on the permutations given by the original references R1 and R2. There are advantages and disadvantages to both approaches, but most significantly they can lead to different resolutions for the same set of references.
To illustrate, let’s continue with the preceding example by supposing that the systems using the merge identity and the cluster identity both use the same two resolution rules. Rule 1 is that the two references are considered equivalent if they agree (exact match) on Attribute C. Rule 2 is that they are equivalent if they agree (exact match) on both Attributes A and B.
Now suppose that each system processes a third entity reference R3=(a1, b2, c2). Using the two rules just discussed, the merge identity system would resolve R3 as equivalent to the identity EM represented by references R1 and R2. By Rule 1, R3 agrees with EM on attribute A and also attribute B. On the other hand, R3 would not resolve to the identity EC in the cluster identity system. R3 does not satisfy either Rule 1 or Rule 2 with respect to either of the references R1 and R2 that comprise the cluster identity EC.
Merge identities and cluster identities both represent valid, but different, approaches to identity management. To some extent they also represent two different ways of thinking about entity identity. I plan to discuss the concept of the entity identity further in the next post.
