By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
From a business standpoint, entity resolution (ER) is really the first step of a two-part process of integrating information about entities. Entity reference records usually carry two types of attributes describing the entity, identifying attributes and informational attributes. Although the line between the two can be fuzzy, identifying attributes are those that describe the entity’s “characteristics,” information that tends to persist over time and helps to distinguish one entity from another of the same type.
For example, a customer reference might have identifying attributes like name, a mailing address, or age, relating to the identity of a person. But the record may also have attributes such as marital status, hobby interests, or the make and model his or her automobile. The latter information could be important in understanding how to market to this individual, but may not be as helpful for identifying the person.
Let’s go back to where we left the technical discussion. In the last post we looked at representing the outcome of an ER process (E) acting on a list of entity references (S) in process order (λ) as being equivalent to a partition (P) of the set S. The notation we used for this was P = (E, S, λ). Recall that a partition of S is simply any collection of non-empty subsets of S with two properties, 1) that the subsets don’t overlap, yet 2) the union of all the subsets is equal to S. So in the case of the process E, if we divide S into subsets based on whether they reference the same entity, then these subsets will give us a partition of S. Even though the partition P doesn’t tell us how the ER process operates, it does convey all of the information about the result of the process. For any two references in S, the partition P will tell us the decision of E. If the two references are in different partition subsets, it means E’s decision is that the two references are to different entities. On the other hand if the two references are in the same subset, it means the references are to the same entity.
Therefore all ER processes acting upon a set of references can be described in terms of a partition of the reference set. The reverse is also true. Given any partition of the reference set, it can be thought of as the result of a decision process, such an ER process. This then is a nice “black box” way to describe an ER process in terms of its result without having to worry about its internal mechanism.
So if a marketer has several sources of entity information, the first step is to apply an entity resolution process that brings together those records about the same customer, then to merge the attribute values among these records to assemble a more complete view of each entity. Now here is an interesting twist. The attributes of the reference records can themselves be thought of as entities. For example, just as “Jim” and “James” can be considered equivalent names, the attributes of age and date-of-birth can be considered equivalent in the sense that, for a fixed point in time, the value of one can be transformed into the value of the other.
Okay, now let’s look at the general case. We start with several references sources R1, R2, .. Rn, where each reference source (Rj) is defined by its underlying set of reference records (Sj), a set of attributes defined for each reference record (Aj), set of attribute values that the attributes can take on (Vj), and a mapping (Mj) that assigns a value to each attribute of each record. That is,
Rj = (Sj, Aj, Vj, Mj), where
Mj(r, a) = v, where r is a record in Sj, a is an attribute in Aj, and v is a value in Vj.
Now let S represent the union of all the individual reference sets S1…Sn, and let A represent the union of all the attributes A1…An. We can describe an entity-based integration model as follows.
Let P be a partition of S (all of the records from all sources) and let Q be a partition of A (all attributes from all sources). As we described earlier, if two records are in the same subset of partition P, it means that they refer to the same entity. In this case P is modeling the ER process. On the other hand, if two attributes are in the same subset of the partition Q which models attribute equivalence, such as with the example of date-of-birth and age, equivalent attributes may not be exactly the same, but the value of one attribute can be systematically mapped into a value of the other attribute.
Here’s how it works. Suppose that {x, y, z} is one of the subsets of P, meaning that x, y, and z are all references to the same entity, and that {u, v} is one of the subsets of Q, meaning that u and v are equivalent attributes. Also suppose the u is an attribute for records x and y, and that v is an attribute for z. The table below shows an “integration cell.”

Because x, y, and z are equivalent references, the three rows of this table really represent one entity “e” while u and v represent the same attribute “w”. In this case there is a conflict because records x and y contribute different values. It is not clear if the integrated entity e should have a value of “ab” or a value of “cd” for the integrated attribute w. Deciding which value to select among conflicting values is called “knowledgebase arbitration.” One way to select is the “voting” scheme. Using this scheme the value would be “ab” because it occurs most frequently in the integration cell.
Space doesn’t permit a full exposition of the this model, but if you want to explore further a more complete description can be found in the paper Talburt, J. & Hashemi, R. (2008) A formal framework for defining entity-based, data source integration. H. Arabnia & R. Hashemi (Eds), 2008 International Conference on Information and Knowledge Engineering, Las Vegas, NV: CSREA Press (pp. 394-398).
In the next post we will discuss the most common architectures for ER systems.
Share This