HOME

The Two Sides of Entity Resolution

By John Talburt, Professor of Information Science and Director of the Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock

I have always liked the definition of Entity Resolution put forward by the Infolab at Stanford University  - “locating and merging records that refer to the same real-world entities”.  The reason is that it succinctly describes the two primary facets of entity resolution, namely locating and merging.  If you look at the literature in the area of entity and identity resolution, you generally find that the focus is on one, but not the other.  Until recently, commercial entity resolution has focused almost entirely on the merge side, mainly because the records being processed were coming from databases, flat files, or other structured sources.  In a structured source, the entity attributes, and consequently the identity attributes, are given explicitly.  In this case, most of the work centers on the process of record linking, i.e. assigning a common identifier to records referring to the same entity.  Unfortunately all too many of these record linking processes subscribe to the “matching myth,” the false assumption that two records represent the same entity if and only if their identifying attributes match, but more about that in another article.

However, now I am seeing increasing attention on the locating side of entity resolution.  Locating is required when information is presented in an unstructured format, such as text documents or images. In this case, the entity references must first be located (identified) and extracted in the source before the merging process can take place.  Once considered the purview of academics, the art of “feature extraction” has gone mainstream as organizations realize that they often possess more information in unstructured format than in structured files.  Recent books like Tapping into Unstructured Data by Inmon and Nesavich, and a number of new commercial software packages for processing unstructured data are evidence of this emerging trend.  Interest by the US intelligence community in developing techniques for efficient, large-scale entity extraction has also motivated new research and interest in this area.  Like so many areas of information technology, the advent of low-cost, high-performance computing has opened the door to many new approaches to entity extraction and identification that were not practical before.  Based on some of the work I have seen, I believe we are rapidly approaching a point where the expression “machine readable” will no longer mean just binary encoding, but reading and understanding in a human sense.

Leave a Reply


Bad Behavior has blocked 1210 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice