HOME

Archive for July, 2009

Identity Resolution Daily Links 2009-07-06

Monday, July 6th, 2009

[Post from Infoglide] Entity Extraction

“In my last post I discussed my definitions for entity resolution, entity identification, entity disambiguation, and anonymous entity resolution.  (And I reiterate that these are just my definitions and are not binding on anyone except possibly my students.) Let’s go back to the overarching term entity resolution (ER).  In its broadest sense, I see ER as encompassing three major activities…”

Article Marketing: Electronic Medical Records – Are There Reasons For Low Implementation?

“Doctors may soon have little choice but to implement computerized medical billing and patient record systems. HIPAA’s scope recently expanded to health care providers with less than $5 million in revenue.”

OCDQ Blog: Worthy Data Quality Whitepapers (Part 1)

“It is about the data – the quality of the data… This is the subtitle of two brief but informative data quality whitepapers freely available (no registration required) from the Electronic Commerce Code Management Association (ECCMA): Transparency and Data Portability.”

SmartData Collective: Moving BI Into The Cloud Part 1

“Over the last year I have been reading a lot about cloud computing and trying to predict how this can be used in business intelligence and analytics. I believe that the cloud is becoming relevant for a number of reasons…”

Happy Fourth of July

Friday, July 3rd, 2009

Our thoughts and prayers are with everyone serving overseas and their families this Fourth of July. Be safe and have a great weekend. We’ll return on Monday.


Entity Extraction

Wednesday, July 1st, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my last post I discussed my definitions for entity resolution, entity identification, entity disambiguation, and anonymous entity resolution.  (And I reiterate that these are just my definitions and are not binding on anyone except possibly my students.)

Let’s go back to the overarching term entity resolution (ER).  In its broadest sense, I see ER as encompassing three major activities:

1.    Extracting or collecting of entity references from sources
2.    Linking references to same entity
3.    Exploring networks of entity associations.

In this post let’s focus on the first process, extracting and collecting entity references.  For many academic researchers, this is what entity resolution is all about, i.e. it represents the really interesting and challenging part. An extensive body of research literature discusses the methods and techniques for finding entity references in unstructured information, especially unstructured textual information (UTI).

There are many other ER participants for whom this process holds little or no interest.  These are primarily the commercial ER processors who expect that the information they begin with is already structured.  For them, the starting point is a record or database instance assumed to relate to an entity (e.g. customer) and that has well-defined fields or columns.  Their game is all about the process of linking these records.

However, there’s a growing realization that most of an organization’s information assets reside in unstructured data stores such as emails, reports, spreadsheets, photos, graphs, comments, notes, and other sources that are not only unstructured but may not even be in computer readable format.  The consensus is that the 80-20 rule applies: 80% unstructured to 20% structured.  The actual proportion will vary from organization to organization, but there is no denying that a tremendous amount of information is tucked away in unstructured formats.  Consequently, the text miners, law enforcement, intelligence community, and other old hands at entity extraction are now being joined by the commercial world in the rush to exploit this new source of information and potentially business intelligence (BI).

Researchers in image processing have long recognized the process of “feature extraction” where the parts of an image of interest (such as a human face) are located within the larger image. Thus I like the term “entity extraction” to describe this process in a broader sense that also includes text, audio, and other media, not just images.

The level or degree to which an entity reference is classified is another important issue in entity extraction.  Entities are just the people, places, or things we are interested in for a given application, and as we have learned from object-oriented analysis, these entity/objects often exist within a logical hierarchy.  The level of classification often impacts the strategy and the complexity of the extraction process.

To illustrate levels of classification, consider the following example of unstructured text that might appear in a newspaper announcement:

“On July 21, 2008, Mary Jo Smith, daughter of Sam and Sue Smith of Ft. Worth, was married to John Doe, son of Bill and Mary Doe, in a ceremony at the St. Joe Church in Dallas.”

At the highest level, several parsers could read the text and classify most of these references as people (e.g. Mary Jo Smith, John Doe), places (e.g. Ft. Worth, Dallas), and dates (e.g. July 21, 2008).  At a deeper level, however, we are interested not just in entity class, but more particularly in the entity’s sub-class or role.  The context is often given in the form of an “ontology” that specifies the entities and roles within a given context.  In this case, a “marriage ontology” would have roles for Bride, Groom, Date-Of-Marriage, Parent-of-Bride, Parent-of-Groom, and Place-of-Marriage. In our example above, determining that the reference Mary Jo Smith is not only to a person, but to the person in the role of “bride” in the context of a marriage announcement is a more demanding problem than simply discovering that Mary Jo is a person.

Even from this simple example, it is clear that developing a general solution for extracting and classifying entity references is a formidable challenge.  Another growing area of ER research is the new focus on moving beyond linking a reference to the same entity to networks of linkages, a topic for my next post.


Bad Behavior has blocked 1168 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice