HOME

Archive for the ‘Identity Resolution’ Category

Entity Extraction

Wednesday, July 1st, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my last post I discussed my definitions for entity resolution, entity identification, entity disambiguation, and anonymous entity resolution.  (And I reiterate that these are just my definitions and are not binding on anyone except possibly my students.)

Let’s go back to the overarching term entity resolution (ER).  In its broadest sense, I see ER as encompassing three major activities:

1.    Extracting or collecting of entity references from sources
2.    Linking references to same entity
3.    Exploring networks of entity associations.

In this post let’s focus on the first process, extracting and collecting entity references.  For many academic researchers, this is what entity resolution is all about, i.e. it represents the really interesting and challenging part. An extensive body of research literature discusses the methods and techniques for finding entity references in unstructured information, especially unstructured textual information (UTI).

There are many other ER participants for whom this process holds little or no interest.  These are primarily the commercial ER processors who expect that the information they begin with is already structured.  For them, the starting point is a record or database instance assumed to relate to an entity (e.g. customer) and that has well-defined fields or columns.  Their game is all about the process of linking these records.

However, there’s a growing realization that most of an organization’s information assets reside in unstructured data stores such as emails, reports, spreadsheets, photos, graphs, comments, notes, and other sources that are not only unstructured but may not even be in computer readable format.  The consensus is that the 80-20 rule applies: 80% unstructured to 20% structured.  The actual proportion will vary from organization to organization, but there is no denying that a tremendous amount of information is tucked away in unstructured formats.  Consequently, the text miners, law enforcement, intelligence community, and other old hands at entity extraction are now being joined by the commercial world in the rush to exploit this new source of information and potentially business intelligence (BI).

Researchers in image processing have long recognized the process of “feature extraction” where the parts of an image of interest (such as a human face) are located within the larger image. Thus I like the term “entity extraction” to describe this process in a broader sense that also includes text, audio, and other media, not just images.

The level or degree to which an entity reference is classified is another important issue in entity extraction.  Entities are just the people, places, or things we are interested in for a given application, and as we have learned from object-oriented analysis, these entity/objects often exist within a logical hierarchy.  The level of classification often impacts the strategy and the complexity of the extraction process.

To illustrate levels of classification, consider the following example of unstructured text that might appear in a newspaper announcement:

“On July 21, 2008, Mary Jo Smith, daughter of Sam and Sue Smith of Ft. Worth, was married to John Doe, son of Bill and Mary Doe, in a ceremony at the St. Joe Church in Dallas.”

At the highest level, several parsers could read the text and classify most of these references as people (e.g. Mary Jo Smith, John Doe), places (e.g. Ft. Worth, Dallas), and dates (e.g. July 21, 2008).  At a deeper level, however, we are interested not just in entity class, but more particularly in the entity’s sub-class or role.  The context is often given in the form of an “ontology” that specifies the entities and roles within a given context.  In this case, a “marriage ontology” would have roles for Bride, Groom, Date-Of-Marriage, Parent-of-Bride, Parent-of-Groom, and Place-of-Marriage. In our example above, determining that the reference Mary Jo Smith is not only to a person, but to the person in the role of “bride” in the context of a marriage announcement is a more demanding problem than simply discovering that Mary Jo is a person.

Even from this simple example, it is clear that developing a general solution for extracting and classifying entity references is a formidable challenge.  Another growing area of ER research is the new focus on moving beyond linking a reference to the same entity to networks of linkages, a topic for my next post.

Identity Resolution Daily Links 2009-06-30

Tuesday, June 30th, 2009

By the Infoglide Team

Francine Hardaway’s Blog: Are There Economies of Scale in Medicine?

“The efficiencies come when a group of physicians are all responsible for a patient’s continuity of care, and when they share information such as that possible with electronic health records (EHRs).”

Insurance & Financial Advisor: Poizner, industry oppose California downgrading of insurance fraud felonies

“‘Reclassifying 73 crimes including ‘false insurance claims’ is a disservice to the consumers and businesses in the state of California,” the letter said. “In addition, taking the power out of the hands of the public prosecutor to charge someone with a felony crime will have a serious impact on public safety.’”

BAM INTEL: A Growing Trend - Fusion Centers Connect Private and Public Sector Thinking

“The private sector owns about 80% of all critical infrastructure, and a communication disconnect could result catastrophically in a disaster scenario.”

Identity Resolution Daily Links 2009-06-27

Saturday, June 27th, 2009

[Post from Infoglide] The Real Test of Identity Resolution

“So the title ‘Catching Terrorists and Making the World a Safer Place’ certainly caught my eye! And the content of the post did not disappoint, as the author Chris Boorman of Informatica did a great job of crystallizing the issue that drove the creation of this blog over two years ago: ‘So how do we balance the freedom of movement we have come to expect as hard-working citizens with the need to spot terrorists?’”

[Post from Infoglide] Identity Resolution Daily on Twitter

“At Identity Resolution Daily, we often come across interesting tidbits about entity resolution, and now we can share them in real time. Just add our ID - @IDResolution - to your twitter sources. Happy tweeting!”

GreenvilleOnline.com: Consumers may see insurance rates rise

“According to Love, the average family spends about $1,000 more per year as a result of insurance fraud. That’s felt in higher insurance premiums, taxes, and the cost of goods and services, she said.”

Ezine: Fraud Alert - Lottery Retailers Win More Than Their Customers Do

California Lottery did an undercover sting where they brought, what they knew to be, a winning lottery ticket to a retailer to have it verified. They caught many retailers on hidden camera telling them that the winning ticket was a loser and, subsequently, went on to claim the money themselves. On top of that, a statistician studied big wins of lottery retailers in Ontario, Canada and found that retailers won big jackpots a lot more than you would statistically expect them too.”

data quality pro: Rethinking Data Quality: The Need for a Data Quality Profession

“Processes, projects, products – each of these contributes to the efforts to improve data quality. But they haven’t solved the problems individually or collectively. To really make substantial and sustainable differences in the quality of data we need to take a different approach. We need to think of data quality as a profession.”


The Real Test of Identity Resolution

Wednesday, June 24th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

So the title “Catching Terrorists and Making the World a Safer Place” certainly caught my eye! And the content of the post did not disappoint, as the author Chris Boorman of Informatica did a great job of crystallizing the issue that drove the creation of this blog over two years ago: “So how do we balance the freedom of movement we have come to expect as hard-working citizens with the need to spot terrorists?” His answer is “technology” and of course we agree.

When Identity Resolution Daily first began in the summer of 2007, we pointed out the constant tension between freedom and privacy versus the need for security:

In the US, the debate between personal privacy (and perhaps liberties in general) versus security is a long-standing one with roots in the very founding of the nation itself. Folks interested in obtaining data often wonder how much people are willing to give up in the name of greater security or convenience. On the other hand, those more focused on privacy worry about how data is obtained, what it’s used for and where it ends up.

Infoglide CEO Mike Shultz also discussed the responsibility that comes with providing technology that deals with identity:

It was important to all of us here that we didn’t create some sort of Big-Brother-enabling technology. As a result, we designed software that can resolve identities across multiple sources while protecting data privacy and security.

The point he made about the design of the software being critical is vital, and The Center for Digital Government’s white paper entitled “Resolving Identity: The Importance of Who’s Who and the Search for the Perfect Engine” delves into what technology can do to answer questions like “who’s who” and “who’s related to whom.”

In a more recent post, we talked about the components needed for an effective identity resolution solution. It’s not enough to have great similarity matching algorithms, and it’s not even enough to be able to find hidden connections in real time across millions of rows of data, although both those capabilities are obviously required. The real test in catching terrorists and making the world a safer place using identity resolution is how decision-making is automated and integrated into existing business processes.

Identity Resolution Daily on Twitter

Wednesday, June 24th, 2009

At Identity Resolution Daily, we often come across interesting tidbits about entity resolution, and now we can share them in real time. Just add our ID - @IDResolution - to your twitter sources. Happy tweeting!

Identity Resolution Daily Links 2009-06-22

Monday, June 22nd, 2009

By the Infoglide Team

intelligent enterprise: They Better Get This MDM Program Right

“As reported in The New York Times and on the TSA Web site, the Secure Flight program will improve upon current practices in matching passenger identities to watch lists in many ways. At first glance, this appears to be a well thought-out program that conforms to several basic tenets of Master Data Management (in bold below), in this case for the ‘Customer’ entity.”

EHRWMS: Georgia’s Best EMR Used By Three of Top Ten Pediatricians

“Of approximately 100 respondents, 28 used an EMR, of which 40% used the EncounterPRO Pediatric EMR. There were only three other EMRs used more than once, and they were used by only 10%, 7%, and 7% of the survey respondents respectively.”

Government Executive: Enforcement agencies boost cooperation on drug investigations

“In addition, ICE agents for the first time will fully participate in the Organized Crime Drug Enforcement Task Force Fusion Center. The center allows participating federal, state and local law enforcement agencies, including DEA and the FBI, to share information and analytical resources to enhance their overall investigative capacity.”

SmartData Collective: The Data-Information Continuum

“Data could be considered a constant while information is a variable that redefines data for each specific use. Data is not truly a constant since it is constantly changing. However, information is still derived from data and many different derivations can be performed while data is in the same state (i.e. before it changes again).”

Identity Resolution Daily Links 2009-06-19

Friday, June 19th, 2009

[Post from Infoglide] Speaking of Narrative Fallacy

Nassim Nicholas Taleb’s book The Black Swan: The Impact of the Highly Improbable uses “narrative fallacy” to describe how we humans tend to enhance ex post facto our ability to predict events that in fact are extremely complex and random. A recent post on Netrics HD attempts to leverage this argument to demonstrate the superiority of “Machine Learning” (i.e. probabilistic analysis) over “data matching” (i.e. deterministic analysis).

advance: Security and Privacy Challenges to EHR Adoption

“Lest we forget, our country is trying to establish similar capabilities with the widespread initiative to implement electronic health records (EHRs). My health history should travel with me — just as easily as my financial information. With some sort of authentication process, a “core” set of data should be easily available to assist in my receipt of health services.”

New York Times: Flying? Don’t Book Under a Nickname

“The government’s aim is to streamline the process of checking travelers’ names against its watch lists — a task currently handled separately by each airline — and to collect more detailed information so passengers with names similar to those on the watch list are less likely to be mistakenly detained. Asking for a birth date, for instance, decreases the likelihood that a child with a name close to one on the list would be subject to an additional search — one example of a false match that has led to complaints.”

Integrated Solutions for Retailers: Organized Retail Crime: Scope, Solutions

“Popular targets of organized retail crime rings include Crest Whitestrips, Rogaine, Similac baby formula, razor blades, and pregnancy tests. Having not been stored or managed properly, these items can pose serious health risks for innocent shoppers looking for a good bargain. And, because most of these items are sold “new in box,” well-meaning consumers are unaware that what they purchased may be spoiled or expired  —  and stolen.”

Speaking of Narrative Fallacy

Wednesday, June 17th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

Nassim Nicholas Taleb’s book The Black Swan: The Impact of the Highly Improbable uses “narrative fallacy” to describe how we humans tend to enhance ex post facto our ability to predict events that in fact are extremely complex and random. A recent post on Netrics HD attempts to leverage this argument to demonstrate the superiority of “Machine Learning” (i.e. probabilistic analysis) over “data matching” (i.e. deterministic analysis).

Product managers have a long history of creating oversimplified comparisons to competing products and technologies to demonstrate the superiority of their own. A favorite technique is to set up a straw man that can then be knocked down. In the case under discussion, describe a “rules based” system that is very unwieldy to use and requires huge amounts of time to tune, and embed an underlying premise that assumes each new application of a rules-based system starts from scratch with no accumulated domain-specific intelligence. (Of course, this doesn’t work if you choose a more intelligent identity resolution system for comparison.)

We’ve spent time here before talking about the differences between these two approaches, so I’m not going to restate the details again. Truthfully, probabilistic systems like that from Netrics have their place in screening large amounts of data, but like any system, they have their limitations. While they can reach a certain level of performance in emulating users’ decisions, they typically don’t leave a trail for an investigator to follow, they don’t support a rational drill-down into possible suspect transactions the way that deterministic systems do, and they don’t allow attribute-specific tweaking so you can leverage the information and better understanding that you’ve gained over time.

The larger issue is whether a solution can take advantage of appropriate technologies in appropriate circumstances (e.g. using both probabilistic and deterministic analytics in one solution), rather than being forced into an either/or, one-size-fits-all scenario. Solutions like those offered by identity resolution companies supply a framework that can incorporate all of them.

Identity Resolution Daily Links 2009-06-15

Monday, June 15th, 2009

By the Infoglide Team

New England Journal of Medicine: Use of Electronic Health Records in U.S. Hospitals

“The very low levels of adoption of electronic health records in U.S. hospitals suggest that policymakers face substantial obstacles to the achievement of health care performance goals that depend on health information technology.”

Federal Computer Week: Standard updated for reporting suspicious activity

“The changes from the Office of the Director of National Intelligence’s Program Manager for the Information Sharing Environment (PM-ISE) come as that office continues a pilot program for the SAR information sharing program at sites around the country. The program uses state and local intelligence fusion centers as a node for verifying and disseminating data on suspicious activity through information technology systems.”

Travel Sentry: Secure Flight Q&A

TSA collects as little personal information as possible to conduct effective watch list matching. Also, personal data is collected, used, distributed, stored, and disposed of in accordance with stringent guidelines and all applicable privacy laws and regulations.”

Central Valley Business Times: Three accused of multi-million workers comp fraud

“‘When businesses cheat the system to save money, they are only setting themselves up to pay later — by serving time in prison,’ says state Insurance Commissioner Steve Poizner.”

Identity Resolution Daily Links 2009-06-12

Friday, June 12th, 2009

[Post from Infoglide] Data Source Disintermediation?

“According to Wikipedia, ‘disintermediation is the removal of intermediaries in a supply chain: ‘cutting out the middleman’… Buyers bypass the middlemen (wholesalers and retailers) in order to buy directly from the manufacturer and thereby pay less.’”

[Jim Harris] OCDQ Blog: The Two Headed Monster of Data Matching

“Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household). Data matching is commonly plagued by what I refer to as The Two Headed Monster…”

CorpWatch: CorpWatch announces release of the CrocTail application and open CorpWatch API

CrocTail provides an interface for browsing information about several hundred thousand U.S. publicly traded corporations and their many foreign and domestic subsidiaries. Information from company Securities and Exchange Commission (SEC) filings has been parsed and annotated by CorpWatch to highlight specific corporate accountability issues. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.”

Vos Is Neias: Washington - TSA Advising Travelers To Book Airline Tickets Using Full Real Names

“While the T.S.A. has announced Aug. 15 as a target date for the airlines to begin asking for each passenger’s full name, gender and date of birth, and has already begun publicizing the program, called Secure Flight, the agency acknowledged that it would go into effect in phases as the airlines update their systems.”


Bad Behavior has blocked 333 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice