HOME

Archive for the ‘Information Quality’ Category

Big Data and Entity Resolution (part 2)

Thursday, December 16th, 2010

By Mike Betron, Infoglide Software Director of Marketing

We talked a week ago about the rapidly emerging market space called Big Data. One statistic that opened my eyes is Gartner’s prediction that the volume of new data generated by enterprises will grow by 650% in the next five years, and 80% of that will be unstructured data!

The 451Group’s definition of Big Data describes a growing need for non-traditional processes that can treat massive amounts of data as a whole, thereby making it impossible to use many traditional tools and techniques. Data is voluminous, complex, and very dynamic, yet business drivers demand that it be captured, managed, and harnessed to benefit the organization.

While entity resolution (ER) software is technologically mature, the evolving requirements for managing Big Data fit ER perfectly. For example, Infoglide’s Identity Resolution Engine (IRE) scales to meet Big Data requirements, and together with its flexibility in handling ambiguous unstructured and structured data with missing elements makes it an ideal solution for wringing value from the “data deluge” we increasingly find ourselves in.

One of the unique problems associated with Big Data is its multiple disparate sources that include email, Word documents, spreadsheets, and social media such as IM, newsfeeds, Facebook, and LinkedIn, just to name a few. Again, entity resolution systems like IRE now include support for multiple data forms and have created special ways to incorporate social media.

So, while Big Data presents a daunting challenge for many organizations, flexible technologies like entity resolution represent a key element of any solution.

Identity Resolution Daily Links 2010-11-23

Tuesday, November 23rd, 2010

By the Infoglide Staff

Tim Estes: Information Systems in an Entity-Centric World

 

Gartner: Four Converging Trends That Will Change the Face of IT and Business
“Gartner has identified four broad trends that will change IT, and the economy, in the next 10 years:

  1. Cloud
  2. Business impact of social computing
  3. Context Aware Computing
  4. Pattern Based Strategy

WSJ Health Blog: Web-Based Electronic Health Record Safety Registry Launches

“Even if EHRs reduce the risk of errors overall, they may produce entirely new ones, Edward Fotsch, CEO of PDR Network, which will provide network operations for the new reporting system, tells the Health Blog. For example, EHRs may cut the risk of failing to alert a patient to an abnormal test result, but confusing user interfaces may produce their own mistakes and need tinkering.”

Community of Experts: Identities and Entities: Resolution or Dissolution?

“Even with these differences, a human can rapidly determine that they refer to the same individual for two reasons. The first is that the values that differ across the pair of records are not too different from each other, and the second is that there seems to be enough support from across each pair of attributes to assert some degree of similarity.”

Identity Resolution Daily Links 2010-11-15

Monday, November 15th, 2010

By the Infoglide Team

Main Justice:Eric Holder’s Prepared Remarks at Health Care Fraud Prevention Summit

“In just the last fiscal year, we obtained settlements and judgments of more than $2.5 billion in False Claims Act matters alleging health care fraud. This marked a new record – and an increase of more than 60 percent from fiscal year 2009. We also opened more than 2,000 new criminal and civil health-care fraud investigations, reached an all-time high in the number of health-care fraud defendants charged, stopped numerous large-scale fraud schemes in their tracks, and returned more than $2.5 billion to the Medicare Trust Fund and more than $800 million to cash-strapped state Medicaid programs.”

SearchDataManagement.com: Gartner Magic Quadrant ranks MDM software vendors

Gartner reports that due to the sluggish economy, customer demand for MDM software is growing at a significantly slower rate than years past. But it is growing. The analyst firm predicts that the overall market for MDM software will increase from $1 billion in 2008 to $2.9 billion by 2013. Gartner also predicts that by 2010, investments in MDM software will lead to an 80% reduction in costs associated with managing redundant data.”

The Crime Report: Fusion Centers Could Face Budget Issues As States Cut Back

“Some of the nation’s 72 fusion centers–where federal, state, and local law enforcement agencies share data on terrorism and crime threats–may face budget problems in the nation’s tough economic conditions. Ross Ashley of the National Fusion Center Association, which represents the centers, says that some newly elected governors must be convinced of the centers’ worth. The agencies typically do not have line-item budgets and are dependent on allocations from various levels of government to operate.”

Sponsoring ICIQ This Weekend

Thursday, November 11th, 2010

By Mike Betron, Infoglide Software Director of Marketing

Infoglide Software is a proud sponsor of the 15th International Conference on Information Quality (ICIQ). The 2010 edition of this annual event is being hosted this weekend by the George W. Dohaghey College of Engineering and Information Technology at the University of Arkansas at Little Rock. Researchers from all over the world will convene to share the results of their efforts.

The organizer of the event is John Talburt, PhD, founder and director of the Center for Advanced Research in Entity Resolution and Information Quality (ERIQ). Infoglide has sponsored the ERIQ lab and the Information Quality graduate program in recent years.

If you’re attending, we’ll be there and look forward to meeting you in Little Rock.

Identity Resolution Daily Links 2010-10-30

Saturday, October 30th, 2010

[Post from Infoglide] Absentee Ballot Fraud

“We’re currently in the heat of the election season. No matter how impeccable the record of any candidate that the major parties put forward, minions of the opposing parties go to great lengths to uncover an embarrassing incident that can be exposed (or even an incident that can be twisted to appear embarrassing) in order to influence voters away from voting for that candidate. While the populace is reasonably good at figuring these tricks out, even more disturbing are the stories involving voter fraud.”

Rob Karel’s Blog: Discussing The Forrester Wave™: Enterprise Data Quality Platforms, Q4 2010

“Also, many data quality vendors specialize and provide depth of expertise in a focused part of the data quality market such as postal address verification (e.g., Experian QAS, Melissa DATA), matching or identity resolution [e.g., Infoglide Software, Netrics (acquired by TIBCO Software), and Pervasive Software], and data profiling (e.g., Ab Initio and Business Data Quality).”

Providence Journal: Deportee charged in identity theft case

“The R.I. State Fusion Center, a state police unit that tracks information on homeland security and crime, assisted in the investigation through the use of facial recognition software that determined that Medrano had been previously issued a Massachusetts identity document in his real name.”

Aviation News Today: November 1 Ends Grace Period For Secure Flight Data Submissions

“While TSA’s watch-list matching takes seconds and can be completed up until the time of departure, the agency cautions passengers that a boarding pass will not be issued until the airline submits complete passenger data to Secure Flight. The agency noted that, despite the crackdown, minor variations in the name on the boarding pass and ID, like middle initials, should not present problems at checkpoints.”

Identity Resolution Daily Links 2010-10-10

Sunday, October 10th, 2010

[Post from Infoglide] OYSTER: A Configurable ER Engine

“Now that I have finished the four-part series on linking methods, I would like to talk about one of my pet projects, OYSTER.  It stands for Open sYSTem Entity Resolution, a project to build a configurable, open-source entity resolution.  Although I am somewhat hesitant to announce a system that is not yet available to readers, it does exist and has been a valuable teaching tool in my ER class.  A run-time version (Java JAR file) will available soon on the ERIQ website, and the source code should be available on Source Forge by the end of the year.”

DATAMONITOR: Bad data costing US businesses $700 billion a year

Madan Sheina, author of the report and an Ovum lead analyst, said: ‘Bad data is a growing problem for businesses due to the sheer volume and pace at which it is now moved between organisations. We now estimate that bad data costs US companies 30 per cent of their revenues – a massive $700 billion per year and a figure that is set to increase.’”

thestar.com: Watchdog warns criminals, terrorists could abuse new payment methods

“‘FINTRAC anticipates that the FATF will publish a public report on this work later in 2010,’ it said. Over the past few years, prepaid cards and Internet payment services have only been identified in a minority of domestic money laundering and terrorist financing cases. In 2008-2009, for instance, Internet-based payment services were involved in roughly 4 per cent of all disclosed cases, FINTRAC said in its report.”

Identity Resolution Daily Links 2010-09-03

Friday, September 3rd, 2010

[Post from Infoglide] Reference Linking Methods - Part 4

“In the direct matching, transitive linking, and association analysis methods discussed in previous posts, the evidence for establishing a link comes from the references themselves, either as attribute values or relationships with other references.  A link created in this way is also called an inferred link. But in almost any ER context, some pairs of equivalent references (i.e. that refer to the same entity) will have insufficient evidence available in the references themselves to make that determination, thereby leaving them as unlinked false negatives.”

Liliendahl on Data Quality: Out of Facebook

“Doing ‘Social Master Data Management’ will become an integrated part of customer master data management offering both opportunities for approaching a ’single version of the truth’ and some challenges in doing so. Of course privacy is a big issue.”

CRN: SMB Cloud Spending To Approach $100 Billion By 2014

“Total cloud-related information and communications technology spending among SMBs globally surpassed $52 billion in 2009, representing just 6 percent of total worldwide SMB ICT spending. But AMI predicts that that will nearly double over a five-year period.”

Media Newswire: Owner of illegal money transmitting business sentenced to 2 years in prison, ordered to forfeit $690K

“According to court documents, between Jan. 1, 2004 and Dec. 31, 2008, Lemine, owner of Sorrento Grocery in Sorrento, Fla., cashed more than $4 million in checks from a local construction company in return for a fee of between 1 and 1.5 percent of the checks’ face value. He did so knowing that the owners of the construction company were attempting by cashing the checks through the grocery to conceal their employment of illegal aliens, avoid paying worker’s compensation and employment taxes, and hide income from state and federal tax officials.”

Reference Linking Methods - Part 4

Thursday, September 2nd, 2010

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

This is the last in a series of four posts that discuss four methods for linking references.  These methods are:

  1. Direct matching
  2. Transitive linking
  3. Linking by association
  4. Asserted linking

In the direct matching, transitive linking, and association analysis methods discussed in previous posts, the evidence for establishing a link comes from the references themselves, either as attribute values or relationships with other references.  A link created in this way is also called an inferred link.

But in almost any ER context, some pairs of equivalent references (i.e. that refer to the same entity) will have insufficient evidence available in the references themselves to make that determination, thereby leaving them as unlinked false negatives.  For example, in the previous post we discussed how it might be possible to discover that the references to Mary Smith on Oak St and the Mary Smith on Elm St are equivalent through association analysis.  But if the collateral evidence of the shared address association were not available, then the link could not have been inferred.

A different way to approach this problem is through asserted linking.  An asserted link between two references is based on prior knowledge that they are equivalent.  For this reason, creating links in this way is also called knowledge-based linking, and ER systems that use this method of resolution are called knowledge-based ER systems.

An asserted link often takes the form of a single record carrying the attribute values of two non-matching references.  The assertion about Mary Smith’s change of address might be something like:

The Mary Smith previously residing at 123 Oak is now residing at 456 Elm.

It reflects the knowledge that references to Mary Smith on Oak Street and Mary Smith on Elm Street are equivalent independent of any similarity or dissimilarity between their corresponding attribute values.

So where do these assertions come from?  Not out of thin air.  An assertion like this could have been self-reported, acquired from public records, or gotten from a commercial data provider, such as a magazine subscription service.  If this knowledge were to be acquired and provisioned in the ER identity management system prior to processing a reference to either Mary Smith on Oak street or Mary Smith on Elm street, then both references would be recognized as equivalent and could be linked at the time they were processed, regardless of the order in which they were received.  Jeff Jonas calls ER systems that have this property “sequence neutral.”

Asserted linking is not just theoretical.  For example, Acxiom® Corporation has made asserted linking the backbone of its AbiliTec® CDI technology that manages billions of assertions for U.S. consumers alone.

The disadvantage of asserted linking is that it is a non-trivial activity to acquire, store, and manage the assertions.  Asserted linking divides the overall ER process into two concurrent processes.  One is a foreground process for resolving equivalence and applying links.  The other is a background process that acquires and integrates assertions into the identity management system.  Of course, timing is critical.  If an assertion is not acquired and available before processing the references that need them, then their equivalence will not be recognized and they will not be linked.

In the next post, I plan to discuss the role of ER in entity-based information exchange systems,  sometimes called “information hubs.”

Identity Resolution Daily Links 2010-08-07

Saturday, August 7th, 2010

[Post from Infoglide] Reference Linking Methods - Part 3

“This is the third in a series of four posts that discuss four methods for linking references.  These methods are:

  1. Direct matching
  2. Transitive linking
  3. Linking by association
  4. Asserted Linking

In the last post I discussed transitive linking, and why it is essential for producing a unique and deterministic outcome of an ER process.  In this post I will discuss the third method, linking by association.”

BeyeNETWORK: Computed Attributes, Entity Resolution and Connectivity Hierarchies

“There are many types of relationships that are discovered as a by-product of entity resolution, such as households or families. These terms take on different meaning depending on the subject area and the business situation. For example, we can examine parent-child and sibling relationships associated with individuals, we can look at components such as paper clips or screws that are in the same ‘family,’ or we can look at corporate ownership relationships that reflect families of companies. Alternatively, we can look at other types of relationships – individuals belonging to the same health club, components manufactured from the same type of metal, or companies that share the same board members.”

WTVM: Phenix City doctor accused of multi-million dollar Medicare fraud

“In an 80-page  civil complaint, the United States Attorney’s Office claims 51-year-old Doctor Robert Ritchea, a physician, not only allowed an unlicensed medical assistant to inject patients with pain medications, but also improperly billed Medicare for the treatments. The complaint also alleges Ritchea over-billed Medicare by more than $2.2 million in over 4,300 separate claims over a period of four years.”

Liliendahl on Data Quality: Location, Location, Location

“If you know that 123 Main Street in Anytown is a single family house there is a high probability that this is the same real world individual. But if you know that 123 Main Street in Anytown is a building used as a nursing home, a campus or that this entrance has many apartments or other kind of units, then it is not so certain that these records represents the same real world individual (not at least if the name is John Smith). So this example highlights the importance of using external reference data in data matching.”

Reference Linking Methods - Part 3

Thursday, August 5th, 2010

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

This is the third in a series of four posts that discuss four methods for linking references.  These methods are:

  1. Direct matching
  2. Transitive linking
  3. Linking by association
  4. Asserted Linking

In the last post I discussed transitive linking, and why it is essential for producing a unique and deterministic outcome of an ER process.  In this post I will discuss the third method, linking by association.

Recall that in transitive linking, each intermediate reference must be equivalent to the next reference in the chain.  However, it is possible to discover links by exploring associations among entity references that don’t rise to the level of equivalence.  These explorations are often done using techniques borrowed from graph theory and network analysis.

A simple example shows how association analysis might work.  Consider the following four references:

  • Ref#1, Mary Smith 123 Oak St
  • Ref#2, Mary Smith 456 Elm St
  • Ref#3, John Smith 123 Oak St
  • Ref#4, John Smith 456 Elm St

Note that none of the six possible pairings of these four references agree on both name and address.  Therefore under typical identity rules, none of six pairs would be considered equivalent.  The diagram below is a graphical representation of these four references and their relationships.

linking-by-association.png

However, given that this association is unlikely to occur by chance, it is reasonable to infer that these are the same John Smith and Mary Smith at both addresses and that their references should be linked (are equivalent).  The decision to link is even more compelling if supported by other evidence, such as uncommon names, or the lack of conflicting evidence, such as different dates-of-birth.

Unlike direct matching and transitive linking where decisions are made pair-by-pair, association analysis allows multiple relationships to be considered at the same time.  As in this example, the decision to link the pairs (R1, R2) and (R3, R4) is justified only when the entire configuration of relationships is considered as a whole, not by examining each pair independently.  The application of graph theory and network analysis of entity relationships is a rapidly growing area of ER research.

Typically these analyses are intended to produce an algorithm for decomposing (partitioning) a large graph reference into smaller sub-graphs where each sub-graph represents a group (cluster) of equivalent references.  These algorithms assess the relative cohesiveness of the sub-graphs in terms of the number and types of connections that each node (reference) has with its adjacent nodes.  For example the SCAN (Structural Clustering Algorithm for Networks) developed by Xu, Yuruk, Feng, and Schweiger that uses a measure called graph modularity (http://www.citeulike.org/user/socwangnan/article/1918601).

In the next post we will discuss asserted linking.

[Editor’s Note: You can hear Dr. Talburt discuss entity resolution in a recently posted video on Infoglide Software’s corporate web site.]


Bad Behavior has blocked 1166 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice