“In the last post we examined how entity resolution (ER) systems are actually implemented, starting with the most basic merge/purge process and heterogeneous join systems. Both of these approaches focus on collecting equivalent references from among the sources provided, either as a large batch of references in a single file, or through queries against a federation of databases…”
“Knowing what we know now, would the U.S. be able to stop another attack like that of Christmas Day 2009? This is certainly the question on the minds of many Americans today. It is also one that Jamie McIntyre, veteran journalist and blogger for Military.com, had the opportunity to ask of Rand Beers, Under Secretary for National Protection and Programs Directorate from DHS, at a Heritage Foundation National Security Bloggers Luncheon.”
“In the middle of all of this are software providers, primarily IBM InfoSphere Identity Insight Solutions, Infoglide (which is providing software for the DHS) and Informatica… Identity recognition and resolution systems enable organizations to use data matches to gain a better understanding of identity across multiple systems. This could include not just individual identities but also networks and relationships: that is, who people know and how they are connected.”
“It’s been a heady couple of months in the IT infrastructure market, as any independent company that wasn’t tied down seemed to be swept up in a whirlwind of M&A activity. Independent data integration specialist Informatica, a 4,000-customer company in business since 1993, announced in January that it had acquired Siperian for $130 million.”
By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
In the last post we examined how entity resolution (ER) systems are actually implemented, starting with the most basic merge/purge process and heterogeneous join systems. Both of these approaches focus on collecting equivalent references from among the sources provided, either as a large batch of references in a single file, or through queries against a federation of databases. The entity identities found by these ER systems are transient in the sense that they depend upon the sources input into the process. When different sources are provided, different identities will emerge.
On the other hand, there are ER systems that retain and manage identity information. By doing this they are able to “recognize” the same identity over time and assign that identity the same entity identifier (sometimes called “persistent identifiers” or “persistent links”). In Customer Data Integration (CDI) applications, these kinds of systems are sometimes called Customer Recognition Systems.
Two major types of ER systems perform identity management. The first type is the “identity resolution” system. It is most effective in situations where a fairly stable set of known identities of interest exists, such as the set of vendors or customers of a company, a set of products, or the students enrolled in a school. The attributes of these identities are pre-loaded into the system and assigned identifiers. When a reference is given to the system, it then decides whether the reference is to one of the known identities, and if so, returns the identifier of that identity.
Identity resolution systems can operate in either batch or transactional mode. In cases where there are a large number of pre-stored identities, the performance of batch operations can be improved through distributed processing where the identities are partitioned over multiple processors and resolved in parallel.
However, there are many situations where the identities are not necessarily known in advance, or in some cases the entities are known but simply not organized in such a way that they can be easily pre-loaded. For example, suppose two companies merge and each company has its own customer database. The customers are identified in different ways in each database, and furthermore, for the customers of one company, poor systems and practices prevent having any confidence that the master records are unduplicated across business lines or company locations.
The type of system often applied in these situations is an “identity capture” system. The identity capture architecture can be seen as a hybrid of merge/purge and identity resolution systems. It supports identity management and persistent identifiers, but without starting with a preloaded set of identities. In my next post, we’ll delve deeper into the identity capture process.
“Andrew White of Gartner recently posed a question about whether master data management (MDM) is dead. He didn’t actually suggest that the demise of master data management is imminent. He was challenging whether our current terminology adequately clarifies the current reality about MDM and associated product areas.”
[Jill Dyche] “Last year, Informatica’s MDM story verged on the schizophrenic as the company simultaneously advocated a “roll your own” approach to MDM using various software components while at the same time making investments in both Siperian and rival Initiate Systems. Siperian fills in some significant voids in Informatica’s MDM capabilities, most notably hierarchy management and transaction integration—updating the golden record in real time.”
“What is Secure Flight and what does it do? Secure Flight is a behind the scenes program that streamlines the watch list matching process. It will improve the travel experience for all passengers, including those who have been misidentified in the past.”
“First is the classic ‘entity resolution‘ challenge. Information about any individual is likely going to be scattered across a range of databases. While one database may contain a red-flag item — a pending drug charge or a secondary connection to a known terrorist — another database may not. The challenge is bringing this information together to create a single record — a ’single version of the truth’ — about an individual or entity.”
Andrew White of Gartner recently posed a question about whether master data management (MDM) is dead. He didn’t actually suggest that the demise of master data management is imminent. He was challenging whether our current terminology adequately clarifies the current reality about MDM and associated product areas.
Certainly the terms describing many markets and types of products are being associated with MDM. Jackie Roberts of DATAForge pointed out that the definition of MDM now seems to include “data integrity, data quality, entity resolution, matching, data integration, governance, metrics and analysis.”
While entity resolution was mentioned in her list, our obsessive focus on entity resolution (aka identity resolution) leads to the conclusion that, rather than being subsumed, its role is growing. Wayne Eckerson at TDWI seems to agree that identity resolution is a critical component of the recent MDM acquisitions. In his post about the acquisitions by Informatica and IBM of Siperian and Initiate Systems, respectively, he described the two transactions this way:
“You could say that Siperian is mostly MDM, but with identity resolution and other capabilities, whereas Initiate is mostly about identity resolution, but with MDM and other capabilities.”
Identity resolution is becoming an integral part of many product areas. Within MDM itself, creating a single-entity view is best done with an identity resolution engine. Data mining is greatly enhanced by the addition of entity resolution. Dan Power of Hub Solution Designs wrote about how key identity resolution is to data matching. We’ve talked about how social CRM can resolve identities of individuals across multiple disparate data sources using identity resolution, as well as “rationalize multiple variations and errors and anomalies that block finding existing customers within their systems”.
Although identity resolution technology has been years in the making, it has only recently risen into the consciousness of most analysts and customers. Because of its ability to bring enhanced clarity to ambiguous data, advanced identity resolution is now beginning to have a significant impact across many data-centered disciplines.
In March 2006, the Communications Fraud Control Association (CFCA) estimated that annual global fraud losses in the telecom sector were between $54 billion and $60 billion, and the losses continue to be substantial. Many types of fraud have been identified, but by far the most prevalent is subscription fraud.
A new subscriber signs up for mobile service using false or stolen identification, with no intention of paying the bill. Since new subscribers are given a grace period of one to three months before the account is shut off, the criminal can make thousands of dollars worth of calls before being detected.
Subscription fraud can be difficult to differentiate from simple bad debt when genuine customers are unable to pay. It’s been estimated that 30% or more of all bad debt is actually subscription fraud.
Different solutions have been tried yet fraud continues to be a problem. One common method is to look for patterns of use that suggest potential fraud, but criminals adapt and learn to probe the limits of these fraud detection systems fairly quickly.
Given the industry’s long history with fraudsters, it seems probable that enough is known about them that they could be spotted at the time they subscribe. Using similarity searching technology, would-be fraudsters can be vetted against lists of known bad actors. Using multiple public and private data sources, non-obvious relationships can highlight risky individuals, and they can then be asked to submit to a more thorough qualification process.
Identity resolution is already used across multiple industries to solve similar problems. By matching an individual’s attributes with common attributes associated with those committing fraud, the “bad guys” are being detected in areas like lottery fraud, fusion centers, insider trading, and workers’ compensation employer fraud. Part of finding the bad guys is finding hidden relationships, connections that often uncover rings of criminals.
The “birds of a feather” axiom predicts that subscription fraud criminals often share the same types of social networks. Applying identity resolution to subscription fraud problem may be the way to finally solve it.
“The Dallas police has a high tech fusion center that monitors potential threats in Dallas. They helped foil the plot when a man was planning on blowing up the Bank of America building… Four years ago, Dallas Police put alert on Kimberly Al-Homsi because she was scouting runways at Love Field. On Saturday, she was arrested allegedly with pipe bombs in her car.”
“When a recruiter and/or a hiring manager finds someone for a job position it is basically done by getting in a number of candidates and then choose the best fit among them. This of course don’t make up for, that there may be someone better fit among all those people that were not among the candidates. We have the same problem in data matching when we are deduplicating, consolidating or matching for other purposes.”
“Proposals in Obama’s new proposal with a strong I.T. flavor include… Adopt real-time analysis of claims and payments data to identify waste, fraud and abuse in public health programs… Establish a CMS/IRS data-matching program to match information on entities that have evaded filing taxes against provider billing data to better detect fraudulent providers.”
“We’ve noted several times over the past couple of years how the market visibility of entity resolution has been evolving. Now the consolidation of the master data management (MDM) market is causing even more conjecture about the crucial role of this technology.”
“Lindsey adds that personnel on Joint Terrorism Task Forces, in fusion centers or in other counterterrorism-related positions could benefit from the system by accessing the more complete data source and incorporating information found there into their own analyses and evaluations. ‘We’re out there for the crime fighters, but we’re also out there to prevent terrorism activities,’ he states.”
“The Federal Bureau of Investigation estimates that the total cost of insurance fraud (excluding health care) exceeds $40 billion per year. That means insurance fraud costs the average U.S. family between $400 and $700 annually in the form of increased premiums. In California alone, the Department of Insurance (CDOI) identified the potential loss from fraud in the 2007/2008 fiscal year at $1.2 billion, according to the 2008 Annual Report of the Insurance Commissioner.”
“Some airlines already have moved to a new identification program, called Secure Flight. All domestic carriers are expected to move to the new program by March. The government system will include more details about the passenger in question, including the passenger’s sex, birth date and full name as it appears on a government identification document.”
“The American Recovery and Reinvestment Act of 2009 provides significant cash incentives to physicians who implement electronic health records. However, in order to qualify for these incentives the physician must not only have the proper software but must engage in “meaningful use” of the software. The government plans to publish the criteria for meaningful use in February 2010. ARRA incentive reimbursement to physicians will begin in 2011.”
“In the last post we looked at a formal model for describing entity-based integration. Now let’s turn our attention to how entity resolution (ER) systems are actually implemented. One of the most important design decisions is whether the system will perform entity identity management. Systems perform identity management when they create and store the attributes values for the identities that they process.”
“The two acquisitions focus the spotlight on two of the hottest functions today, in terms of user organizations adopting them, namely: MDM and identity resolution. More than ever, organizations need trusted data, in support of regulatory reporting, compliance, business intelligence, analytics, operational excellence, and other data-driven requirements. MDM and identity resolution are key enablers for these requirements, so it’s no surprise that two leading vendors have chosen to acquire these at this time.”
“Serrao says that in the time he has spent in a dozen different fusion centers in the United States — coupled with his own background in law enforcement — he’s gleaned several ‘best practices’ for consideration. Ideally, he says, leadership should ’set a specific strategic mission before the center is even built. Everything else follows. Determine the role of the center and whether strategic intelligence analysis will be part of the mix. Then, it will be easier to define what processes will be developed, what reporting mechanisms are needed, what technology is appropriate, and what types of personnel are needed.’”
“The state of Kansas has been conducting sting operations to prevent this kind of theft by lottery terminal clerks. Law enforcement agents fanned out across the state and presented ‘winning’ tickets at several retail lottery outlets. In five separate cases clerks told the agents the tickets were worthless and then tried to redeem the ‘winning’ lottery tickets. The undercover investigation led to charges of attempted theft and computer crime against five people across the state.”
By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
In the last post we looked at a formal model for describing entity-based integration. Now let’s turn our attention to how entity resolution (ER) systems are actually implemented. One of the most important design decisions is whether the system will perform entity identity management. Systems perform identity management when they create and store the attributes values for the identities that they process. Identity management is necessary for systems that assign persistent entity identifiers, i.e. the system must give all of the references to the same entity the same identifier value from one resolution process to the next.
The most basic form of ER is the merge/purge process. A merge/purge process reads a large batch of references and systematically makes pair-wise comparisons between them. During the process, it assigns a group identifier to all of the references it determines to be for the same entity. However, these identifiers are transient, only existing during the process of a particular batch of references since the end result is to create a single, merged record (called a “survivor” record) in place of each reference group. The result is that references to the same entity occurring in two different merge/purge processes will likely be given different group identifiers from one process to the next. For example, the references for John Doe in the first batch of references processed might given the group ID of 213, but references to the same John Doe in a batch of references processed the next day might be given a group ID of 634. The merge/purge process can still correctly resolve the entity references in each batch, but the values of the group IDs don’t persist or carry over for the same entities from batch to batch.
Another characteristic of the merge/purge ER process is that it is designed to operate in batch mode. However, there are transactional or “on-demand” versions of merge/purge that are sometimes referred to as heterogeneous database join systems. Instead of combining all of the reference sources into a single file for batch processing, each reference source is loaded as a database table. The application is connected to all of the source tables and has metadata that describes the structure of each reference source. This allows a single query or “join request” to be submitted to the application, which then translates the request into an appropriate query for each source. The individual query responses are collected and processed into a single view that is provided as the query result for the initial query. Just as in the merge/purge process, the groups of references brought together for an entity (a query) are transient. These types of query-based ER systems are common in law enforcement and other hypothesis testing applications.
On the other hand, there are other ER architectures designed to retain and manage entity identity information. By doing this they are able to “recognize” references to the same entity over time and assigned those references the same entity identifier, i.e. maintain persistent entity identifiers. In CRM applications these kinds of systems are sometimes called Customer Recognition Systems.
There are two major types of ER system architectures that perform identity management - “identity resolution” systems and “identity capture” systems. In the next post, I will pick up here with a discussion of how these systems manage identity and maintain persistent entity identifiers.
IBM announced today that it plans to buy MDM vendor Initiate Systems. As hypothesized here in this blog last week, the move was not entirely unexpected, but on the heels of last week’s announcement by Informatica to purchase Siperian, it certainly creates yet another wave in the marketplace. More moves are certain to take place as competing companies align – and realign – their Single Entity View (SEV) strategies. The key to this realignment will be for current industry players to maximize their functionality beyond “playing with matches”. That dated view of fuzzy matching is no longer enough. Not for the large data quality vendors. Certainly not for the customer.
The question of when companies like Oracle, SAP and Microsoft react – and how – will keep the blogosphere humming for awhile.
From the perspective of identity resolution – technologies that go well beyond simple matching - the IBM announcement creates a very interesting scenario. Let’s be honest… there are three organizations have been truly positioned as leaders in providing SEV functionality that helps organizations expose fuzzy matches and non-obvious relationships across data sources. IBM and Initiate are two; Infoglide Software Corporation is the third. IBM’s Identity Insight (formerly EAS), Initiate’s entity resolution, and Infoglide’s Identity Resolution Engine (IRE) all deliver the promise of SEV or “who’s who… and who knows whom” technology, and all three answer considerably more than “yes it’s a match” or “no it’s not a match”.
In the case of Initiate Systems, the entity resolution product is new, and frankly came about as a basic repackaging of their successful MDM product for the Healthcare market. IBM’s product, like Infoglide’s, was built from the ground up as an identity resolution engine by Jeff Jonas and the old SRD organization. Now, with today’s announcement, IBM seems to have created some painful duplicity in their offerings. It occurs to me that IBM has not become a global technology leader by mismanaging its products and messaging, so something’s gotta give! Which product goes away, and when, will be interesting to see.
Either way, there are now effectively two players left standing in the SEV market – IBM and Infoglide.
Infoglide Software provides entity resolution and analysis solutions for retail, banking, insurance, government, and law enforcement. Without the need for data cleansing or warehousing, Infoglide Software's Identity Resolution Engine™ (IRE) analyzes all of the information relating to individuals and/or entities from multiple sources of data and then applies...