HOME

Archive for the ‘Data Matching’ Category

And Then There Were Two

Wednesday, February 3rd, 2010

By Douglas Wood, Infoglide Senior Vice President

IBM announced today that it plans to buy MDM vendor Initiate Systems.  As hypothesized here in this blog last week, the move was not entirely unexpected, but on the heels of last week’s announcement by Informatica to purchase Siperian, it certainly creates yet another wave in the marketplace.  More moves are certain to take place as competing companies align – and realign – their Single Entity View (SEV) strategies.  The key to this realignment will be for current industry players to maximize their functionality beyond “playing with matches”.  That dated view of fuzzy matching is no longer enough.  Not for the large data quality vendors.  Certainly not for the customer.

The question of when companies like Oracle, SAP and Microsoft react – and how – will keep the blogosphere humming for awhile.

From the perspective of identity resolution – technologies that go well beyond simple matching - the IBM announcement creates a very interesting scenario.  Let’s be honest… there are three organizations have been truly positioned as leaders in providing SEV functionality that helps organizations expose fuzzy matches and non-obvious relationships across data sources.  IBM and Initiate are two;  Infoglide Software Corporation is the third.  IBM’s Identity Insight (formerly EAS), Initiate’s entity resolution, and Infoglide’s Identity Resolution Engine (IRE) all  deliver the promise of SEV or “who’s who… and who knows whom” technology, and all three answer considerably more than “yes it’s a match” or “no it’s not a match”.

In the case of Initiate Systems, the entity resolution product is new, and frankly came about as a basic repackaging of their successful MDM product for the Healthcare market.  IBM’s product, like Infoglide’s, was built from the ground up as an identity resolution engine by Jeff Jonas and the old SRD organization.  Now, with today’s announcement, IBM seems to have created some painful duplicity in their offerings.  It occurs to me that IBM has not become a global technology leader by mismanaging its products and messaging, so something’s gotta give!  Which product goes away, and when, will be interesting to see.

Either way, there are now effectively two players left standing in the SEV market – IBM and Infoglide.

Gentlemen, Start Your Engines (and don’t play with those matches)

Monday, February 1st, 2010

By Douglas Wood, Infoglide Senior Vice President

Much is happening these days in the Data Quality space.  Customers are embracing MDM strategies at a record pace, M&A activity has picked up from an industry perspective, and the various players in the data quality marketplace are expanding their offerings like never before.  It matters little if the objective is to vet fraud or to master data. The race to deliver the dream of an enterprise-wide single-entity-view (SEV) is on.  Gentlemen (and Danica Patrick)… start your engines!

The key word here, naturally, is ‘engines’.  An engine moves things forward, and performs considerably more than one basic task.  As has been well-documented here at IdentityResolutionDaily, a true identity resolution engine plays a vital part of any SEV initiative.  Technologies that can look at data across disparate silos and return results that point to both matches AND non-obvious relationships are in high demand…  and set to grow even further in 2010.  The simplicity of “yes it’s a match” or “no, it’s not a match” is no longer sufficient for most organizations as they seek the single-entity-view.  Remember, an entity is not merely made up of attributes… but also relationships.  A true ‘engine’ points to those relationships, and moves the entire data quality initiative forward.

An engine cares little what the car looks like, and ought to drive a multitude of vehicles.  Similarly, an identity resolution engine ought to be built to solve a multitude of problems.  SEV for exposing risk and fraud, SEV for Healthcare Patient Matching, SEV for Law Enforcement, SEV for customer relationship management, SEV for data disambiguation, SEV for house-holding, and so on and so on.  The engine should perform the same functions… while only the domain (or body type) changes.

It also occurs to us that the engine ought to be flexible in terms of what is mounted to the chassis – and how.  Do you want the 2.2L engine?  4 cylinder or 6 cylinder?  In the case of an identity resolution engine, customers ought to be able to pick how the functionality is delivered.  Full enterprise software license with professional services to build the car?  Done.  Functionality on demand a la Infoglide Software’s Identity Resolution as a Service (IRaaS TM) offering?  You got it.  A SEV appliance that sits behind a customer’s firewall to alleviate privacy-in-data concerns?  No problem.

The need for an SEV engine that provides a powerful library of matching and relationship capabilities, delivered in a variety of customer-friendly methods is now more critical than ever.  With the increase in activity lately around the MDM space, one thing is clear:  the race is most definitely on.

Identity Resolution Daily Links 2010-01-29

Friday, January 29th, 2010

[Post from Infoglide] Master Data Movement

“I read with interest yesterday’s article at SeekingAlpha which discusses rumors swirling around the MDM software industry.  According to the article, sources suggest that two deals are very near completion.  The first of those rumored transactions would see Informatica picking up MDM provider Siperian.  On the heels of their acquisitions of Identity Systems and AddressDoctor, the Siperian purchase could not be totally unexpected – but would most certainly create some ripple effect worth watching.”

[Post from Infoglide] Connecting the Dots: We May Be Closer Than We Think

“Paul Rosenzweig, former Deputy Assistant Secretary for Policy at the Department of Homeland Security, recently posted an intriguing piece on Harvard National Security Journal about connecting the dots regarding the Christmas Bomber. He makes a strong case that a decision to stop research on data analytic tools in 2003 has contributed to the problem analysts face today in making sense of the massive and manifold data sources they sift through.”

Forrester Blog: Introducing The MDM Market’s Newest 800lb Gorilla: Informatica Acquires Siperian!

“In the short term, I’m sure Informatica will be more than happy to continue to collect revenue from Oracle while keeping this partnership alive, but don’t expect future negotiated contracted terms to remain very reasonable as Informatica gains traction with its MDM strategy. No matter how often Oracle says how happy they are to maintain a friendly state of co-opetition with strategic partners, I don’t anticipate they will want to run the risk of a competitor pulling the rug out from under its aggressive MDM strategy.”

News8Austin: Community forum poses questions about Fusion Center

“According to department officials, sharing information with neighboring jurisdictions as well as state and federal agencies ensures that crime history and other information is shared outside the city limits. The department said it the center will be one that ‘analyzes information in order to best detect, respond and hopefully prevent criminal and terrorist activity — as well as other public safety hazards.’”

Ramon Chen: Informatica + Siperian Acquisition = Premier MDM Platform

“As expected, Informatica has announced that it has acquired Siperian (disclosure, my former company) for $130M… If predictions are correct, this will be a relative ‘bargain’ when compared with the upcoming IBM and Initiate Systems tie up which is expected to be 4 to 5x Initiate’s $90M annual revenues.”

Master Data Movement

Thursday, January 28th, 2010

By Douglas Wood, Infoglide Senior Vice President

I read with interest yesterday’s article at SeekingAlpha which discusses rumors swirling around the MDM software industry.  According to the article, sources suggest that two deals are very near completion.  The first of those rumored transactions would see Informatica picking up MDM provider Siperian.  On the heels of their acquisitions of Identity Systems and AddressDoctor, the Siperian purchase could not be totally unexpected – but would most certainly create some ripple effect worth watching.

The first thing that springs to mind is what Oracle would intend to do with Informatica.  A long-time business partner of Oracle, strengthened through the 2008 purchase of Identity Systems, Informatica could now only be classified as a true and direct competitor to Oracle.  Can Oracle continue to OEM technology (SSA Name3, for example) from what would instantly become a major competitor?  Sleeping with the enemy is one thing… leaving money on the nightstand afterwards is another thing altogether!  It will be interesting to see what happens here, to say the least.

The other rumored acquisition is that of Initiate Systems by IBM.  Thought to be roughly twice the size of Siperian, Initiate would tend to give further credibility to IBM’s vast – and growing – presence in the Health Care industry, where Initiate has become a recognized industry leader.  What muddies the waters, however, would be the question of what IBM would intend to do with Initiate’s entity resolution engine.  In a nutshell, Initiate has been one of two software vendors doing an excellent job of providing technologies applicable for both MDM and fraud/risk related implementations.  Infoglide Software Corporation is the other.

Marketed in an eerily similar fashion to Infoglide’s earlier-released Identity Resolution Engine (is imitation the most sincere form of flattery?), Initiate’s offering in this identity resolution space could become short-lived given IBM’s large and ongoing investment in InfoSphere Identity Insight Solutions (formerly Entity Analytics Solutions).  How soon that would happen, of course, is anyone’s guess.

One thing is certain, however: the need for technology that is applicable to both MDM initiatives and that exposes risk and fraud through matching and linking of entities is very real and growing.  How the other major industry players react – should either or both of these rumors become reality – will define the industry for years to come.

Connecting the Dots: We May Be Closer Than We Think

Wednesday, January 27th, 2010

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

Paul Rosenzweig, former Deputy Assistant Secretary for Policy at the Department of Homeland Security, recently posted an intriguing piece on Harvard National Security Journal about connecting the dots regarding the Christmas Bomber. He makes a strong case that a decision to stop research on data analytic tools in 2003 has contributed to the problem analysts face today in making sense of the massive and manifold data sources they sift through.

Initiating more research would clearly add to the tools that analysts have at their disposal. At the same time, applying existing entity resolution software technology to more data sources could add significant firepower and help address the data challenge.

Let’s examine four issues Mr. Rosenzweig raised and evaluate the current state of entity resolution technology to address each issue:

1.  Scalability

“This is a veritable flood of data.  In hindsight, of course, it is very easy to see the pieces that connect together to form a picture of Abdulmutallab’s plot.  But those 10 or so bits of information were floating in an ocean of other data—literally millions of different individual entries from thousands of different sources in a host of different databases.”

Existing entity resolution technology scales to handle multiple tens of millions of transactions daily. While the “flood of data” would likely test the limits of existing systems, it’s not clear that reaching the required scalability is limited by the software or is simply a function of establishing well-founded rules and incorporating the needed amount of hardware capacity.

2.    Real-Time Analysis

“We continue to rely on the intuition of analysts to provide the insight we need.  It is all well and good to say ‘with the NSA intercept about a Nigerian we should have started looking at all Nigerians’ or ‘we should have begun looking at everyone named Umar Farouk,’ but those leaps of insight and anticipation are not routine—they require analysis and consideration.  And that requires time—time to ponder the necessity of making precisely that inquiry. But time is what our analysts don’t have.  At least not enough of it.  Not with the flood of data we are seeing.  They have to prioritize and move certain lines of inquiry to the top of the pile.”

Crucial attributes of entity resolution technology are its ability to (a) process massive amounts of data in real time and (b) make automated decisions that prioritize the importance of each element. Entity resolution will never displace trained analysts, but its ability to sift through millions of pieces of data to produce a prioritized list of the most important potential connections offers the best way to fully exploit analysts’ brainpower and accelerate the process of detecting impending terrorism.

3.    Automated Scoring

“What we lack is not human intuition.  Rather we lack the tools to make human intuition effective and automated.  The head of the NCTC told a rather shocked Senate committee the other day that, in effect, NCTC analysts don’t have a “Google‐like” tool for database inquiries.  They can’t, for example, simply type in ‘Umar Farouk’ and pull up all the pages with links to that name.”

While a “Google-like” tool isn’t currently being used, the components needed to build one are available. By connecting to the appropriate data sources, some of the more powerful entity analytic software can “similarity search” a name across multiple disparate (and even remote) databases, and the software will detect similar attributes of multiple identities, and then combine them to yield a broad picture of an individual’s activities as documented in the data sources.

4.    Multiple Attributes

“But even that wouldn’t be enough—because there would likely still be far too many ‘Umar Farouk’ pages for any analyst to review (especially if instead the name we had was, for example, ‘Omar Abdul’).  What is necessary, as the Markle Foundation has said persistently, is for us to authorize and invest in tools that allow for automated analytics—things like tagged data (so that corrections to information are automatically transmitted for updates), identity resolution techniques (so that ‘Umar’ and ‘Omar’ are both considered), and persistent queries (so that a question that an analyst asked last month about Umar Farouk persists in the databases and is automatically linked to a father’s warning about his son Umar when that comes in three weeks later).”

One untouched topic is the effect of associating other attributes with an identity in addition to names, e.g. phone, SSN, passport, license plate, eye color, DOB). Matching similar names in the absence of other information may not be adequate to raise an alert about an identity, but when other attributes are captured and added, the problem becomes markedly more manageable. “Persisting” an identity is a good suggestion that enables more attributes to be added over time. Growing the data in this fashion will enable the system to trigger when a connection to someone on a watch list is identified.

Entity resolution technology is already sufficient to make an enormous difference today if it were just more broadly applied. While Mr. Rosenzweig is correct in his assertion that more research on data analytics tools is needed and can help move the process forward, we should also move rapidly to leverage available technology: entity resolution.

Healthcare Identity Resolution Confusion

Wednesday, January 20th, 2010

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

Confusion about medical records can lead to chaos. We’ve all heard horror stories about hospital tragedies caused by misidentification of a patient, such as applying an unnecessary surgery. It’s hard to overemphasize the importance of correct, unambiguous information in the practice of medicine. Knowing as much as possible about a patient enables a practitioner to reach a correct diagnosis and the proper treatment regimen in the least amount of time.

Underscoring the importance that accurate information plays in effective treatment, the American Recovery and Reinvestment Act (ARRA) passed in 2009 includes incentives for hospitals and doctors to adopt and support certified electronic health record (EHR) technology. In fact, the Act set aside $20 billion to encourage health care organizations to improve their recordkeeping through healthcare information technology.

Today’s hot healthcare industry topic, therefore, is electronic health records. While an EHR can create the potential for interoperability, it can’t deliver interoperability without robust identity resolution. High-quality health care depends on complete, unambiguous patient information being available at all times, so identity resolution technology has become a crucial component of a well-designed healthcare identification infrastructure.

By applying identity resolution to patient identification integrity, identity resolution can prevent common medical errors:
Duplicates are a simple example, where the two records exist for the same person within a single facility. More complex types of errors can easily start to mount up, including overlaps where more than one record exists for one person within two facilities within a single organization, and overlays where information for two people are integrated under a single record.

The rush to respond to ARRA resulted in overstatements of the identity resolution capabilities of many products. For example, most master data management (MDM) systems include matching and de-duplication capabilities that have become labeled “identity resolution” while in fact they lack the critical requirements for identity resolution. Dan Power of Hub Solution Designs has pointed out the growing role of identity resolution in MDM and the need for MDM vendors to move beyond “not invented here” thinking to incorporate true identity resolution into their offerings.

Confusion about medical records can lead to chaos. Clearing up confusion about identity resolution clears a path out of the chaos that will lead to better solutions.

Entity-Based Integration Model

Wednesday, January 13th, 2010

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

From a business standpoint, entity resolution (ER) is really the first step of a two-part process of integrating information about entities.  Entity reference records usually carry two types of attributes describing the entity, identifying attributes and informational attributes. Although the line between the two can be fuzzy, identifying attributes are those that describe the entity’s “characteristics,” information that tends to persist over time and helps to distinguish one entity from another of the same type.

For example, a customer reference might have identifying attributes like name, a mailing address, or age, relating to the identity of a person.  But the record may also have attributes such as marital status, hobby interests, or the make and model his or her automobile.  The latter information could be important in understanding how to market to this individual, but may not be as helpful for identifying the person.

Let’s go back to where we left the technical discussion. In the last post we looked at representing the outcome of an ER process (E) acting on a list of entity references (S) in process order (λ) as being equivalent to a partition (P) of the set S. The notation we used for this was P = (E, S, λ).  Recall that a partition of S is simply any collection of non-empty subsets of S with two properties, 1) that the subsets don’t overlap, yet 2) the union of all the subsets is equal to S.  So in the case of the process E, if we divide S into subsets based on whether they reference the same entity, then these subsets will give us a partition of S.  Even though the partition P doesn’t tell us how the ER process operates, it does convey all of the information about the result of the process.  For any two references in S, the partition P will tell us the decision of E.  If the two references are in different partition subsets, it means E’s decision is that the two references are to different entities.  On the other hand if the two references are in the same subset, it means the references are to the same entity.

Therefore all ER processes acting upon a set of references can be described in terms of a partition of the reference set.  The reverse is also true.  Given any partition of the reference set, it can be thought of as the result of a decision process, such an ER process.  This then is a nice “black box” way to describe an ER process in terms of its result without having to worry about its internal mechanism.

So if a marketer has several sources of entity information, the first step is to apply an entity resolution process that brings together those records about the same customer, then to merge the attribute values among these records to assemble a more complete view of each entity.  Now here is an interesting twist.  The attributes of the reference records can themselves be thought of as entities.  For example, just as “Jim” and “James” can be considered equivalent names, the attributes of age and date-of-birth can be considered equivalent in the sense that, for a fixed point in time, the value of one can be transformed into the value of the other.

Okay, now let’s look at the general case.  We start with several references sources R1, R2, .. Rn, where each reference source (Rj) is defined by its underlying set of reference records (Sj), a set of attributes defined for each reference record (Aj), set of attribute values that the attributes can take on (Vj), and a mapping (Mj) that assigns a value to each attribute of each record.  That is,

Rj = (Sj, Aj, Vj, Mj), where

Mj(r, a) = v, where r is a record in Sj, a is an attribute in Aj, and v is a value in Vj.

Now let S represent the union of all the individual reference sets S1…Sn, and let A represent the union of all the attributes A1…An.  We can describe an entity-based integration model as follows.

Let P be a partition of S (all of the records from all sources) and let Q be a partition of A (all attributes from all sources).  As we described earlier, if two records are in the same subset of partition P, it means that they refer to the same entity.  In this case P is modeling the ER process.  On the other hand, if two attributes are in the same subset of the partition Q which models attribute equivalence, such as with the example of date-of-birth and age, equivalent attributes may not be exactly the same, but the value of one attribute can be systematically mapped into a value of the other attribute.

Here’s how it works.  Suppose that {x, y, z} is one of the subsets of P, meaning that x, y, and z are all references to the same entity, and that {u, v} is one of the subsets of Q, meaning that u and v are equivalent attributes.  Also suppose the u is an attribute for records x and y, and that v is an attribute for z.  The table below shows an “integration cell.”

talburt-011310-jpg.jpg

Because x, y, and z are equivalent references, the three rows of this table really represent one entity “e” while u and v represent the same attribute “w”.  In this case there is a conflict because records x and y contribute different values.  It is not clear if the integrated entity e should have a value of “ab” or a value of “cd” for the integrated attribute w.  Deciding which value to select among conflicting values is called “knowledgebase arbitration.”  One way to select is the “voting” scheme.  Using this scheme the value would be “ab” because it occurs most frequently in the integration cell.

Space doesn’t permit a full exposition of the this model, but if you want to explore further a more complete description can be found in the paper Talburt, J. & Hashemi, R. (2008) A formal framework for defining entity-based, data source integration. H. Arabnia & R. Hashemi (Eds), 2008 International Conference on Information and Knowledge Engineering, Las Vegas, NV: CSREA Press (pp. 394-398).

In the next post we will discuss the most common architectures for ER systems.

Identity Resolution Daily Links 2010-01-11

Monday, January 11th, 2010

[Post from Infoglide] Actionable Identity Intelligence from Identity Resolution

“The recent ‘Christmas Bomber’ incident incited many posts about applying technology to address the gaps that allowed it to happen. For example, David Loshin wrote about a piece for BeyeNETWORK about a ‘master terrorist system’ while Lawrence Dubov suggested improving the watch list process using entity resolution. While technology is a critical component of any solution, some specific issues about the technology are important to understand.”

[Post from Infoglide] Entity Resolution Cloud Rising in 2010

A recent Information Week article referenced Oracle CEO Larry Ellison’s views on the future of IT that were offered during a December 17th analyst call. His remarks hint at the growing importance of cloud computing as a key driver in 2010. Writer Bob Evans mentioned that ‘Ellison also quite casually wove the terms ‘private clouds’ and ‘cloud computing’ into his strategic overview without lampooning them, which was a big step forward even though Ellison’s discomfort with the term is shared by IBM CEO Sam Palmisano and Hewlett-Packard CEO Mark Hurd.’”

Business Computing World: Trends In Master Data Management

[Philip Howard] “One of the outcomes of the recession has been that a lot of companies have cut back on long-term projects, especially where ROI may not be clear. And talking to various people it is clear that one of the areas so hit has been large hub-based MDM (Master Data Management) projects. That is because these typically take 18 months to 2 years to implement, require a lot of investment in time and money, and the benefits are a long way in the future.”

Chicago Security: What is a Fusion Intelligence Analyst?

“These analysts are responsible for providing support to decision makers by fusing information from local and federal law enforcement criminal databases with national-level intelligence from the Department of Homeland Security, for example, to create relevant intelligence products (finished reports about salient issues) to leaders (also known as “intelligence customers”) at all levels of government.”

Initiate Blog: Entity Resolution to Build a Better “Watch List”

“We should not be afraid to create more data sources and integrate more information. The fear is we run the risk of missing the useful information in a sea of worthless data. Entity resolution technology can make sense of all that information and resolve identities and relationships between them.”

Identity Resolution Daily Links 2009-01-05

Tuesday, January 5th, 2010

By the Infoglide Team

Center for Advanced Public Safety: SHARE & PUSH

“While SHARE is strictly for communications between law enforcement and the state’s Fusion Center, a companion portal, called the Portal to Uphold a Secure Homeland (PUSH), was also developed as part of the USDHS ITEP project to support private sector security personnel who oversee critical infrastructure.”

HealthNewsDigest.com: Medical/Healthcare Privacy and Fraud Outlook for 2010

“You may not be aware of this, but medical-related fraud and identity theft are growing problems in America. With the exploding cost of healthcare, increasing bureaucratic administrative healthcare systems, and a large, aging Baby Boomer population requiring increased medical care, it would seem that we are entering into a kind of ‘perfect storm’ for medical fraud.”

Aerospace News & Views: Business Travel Association Calls for Greater Attention to Aviation Security

NBTA has long supported risk-management programs that enhance aviation security. TSA’s Secure Flight helps to enhance domestic and international travel through the use of improved watch list matching, while the US-VISIT program collects biometric information from international travelers, both of which help to protect travelers and our nation. These programs should be used as readily available tools to improve the system that protects our global aviation security.”

[Wes Richel] Gartner: Simple Interop: Why We Don’t Seek a Top Level Domain Name

“Should anyone need a demonstration of the difficulties that delay reaching global agreements, consider that the term “EHR” has an idiosyncratic definition in the U.S. when compared to most of the world. In the U.S. the term refers to the record of patient information that is kept by an individual care delivery organization (CDO), with the proviso that there be some degree of interoperability. In most other countries that use the term it refers to some specific sharing of information that may be sourced from many places including but not limited to the electronic patient records of individual CDOs.”

Identity Resolution Daily Links 2009-12-21

Monday, December 21st, 2009

By the Infoglide Team

Citizen-Times: Lawmakers to mediate spat over Iowa Lottery security

“The investigation began after questions arose about a northwest Iowa store clerk who won the lottery six times in 12 months, collecting $264,000. The ombudsman’s report, called ‘Taking Chances on Integrity,’ included 60 recommendations for changes in lottery procedures and policies.”

Cheap Mommy: EHR Savings Go Beyond Time and Money

“The national government will pump billions of dollars into the transfer of medical records to electronic data in order to improve medical care and communications. Doctors, drugstores, hospitals and insurance companies will be more efficient with the utilization of electronic medical information. They will be able to exchange data instantaneously through electronic health networks, saving time and reducing the frustration of patients. Having electronic files can also guarantee greater privacy than hard-copy records. E-files can monitor exactly who has access to your medical data and log when it is accessed.”

SFGate: Forecast calls for more clouds in computing

Cloud computing certainly had mindshare and now, for many people, it has credibility,’ said Ray Valdes, analyst with Gartner Research. ‘A lot of the initial anxieties have faded.’ Gartner ranks cloud computing its top strategic technology area for 2010 and forecasts that revenue will grow from about $56.3 billion in 2009 to $150.1 billion in 2013.”

[Wes Richel] Gartner:Simple Interop: The Health Internet Node

“The goal here is to establish a framework for secure communications among healthcare organizations and between healthcare organizations and patient/consumers. Although we propose some specific uses (protected email and transactions among EHRs) our premise is that the framework will support a much broader set of use cases and Internet technologies.”


Bad Behavior has blocked 689 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice