HOME

Archive for January, 2010

Identity Resolution Daily Links 2010-01-29

Friday, January 29th, 2010

[Post from Infoglide] Master Data Movement

“I read with interest yesterday’s article at SeekingAlpha which discusses rumors swirling around the MDM software industry.  According to the article, sources suggest that two deals are very near completion.  The first of those rumored transactions would see Informatica picking up MDM provider Siperian.  On the heels of their acquisitions of Identity Systems and AddressDoctor, the Siperian purchase could not be totally unexpected – but would most certainly create some ripple effect worth watching.”

[Post from Infoglide] Connecting the Dots: We May Be Closer Than We Think

“Paul Rosenzweig, former Deputy Assistant Secretary for Policy at the Department of Homeland Security, recently posted an intriguing piece on Harvard National Security Journal about connecting the dots regarding the Christmas Bomber. He makes a strong case that a decision to stop research on data analytic tools in 2003 has contributed to the problem analysts face today in making sense of the massive and manifold data sources they sift through.”

Forrester Blog: Introducing The MDM Market’s Newest 800lb Gorilla: Informatica Acquires Siperian!

“In the short term, I’m sure Informatica will be more than happy to continue to collect revenue from Oracle while keeping this partnership alive, but don’t expect future negotiated contracted terms to remain very reasonable as Informatica gains traction with its MDM strategy. No matter how often Oracle says how happy they are to maintain a friendly state of co-opetition with strategic partners, I don’t anticipate they will want to run the risk of a competitor pulling the rug out from under its aggressive MDM strategy.”

News8Austin: Community forum poses questions about Fusion Center

“According to department officials, sharing information with neighboring jurisdictions as well as state and federal agencies ensures that crime history and other information is shared outside the city limits. The department said it the center will be one that ‘analyzes information in order to best detect, respond and hopefully prevent criminal and terrorist activity — as well as other public safety hazards.’”

Ramon Chen: Informatica + Siperian Acquisition = Premier MDM Platform

“As expected, Informatica has announced that it has acquired Siperian (disclosure, my former company) for $130M… If predictions are correct, this will be a relative ‘bargain’ when compared with the upcoming IBM and Initiate Systems tie up which is expected to be 4 to 5x Initiate’s $90M annual revenues.”

Master Data Movement

Thursday, January 28th, 2010

By Douglas Wood, Infoglide Senior Vice President

I read with interest yesterday’s article at SeekingAlpha which discusses rumors swirling around the MDM software industry.  According to the article, sources suggest that two deals are very near completion.  The first of those rumored transactions would see Informatica picking up MDM provider Siperian.  On the heels of their acquisitions of Identity Systems and AddressDoctor, the Siperian purchase could not be totally unexpected – but would most certainly create some ripple effect worth watching.

The first thing that springs to mind is what Oracle would intend to do with Informatica.  A long-time business partner of Oracle, strengthened through the 2008 purchase of Identity Systems, Informatica could now only be classified as a true and direct competitor to Oracle.  Can Oracle continue to OEM technology (SSA Name3, for example) from what would instantly become a major competitor?  Sleeping with the enemy is one thing… leaving money on the nightstand afterwards is another thing altogether!  It will be interesting to see what happens here, to say the least.

The other rumored acquisition is that of Initiate Systems by IBM.  Thought to be roughly twice the size of Siperian, Initiate would tend to give further credibility to IBM’s vast – and growing – presence in the Health Care industry, where Initiate has become a recognized industry leader.  What muddies the waters, however, would be the question of what IBM would intend to do with Initiate’s entity resolution engine.  In a nutshell, Initiate has been one of two software vendors doing an excellent job of providing technologies applicable for both MDM and fraud/risk related implementations.  Infoglide Software Corporation is the other.

Marketed in an eerily similar fashion to Infoglide’s earlier-released Identity Resolution Engine (is imitation the most sincere form of flattery?), Initiate’s offering in this identity resolution space could become short-lived given IBM’s large and ongoing investment in InfoSphere Identity Insight Solutions (formerly Entity Analytics Solutions).  How soon that would happen, of course, is anyone’s guess.

One thing is certain, however: the need for technology that is applicable to both MDM initiatives and that exposes risk and fraud through matching and linking of entities is very real and growing.  How the other major industry players react – should either or both of these rumors become reality – will define the industry for years to come.

Connecting the Dots: We May Be Closer Than We Think

Wednesday, January 27th, 2010

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

Paul Rosenzweig, former Deputy Assistant Secretary for Policy at the Department of Homeland Security, recently posted an intriguing piece on Harvard National Security Journal about connecting the dots regarding the Christmas Bomber. He makes a strong case that a decision to stop research on data analytic tools in 2003 has contributed to the problem analysts face today in making sense of the massive and manifold data sources they sift through.

Initiating more research would clearly add to the tools that analysts have at their disposal. At the same time, applying existing entity resolution software technology to more data sources could add significant firepower and help address the data challenge.

Let’s examine four issues Mr. Rosenzweig raised and evaluate the current state of entity resolution technology to address each issue:

1.  Scalability

“This is a veritable flood of data.  In hindsight, of course, it is very easy to see the pieces that connect together to form a picture of Abdulmutallab’s plot.  But those 10 or so bits of information were floating in an ocean of other data—literally millions of different individual entries from thousands of different sources in a host of different databases.”

Existing entity resolution technology scales to handle multiple tens of millions of transactions daily. While the “flood of data” would likely test the limits of existing systems, it’s not clear that reaching the required scalability is limited by the software or is simply a function of establishing well-founded rules and incorporating the needed amount of hardware capacity.

2.    Real-Time Analysis

“We continue to rely on the intuition of analysts to provide the insight we need.  It is all well and good to say ‘with the NSA intercept about a Nigerian we should have started looking at all Nigerians’ or ‘we should have begun looking at everyone named Umar Farouk,’ but those leaps of insight and anticipation are not routine—they require analysis and consideration.  And that requires time—time to ponder the necessity of making precisely that inquiry. But time is what our analysts don’t have.  At least not enough of it.  Not with the flood of data we are seeing.  They have to prioritize and move certain lines of inquiry to the top of the pile.”

Crucial attributes of entity resolution technology are its ability to (a) process massive amounts of data in real time and (b) make automated decisions that prioritize the importance of each element. Entity resolution will never displace trained analysts, but its ability to sift through millions of pieces of data to produce a prioritized list of the most important potential connections offers the best way to fully exploit analysts’ brainpower and accelerate the process of detecting impending terrorism.

3.    Automated Scoring

“What we lack is not human intuition.  Rather we lack the tools to make human intuition effective and automated.  The head of the NCTC told a rather shocked Senate committee the other day that, in effect, NCTC analysts don’t have a “Google‐like” tool for database inquiries.  They can’t, for example, simply type in ‘Umar Farouk’ and pull up all the pages with links to that name.”

While a “Google-like” tool isn’t currently being used, the components needed to build one are available. By connecting to the appropriate data sources, some of the more powerful entity analytic software can “similarity search” a name across multiple disparate (and even remote) databases, and the software will detect similar attributes of multiple identities, and then combine them to yield a broad picture of an individual’s activities as documented in the data sources.

4.    Multiple Attributes

“But even that wouldn’t be enough—because there would likely still be far too many ‘Umar Farouk’ pages for any analyst to review (especially if instead the name we had was, for example, ‘Omar Abdul’).  What is necessary, as the Markle Foundation has said persistently, is for us to authorize and invest in tools that allow for automated analytics—things like tagged data (so that corrections to information are automatically transmitted for updates), identity resolution techniques (so that ‘Umar’ and ‘Omar’ are both considered), and persistent queries (so that a question that an analyst asked last month about Umar Farouk persists in the databases and is automatically linked to a father’s warning about his son Umar when that comes in three weeks later).”

One untouched topic is the effect of associating other attributes with an identity in addition to names, e.g. phone, SSN, passport, license plate, eye color, DOB). Matching similar names in the absence of other information may not be adequate to raise an alert about an identity, but when other attributes are captured and added, the problem becomes markedly more manageable. “Persisting” an identity is a good suggestion that enables more attributes to be added over time. Growing the data in this fashion will enable the system to trigger when a connection to someone on a watch list is identified.

Entity resolution technology is already sufficient to make an enormous difference today if it were just more broadly applied. While Mr. Rosenzweig is correct in his assertion that more research on data analytics tools is needed and can help move the process forward, we should also move rapidly to leverage available technology: entity resolution.

Identity Resolution Daily Links 2010-01-25

Monday, January 25th, 2010

By the Infoglide Team

Liliendahl on Data Quality: Create Table Homo_Sapiens

Identity Resolution is about the same but  – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.”

OmniMD: Clock starts ticking on meaningful use comments

“The clock starts ticking today on a two-month window in which the public can comment on the Health & Human Service Department’s “meaningful use” proposal, a set of rules outlining how providers can qualify for incentives for using electronic health records.”

Beyond Search: Startling Fact: Size of Cloud Computing Market

“The global cloud computing market is expected to grow at a compounded annual rate of 28 percent from $47 billion in 2008 to $126 billion by 2012, according to IBM based on various market estimates.’

National Underwriter: Fraud Increases In ’09; Bureau Budgets Tighten 

“The Coalition interviewed 37 fraud bureaus during the first three weeks of Oct. 2009 for its survey, titled ‘The Economy and Fraud Fighting on the State Level.’ The bureau directors were asked for their views on trends in 15 areas of fraud, which include staged auto accidents, auto give-ups, padding auto and homeowner claims, arson, and workers’ compensation fraud by both workers and employers.”

Identity Resolution Daily Links 2010-01-22

Friday, January 22nd, 2010

[Post from Infoglide] Healthcare Identity Resolution Confusion

“Confusion about medical records can lead to chaos. We’ve all heard horror stories about hospital tragedies caused by misidentification of a patient, such as applying an unnecessary surgery. It’s hard to overemphasize the importance of correct, unambiguous information in the practice of medicine. Knowing as much as possible about a patient enables a practitioner to reach a correct diagnosis and the proper treatment regimen in the least amount of time.”

NewsandSentinel.com: Local officials do their part to fight terrorism

“Tom Campbell, a consultant on terrorist issues who has worked with Sandy in the past, has been in the field of counter-terrorism for 14 years. We do not profile based on ethnicity and race, what we do is profile behavior,” said Campbell. “Terrorism is evolutionary. Terrorists are always changing their behavior, appearances and tactics. What we try to do to prevent terrorism is focus on the behavior. That’s how we disrupt it before it happens. The emphasis is on prevention.”

intelligent enterprise: Predicting BI Highlights for 2010

Cloud computing and SaaS will become less niche as both BI heavy weights and vertically-focused vendors recognize that the infrastructure side of BI offers little competitive advantage; instead, it’s the time-to-value and agility. IT owners who don’t want to give up any control are in for a bruising.”

ISRIA: Testimony of Secretary Napolitano before the Senate Committee on the Homeland Security and Governmental Affairs, “Intelligence Reform: The Lessons and Implications of the Christmas Day Attack”

DHS uses TSDB data, managed by the Terrorist Screening Center that is administered by the FBI, to determine who may board, who requires further screening and investigation, who should not be admitted, or who should be referred to appropriate law enforcement personnel. Specifically, to help make these determinations, DHS uses the No-Fly List and the Selectee List, two important subsets within the TSDB. Individuals on the No-Fly List should not receive a boarding pass for a flight to, from, over, or within the United States.”

Healthcare Identity Resolution Confusion

Wednesday, January 20th, 2010

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

Confusion about medical records can lead to chaos. We’ve all heard horror stories about hospital tragedies caused by misidentification of a patient, such as applying an unnecessary surgery. It’s hard to overemphasize the importance of correct, unambiguous information in the practice of medicine. Knowing as much as possible about a patient enables a practitioner to reach a correct diagnosis and the proper treatment regimen in the least amount of time.

Underscoring the importance that accurate information plays in effective treatment, the American Recovery and Reinvestment Act (ARRA) passed in 2009 includes incentives for hospitals and doctors to adopt and support certified electronic health record (EHR) technology. In fact, the Act set aside $20 billion to encourage health care organizations to improve their recordkeeping through healthcare information technology.

Today’s hot healthcare industry topic, therefore, is electronic health records. While an EHR can create the potential for interoperability, it can’t deliver interoperability without robust identity resolution. High-quality health care depends on complete, unambiguous patient information being available at all times, so identity resolution technology has become a crucial component of a well-designed healthcare identification infrastructure.

By applying identity resolution to patient identification integrity, identity resolution can prevent common medical errors:
Duplicates are a simple example, where the two records exist for the same person within a single facility. More complex types of errors can easily start to mount up, including overlaps where more than one record exists for one person within two facilities within a single organization, and overlays where information for two people are integrated under a single record.

The rush to respond to ARRA resulted in overstatements of the identity resolution capabilities of many products. For example, most master data management (MDM) systems include matching and de-duplication capabilities that have become labeled “identity resolution” while in fact they lack the critical requirements for identity resolution. Dan Power of Hub Solution Designs has pointed out the growing role of identity resolution in MDM and the need for MDM vendors to move beyond “not invented here” thinking to incorporate true identity resolution into their offerings.

Confusion about medical records can lead to chaos. Clearing up confusion about identity resolution clears a path out of the chaos that will lead to better solutions.

Identity Resolution Daily Links 2010-01-18

Monday, January 18th, 2010

By the Infoglide Team

hrtools: Workers’ comp anti-fraud and compliance program saved $128 million in FY 2009

“The fight against fraud in the workers’ compensation system brought in $128 million last year, according to a new report from the Washington Department of Labor & Industries (L&I)… L&I also referred 25 fraud cases for criminal prosecution, including 18 workers, four employers, and three health care providers — with a 100 percent success rate.”  [Link to Full Report]

Connecticutplus.com: Governor Rell directs State Homeland Security officials to review summary of NWA 253 failures

“‘Connecticut is home to a state and local ‘fusion center‘ – a place where we share the information with our federal homeland security partners,’ Governor Rell said… Connecticut’s proximity to New York, its number of high-profile locations and its importance as a transportation hub mean that fusion center is a critical – and very busy – place. We want to make sure there are no avoidable breakdowns.’”

FierceEMR: CDC: More than 40 percent of docs have EMRs

“Breaking down the numbers leads to a little more sanity. About 20.5 percent of respondents say they had a basic system capable of recording patient demographics, problem lists, clinical notes, medication orders and of viewing test results. Just 6.3 percent had fully functional EMRs, with medical histories, electronic order entry, drug interaction checking, highlighting of abnormal readings and reminders for guideline-based interventions, the CDC says.”

The Server Room: Cloud Computing and the Hype Cycle

“Hence we’d like to claim that the recent interest in cloud computing, taken in the context of prior developments on grid computing, the service paradigm and virtualization and over the infrastructure provided by the Internet, is actually the slow climb into the Slope of Enlightenment.  Experimentation will continue, and some attempts will still fail.  However the general trend will be toward mainstreaming.”

Identity Resolution Daily Links 2010-01-15

Friday, January 15th, 2010

[Post from Infoglide] Entity-Based Integration Model

“From a business standpoint, entity resolution (ER) is really the first step of a two-part process of integrating information about entities.  Entity reference records usually carry two types of attributes describing the entity, identifying attributes and informational attributes. Although the line between the two can be fuzzy, identifying attributes are those that describe the entity’s ‘characteristics,’ information that tends to persist over time and helps to distinguish one entity from another of the same type.”

Healthcare Technology Online: 10 Healthcare IT Trends To Watch In 2010

“According to the latest statistics from HIMSS (Healthcare Information and Management Systems Society), only 0.5% of U.S. hospitals currently have a complete EMR (electronic medical record) system that provides data continuity throughout the institution. Hospitals and healthcare systems will install, integrate, and enhance EMR systems at an accelerated pace in an effort to demonstrate ‘meaningful use’ and capitalize on ARRA incentives.”

InformationWeek: Airline Security: The Technical Task Of Connecting Dots

“Pulling those data streams together–from federal agencies, law enforcement, foreign governments, and private sector companies–and getting that information to the right people quickly and in useable format are huge technical challenges. While there were obvious missed opportunities in the case of Umar Farouk Abdulmutallab, including failure to take action with information in hand, it would be a mistake to underestimate the end-to-end data integration effort required as one of, simply, ‘connecting the dots.’”

ChannelWeb: Gartner: Cloud Computing Contributes To Mass IT Asset Exodus

Cloud computing will take such a stranglehold on the market as companies try to reduce hardware spending that Gartner has made the bold proclamation that one-fifth of all businesses will own absolutely no IT assets come 2012.”

Entity-Based Integration Model

Wednesday, January 13th, 2010

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

From a business standpoint, entity resolution (ER) is really the first step of a two-part process of integrating information about entities.  Entity reference records usually carry two types of attributes describing the entity, identifying attributes and informational attributes. Although the line between the two can be fuzzy, identifying attributes are those that describe the entity’s “characteristics,” information that tends to persist over time and helps to distinguish one entity from another of the same type.

For example, a customer reference might have identifying attributes like name, a mailing address, or age, relating to the identity of a person.  But the record may also have attributes such as marital status, hobby interests, or the make and model his or her automobile.  The latter information could be important in understanding how to market to this individual, but may not be as helpful for identifying the person.

Let’s go back to where we left the technical discussion. In the last post we looked at representing the outcome of an ER process (E) acting on a list of entity references (S) in process order (λ) as being equivalent to a partition (P) of the set S. The notation we used for this was P = (E, S, λ).  Recall that a partition of S is simply any collection of non-empty subsets of S with two properties, 1) that the subsets don’t overlap, yet 2) the union of all the subsets is equal to S.  So in the case of the process E, if we divide S into subsets based on whether they reference the same entity, then these subsets will give us a partition of S.  Even though the partition P doesn’t tell us how the ER process operates, it does convey all of the information about the result of the process.  For any two references in S, the partition P will tell us the decision of E.  If the two references are in different partition subsets, it means E’s decision is that the two references are to different entities.  On the other hand if the two references are in the same subset, it means the references are to the same entity.

Therefore all ER processes acting upon a set of references can be described in terms of a partition of the reference set.  The reverse is also true.  Given any partition of the reference set, it can be thought of as the result of a decision process, such an ER process.  This then is a nice “black box” way to describe an ER process in terms of its result without having to worry about its internal mechanism.

So if a marketer has several sources of entity information, the first step is to apply an entity resolution process that brings together those records about the same customer, then to merge the attribute values among these records to assemble a more complete view of each entity.  Now here is an interesting twist.  The attributes of the reference records can themselves be thought of as entities.  For example, just as “Jim” and “James” can be considered equivalent names, the attributes of age and date-of-birth can be considered equivalent in the sense that, for a fixed point in time, the value of one can be transformed into the value of the other.

Okay, now let’s look at the general case.  We start with several references sources R1, R2, .. Rn, where each reference source (Rj) is defined by its underlying set of reference records (Sj), a set of attributes defined for each reference record (Aj), set of attribute values that the attributes can take on (Vj), and a mapping (Mj) that assigns a value to each attribute of each record.  That is,

Rj = (Sj, Aj, Vj, Mj), where

Mj(r, a) = v, where r is a record in Sj, a is an attribute in Aj, and v is a value in Vj.

Now let S represent the union of all the individual reference sets S1…Sn, and let A represent the union of all the attributes A1…An.  We can describe an entity-based integration model as follows.

Let P be a partition of S (all of the records from all sources) and let Q be a partition of A (all attributes from all sources).  As we described earlier, if two records are in the same subset of partition P, it means that they refer to the same entity.  In this case P is modeling the ER process.  On the other hand, if two attributes are in the same subset of the partition Q which models attribute equivalence, such as with the example of date-of-birth and age, equivalent attributes may not be exactly the same, but the value of one attribute can be systematically mapped into a value of the other attribute.

Here’s how it works.  Suppose that {x, y, z} is one of the subsets of P, meaning that x, y, and z are all references to the same entity, and that {u, v} is one of the subsets of Q, meaning that u and v are equivalent attributes.  Also suppose the u is an attribute for records x and y, and that v is an attribute for z.  The table below shows an “integration cell.”

talburt-011310-jpg.jpg

Because x, y, and z are equivalent references, the three rows of this table really represent one entity “e” while u and v represent the same attribute “w”.  In this case there is a conflict because records x and y contribute different values.  It is not clear if the integrated entity e should have a value of “ab” or a value of “cd” for the integrated attribute w.  Deciding which value to select among conflicting values is called “knowledgebase arbitration.”  One way to select is the “voting” scheme.  Using this scheme the value would be “ab” because it occurs most frequently in the integration cell.

Space doesn’t permit a full exposition of the this model, but if you want to explore further a more complete description can be found in the paper Talburt, J. & Hashemi, R. (2008) A formal framework for defining entity-based, data source integration. H. Arabnia & R. Hashemi (Eds), 2008 International Conference on Information and Knowledge Engineering, Las Vegas, NV: CSREA Press (pp. 394-398).

In the next post we will discuss the most common architectures for ER systems.

Identity Resolution Daily Links 2010-01-11

Monday, January 11th, 2010

[Post from Infoglide] Actionable Identity Intelligence from Identity Resolution

“The recent ‘Christmas Bomber’ incident incited many posts about applying technology to address the gaps that allowed it to happen. For example, David Loshin wrote about a piece for BeyeNETWORK about a ‘master terrorist system’ while Lawrence Dubov suggested improving the watch list process using entity resolution. While technology is a critical component of any solution, some specific issues about the technology are important to understand.”

[Post from Infoglide] Entity Resolution Cloud Rising in 2010

A recent Information Week article referenced Oracle CEO Larry Ellison’s views on the future of IT that were offered during a December 17th analyst call. His remarks hint at the growing importance of cloud computing as a key driver in 2010. Writer Bob Evans mentioned that ‘Ellison also quite casually wove the terms ‘private clouds’ and ‘cloud computing’ into his strategic overview without lampooning them, which was a big step forward even though Ellison’s discomfort with the term is shared by IBM CEO Sam Palmisano and Hewlett-Packard CEO Mark Hurd.’”

Business Computing World: Trends In Master Data Management

[Philip Howard] “One of the outcomes of the recession has been that a lot of companies have cut back on long-term projects, especially where ROI may not be clear. And talking to various people it is clear that one of the areas so hit has been large hub-based MDM (Master Data Management) projects. That is because these typically take 18 months to 2 years to implement, require a lot of investment in time and money, and the benefits are a long way in the future.”

Chicago Security: What is a Fusion Intelligence Analyst?

“These analysts are responsible for providing support to decision makers by fusing information from local and federal law enforcement criminal databases with national-level intelligence from the Department of Homeland Security, for example, to create relevant intelligence products (finished reports about salient issues) to leaders (also known as “intelligence customers”) at all levels of government.”

Initiate Blog: Entity Resolution to Build a Better “Watch List”

“We should not be afraid to create more data sources and integrate more information. The fear is we run the risk of missing the useful information in a sea of worthless data. Entity resolution technology can make sense of all that information and resolve identities and relationships between them.”


Bad Behavior has blocked 1175 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice