HOME

Archive for November, 2009

Identity Resolution Daily Links 2009-11-24

Tuesday, November 24th, 2009

By the Infoglide Team

Topeka Capital-Journal: Five arrests in lottery fraud

“In this investigation, agents presented unsigned winning lottery tickets to retailers. The clerks were required to advise customers they had won a prize and instruct each they had to redeem winnings at lottery headquarters. In six instances, the clerks withheld information about the winning ticket but later tried to redeem the prize personally or with help of an accomplice.”

New York Times: Computerized Health Records

“Most other countries have much more use of electronic health records than we do. For example, the Danes have virtually 100 percent of physicians using electronic health records. In Britain, virtually 100 percent of primary care physicians use them. In Australia, Sweden, Norway, virtually 100 percent. In many, many other Western countries, the electronic record is virtually ubiquitous.”

ebizQ: Eight Reasons Why Data-Centricity Is The Future Of Business

“Unfortunately less well known, Data-as-a-Service (DaaS) is, however, likely the most strategic aspect of creating business value over the network, more than SaaS and possibly even more than PaaS. Creating a best-of-breed set of data, wrapping a business model around it (advertising, metering, internal chargebacks, build a network effect, etc), defining an SLA, and opening it up internally or to the world is how to both generate consumption as well as becoming in itself the new lock-in.”

reviewjournal.com: A fusion of crime fighters 

“The fusion center concept, which was developed by the federal government after the 9/11 attacks, is grounded in the idea that information flow between police agencies is key to stopping terrorism. But in Las Vegas and elsewhere, the concept has evolved to include a broader ‘all crimes, all hazards’ approach.”

Identity Resolution Daily Links 2009-11-20

Friday, November 20th, 2009

[Post from Infoglide] Entity Resolution Metrics

“In the last post we looked at the problem of measuring the accuracy of entity resolution processes.  As with any accuracy measure, comparing to a known standard of correctness or benchmark is required.  However, even without a benchmark, other measures are also important in evaluating ER outcomes.”

SmartData Collective: MDM: Build or Buy?

“In the paper, I describe five core MDM functions that should drive a deliberate MDM strategy:

1. Data cleansing and correction
2. Metadata
3. Security and access services
4. Data migration
5. Identity resolution”

New York Times: The Rules on Names Could Bend a Little

“Given more precise information at booking, the T.S.A. expects to be able to match more precisely a passenger’s identity against those on the watch list. This should reduce the number of false positives — people who are flagged at security until it can be determined that they are not the person with a similar name who is on a watch list. ‘The Secure Flight watch-list matching process occurs before a passenger even gets to the airport,’ Mr. Leyh said. ‘So if you get a boarding pass, the Secure Flight watch-list matching process is done.’ In other words, you are clear once you get that pass.”

O’Reilly radar: Health gets personal in the cloud

“A Personal Health Record (PHR) is one way that patients can have some control of their own health data, while providing an interoperable platform for sharing relevant clinical data between providers. Healthcare is changing rapidly and there are some important trends worth watching. Healthcare in the near future will be quite different than it is today. Web enabled technology is already changing the way medicine is practiced. As the digital nation comes of age we will see new opportunities, and new challenges, bringing healthcare in America into the 21st century.”

Entity Resolution Metrics

Thursday, November 19th, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post we looked at the problem of measuring the accuracy of entity resolution processes.  As with any accuracy measure, comparing to a known standard of correctness or benchmark is required.  However, even without a benchmark, other measures are also important in evaluating ER outcomes.

As discussed previously, if S is a list of references, we can think of the outcome of an ER process E acting on S as the partition of S produced by the assignment of links by E.  If the process E assigns each and every reference a single link ID, then the collection of subsets of S sharing the same link ID value will form a partition of S.

It is fairly easy to see that this must be true.  Since E assigns every reference a link ID, every reference must be in one of the subsets even if it is in a set by itself.  Because it assigns every reference only one link ID, a reference can only be a member of one of the link ID sharing subsets.  Therefore, the collection of subsets of S that share a common link ID fits the definition of a partition, i.e. a set of non-empty, non-overlapping subsets of S whose union is S.

The notation for this is P = (E, S, λ).  Wait a second – we talked about the process E and the set of references S, so where did λ come from?  Well, λ represents the order in which E processes the set S.  Unfortunately, many ER processes are order-dependent so that, even with the same process E acting on the same set of references S, it’s possible that the outcome (i.e. the partition of S) will be different if the order of processing is different.  In other words, it may be that (E, S, λ1) and (E, S, λ2) are different partitions when λ1 and λ2 are different orders of processing, either because of the physical ordering of S or because of differences in logical order due to parallel or distributed processing.

One of the reasons that ER processes may be order-dependent is that they often rely upon probabilistic matching of identity attributes.  Take as a simple example the Levenshtein Edit Distance that is commonly used for approximate string matching.  Edit distance is defined as the minimum number of “edits” that will transform one string into another string.  The edits most commonly used are “insert character”, “delete character”, and “replace character”.

Consider the three strings “JOHNSON”, “JOHNSTON”, and “JOHANSON”.  The edit distance from “JOHNSON” to “JOHNSTON” is 1 because we can simply insert a “T”.  The edit distance from “JOHNSON” to “JOHANSON” is also 1 because we can simply insert an “A”.  However, the edit distance from “JOHNSTON” to “JOHANSON” is 2 because there is no single edit that will convert one to the other, it requires at least 2, inserting an “A” and deleting the “T”.

So to illustrate order dependence of processing, suppose that S comprises the three names just given and that the E is a process that assigns links by building identity groups.  The process for building an identity group is that a reference belongs to the first existing identity group for which it differs from one of the identity group’s exemplars by an edit distance of 1 or less.  If there is no identity group that satisfies this condition, then the reference starts a new identity group.

Using this scheme, suppose that “JOHNSON” is the first reference to be processed.  Then by default, Identity Group 1 comprises [JOHNSON].  Suppose that the second reference processed is “JOHNSTON”.  Then by our rule, it also belongs to Identity Group 1 because the edit distance from “JOHNSTON” to the Identity 1 exemplar “JOHNSON” is 1.  So now Identity Group 1 comprises [JOHNSON, JOHNSTON].  Finally, suppose that the last reference processed is “JOHANSON”.  Since the edit distance from “JOHANSON” to “JOHNSON” is one, E will also assign it to Identity Group 1.  The final result is that the partition of S created by E is a single set, the set S itself, i.e. all 3 references are to the same entity.

However, using the same set S and process E, the result will be different if the order of processing is reversed.  If we suppose that “JOHANSON” is processed first, then Identity Group 1 comprises [JOHANSON].  If “JOHNSTON” is processed second, E will determine that it is a new identity 2 because its edit distance from “JOHANSON” is greater than 1.  Thus Identity Group 2 comprises [JOHANSON].  Finally, when “JOHNSON” is processed, it will differ from “JOHANSON” in Identity Group 1 by an edit distance of 1, thus it will be assigned to Identity Group 1.  The final result is that S is partitioned into two subsets [“JOHANSON”, “JOHNSON”] and [“JOHNSTON”] a different outcome than for the original order.

The underlying problem is that probabilistic matching processes such as edit distance are not transitive.  Equivalence relations such as equality exhibit the property that if A=B and B=C, then we can conclude that A=C.  This is called the transitive property.  However, as we have seen, “JOHNSTON” ≅ “JOHNSON” and “JOHNSON” ≅ “JOHANSON” does not imply that “JOHNSTON” ≅ “JOHANSON” where ≅ represents the relation that two strings differ by an edit distance of less than 2.

This issue of order dependence is basic to the development of entity resolution metrics.  Next time, I will wrap up this discussion and move into a model for entity-based information integration.

Identity Resolution Daily Links 2009-11-16

Monday, November 16th, 2009

By the Infoglide Team

Government Technology: California Plans to Launch Information Security Operations Center

“The CA-ISOC would watch for attacks on the state government’s critical information infrastructure, including attempts to disrupt automated control networks for dams, power plants and other physical facilities. The plan also envisions creating a California Computer Incident Response Team that would work in concert with the state’s Emergency Management Agency and Fusion Center, as well as the U.S. Department of Homeland Security.”

Liliendahl on Data Quality: Splitting names

“When working through a list of names in order to make a deduplication, consolidation or identity resolution you will meet name fields populated as these:

  • Margaret & John Smith
  • Margaret Smith. John Smith
  • Maria Dolores St. John Smith
  • Johnson & Johnson Limited
  • Johnson & Johnson Limited, John Smith
  • Johnson Furniture Inc., Sales Dept
  • Johnson, Johnson and Smith Sales Training…”

tunnellsystems: EHR vs EMR

“‘An EHR refers to a person’s health record that can be accessed online from many separate, compatible systems within a network. An EMR refers to an electronic patient record that can be accessed from a single system in a doctor’s office and that may, or may not, be shared with other health care professionals.’”

Identity Resolution Daily Links 2009-11-13

Friday, November 13th, 2009

[Post from Infoglide] The Big Story: Evolution

“Technology writer Chris Calnan’s story opened with a comment about Infoglide that nicely sums up the evolution of the broader market for identity resolution and entity analytics: ‘The market may have finally caught up with Infoglide Software Corp.’s technology.’”

OCDQ Blog: Beyond a “Single Version of the Truth”

“However, in his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman explains: ‘A fiendishly attractive concept is… ‘a single version of the truth’…the logic is compelling…unfortunately, there is no single version of the truth. For all important data, there are…too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. This does not imply malfeasance on anyone’s part; it is simply a fact of life. Getting everyone to work from a single version of the truth may be a noble goal, but it is better to call this the ‘one lie strategy’ than anything resembling truth.’”

RISK&INSURANCE: States of Disparity

“Risk & Insurance® looked at four factors that indicate how well a state’s workers’ comp system may be working. Those factors were adjusted by giving additional weight to the amount of premium charged to the employer, and the benefits paid to claimants. The states are ranked by their composite score.”

Security Management: DHS Official Outlines Federal Support to State-based Fusion Centers

“To better facilitate information sharing, Johnson promised DHS will deploy personnel to all fusion centers while giving fusion centers access to the Homeland Security Data Network by the end of fiscal year 2010. Currently, I&A has 44 field representatives based in fusion centers nationwide. I&A will also manage the newly created Joint Fusion Center-Program Management Office (JFC-PMO), which Napolitano tasked in October with coordinating how DHS’ various components and other federal agencies will support fusion centers.”

MedicExchange.com: EMR likely to boom throughout 2013

Health IT currently is growing at an 11 percent annual rate, and solid growth should continue at least through 2013, which would be the third year of the federal EMR stimulus program here in the States, the Scientia report forecasts. In that time frame, health IT will increase its market share by a quarter, to 5 percent of global healthcare products sales from the current 4 percent.”

The Big Story: Evolution

Wednesday, November 11th, 2009

Technology writer Chris Calnan’s story opened with a comment about Infoglide that nicely sums up the evolution of the broader market for identity resolution and entity analytics: “The market may have finally caught up with Infoglide Software Corp.’s technology.”

While identity resolution technology has evolved rapidly over the past decade, its market visibility only emerged fairly recently. It was barely two years ago in mid-2007 when Gartner analyst Mark Beyer dubbed it “entity resolution and analysis” and pointed out that it “was previously an obscure, but gradually developing, technology that has come to the forefront as a result of world events and market forces.” Gartner singled it out as an “On the Rise” technology within operational business intelligence.

That first Gartner “hype cycle” showed entity resolution and analysis entering at the earliest stage. A year later in mid-2008, a broader report on data management  depicted it significantly higher on the curve in the opinion of the Gartner analyst team. In both reports, its estimated time to “mainstream adoption” was 2-5 years, the second fastest category.

At the end of 2008, noted consultant and speaker Jill Dyché of Baseline Consulting issued her predictions for 2009. Along with predictions about SaaS, data governance, BI, and MDM, she said that “Identity Resolution will get its due.” Rob Karel of Forrester had written several months before about Informatica’s acquisition of one of the two closest Infoglide competitors (IBM EAS being the other one). Identity Systems was acquired from Nokia for $85 million.

As we progressed further into 2009, the most meaningful indicator of identity resolution’s growing importance surfaced: an escalating identification with the space by other companies. IBM, Infoglide, and Informatica were joined by Initiate Systems, Intelligent Search, and Netrics, each of whom began incorporating messaging around identity and entity resolution.

For our customers and for us, this is all good news.  Our evolving space becoming better known and more highly valued will provide more alternatives for customers while increasing our own visibility. The future of identity resolution looks bright, and we all win.

[Distributed earlier this week in our quarterly publication, Identity Resolution Quarterly]

Identity Resolution Daily Links 2009-11-09

Monday, November 9th, 2009

By the Infoglide Team

NYTimes Dealbook: Insider Scheme Had Touches of James Bond

“Unlike the Galleon case, where senior officials at corporations passed tips on early earnings estimates to people at the fund, the Goffer case centers on allegations that may sound more familiar to students of the insider trading scandals of 25 years ago — early tips about deals from the people involved in doing them. According to the criminal complaints, Mr. Cutillo passed the information along through a friend, Jason C. Goldfarb, 31, who specialized in workers compensation law at a private firm in Brooklyn and who was also arrested on Thursday.”

Computerworld: Data quality vendors missing the mark, study finds

“One-fifth of respondents felt data quality is a prerequisite to an MDM initiative and wanted to see more vendor offerings integrating those two areas. Hayler says one would expect vendor partnerships between the areas of data quality and MDM, and that is precisely what is currently happening in the industry.”

docinthemachine: Encrypt EHR — Else HIPAA Violations Need Be Reported To Government & Media

“For example, if a physician maintains patient information in a laptop computer containing the unsecured information of more than 500 patients and the laptop is stolen, the physician would be required to notify not only the patients affected by the breach, but would likely need to also notify the DHHS and the media. A medical practice need not report a breach if the patient information has been properly encrypted – because information that is encrypted is not considered ‘unsecure.’”

Initiate Blog: The Brittle Nature of Data Warehouses

“Usually, only a small percentage of the data are ever used. So why bother? The TCO for extracting, copying, converting, transferring, transforming, integrating, propagating, backing-up, loading, and verifying the data skyrockets far beyond its value and injects significant risk and brittleness into the entire ecosystem.”

Identity Resolution Daily Links 2009-11-06

Friday, November 6th, 2009

[Post from Infoglide] The Other Half of Entity Resolution

“In a recent post, Jonathan McDonald quotes one definition of entity resolution: ‘According to Gartner, entity resolution is ‘the capability to resolve multiple labels for individuals, products or other noun classes of data into a single resolved entity when pseudonyms, alias names or other synonym-style constructs exist.’ …While the definition nicely captures the value of ‘first degree’ entity resolution, it falls short by omitting non-obvious relationship detection.”

iHealthBeat: Study: U.S. Lags Behind Many Other Countries in EHR Use

“The study found that 46% of U.S. physicians use electronic health records, up from 28% in 2006. The researchers found that 99% of doctors in the Netherlands use EHRs. Australia, Italy, New Zealand, Norway, Sweden and the U.K. also reported EHR adoption rates of 94% or higher. “

data quality PRO: Profit by Data Quality Best Practices

“Insurers use data to manage litigation, detect fraudulent claims and limit financial exposure to claims through reinsurance, but this practice works only when the data is credible. It is no overstatement that sound, profitable property / casualty operations begin – and end – with quality data.”

Federal News Radio: What airline passengers need to know about TSA’s Secure Flight program

“The information is then used ‘behind the scenes’ to match against the No-Fly list. ‘It’s a behind the scenes process,’ said Leyh. ‘If you get to the airport and you have your boarding pass, the Secure Flight part of it, and the watch list matching part of it, is over. It’s done with.’”

information management: Inefficiency as a Standard in Product Information Management

Managing product information across a large organization consists of much more than making sure prices and descriptions are accurate and consistent. Large manufacturers and retailers employ teams of people tasked with the job of cross checking product data. While the deployment of these teams is a good idea in theory, the process is loaded with inefficiency and errors are all but guaranteed.”

The Other Half of Entity Resolution

Wednesday, November 4th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

In a recent post, Jonathan McDonald quotes one definition of entity resolution:

According to Gartner, entity resolution is “the capability to resolve multiple labels for individuals, products or other noun classes of data into a single resolved entity when pseudonyms, alias names or other synonym-style constructs exist. This is especially true in cases wherein there exists intentional falsification of information or the creation of false identities. While most prevalent in detecting perpetrators of criminal or illegal activity, more-commercial applications exist as well.

While the definition nicely captures the value of “first degree” entity resolution, it falls short by omitting non-obvious relationship detection.

Basic entity resolution determines “who’s who” by sifting through massive amounts of noun/attribute data in multiple disparate data sources. Cutting through ambiguity caused by missing attributes, pseudonyms, aliases, and obvious efforts to deceive, it mines and resolves the essential elements of identity to form an unambiguous picture that greatly enhances business decisions and reduces risk.

However, in many application domains, pinpointing “who knows whom” is equally valuable. In detecting insider trading, for example, it’s important to resolve identity information to achieve an unambiguous picture of a person of interest, but to expose fraudulent activity, it’s critical to identify second and third degree linkages between suspects and their friends, relatives, and business associates.

More examples abound. In insurance, fraudsters change roles each time they stage a car accident and also intentionally modify their identities in accident reports. Fraudulent employers who want to reduce their workers’ compensation premiums will close their company and start a new one with modified identities of corporate officers. In retail, non-receipted returns of merchandise are often linked to store employees and the customers they enlist to act as their confederates. The list goes on and on.

In each case, entity resolution finds hidden connections by evaluating multiple ambiguous attributes with the same algorithms used to resolve identities. A retail employee who takes a customer’s winning lottery ticket (while telling the customer he didn’t win!) can be traced through address and phone information to other suspiciously connected people, e.g. frequent lottery winners and lottery commission employees.

With apologies to the experts at Gartner, here’s a suggested addition to the definition that acknowledges the other half of entity resolution:

The capability to (a) resolve multiple labels for individuals, products or other noun classes of data into a single resolved entity when ambiguity from pseudonyms, alias names or other synonym-style constructs exists, and (b) to expose hidden connections between entities that are two or more degrees of separation apart. This is especially true in cases where there exists intentional falsification of information or the creation of false identities. While most prevalent in detecting perpetrators of criminal or illegal activity, more-commercial applications exist as well.

Identity Resolution Daily Links 2009-11-02

Monday, November 2nd, 2009

By the Infoglide Team

Come by and see us at TDWI World in Orlando Nov. 3 & 4, Booth 405

The Emculturated World: Unmanage Master Data Management

MDM breaks down in the moment it becomes divorced from a practical, immediate attempt to capture just what is needed today. The moment it attempts to “bank” standard symbols ahead of their usage, the MDM process becomes speculative, and proscriptive.”

Governing: Can I Say No to an Electronic Health Record?

“In some instances, patients don’t even know their information is being shared. For example, if consumers turn over prescription drug records when applying for life insurance, the insurer will sometimes hand off the information to business partners who then hand it off to data miners. To keep a tighter grip on privacy, Deven McGraw, director of health privacy at the Center for Democracy and Technology, would like a set of rules that all organizations in the health IT world would have to follow.”

Related post: “Applying Identity Resolution to Patient Identification Integrity”

San Antonio Express-News: McManus recalls 9-11 at GEOINT summit

“Bart Johnson, acting undersecretary for intelligence and analysis with the Homeland Security Department, said cooperation is improving, although problems remain with security clearances and interdepartmental connectivity. ‘The federal government can only do so much in getting it down to the street level,’ Johnson said. Homeland security and Justice Department officials have formed 72 “fusion centers” — terrorism prevention and response centers where federal agencies work with the military, local law enforcement and private partners. Three are in Texas: Austin, Dallas and Collin County near Dallas.”

information management: From Search to Explore

“It’s no surprise that people are looking at more and more internal and external resources for informed decision-making. In the internal case, data integration is a foundation of master data management as well. But integration for BI to common visual tools is increasingly taking place in subsystems, relational databases and cubes, and the visualization layer itself.”


Bad Behavior has blocked 1175 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice