HOME

Archive for the ‘Data Synchronization’ Category

Identity Resolution Daily Links 2010-04-11

Saturday, April 10th, 2010

By the Infoglide Team

Liliendahl on Data Quality: What is a best-in-class match engine?

“I don’t think anyone knows what product is the best match engine, because I don’t think that all match engines have been benchmarked with a representative set of data.”

ITBusinessEdge: SOA Spending on the Rise. Surprised? Here’s Why

“It’s important to realize that SOA is really a rather loose collection of best practices. It’s not necessarily a well-defined list where you have some checklist of things to do SOA and if you miss one, you’re not doing SOA. What’s happening is architecture teams are incorporating SOA best practices into various other initiatives.”

BTNonline.com: TSA To Assume All Watchlist Matching For U.S. Carriers By June, All Carriers By January

“The U.S. Transportation Security Administration is on track to assume watchlist matching from all U.S. carriers by the end of May, only slightly behind its March 31 U.S. implementation target for the Secure Flight passenger prescreening system, according to a U.S. Government Accountability Office report. The Secure Flight program also calls for TSA to assume watchlist matching from foreign carriers, and the agency already is working with 19 airlines outside the United States to do so. Five of those carriers are fully functional within the program, and an additional 14 are testing, GAO reported.”

[video] KENS5.com: UT Health Science Center helps bring medicine into computer age

“Currently 80 to 90 percent of all medical records are stored on paper.  The goal is that have an electronic health record for everyone in the U.S. by 2014. Electronic health records are expected to greatly reduce the number of medical errors, which is significant.  Each year in the United States, as many as 100,000 people die in hospitals because of such errors.  That’s the equivalent of one major airline crash every single day of every single year.”

Identity Resolution Daily Links 2009-07-27

Monday, July 27th, 2009

By the Infoglide Team

information management: Multidomain Master Data Management for Business Success

“All data that flows through an enterprise can be categorized into six different types: who, what, when, where, how and why. Master data is about who, what, when and where. ‘Who’ data is about the parties of interest that matter most to a business or organization including stakeholders, benefactors, customers, suppliers, owners, providers, partners, etc.”

HSToday: DHS Highlights Intelligence Improvements in Report Marking 9/11 Report Anniversary

“To date, 72 fusion centers have been designated throughout the country, with DHS having provided more than $340 million from fiscal years 2004-2009 to state and local governments to support these centers. DHS also deployed the Homeland Security Data Network to 29 fusion centers, which allows the federal government to share information and intelligence with states and provides fusion center staff access to the most current terrorism-related information.”

The Healthcare IT Guy: Guest Article: Why Doctors Hate Electronic Medical Records

“The fact is that doctors love high-tech. They have reason to hate EMRs but not computers and iPhones.”

DecisionStats: Interview Jim Harris Data Quality Expert OCDQ Blog

Jim Harris - ‘I know that Gartner has reported that 25% of critical data within large businesses is somehow inaccurate or incomplete and that 50% of implementations fail due to lack of attention to data quality issues.’”

Identity Resolution Daily Links 2009-07-24

Friday, July 24th, 2009

[Post from Infoglide] Entity Resolution as Data Mining

“In my last post, I suggested that entity resolution in the broadest sense (“Big ER”) really encompasses three activities.  The first is locating and collecting entity references from unstructured sources (entity extraction), the second is resolving and merging references to the same entity (“Little ER”), and the third is analyzing associations among entities.  Not every ER process involves all three activities.”

BeyeNETWORK: Some Perspectives on Quality

[Bill Inmon] “There are then very legitimate circumstances where incorrect data is best left in the database or data warehouse. Stated differently, there is no circumstance where correcting data or not correcting data is the right thing to do. In order to determine which approach is proper, the context of the corrections has to be known. Only then can it be determined whether correcting errors is the proper thing to do.”

Homeland Security Watch: How To Improve Homeland Security: Give the ODNI Oversight Responsibility for Fusion Centers

“To me, fusion centers are a fine example of Darwinian logic in homeland security.  There was no comprehensive national plan to create fusion centers.  In original intent, Founding-Fathers-federalism fashion, states and cities decided they were not getting the intelligence they wanted.  Arizona, Georgia, Illinois, New York and a handful of other jurisdictions took responsibility for processing - or “fusing” - their own intelligence.”

ITBusinessEdge: Master Data Management and the CIO’s Strategic Plan

“If we look at MDM as a collection of techniques providing enterprise-wide data requirements analysis and subsequent implementation of best practices in data management, then the savvy IT manager might cherry-pick from the tools offered by vendors to provide the optimal solution that unifies the view of critical data concepts while satisfying the data quality requirements imposed by a horizontal information solution.”

I, Cringely: Medical Records R Us

“So medical records are an area where IT could make us healthier and, if done correctly, ought to save lots of money, too.  What we need is some form of centralized medical record keeping that preserves patient privacy yet, at the same time, keeps us from shopping all over town for bogus Oxycontin prescriptions.”

Identity Resolution Daily Links 2009-06-12

Friday, June 12th, 2009

[Post from Infoglide] Data Source Disintermediation?

“According to Wikipedia, ‘disintermediation is the removal of intermediaries in a supply chain: ‘cutting out the middleman’… Buyers bypass the middlemen (wholesalers and retailers) in order to buy directly from the manufacturer and thereby pay less.’”

[Jim Harris] OCDQ Blog: The Two Headed Monster of Data Matching

“Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household). Data matching is commonly plagued by what I refer to as The Two Headed Monster…”

CorpWatch: CorpWatch announces release of the CrocTail application and open CorpWatch API

CrocTail provides an interface for browsing information about several hundred thousand U.S. publicly traded corporations and their many foreign and domestic subsidiaries. Information from company Securities and Exchange Commission (SEC) filings has been parsed and annotated by CorpWatch to highlight specific corporate accountability issues. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.”

Vos Is Neias: Washington - TSA Advising Travelers To Book Airline Tickets Using Full Real Names

“While the T.S.A. has announced Aug. 15 as a target date for the airlines to begin asking for each passenger’s full name, gender and date of birth, and has already begun publicizing the program, called Secure Flight, the agency acknowledged that it would go into effect in phases as the airlines update their systems.”

Solving the False Negative Problem

Wednesday, April 22nd, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my March 25, 2009 post “The Myth of Matching,” I discussed the confusion between entity resolution and matching as in record de-duplication.  Matching is a necessary part of entity resolution, but it is not sufficient.  In particular I brought up the issue of “false negatives,” cases where records don’t match, but are in fact references to the same entity.  I used the example of Mary Doe living on Elm Street who married John Smith living on Pine Street resulting in two references “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” that don’t match, but are never-the-less references to the same person.  Let’s discuss a couple of approaches to solving this problem - enlarging the scope of identity attributes and utilizing asserted associations.

The Mary Doe - Mary Smith case might be resolved if the scope of identity attributes were increased, i.e. if additional information such as date-of-birth, drivers license, or social security number were available in both records.  But as anyone acquainted with information quality understands, acquiring and maintaining additional information can create as many problems as it solves.  It also brings up a number of questions that the information custodians and collectors must answer.

Is this information available? Is it costly? Is use for this purpose permissible/legal?  Even if expanding the number of identity attributes is an option, it is not necessarily a panacea.  Increasing the number of identity attributes also increases the complexity of the matching.  What if some values are missing?  What if some values agree, but others disagree?

A second approach is to collect and use asserted associations.  The fundamental problem is that if Mary Doe and Mary Smith do not share any matching identity attributes, you cannot know that they are the same person without some separately acquired knowledge that they are in fact the same person.  Moreover, because not all Mary Doe’s are the same person as Mary Smith, you also need additional context such as the address to make the connection clear.  The upshot is that you need to possess the explicit knowledge that “Mary Doe at 234 Elm St is the same person as Mary Smith at 456 Pine St.”

If Mary lives in the United States and Mary registers her change of name and address with the US Postal Service, then you might be able to resolve this through the USPS Change of Address file.  Besides the fact that this is only helpful in the US, relying on the USPS COA file has other disadvantages, not the least of which is that Mary may have decided not to register with the USPS.  For this reason, some companies choose to maintain their own knowledge by acquiring information from other public and private sources.

For example in the US, marriage records are publicly available and are a possible source of this associative information.  It may also be true that while Mary didn’t register her change of address with the USPS, she may have wanted to avoid missing any issues of her Modern Square Dancing magazine subscription and promptly registered her change of address with the publisher.  There are potentially many other data sources, such as changes in utility service, cable service, or required licensure notifications.

Even though the application of external association information can alleviate the false negative problem, it comes at a cost.  The collection and maintenance of associative information can be a monumental task for some types of entities. For example, at least 20% of the US population moves each year.  Because it is too large a task for most organizations to take on by themselves, companies that aggregate large amounts of associative data sometimes offer the application of this knowledge as a product.

In the next installment, I will discuss another common confusion, the difference between entity resolution and identity resolution.

Identity Resolution Daily Links 2009-03-02

Monday, March 2nd, 2009

By the Infoglide Team

Background Now: AG Seeks Injunction Against Contractors Asset Protection Association, Inc. (ConAPA) and Eugene Magre

“‘This company falsely promised its clients that if they gave their employees empty titles and worthless shares of stock they could avoid tens of thousands of dollars in workers compensation premiums,’ Attorney General Brown said. ‘But you can’t simply call a security guard a vice president and avoid complying with the law through a sophisticated and fraudulent scheme.’”

DailyTech: New Bills Target Stolen Merchandise Sold Online

“Under the new legislation, the brick and mortar retailers would score a major coup in that they could order eBay.com, Overstock.com, and Amazon.com to remove numerous goods without any proof.  Under the proposed laws, failure by the online retailers to ‘expeditiously investigate’ and remove the items would result in criminal penalties.”

BeyeNetwork: Business Drivers and Master Data

“Is the actual business need for a single version of the data, or just multiple versions, each of which is of higher quality? Drill down into this a little bit and you may need additional information from your business customers. What constitutes a requirement for master data? A situation in which two business processes need to have a fully shared view of the same representation of a data item?”

Web of Data: Report on Data Discovery by Bloor Research

“…there are now a number of products on the market that can discover data relationships that do not fall within the category of either data profiling or data quality. As a result, it is time to consider the importance of data discovery, and its requirements, as a market in its own right.”

Entity Extraction: The Flip Side of Entity Resolution

Wednesday, February 25th, 2009

By John Talburt, PhD, CDMP, Director of the Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock

John Talburt - smallUnder our working definition of entity resolution as locating and merging references to the same entity, the last installment focused on the merge problem, and how matching is often used as a stand-in for ER.  Now let’s take a look at the locating problem.

First we should note that information comes to us in two forms, structured and unstructured.  The traditional world of IT has been built around structured information based on the discipline of relational database schemas.  In essence, data is structured if it is ready to be loaded into a relational database, i.e. all of the entities and their attributes are clearly delimited or tagged in a way that a computer can correctly read the entire data set by following one simple, repeating pattern.  In the good ole days, the flat-file format gave us this by requiring that every record must have a fixed length and every attribute must occupy a fixed position in the record.  Inspired by the spreadsheet paradigm, a friendlier version came along only requiring that all of the attributes be presented together in a fixed order, each separated from the other by a specially designated character, the delimiter.  Now XML has brought us yet another discipline of explicitly tagging the start and end of records and attributes with a consistent naming convention.

So in the structured world, locating is easy, you just follow the pattern.  The problem is that we are now beginning to realize that there is a tremendous amount of information in unstructured formats such as free-form documents, photos, videos, audio files, sensor data, and other formats, formats that are not easily mapped into an entity-attribute schema.  Even if we just focus on information encoded in character (text) format, the total amount of unstructured information in most organizations often exceeds the amount of structured information by a considerable amount.  What’s more, we now realize that some of this information could be important, i.e. that processes like customer relationship management (CRM) could be transformed if the company only knew what their customers were saying in their emails to the company or in the comment they gave to telemarketers or technical support personnel who typed those comments into a free-form, notes field.

So how did we end up with so much unstructured information? Did good information go bad?  No, the reason is that the information age operates on four channels –  people to computers, computers to people, computers to computers, and people to people – and it is the latter generates the unstructured information.  Person-to-person communication is inherently complex and often carries a tremendous amount of implicit and explicit context that people understand, but computers don’t.

Early in my career, I worked with a professor on the problem of disambiguation of homographs using thesauri (a fancy way of asking if a computer can understand the difference in meaning between two words that are spelled the same, but mean different things, just by looking at the synonyms of the words around them., e.g. “I can open this can.”)  His favorite test was “Time flies like an arrow, but fruit flies like a banana.”

But getting back on topic, if you want to resolve whether references are to the same or different entities, you must first have the references.  So if the information sources are unstructured, the locating side of entity resolution is about finding the entity references.  This process is variously referred to as “named entity recognition”, “entity identification”, or “entity extraction”.  In the next installment we will discuss some of the strategies for entity extraction from unstructured text documents.

Identity Resolution Daily Links 2008-12-12

Friday, December 12th, 2008

[Post from Infoglide] Part Deux: If Only Data Quality Were That Simple

“Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.”

Data Value Talk: The added value of an integrated customer view

“So it appears that the data itself plays a crucial role in the lack of an integrated customer view. Or more accurately, the better the data - the better the customer view.  And the better the matching of customer records across separate systems the better the integrated customer view. So Data Quality and Matching (Identity Resolution) determine in large parts the quality of the integrated customer view and the added value that it delivers.”

Marion Star: Muzzle loading and compensation

“Investigators from the Ohio Bureau of Workers’ Compensation, posing as gun enthusiasts, twice visited SMS. Those visits consisted primarily of small talk about guns and ammo. McGraw discussed some pistols that he had recently sold and invited one of the investigators to bring in an allegedly defective gun, telling them he would ‘take a look at it.’”

Intelligent Enterprise: ‘Surround Strategy:’ A Prediction for 2009

” Rather than trying to remodel the data warehouse to accommodate fresher and more detailed operational data (near real-time activity in operational systems, process logs, etc.), these data sources will operate in parallel (or horizontally, whichever word you like) as complementary feeds to analytics. It takes too long and is too expensive to expand the data warehouse concept to do this.”

New York State Insurance Department: Cortland Woman Accused of Workers’ Comp Fraud

“Horton is charged with making false statements and submitting false testimony to the Workers’ Compensation Board to receive benefits. She claimed that an April 2006 back injury she suffered while she was a health aide prevented her from working or attending school. Investigators learned that she was attending school full-time.”

Gartner: When is SOA, DOA? When it’s without MDM!

[Andrew White] “Clearly, if every SOA-based application interaction had to incur the costs of data reconciliation, mapping, clean up etc, then the cost of building and maintaining that SOA-based application would exceed what it costs today without SOA.  The bottom line: SOA needs MDM to help with the evolution of the information infrastructure.”

The State Journal: Insurance Fraud Unit Wins 45 Convictions This Year

“Since January 2007, the fraud unit has received 1,703 case referrals for review from those in the insurance industry and private citizens. After reviewing the referrals, field investigators have been assigned 397 cases to pursue. During that time, [West Va. Insurance Commissioner Jane] Cline said, 292 criminal cases have been referred to various prosecuting authorities, as well as in-house prosecutors who have been assigned to the unit on a full-time basis. Further, the fraud unit has secured indictments on 84 individuals for 294 felony counts and successfully obtained 73 convictions, including 45 in 2008.”

Part Deux: If Only Data Quality Were That Simple

Wednesday, December 10th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

During the past two weeks, Phillip Howard at Bloor Research has raised interesting questions about the nature and efficiency of data quality solutions in a series of posts entitled “The problem with data quality solutions.” Last week I responded on his blog and posted an expanded discussion of the same points here.

His fourth installment opens some interesting new topics. Perhaps the best approach is to lift some quotes and then respond below.

“Where I will comment is on the importance of understanding relationships not just between data elements but also between data and applications and even between data and the business. Understanding data relationships is arguably the most important factor whenever you are moving and transforming data, especially in data migration and data archiving environments but also for moving data into a warehouse and similar applications.” We agree that finding non-obvious connections is crucial to building effective data quality solutions. Many technologies fall short in this regard. They are unable to evaluate relationships based on similarity when data is inconsistent. Philip’s simple example baffles many technologies:

“A typical case might be where one application required a five digit numeric field and another application requires the same five numbers plus an additional two alphabetic characters. So, here’s a question for data quality vendors: can your software tell the difference?”  Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.

He goes on: “Unfortunately, discovering relationships is not just about profiling your database. There may be relationships that exist across data sources (and types of data source) that you need to understand; and then there is the application factor. While it may not be theoretically correct from a purist data management perspective the fact is that many data relationships are defined within applications so, in one way or another, you really need to discover these.”  We couldn’t have articulated it any better. Many data quality solutions assume a higher degree of order than actually exists in the real world. Being able to deal with ambiguity (e.g., data sometimes missing, data entered in wrong fields) distinguishes the best technologies from their more simplistic brethren.

This post is getting a little long, so we’ll continue this discussion next week. In the meantime, we’d like to hear your reaction.

Identity Resolution Daily Links 2008-11-17

Monday, November 17th, 2008

[Post from Infoglide] Identity Resolution Daily: Proud of Our Heritage

“When we examine our company’s roots, we see that our heritage is finding bad guys. That’s what David Wheeler set out to do when he saw that detectives had a critical need for better tools for criminal investigations. That is what we are beginning to do in the great State of Washington to identify businesses trying to cheat on their workers’ compensation premiums. From desktops to mainframes and everything in between, our roots have spread and have helped keep us stable as the winds of change have buffeted us about.”

Miami Herald: Workers’ compensation investigator accused of fraud

“In September, according to an arrest warrant, Vega visited Pipe Designs Inc., 7710 NW 72nd Ave., in Miami-Dade. The company did not have any workers’ compensation coverage, Vega found. Vega told owner Ronald Triana that he would lower the hefty penalty — between $27,000 and $30,000 — if Triana gave him a $2,500 money order with the payee information blank, according to the warrant.”

onestopclick: MDM ‘driving software development’

“Studies carried out by IT industry analyst Gartner indicate the necessity for firms to increase the effectiveness of their database development, while reducing costs and meeting compliance requirements, is driving the take-up of MDM technologies.”

Computing SA: IT downturn: every cloud has a silver lining

“Open source data integration, data quality, and extraction, transformation and loading (ETL) applications will flourish in these conditions because they are less costly to obtain, widely supported and constantly updated.”

opodo: Travellers reminded of Esta regulations

“Jim Forster, British Airways’ government and industry affairs manager, said: ‘The US is our biggest overseas market and we have been working hard to advise our visa waiver customers that they must apply to the Department of Homeland Security well in advance of travel.’”


Bad Behavior has blocked 1166 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice