HOME

Archive for the ‘Data Quality’ Category

Identity Resolution Daily Links 2010-08-31

Tuesday, August 31st, 2010

By the Infoglide Team

OCDQ Blog: The Data-Decision Symphony

“Data is now everywhere.  Data is no longer just in the structured rows of our relational databases and spreadsheets.  Data is also in the unstructured streams of our Facebook and Twitter status updates, as well as our blog posts, our photos, and our videos. The challenge is can we somehow manage to listen for business insights among the endless cacophony of chaotic data volumes, and use those insights to enable better business decisions and deliver optimal business performance.”

SecurityInfoWatch.com: Welcome to the melting pot

“A Fusion Center is a terrorism prevention and response center program that began as a joint project between the DHS and the U.S. Department of Justice’s Office of Justice Program. It is designed to gather information from government and the private sectors to aid in safety and security. The Fusion Centers share information at the federal level between the CIA, FBI, DoJ, U.S. Military and state and local level governments, as well as Emergency Operations Centers in the event of a disaster. State and local police departments provide both space and resources for the majority of Fusion Centers. The analysts working there can be drawn from DHS, local police, or the private sector as in the case of Dallas.”

South Florida Business Journal: Clinic operator convicted in $2.3M fraud

“According to evidence presented during the two-week trial in Michigan, between about November 2006 and March 2007, the defendants submitted about $2.3 million in claims to Medicare for injection therapy services that were never provided and were not medically necessary. Medicare paid about $1.7 million.”

Identity Resolution Daily Links 2010-08-29

Monday, August 30th, 2010

[Post from Infoglide] Surface Web, Dark Web, and Social Media

“A recent article in Bank Systems & Technology  says that financial services institutions are discovering increasingly sophisticated attempts to defraud their customers – more sophisticated in how they gather information and employ it in their criminal schemes. ‘As fraudsters increasingly seek to exploit weaknesses in consumers’ defenses through social engineering schemes rather than hack vulnerabilities in banks’ security systems, the need for enterprisewide solutions to detect fraud across channels is greater than ever.’”

IT-Director.com: An Intelligent Match

“Buying rather than building speeds up the process of filling gaps in (or simply improving) functionality, and so is a logical step, and Experian itself has plenty of acquisition experience (including of course QAS itself). It opens up the intriguing possibility that Experian QAS may be looking in the future to spread its wings beyond its historically tight market of contact data management. If so then this may not be the last acquisition that it makes.”

AllBusiness: TSA “Secure Flight” Requirements

“Effective November 1, 2010, if you do not accurately provide the TSA with your full legal name as it appears on your government issued identification within 72 hours of a flight, your reservation could be canceled, at will, by the Transportation Security Administration (TSA). Why are they doing this?  To enhance the security of commercial air travel, the TSA has developed Secure Flight, a program that compares airline passenger information against U.S. government watch lists.”

InformationWeek: Top 10 Cloud Computing Complaints

“‘You need the ability to migrate data from one cloud service provider to another, and there are cloud interoperability scenarios that need to be addressed as well,’ notes Matt Edwards, director of the cloud services initiative at TM Forum, a communications industry association. ‘There are multiple things that need to be addressed to avoid vendor lock-in and to remove the barriers for the adoption of cloud services.’”

Identity Resolution Daily Links 2010-08-07

Saturday, August 7th, 2010

[Post from Infoglide] Reference Linking Methods - Part 3

“This is the third in a series of four posts that discuss four methods for linking references.  These methods are:

  1. Direct matching
  2. Transitive linking
  3. Linking by association
  4. Asserted Linking

In the last post I discussed transitive linking, and why it is essential for producing a unique and deterministic outcome of an ER process.  In this post I will discuss the third method, linking by association.”

BeyeNETWORK: Computed Attributes, Entity Resolution and Connectivity Hierarchies

“There are many types of relationships that are discovered as a by-product of entity resolution, such as households or families. These terms take on different meaning depending on the subject area and the business situation. For example, we can examine parent-child and sibling relationships associated with individuals, we can look at components such as paper clips or screws that are in the same ‘family,’ or we can look at corporate ownership relationships that reflect families of companies. Alternatively, we can look at other types of relationships – individuals belonging to the same health club, components manufactured from the same type of metal, or companies that share the same board members.”

WTVM: Phenix City doctor accused of multi-million dollar Medicare fraud

“In an 80-page  civil complaint, the United States Attorney’s Office claims 51-year-old Doctor Robert Ritchea, a physician, not only allowed an unlicensed medical assistant to inject patients with pain medications, but also improperly billed Medicare for the treatments. The complaint also alleges Ritchea over-billed Medicare by more than $2.2 million in over 4,300 separate claims over a period of four years.”

Liliendahl on Data Quality: Location, Location, Location

“If you know that 123 Main Street in Anytown is a single family house there is a high probability that this is the same real world individual. But if you know that 123 Main Street in Anytown is a building used as a nursing home, a campus or that this entrance has many apartments or other kind of units, then it is not so certain that these records represents the same real world individual (not at least if the name is John Smith). So this example highlights the importance of using external reference data in data matching.”

Reference Linking Methods - Part 1

Thursday, May 27th, 2010

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last few posts, we reviewed the basic architectures used to implement entity resolution (ER) systems.  Although this gives us the big picture at the systems level, ER really takes place at the reference (record) level where the system must ultimately decide whether two references are for the same or for different real-world objects, i.e. to link or not to link.  In this series I’ll discuss some of the most common methods for making these linking decisions.

I classify these methods into four categories:

  1. Direct matching
  2. Transitive matching
  3. Association analysis
  4. Assertion

In this post, we’ll consider the first and most familiar category, direct matching.  Here we decide to link based on the degree of similarity between the values of corresponding identity attributes.  For example, if the identity attributes are first name, last name, and date-of-birth in a certain context, then in direct matching we would compare the values for these attributes between two references.  In its simplest form, deterministic matching, the decision is yes if and only if all three values match exactly, i.e. two references should be linked (are equivalent) only when the first names are the same, the last names are the same, and the dates-of-birth are the same.  Otherwise they are judged to be references to different persons.

Deterministic matching is very easy to implement, but it is not usually very effective.  Its lack of effectiveness stems from the pervasiveness of information quality (IQ) issues.  If the reference values are inaccurate, inconsistent, or missing, then direct matching creates too many false negatives, i.e. references to the same entity that really should be linked but don’t satisfy the deterministic matching criteria.  When names are misspelled, nicknames are used, values are missing, or date formats are inconsistent, the direct match between references can fail even though the references were intended to reference the same person. It should be clear that IQ is closely related to ER.

To address these issues, most systems rely upon some of level of probabilistic matching.  In this form of matching, we link records even if some attributes’ values are different as long as the values of certain other attributes are the same.  Using the previous example, we might decide that the context in which we are working only requires an exact match on last name and date-of-birth in order to link the records.  Generally this incurs a certain amount of risk of creating a false positive link, i.e. references to the different entities that match on certain attributes but should not be linked.  This risk is expressed as the probability that this might happen, hence the term “probabilistic” matching.

Probabilistic matching has been the subject of extensive research.  Most modern practice in probabilistic matching is based in the work of two Canadian statisticians, I.P. Fellegi and A. B. Sunter, who published A Theory for Record Linkage in 1969.  Their model, the Fellegi-Sunter Model, provides a systematic way of creating a probabilistic matching scheme that is optimal with respect to a given level (tolerance) of false positive and false negative risk.

Probabilistic matching is binary in the sense that attribute values either match or don’t match.  In our example, we could represent the case where we only require the last name and date-of-birth to match by the binary string “011” where the zero in the first position means that the first name doesn’t match, while the ones in the second and third positions mean that the last name and date-of-birth must match exactly.  When there are three attributes there would be 8 possible binary combinations to consider.  The problem with the binary model is that it doesn’t account for similarity.  Intuitively we would feel much more confident that the references “John, Doe, 1989-08-13” and “Jon, Doe, 1989-08-13” should be linked than we would the references “John, Doe, 1989-08-13” and “Mary, Doe, 1989-08-13”.

Therefore, a common extension of probabilistic matching is to allow for intermediate levels of similarity between values, i.e. accounting for the fact that attributes values may not be the same but are similar.  For example, if the name values differ only by one character, we could say that the names are similar, or if the dates-of-birth differ by less than 10 days, we could say the dates are similar.  We have now moved from a binary model to a tertiary (base 3) model so that in our previous example, the first pair of references would fit the pattern “122” and the second pair the pattern “022” where 1 represents similar values and 2 epresents the same value.  The downside is that there are now more patterns to analyze and evaluate.  For 3 attributes there are now 27 cases to consider instead of 8.

Probabilistic matching that allows for intermediate levels of similarity is sometimes called fuzzy matching.  Although the term fuzzy implies that there is some leeway, in practice we must always set a discrete threshold that limits the amount dissimilarity we are willing to tolerate.  Fuzzy matching also introduces a plethora of schemes for measuring similarity between two values.  In the cases where the values are character strings, such as for names, the schemes are called approximate string matching (ASM) algorithms.

One of the most often used is the Levenshtein Edit Distance that counts the minimum number of character transformations (usually insertion, deletion, and substitution) that will transform one string into another.  For example the edit distance between “Smythe” and “Smith” is 2 because in the first string you can substitute “i” for “y” and delete the “e” to create the second string in 2 transformations.

Typically ASM outputs are normalized to a scale from 0 to 1.  To normalize edit distance, divide the edit distance by the number of characters is the longest string.  In this example, the normalized edit distance would be 2/6 or 0.33.  Many other ASM algorithms have been developed such as Jaro, Jaro-Winkler, q-grams, Soundex, Smith-Waterman, and Ukkonen, just to mention a few.

In the next post we will discuss transitive matching.

Identity Resolution Daily Links 2010-05-25

Tuesday, May 25th, 2010

By the Infoglide Team

Information Management: 10 Key Trends In MDM

“During 2010, independent/standalone data quality vendors (Clavis, Pitney Bowes, Human Inference and Trillium) will focus on name and address cleansing as they struggle against better-funded match/merge and data profiling capabilities increasingly integrated with megavendor MDM. Also at this time, a dearth of non-aligned matching algorithms (such as those from Digital Trowel, Infoglide, Omikron and Uniserve) will engender ‘algorithm envy’ among disenfranchised MDM providers.”

NewCityPatch: Legislator: Rockland Should Review Medicaid Spending

“Rockland County Legislator Ed Day, R-New City, has called for a review of Medicaid spending by the county that would also determine whether enough is being done to prevent and detect Medicaid fraud. ‘Medicaid expenditures represent an amount that is 110 percent of all the property taxes collected here in Rockland,’ said Day.”

Canadian Immigration: Canada should improve its AML efforts according to US report

“The most significant area of concern is organized crime. Canadian Security Intelligence Service estimates that there are about 750 organized crime groups operating in Canada and 80% of them are involved in the illicit drug trade. The cross-border movement of currency was identified as a continued concern.”

Identity Resolution Daily Links 2010-05-15

Saturday, May 15th, 2010

[Post from Infoglide] Trade-Based Money Laundering

“Who’d have thought that iTunes could be used for money laundering? Yet that is exactly what five men in Great Britain were recently jailed for the other day. Using stolen credit card numbers, they bought £750,000 in vouchers, then sold them at cheaper prices over eBay. Methods of money laundering continue to evolve.”

Liliendahl on Data Quality: Big Time ROI in Identity Resolution

“So the question is if authorities may have avoided losing 5 billion taxpayer Euros if some identity resolution including automated fuzzy connection checks and real world checks was implemented. I know that you are so much more enlightened on what could have been done when the scam is discovered, but I actually think that there may be a lot of other billions of Euros (Pounds, Dollars, Rupees) to avoid losing out there by making some decent identity resolution.”

LISTA: The Privacy and Security Challenges of Electronic and Personal Health Records: Is Your Business Prepared?

“In a 2008 study conducted by Kroll Fraud Solutions/HIMSS Analytics to better understand the status of patient data security at hospitals, the hospitals surveyed reported an average level of preparedness to deal with a security breach of 5.88 on a one to seven ascending scale.19  Yet the same study indicated that only 56 percent of these hospitals had notified patients whose information was compromised as a result of a security breach.”

Newsweek: Intel Paper Says Al Qaeda’s Yemeni Affiliate More Determined Than Ever to Attack Inside U.S.

“The ‘official use only’ bulletin, produced by the Northern California Regional Intelligence Center, a partnership of federal, state, and local agencies originally set up to deal with drug trafficking, is entitled ‘Al-Qa’ida in the Arabian Peninsula’s Online Rhetoric Signals Shift in Intentions.’”

Identity Resolution Daily Links 2010-05-11

Tuesday, May 11th, 2010

By the Infoglide Team

Media Health Leaders Media: Detroit Doc Gets Six Years for Medicare Fraud

“Myint, of Bloomfield Hills, MI, was also ordered to pay more than $3.1 million in restitution, jointly with co-defendants, and to serve two years of supervised release following his prison term. Terrence Hicks, of Jackson, MI, the patient recruiter, was ordered to pay more than $4.9 million in restitution, jointly with co-defendants, and to serve three years of supervised release following his prison term.”

AolTravel: Is the No-Fly List Working?

“‘The TSA is hoping to smooth glitches with the new Secure Flight program — a system by which the ‘TSA will conduct uniform prescreening of passenger information against federal government watchlists,’ according to an official statement. ‘The TSA is taking over this responsibility from the airlines.’ The TSA says the Secure Flight system will be in effect for all domestic flights by mid-2010 and all international flights by the end of 2010, at which time the latest two-hour notification rule will become moot (since the airlines will no longer be responsible). Meanwhile, in the case of Shahzad, Kahn says it’s important to remember that the current system — for all its perceived faults related to his near escape — ultimately did what it was meant to do.”

ITBusinessEdge: Baby Steps to Master Data Management

“If you want to start small with master data management, you’ve got to start with a noun, says Evan Levy, a partner at Baseline Consulting  and an instructor with The Data Warehousing Institute… The problem is, IT doesn’t think in nouns. IT is all about the verb: Defining, coding, testing, supporting. What’s more, IT departments tend to view the world in terms of projects – fulfilling this feature request, upgrading to this release, migrating to this server.”

Liliendahl on Data Quality: Aadhar (or Aadhaar)

“In Denmark we have had such an identifier (one for citizens and one for companies) for many years. It is not used by everyone everywhere – so you still are able to make money being a data quality professional specializing in data matching. The main reason that the unique citizen identifier is not used all over is of course privacy considerations.”

The Big Short: How the Credit Scoring World Has Shifted

Wednesday, May 5th, 2010

By Infoglide Software CEO Mike Shultz

The hottest non-fiction book at the moment is The Big Short: Inside the Doomsday Machine. Best-selling author Michael Lewis explores and explains what went on behind the scenes during the years leading up to the big stock market crash in 2008 and answers a crucial question: “Who understood the risk inherent in the assumption of ever-rising real estate prices, a risk compounded daily by the creation of those arcane, artificial securities loosely base on piles of doubtful mortgages?” While misguided government policies together with greed and stupidity provide the larger answer, events during that time beg certain questions about the specific ways in which credit risk is evaluated.

For decades, several well-known organizations have assessed credit risk, i.e. the likelihood that a loan applicant will default on a loan. Financial services organizations like banks, credit card companies, and mortgage lenders base decisions to lend on credit scores like FICO. The scores are based primarily on a person’s financial history, including whether they have taken out loans and paid them back. Two major trends work together to hinder the effectiveness of traditional credit scores.

First, the use of credit cards as a form of payment has become ubiquitous, and a large percentage of people carry a balance from month to month. For example,  according to creditcards.com the average credit card debt per household carrying a balance is over $16,000. It’s clear that an ever-increasing number of households depend on credit cards to manage cash flow.

The second trend is a behavioral one. For many years, past loan history was a reasonable predictor of future behavior. People in general were committed to paying off their mortgage, and if they were in a tight cash flow situation, their highest priority was keeping up their mortgage payments. Now, with zero interest down and interest-only loans, the lack of equity in the home translates to lower commitment, and defaulting on a loan is a less traumatic event. When credit cards are used for cash management and the penalty for mortgage foreclosure is not so high, it’s not hard to predict a much higher rate of foreclosures.

So, if past behavior is less predictive than before, expect a growing desire in the industry for more sophisticated measures that draw on both historical data and other sources, e.g. up-to-the-minute income and banking status. With its ability to combine and score disparate data, identity resolution technology is certain to play a key role in improving the financial industry’s ability to assess risk.

Identity Resolution Daily Links 2010-05-04

Tuesday, May 4th, 2010

By the Infoglide Team

USDOJ: Houston Medical Equipment Company Owner, Operator and Patient Recruiter Plead Guilty to Health Care Fraud Scheme and Illegal Health Care Kickbacks

“Onward began billing Medicare for fraudulent durable medical equipment in 2003, according to court documents. Vinitski and Lachman admitted they paid kickbacks, sometimes $1,000 per patient, to recruiters who brought patients to Onward. Lachman and Vinitski then would bill Medicare for durable medical equipment that these patients did not need or never received.”

Healthcare IT News: ONC turns its attention to health reform IT

“Congress and the administration want to make it easy for people to learn if they are eligible for benefits under the law, which in addition to changes to Medicare and Medicaid calls for setting up 50 new state health insurance plan exchanges for people who lack coverage. Potential beneficiaries would check their eligibility for such benefits online, where the data necessary to enroll an applicant might be scattered across separate social service and agency databases, like food stamps, a school lunch program, or the state tax department.”

CapitalSoup.com: CFO Sink Announces Arrest Of Eight Pip Scammers In Continued Pip Sweep Arrests

“The charges stem from a previous arrest of both suspects in May 2009 for participating in a staged accident ring under a former business name, E & B Rehabilitation Center. After their first arrest, they posted bond and reopened a clinic under a new name, Ganesha Medical Center Corporation; continuing to work without a business license.”

Liliendahl on Data Quality: Data Quality from the Cloud

“Many of the data quality issues I encounter in my daily work with clients and partners is caused by that adequate information isn’t available at data entry – or isn’t exploited. But information needed will in most cases already exist somewhere in the cloud. The challenge ahead is how to integrate available information in the cloud into business processes.”

Identity Resolution Daily Links 2010-04-20

Tuesday, April 20th, 2010

By the Infoglide Team

The Miami Herald: Medicare’s fraud hot line begins to root out billing scams

“By September, Feliberto Ramos was arrested on fraud charges accusing him and his company, Miracle Group Rehabilitation Center, of falsely billing the federal healthcare program $3.1 million over just three months. Medicare paid Ramos $1.9 million for rehab services never provided to angry beneficiaries.”

OCDQ:Data, data everywhere, but where is data quality?

“Data matters because everything—and not just the rows in our relational databases and spreadsheets, but also our status updates from Facebook and Twitter, our blog posts, and even most of our daily conversations—is data. The growing challenge is can we extract meaningful insights from these vast and veritable oceans of unrelenting data volumes, and use those insights to make better decisions in near real-time in order to positively impact the various aspects of our lives.”

eBusiness Tweets: Microsoft entering the electronic medical record (EMR) software market

“You would think Microsoft would be in such a promising industry, but you won’t find a Microsoft EHR available. The primary reason why is that EHRs are highly specialized, and Microsoft’s main products (Dynamics, CRM, and SharePoint) don’t come anywhere near the needs of physician practices. It would be very difficult for Microsoft to build an EHR from scratch and introduce it to the market. so what should Microsoft do to enter the industry? Acquire a current player.”


Bad Behavior has blocked 1166 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice