HOME

Archive for the ‘Data-Mining’ Category

#3: If Only Data Quality Were That Simple

Wednesday, December 17th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

Our previous post was a response to Phillip Howard at Bloor who recently raised questions about data quality solutions in a series of posts.  One final point he made raises an issue that deserves an extended comment.

On this point Philip says “where I think there may be a significant difference between products is in their ability to discover relationships. However, I will not comment on this now as I am conducting research into this issue and plan to publish a detailed report in the New Year.”  This brief comment highlights what I believe is a rapidly emerging market.

Discovering hidden relationships is a crucial part of a market known by various aliases: entity resolution and analysis, entity analytics, and our favorite of course, identity resolution. Regardless of the name you use, identity resolution problems have distinct characteristics not adequately addressed by existing solutions for data quality (DQ). Neither are they addressed by customer relationship management (CRM), master data management (MDM), or business intelligence (BI).

Identity resolution solutions focus on horizontal need: identifying bad actors in multiple industries. They require at a minimum the following capabilities:
1.    Identity matching through an extensive library of attribute-specific analytics
2.    Relationship detection and resolution resolution
3.    Decisioning that leverages industry-standard and other rules-based systems
4.    Seamless integration with existing business processes via web services and APIs.

Until recently, identity resolution problems were overlooked, addressed by custom in-house applications, or served by cobbling together products from adjacent and overlapping markets. As the identity resolution market has emerged, vendors from these adjacent markets have tried to address the needs with existing products, but customers quickly learn that these products lack the combination of integrated capabilities needed to adequately this unique problem area.

Identity resolution has a similar relationship to each of the 4 adjacent areas mentioned above. Each area causes the creation of data sources that can be consumed by identity resolution solutions, while the addition of identity resolution technology enhances the accuracy of DQ, BI, CRM, and MDM.

Here’s a chart that seeks to clarify these relationships and to characterize the strengths and weaknesses of the adjacent products when applied to identity resolution problems. The adjacent products simply can’t address identity resolution well because they were built for a different purpose. Do you agree?

Identity Resolution Daily Links 2008-12-12

Friday, December 12th, 2008

[Post from Infoglide] Part Deux: If Only Data Quality Were That Simple

“Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.”

Data Value Talk: The added value of an integrated customer view

“So it appears that the data itself plays a crucial role in the lack of an integrated customer view. Or more accurately, the better the data - the better the customer view.  And the better the matching of customer records across separate systems the better the integrated customer view. So Data Quality and Matching (Identity Resolution) determine in large parts the quality of the integrated customer view and the added value that it delivers.”

Marion Star: Muzzle loading and compensation

“Investigators from the Ohio Bureau of Workers’ Compensation, posing as gun enthusiasts, twice visited SMS. Those visits consisted primarily of small talk about guns and ammo. McGraw discussed some pistols that he had recently sold and invited one of the investigators to bring in an allegedly defective gun, telling them he would ‘take a look at it.’”

Intelligent Enterprise: ‘Surround Strategy:’ A Prediction for 2009

” Rather than trying to remodel the data warehouse to accommodate fresher and more detailed operational data (near real-time activity in operational systems, process logs, etc.), these data sources will operate in parallel (or horizontally, whichever word you like) as complementary feeds to analytics. It takes too long and is too expensive to expand the data warehouse concept to do this.”

New York State Insurance Department: Cortland Woman Accused of Workers’ Comp Fraud

“Horton is charged with making false statements and submitting false testimony to the Workers’ Compensation Board to receive benefits. She claimed that an April 2006 back injury she suffered while she was a health aide prevented her from working or attending school. Investigators learned that she was attending school full-time.”

Gartner: When is SOA, DOA? When it’s without MDM!

[Andrew White] “Clearly, if every SOA-based application interaction had to incur the costs of data reconciliation, mapping, clean up etc, then the cost of building and maintaining that SOA-based application would exceed what it costs today without SOA.  The bottom line: SOA needs MDM to help with the evolution of the information infrastructure.”

The State Journal: Insurance Fraud Unit Wins 45 Convictions This Year

“Since January 2007, the fraud unit has received 1,703 case referrals for review from those in the insurance industry and private citizens. After reviewing the referrals, field investigators have been assigned 397 cases to pursue. During that time, [West Va. Insurance Commissioner Jane] Cline said, 292 criminal cases have been referred to various prosecuting authorities, as well as in-house prosecutors who have been assigned to the unit on a full-time basis. Further, the fraud unit has secured indictments on 84 individuals for 294 felony counts and successfully obtained 73 convictions, including 45 in 2008.”

Part Deux: If Only Data Quality Were That Simple

Wednesday, December 10th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

During the past two weeks, Phillip Howard at Bloor Research has raised interesting questions about the nature and efficiency of data quality solutions in a series of posts entitled “The problem with data quality solutions.” Last week I responded on his blog and posted an expanded discussion of the same points here.

His fourth installment opens some interesting new topics. Perhaps the best approach is to lift some quotes and then respond below.

“Where I will comment is on the importance of understanding relationships not just between data elements but also between data and applications and even between data and the business. Understanding data relationships is arguably the most important factor whenever you are moving and transforming data, especially in data migration and data archiving environments but also for moving data into a warehouse and similar applications.” We agree that finding non-obvious connections is crucial to building effective data quality solutions. Many technologies fall short in this regard. They are unable to evaluate relationships based on similarity when data is inconsistent. Philip’s simple example baffles many technologies:

“A typical case might be where one application required a five digit numeric field and another application requires the same five numbers plus an additional two alphabetic characters. So, here’s a question for data quality vendors: can your software tell the difference?”  Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.

He goes on: “Unfortunately, discovering relationships is not just about profiling your database. There may be relationships that exist across data sources (and types of data source) that you need to understand; and then there is the application factor. While it may not be theoretically correct from a purist data management perspective the fact is that many data relationships are defined within applications so, in one way or another, you really need to discover these.”  We couldn’t have articulated it any better. Many data quality solutions assume a higher degree of order than actually exists in the real world. Being able to deal with ambiguity (e.g., data sometimes missing, data entered in wrong fields) distinguishes the best technologies from their more simplistic brethren.

This post is getting a little long, so we’ll continue this discussion next week. In the meantime, we’d like to hear your reaction.

Leveraging Identity Resolution Data Sources

Wednesday, November 19th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

Ever have this experience? You’re searching Google for specific examples of a topic when you come across an already compiled and complete list of related examples – what a find! I had this experience recently when looking for contact information for people we wanted to invite to a marketing event, and voila! – I found a list of them that was 90% complete and accurate. Without that find, the project could have taken a couple days longer.

Aggregations of this type abound. Besides lists that individuals put together and post on the web, public and private databases offer all sorts of information on people that are useful in addressing multiple types of business problems and opportunities. Here are a few links to “people data” that you may not be aware of, and there are many more:

CapitalIQ

ChoicePoint

KnowX

NETR Online

PublicRecordFinder

SearchSystems

Who’s Who

Identities contained in multiple databases with varying schemas as well as ambiguous and sometimes missing attributes can be resolved to deliver a clear picture of a person and activities they are involved in. Here’s an example that illustrates how you can draw on multiple data sources to solve a complex problem.

Everyone knows about insider trading, especially with the recent allegations about Mark Cuban. Essentially, someone uses confidential knowledge about a financial transaction to buy or sell stock to their personal advantage.

Many illegal insider stock trades can be readily identified. How? By “similarity searching” across records of stock trades, associated timelines (who knew what and when about the event) and public company financial institution data (e.g., CapitalIQ) then finding hidden relationships using biographical information (e.g., Who’s Who), background screening and residential information (e.g., ChoicePoint), and other public and private sources.

There are many more cases where identity resolution can exploit available data sources to address complex problems. Making sense out of these massive amounts of data by aggregating and sifting through them requires an ability to score the results accurately. Just as importantly, you need to be able to configure the scoring to fit the specific problem, i.e. the solution must be tuned to meet unique requirements.

Solving complex business problems often requires knowing more about who you’re dealing with and their relationships. Vast amounts of data are accessible online via APIs and web services, and they can be incorporated into new kinds of online applications that once were impossible.

Identity Resolution Daily Links 2008-6-27

Friday, June 27th, 2008

[Post from Infoglide] The Importance of Identity Resolution to MDM

“In a recent post on the Hub Solution Designs blog, I wrote about the importance of integration to Master Data Management (MDM). Today, I’d like to dig into the importance of identity resolution to MDM.”

b-eye.com - Business Intelligence Network: Loss Prevention in the Retail Environment

“Leading retailers are using business intelligence to implement enterprise-wide loss prevention solutions that take into account the wide variety of factors that contribute to economic loss.”

DataFlux Community of Experts: Data Quality Strategy

“I was just creating a data quality strategy for a customer going through a conversion from multiple sources to one integrated database. While doing the conversion, new software will be introduced and propagated out to the organization. Not an easy task for any organization!”

Wired Danger Room: FBI Data-Mining Slashed After G-Men Dis Congress

“Earlier today, a House appropriators voted to pull $11 million to expand a controversial FBI data-mining project, after the Bureau repeatedly stiff-armed Congressmen and their gumshoes in the Government Accountability Office.”

b-eye.com - Business Intelligence Network: The Dirty Job of Data Governance

“Hence, this inaugural edition of the Jill Dyché Data Governance newsletter, courtesy of our buddies at the Business Intelligence Network (BeyeNETWORK.com). We’re encountering some new clients who’ve asked us to do cleanup duty on their failed data governance efforts, and that’s taught us a lot about how to deliver data governance, data management, data quality, master data management (MDM) and data integration initiatives the right way. Through this monthly newsletter, I’ll be sharing some of those hard-won lessons with you.”

Data Governance and Data Quality Insider: Data Quality Events – Powerful and Cozy

“For those of you who enjoy hobnobbing with the information quality community, I have a couple of recommendations for you. These events are your chance to rub elbows with different factions of the community. In the case of these events, the crowds are small but the information is powerful.”

Identity Resolution Daily Links 2008-3-17

Monday, March 17th, 2008

PogoWasRight.org: Dept. of Homeland Security 2008 Data Mining Report

“From their site: ‘This is the third report by the Privacy Office to Congress on data mining.’”

North Country gazette: Insurance Report: Fraud Arrests Up 17%

“The Bureau reported that most insurance fraud cases are now tracked electronically by means of a web-based case management system that became fully operational in 2007. The system allows insurers to electronically transmit suspected fraud reports to the Bureau and allows Bureau personnel to track investigative tasks and events from initial assignment through closure.”

Data Governance and Data Quality Insider: Data Governance in a Recession

“Talk of a recession may slide your plans for big projects like master data management and data governance onto the back burner. Instead, you may be asked to be more tactical – solving problems at a project level rather than an enterprise level. . . . The good news is that times will get better. If and when there is a recession, we most certainly DON’T want to have to rewire and re-do our efforts later on. If you are asked to become more tactical, there are some things to keep in mind that’ll save you strategic frustration:”

ComputerWeekly.com: Speakers at BCS event unravel data privacy issues

“There has always been a need for information sharing within and across enterprises. The aim of the information system, then, should be to provide the right information, to the right person, at the right time. For this to happen, developers should focus on the whole system rather than just the technology.”

Identity Resolution Daily Links 2008-3-14

Friday, March 14th, 2008

[Post from Infoglide Software] Fraud Tax Increase

“We recently featured a link to an article titled ‘You may be paying $400 to $600 a year to offset shoplifting costs’. When we coined the phrase ‘fraud tax’ in a previous post, we estimated the cost of both property and casualty insurance and retail fraud combined to be about $600 per household. . . . Apparently, we may need to adjust our figure if retail theft alone costs as much as $600 per person, as is speculated in the article.”

DataFlux Community of Experts: Is MDM the Same as Data Quality?

“I just reviewed a handful of case studies on master data management, and I had the distinct feeling of deja vu. Many of the MDM programs in the case studies centered (of course) on customer data, and even more pointedly, on the matching and linkage aspects of customer data integration. Considering that five years ago (prior to the creation of the term ‘master data management’), these case studies would have been touted as best practices in data quality.”

Evolution of Security: Apple MacBook Airs are Cleared for Takeoff

“I’ve never taken part in the war between Mac & PC users… I’ve used both and I enjoy using both, but I thought surely the TSA wasn’t diving into the digital trenches and waging war against Apple. I know we’re a versatile agency, but I would have to admit this would definitely be mission creep.”

Public Opinion: Thefts of Oil of Olay, other beauty products, on the rise

“In some cases, the people who are stealing aren’t working alone. An investigation into shoplifting in Polk County, Fla., uncovered a multimillion-dollar theft ring, leading to 18 people being charged in January. According to the Ledger newspaper in Lakeland, Fla., $60 million to $100 million in merchandise was stolen, which was then transported to ‘fences’ who resold the products at flea markets and with online auctions, such as eBay. Police began investigating the theft ring after two of its members were arrested for retail theft. A search on eBay revealed more than 1,100 products for sale, with many of them being sold for prices that are considerably less than retail value.”

ZDNet: How an information system helped nail Eliot Spitzer and a prostitution ring

“On the surface, Spitzer’s downfall is a New York tabloid’s dream. . . . But what really snared Spitzer was a money laundering investigation that was flagged by suspicious activity reports (SARs) that banks have to file with the Treasury to surface everything from money laundering to terrorist activity.”

Identity Resolution Daily Links 2008-2-8

Friday, February 8th, 2008

ars technica: TSA answers our question, changes policy after blog comments

“When the TSA launched its Evolution of Security blog last month in an effort to engage in open communication with travelers, my colleague Jon Stokes expressed skepticism and characterized as laughable the prospect of TSA acting on feedback from the public. Having written about the ineptitude of TSA in the past myself, I was emphatically inclined to agree with Jon. It looks like we might have passed judgment prematurely, because it only took TSA one week to start making broad policy changes in response to feedback received through the blog.”

Roseville & Rocklin Today: Organized Retail Crime Ring Busted

“Dave Kemp spotted suspicious activity by a shopper within the Longs store on Stanford Ranch Road in Rocklin and obtained the license plate number of the car that the shopper and two others left in. He suspected that the shoppers were involved in ORC based upon the methods of filling shopping carts; specifically, the type of products being taken and the method of hiding the products in the shopping cart.”

cnet: How will Real ID affect you?

“The Real ID law is touted by Homeland Security officials as an anticrime and antiterror measure, but is steadfastly opposed by some state governments on privacy and sovereignty grounds. Computer scientists also have raised concerns about how its creation of a national interlinked database would work in practice. . . . Real ID will require states to share detailed information about anyone with a state ID card or driver’s license, perhaps through a network called AAMVAnet, which the Department of Transportation is paying to expand in hopes of supporting the massive amount of data that will be exchanged. Databases owned by Social Security and U.S. Citizenship and Immigration Services will also be integrated.”

casino.co.uk: Is Casino Fraud Increasing?

“There has been a flurry of police investigations and arrests on the subject of Casino fraud, although surprisingly it has been a number of casino employees whom have been arrested. So is Casino fraud on the increase or is this just a spate of fraud cases hitting the media?”

b-eye.com - Business Intelligence Network: What is Master Data?

“The enormous interest in master data management (MDM) that has appeared in the past couple of years has not yet generated a great deal of methodological progress. Hopefully, as data professionals, consultants, and vendors grapple with the complex issues involved, the situation will improve. A central problem, however, is that there is little agreement about what master data is.”

[Note: Look for our upcoming blog post on Identity Resolution vs. Master Data Management, a part of our series on Mistaken Identity Resolution.]

Identity Resolution Daily Links 2008-1-21

Monday, January 21st, 2008

EyeforRetail: Loss Prevention Executive Interviews

Q: What do you think has been the most successful move against internal loss in retail in the last 5 years? A: Generally, the increased use of technology such as video, data mining and suchlike. In addition to this, retailers are beginning to make fundamental changes and are safeguarding their organisations more effectively. Of course, the answer can’t be found in data mining advances, staff recruitment, training or any other solution singularly; effective screening, instruction, control and detection are ALL key and will be the only way for retailers to win the internal loss battle.”

Insurance Journal: New York Says Insurance Fraud Arrests Are Up

“Beginning this year, the bureau has established a major case unit that will focus on the investigation of systemic insurance fraud involving organized conspiracies. The unit will be headed up by a deputy chief investigator and will include five investigators who were selected from the bureau’s specialized units. The new unit will take the lead in investigating complex insurance cases involving no-fault, commercial rate evasion, health care fraud and workers’ compensation premium fraud.”

The Creswell Chronicle: New laws going into effect Jan. 1

“SB 331 A: Creates a new crime of ‘organized retail theft’. To be guilty of organized retail theft, the state must establish that: (1) the person stole merchandise; (2) from a mercantile establishment; (3) the person acted in concert with another person; and (4) the aggregate value of the merchandise within a 90-day period exceeds $5,000. SB 331 A also places organized retail theft within Oregon’s RICO (Racketeer Influenced and Corrupt Organization Act) statute.”

Rusty Hubcaps and Old Boots: A Fish Tale by John Ripley, Chief Software Architect - Part II

Monday, December 10th, 2007

[NOTE: We apologize that this is late in getting posted. We were having some technical problems that we just resolved.]

We left off Wednesday with Fisherman Bob and his catch of fish, lobsters, rusty hubcaps, and old boots.

Practically speaking, the balance of “false positives” (records I shouldn’t have found but did) and “false negatives” (records I should have found but didn’t) is something that challenges anyone in the business of information retrieval.

In the world of retail returns fraud, it is a question of potentially turning a loyal customer into a former customer. We either:

  • Deny their return because of a false positive hit against a “known shoplifters” database or
  • Let “Ima Theef” return a shoplifted item because the shoplifter database contained “Iama Thief” but the tightened matching policy is such that “Ima McLoyalCustomer” would not be denied.

In the retail world, I suspect the tendency is to lean toward avoiding false “positives” because after all, the customer is always right.

Contrast that with the world of background checks and employee vetting. The consequences of missing a hit against the sex offender registry because a newly hired school cafeteria worker misspelled his last name and changed his date of birth by 1 digit on his application are immeasurable. The parameters of this search would most likely err on the side of caution and consider multiple variations on name and date of birth.

In either case, the search problems are similar regardless of the tendency towards false positives or false negatives. Ultimately, whether searching with a wide net or a standard one, the results of the search should be qualified. There needs to be a way to remove the “old boot” from the record set before it gets to the consumers of the data. They should have confidence that records of relevance were returned to them. In principle it sounds like a simple expectation to fulfill, but in practical terms, it is not easy to achieve in a consistent, automated fashion. We hear it regularly from our customers in need of a better solution.

Infoglide Software to the RESQ! No, that is not a typo, it is my not-so-clever acronym to describe what Identity Resolution Engine(tm) (IRE) offers for the search challenges I have mentioned: Restrict/Expand, Search, and Qualify.

Restrict. No you can’t boil the ocean. Whether limited by processing power, memory, single search performance, bandwidth, data volume, or overall throughput, at some point you will unable to a do complete “deep-dive” search and still achieve the required performance characteristics. The system needs an intelligent means to logically subset the searchable data.

Expand. Using exact matches against indexed fields has been the bread and butter of the relational database market for the past 30 years. They are incredibly fast at doing it, but that is not enough. We need to find variations in the data. We need to find things that are “like” what we are looking for. We need to find things that are “near” what we are looking for. The system needs an intelligent means to expand our search net so that our candidate set more-than-likely contains our records of interest. The more “more-than-likely” the better. Obviously, returning every record on every search guarantees 100% recall, but the answer lies in between.

Search. Of course, we have to physically search at some point in our processing. That search may be a SQL query with all the search criteria restricted and expanded as appropriate, a high-performance-in memory database, or a web-service call to a data provider, to name a few. In any case, we end up with a result set of records that, in some shape or form, more than likely contain the records we are looking for.

Qualify. The records in our candidate set have been included because they have, for better or worse, met one or more of the restrictive or expanded parameters in our search. However, as a whole, the record may be discarded because of other attributes, contradictory information, or some other reason to disqualify the record.

As a practical (yet trivial) example, one of my restriction/expansion strategies may have been to do a “starts with” when searching for “John Ripley.” That would yield a search that might be: find me all records where first name starts with ‘Jo’ and last name starts with ‘Rip.’ In this simple example, I have avoided boiling the ocean and only asked to return records that are like Jo___ and Rip___. In my record set I have Joe Riplinski, Jon Ripley, Jolene Ripple, Joseph Ripcord, Johnathan Smith (Riptide California), John Ripley, Rip Johnson, and Cal Joe. Ripken.

Effectively, most of the records are “old boots” and “rusty hubcaps.” The system needs an intelligent means to separate the quality matches from the junk. IRE’s patented “Similarity Search,” with its large library of configurable measures, both generic and domain-specific, can be applied across one or more attributes, qualify the degree of similarity of a potential record, and either include or exclude it from the final record set based on configuration-time or search-time business logic. In this example, John Ripley would qualify 1st, with Jon Ripley coming in 2nd and so on. At some point the scores would be low enough to discard the record completely.

IRE has many configurable techniques to handle the “intelligent” part of restrict and expand such as nicknames, field transpositions, word stemming, ranges, and geographic proximity just to name a few.

The example above uses two fields (first and last name). With only that little amount of information, how we qualify the record (with respect to the search criteria) is limited because we only have name information. Imagine if we also had address and date of birth information. We could make more informed decisions . . . perhaps house-holding logic, familial relationship information (parent and children in the same house), and neighbors. But like the HDTV search discussed on Wednesday, we have been trained by ineffective search strategies to limit the types of things we ask for. We went from searching brand, model number, type of electronics, and feature keywords “Samsung LX-53567 TV HD 1080p” down to brand and type (”Samsung TV”). With the techniques of restrict, expand, search, and qualify, our initial search would not have yielded 0 results but instead “Samsung LX-53576 TV HD 1080i” and “Samsung VX-3345 TV HD 1080i” and so on, ranked by strength of match.

The power of the RESQ search pattern lies not only in the ability to configure each phase of the process with a variety of restrictors, expanders, and qualifiers, but in your ability to actually replace them with a completely different implementation to meet a particular customer and data need. Meanwhile, the pattern remains the same: Flexible yet powerful.


Bad Behavior has blocked 1166 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice