HOME

Archive for the ‘Data Management’ Category

Identity Resolution Daily Links 2009-06-22

Monday, June 22nd, 2009

By the Infoglide Team

intelligent enterprise: They Better Get This MDM Program Right

“As reported in The New York Times and on the TSA Web site, the Secure Flight program will improve upon current practices in matching passenger identities to watch lists in many ways. At first glance, this appears to be a well thought-out program that conforms to several basic tenets of Master Data Management (in bold below), in this case for the ‘Customer’ entity.”

EHRWMS: Georgia’s Best EMR Used By Three of Top Ten Pediatricians

“Of approximately 100 respondents, 28 used an EMR, of which 40% used the EncounterPRO Pediatric EMR. There were only three other EMRs used more than once, and they were used by only 10%, 7%, and 7% of the survey respondents respectively.”

Government Executive: Enforcement agencies boost cooperation on drug investigations

“In addition, ICE agents for the first time will fully participate in the Organized Crime Drug Enforcement Task Force Fusion Center. The center allows participating federal, state and local law enforcement agencies, including DEA and the FBI, to share information and analytical resources to enhance their overall investigative capacity.”

SmartData Collective: The Data-Information Continuum

“Data could be considered a constant while information is a variable that redefines data for each specific use. Data is not truly a constant since it is constantly changing. However, information is still derived from data and many different derivations can be performed while data is in the same state (i.e. before it changes again).”

Identity Resolution Daily Links 2009-06-12

Friday, June 12th, 2009

[Post from Infoglide] Data Source Disintermediation?

“According to Wikipedia, ‘disintermediation is the removal of intermediaries in a supply chain: ‘cutting out the middleman’… Buyers bypass the middlemen (wholesalers and retailers) in order to buy directly from the manufacturer and thereby pay less.’”

[Jim Harris] OCDQ Blog: The Two Headed Monster of Data Matching

“Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household). Data matching is commonly plagued by what I refer to as The Two Headed Monster…”

CorpWatch: CorpWatch announces release of the CrocTail application and open CorpWatch API

CrocTail provides an interface for browsing information about several hundred thousand U.S. publicly traded corporations and their many foreign and domestic subsidiaries. Information from company Securities and Exchange Commission (SEC) filings has been parsed and annotated by CorpWatch to highlight specific corporate accountability issues. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.”

Vos Is Neias: Washington - TSA Advising Travelers To Book Airline Tickets Using Full Real Names

“While the T.S.A. has announced Aug. 15 as a target date for the airlines to begin asking for each passenger’s full name, gender and date of birth, and has already begun publicizing the program, called Secure Flight, the agency acknowledged that it would go into effect in phases as the airlines update their systems.”

Data Source Disintermediation?

Wednesday, June 10th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

According to Wikipedia, “disintermediation is the removal of intermediaries in a supply chain: ‘cutting out the middleman’… Buyers bypass the middlemen (wholesalers and retailers) in order to buy directly from the manufacturer and thereby pay less.” Some famous disintermediation examples are:

•    Bookselling (e.g., Amazon’s long-tail marketing of millions of books online)
•    Travel (e.g., Southwest Airlines selling tickets direct to consumers on the web)
•    Computers (e.g., Dell selling computers direct to consumer and businesses over the internet).

Disintermediation was THE hot topic during the dot com boom, but the heady prediction that virtually every industry would be disintermediated has yet to become a reality. Nevertheless, over the past decade or so we’ve all tracked the news as one business model after another is attacked by competitors who seek a way to “disintermediate” a particular sector.

Part of the power of identity resolution solutions derives from the data sources upon which they’re based, and both the quantity and quality of data sources can affect the results. One challenging identity resolution problem we’ve written about that relies on a variety of data sources is insider trading (see Leveraging Identity Resolution Data Sources). Drawing on multiple data internal and external, public and private data sources, identity resolution unwinds multiple degrees of business, friendship, and familial relationships to uncover likely illegal stock market gains.

Now potential disintermediation plays related to data sources are emerging. CrunchBase is a well-known example, offering a free database of technology companies, people, and investors that anyone can edit. San Francisco-based CorpWatch is a non-profit engaged in “investigative research and journalism to expose corporate malfeasance and to advocate for multinational corporate accountability and transparency”. They’ve just announced an API that makes it easier to search SEC data:

“Although the SEC provides a search interface for locating company filings (EDGAR / IDEA), and the subsidiary information is not presented in a standardized format suitable for automated use or insertion into a database. The CorpWatch API uses parsers to “scrape” the subsidiary relationship information from Exhibit 21 of the 10-K filings and provides a well-structured interface for programs to query and process the subsidiary data.”

The free CorpWatch API enables identity resolution and other applications to look up the formal names of corporations, ascertain their relationships to other corporations, find their locations around the world, learn their alternate and formal names, and access other useful information. Up to now, you could only get this kind of information from relatively expensive paid subscriptions from commercial data providers.

Is it possible that the efforts of organizations like CorpWatch point to a future in which an abundance of new, free sources of data will make it even easier to create identity resolution applications?

Identity Resolution Daily Links 2009-05-18

Monday, May 18th, 2009

By the Infoglide Team

e-patients.net: Meaningful Use: The Elephant IS In The Room

“A recent NPR/Kaiser Family Foundation poll shows that the American public is surprisingly more positive about the potentials of EHRs than most professionals. People already are familiar with computerized information and accept its risks.”

IT-Director.com: Trends in Master Data Management

“The interesting question is how much pressure this puts on the other MDM players with data quality solutions (like Dataflux and SAP/Business Objects) to build out their data profiling capabilities into the area of data discovery.”

NationalSecurity.org: MYTHBUSTER: TSA’S WATCH LIST IS MORE THAN 1 MILLION PEOPLE STRONG

“There are less than 400,000 individuals on the consolidated terrorist watch list and less than 50,000 individuals on the no-fly and selectee lists. Individuals on the no-fly and selectee lists are identified by law enforcement and intelligence partners as legitimate threats to transportation requiring either additional screening or prohibition from boarding an aircraft.”

OCDQ Blog: TDWI World Conference Chicago 2009

“TDWI World Conference Chicago 2009 was held May 3-8 in Chicago, Illinois at the Hyatt Regency Hotel and was a tremendous success.  I attended as a Data Quality Journalist for the International Association for Information and Data Quality (IAIDQ). I used Twitter to provide live reporting from the conference.  Here are my notes from the courses I attended…”

Identity Resolution Daily Links 2009-05-14

Thursday, May 14th, 2009

[Post from Infoglide] Let’s Be Reasonable

“A recent post, ‘Terrorist Watchlist, Troubling Flaws Revealed’, starts out by making a valid point. If the terrorist watchlist is flawed, then the name matching results against such a list will be flawed. The author then goes on to reach related conclusions through rationalization rather than reasoning.”

Acxiom: Prognostications for the New Year 

Identity resolution will get its due. Sure, you can call it infrastructure. Processing and rules intensive, customer identity resolution has been relegated to the underlying algorithms of third-party data providers, MDM, and data quality vendors. However, companies are recognizing that they may have unique customer data-matching needs-a bank we work with has more than 50 definitions of a household-and they’ll be looking at smarter, more specialized ways to automate them.”

Dallas Morning News: Dallas Police Department’s Fusion Center outsmarts criminals

“Chief David Kunkle, who championed the unit’s formation in January 2007, refers to it as the “brains” of a department that reported a 10 percent drop in crime last year and a nearly 19 percent decline in the first quarter of this year.”

datanomic: Fractured approaches to Sanctions Screening put UK Companies at risk, says new FSA report

“‘The use of multiple identities is common in the criminal world and Al-Qaeda’s own training manual requires its operatives to use false identities to hide their terrorist activities. Exploiting variations of a criminal’s real name is, perhaps, the simplest way of acquiring a new identity. Typical approaches are to use name variations or switching the order of names,’ added Pearson. ‘Other data, such as dates of birth are often manipulated simply by transposing digits.’”

Cloud Computing Journal: Experian QAS Launches QAS Pro On Demand

“‘By offering address verification in a SaaS model, we are enabling organizations of all sizes to maintain accurate contact data in a cost-effective platform,’ said Joel Curry, chief operating officer, Experian QAS. ‘As businesses change over time, our new infrastructure is able to adapt to shifting demands.’

Identity Resolution Daily Links 2009-05-11

Monday, May 11th, 2009

By the Infoglide Team

BI Blogs: Business Intelligence - The Unconquered Territories

“Let’s face it - There are technology limitations. Operational BI (Lack of real-time data access), Guided analytics (Lack of comprehensive business metadata), Information as a Service (Lack of SOA based BI architecture) are some of those technology limitations that come to my mind.”

SecurityInfoWatch: RILA survey: Retail crime on rise

“Some 72 percent of respondents said they have seen an increase in organized retail crime (ORC), and 52 percent said they had experienced a rise in financial fraud. Paul Jones, vice president of asset protection for RILA, noted that the increase in ORC should set off alarms not only within the retail community, but also within the business and law enforcement community. Organized retail crime typically involves organized groups of criminals operating shoplifting rings which have networks to fence their stolen goods, which may also appear on Internet auction sites like eBay, as well as at flea markets.”

Fast Company: Work/Life: “Secure Flight” Takes to the Air in August

“So now is the time to examine your driver’s license or passport to see that your first name, middle initial (if you use one), and last name appeared exactly the same across all of your identification. If you need a new photo for your driver’s license, now is the time to get it. Being consistent with your name also means that all of your bookings - including air, hotel, and car rental - must be consistent.”

SmartDataCollective: Enterprise Data World 2009

[Jim Harris] “Enterprise Data World is the business world’s most comprehensive vendor-neutral educational event about data and information management.  This year’s program was bigger than ever before, with more sessions, more case studies, and more can’t-miss content.”

Identity Resolution Daily Links 2009-04-03

Friday, April 3rd, 2009

[Post from Infoglide] Secure Flight and Identity Resolution

“Most of the time we’re just heads down doing our job of providing identity resolution solutions for our customers. But this is one of those weeks at Infoglide when we appreciate the opportunities we’ve had to make a difference. TSA announced that the Secure Flight Program has begun vetting passengers.”

[Post from Infoglide] TDWI Interview: Identity Resolution Reveals

“In a comprehensive discussion, Doug Wood of Infoglide Software spoke about an area of confusion that exists when people discuss identity resolution. He pointed out that the term is sometimes misapplied to describe software that performs data matching alone.”

data quality PRO: Identifying Duplicate Customers (Part 2)

[Jim Harris] “False negatives can be caused when the greater concern about false positives motivates a cautious approach to duplicate identification. This leads many projects to adopt a strategy allowing only exact matches. Therefore, let’s begin by looking for duplicates where the exact same information is repeated on multiple records – meaning where all attributes are populated and have the same value…”

TMCnet.com: Workers’ comp fraud rising: Task force to make unannounced visits

“‘The theory is that in a difficult economic climate, crime tends to go up as does fraud. People who aren’t otherwise motivated to be dishonest may follow that path,’ said Maureen O’Connell deputy district attorney with the fraud unit. The unit has seen a marked increase in workers’ compensation fraud over the past year and has had to double the fraud unit’s staff to handle the volume, she said.”

data quality PRO: External Reference Data - An Overview

“My guess is that exploiting external reference data as an important element in achieving optimal data quality will increase heavily in the following years, and cloud computing will be a main driver.”

Identity Resolution Daily Links 2009-03-30

Monday, March 30th, 2009

By the Infoglide Team

data quality PRO: Identifying Duplicate Customers (Part 1)

[Jim Harris] “What is sometimes overlooked is that although technology provides the solution, what is being solved is a business problem. Technology sometimes carries with it a dangerous conceit – that what works in the laboratory and the engineering department will work in the board room and the accounting department, that what is true for the mathematician and the computer scientist will be true for the business analyst and the data steward.”

Correction Officers Going Wrong: California Correctional Officer Arrested for Fraud

“Each insurance fraud count carries up to five years in state prison. Also, California workers’ compensation fraud statutes require restitution of double the monetary amount of the fraud; the suspected loss on this case is more than $150,000, not including more than $1.6 million in disability retirement from the California Public Employee Retirement System (CalPERS) that would have been paid out on this suspect claim.”

TwinCities.com: What if your lottery sales clerk said your ticket was a loser … and lied?

“‘(We) really need our retailers to be honest and to have their employees do it right every time,’ said state lottery director Clint Harris. The stings took place last December and January at 186 randomly selected metro stores, Harris said. Undercover agents would ask clerks to verify the specially constructed crossword game scratch-offs as winners. The prizes ranged from $7,000 to $21,000.”

Register Herald: Fusion center helps fight war on crime

“Kirk is handling a new mission in life, directing West Virginia’s fledgling fusion center, a new tack in the war on terrorism and crime in general. Put simply, it acts as a clearinghouse so data can be analyzed and the proper law enforcement agency put on notice for immediate, or long-range, investigations.”

Dashboard INSIGHT: The increasing convergence of MDM and data governance

“The concept of data governance is simple.  The Data Governance Institute (datagovernance.com) defines it as ‘a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.’”

Identity Resolution Daily Links 2009-03-23

Monday, March 23rd, 2009

By the Infoglide Team

Vermont News Guy: When Is a Worker Not an Employee?

“The practice - scorned as ‘1099ing,’ by construction union officials (for the Internal Revenue Service form that freelance workers fill out) -short-changes Worker Compensation, Unemployment Insurance and Social Security funds. It also ‘creates an unlevel playing field,’ in the words of Vermont Labor Commissioner Patricia Moulton Powden. Businesses that play by the rules can be underbid by their competitors who do not. The specific reason is the company, GNPB/Kal-Vin , which sometimes goes by only one or the other of those names, and which is known by contractors, union leaders, and government officials as a company with a spotty labor law record.”

Forrester Blog: Lean Information Management Strategies For Lean Times

[James Kobielus] “Many organizations struggle to gain control over information infrastructures that have become too bloated, rigid, and slow to realign with new business drivers. Lean information management practices are essential for corporate survival. They are far more than belt-tightening exercises. They also help you build analytic muscle for excelling in any business environment.”

Chicago Daily Herald: State’s new fraud unit targets workers’ comp abusers

“Every employer in Illinois is required by law to have workers’ compensation insurance. The amount a company pays for this insurance depends on the business and - just like with car insurance - it will go up based on the number of claims filed. For this reason, workers’ compensation fraud is not committed only by employees, but by medical providers and employers trying to avoid pricey premiums and payouts. They do this in a variety of ways, said Michael McRaith, director of insurance for the Illinois Department of Financial and Professional Regulation.”

Entity Extraction: The Flip Side of Entity Resolution

Wednesday, February 25th, 2009

By John Talburt, PhD, CDMP, Director of the Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock

John Talburt - smallUnder our working definition of entity resolution as locating and merging references to the same entity, the last installment focused on the merge problem, and how matching is often used as a stand-in for ER.  Now let’s take a look at the locating problem.

First we should note that information comes to us in two forms, structured and unstructured.  The traditional world of IT has been built around structured information based on the discipline of relational database schemas.  In essence, data is structured if it is ready to be loaded into a relational database, i.e. all of the entities and their attributes are clearly delimited or tagged in a way that a computer can correctly read the entire data set by following one simple, repeating pattern.  In the good ole days, the flat-file format gave us this by requiring that every record must have a fixed length and every attribute must occupy a fixed position in the record.  Inspired by the spreadsheet paradigm, a friendlier version came along only requiring that all of the attributes be presented together in a fixed order, each separated from the other by a specially designated character, the delimiter.  Now XML has brought us yet another discipline of explicitly tagging the start and end of records and attributes with a consistent naming convention.

So in the structured world, locating is easy, you just follow the pattern.  The problem is that we are now beginning to realize that there is a tremendous amount of information in unstructured formats such as free-form documents, photos, videos, audio files, sensor data, and other formats, formats that are not easily mapped into an entity-attribute schema.  Even if we just focus on information encoded in character (text) format, the total amount of unstructured information in most organizations often exceeds the amount of structured information by a considerable amount.  What’s more, we now realize that some of this information could be important, i.e. that processes like customer relationship management (CRM) could be transformed if the company only knew what their customers were saying in their emails to the company or in the comment they gave to telemarketers or technical support personnel who typed those comments into a free-form, notes field.

So how did we end up with so much unstructured information? Did good information go bad?  No, the reason is that the information age operates on four channels –  people to computers, computers to people, computers to computers, and people to people – and it is the latter generates the unstructured information.  Person-to-person communication is inherently complex and often carries a tremendous amount of implicit and explicit context that people understand, but computers don’t.

Early in my career, I worked with a professor on the problem of disambiguation of homographs using thesauri (a fancy way of asking if a computer can understand the difference in meaning between two words that are spelled the same, but mean different things, just by looking at the synonyms of the words around them., e.g. “I can open this can.”)  His favorite test was “Time flies like an arrow, but fruit flies like a banana.”

But getting back on topic, if you want to resolve whether references are to the same or different entities, you must first have the references.  So if the information sources are unstructured, the locating side of entity resolution is about finding the entity references.  This process is variously referred to as “named entity recognition”, “entity identification”, or “entity extraction”.  In the next installment we will discuss some of the strategies for entity extraction from unstructured text documents.


Bad Behavior has blocked 333 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice