Start Early on Data Quality
Wednesday, October 8th, 2008“A problem well stated is a problem half solved.”
Charles F. Kettering (1876 - 1958)
Have you been wondering how something as big as the sub-prime mortgage fiasco could have not been forecast well before it happened? Ellen Pearlman pondered this while writing in CIOZone about the thoughts of Thomas Redman in his recent book on data quality. Interestingly, Redman wrote that the mortgage crisis “illustrates perfectly how bad or missing data contribute to issues of international importance and the costs and uncertainties that result”, and he wrote it before this past week’s federal intervention.
The data quality predicament in the mortgage arena is one example of a much more ubiquitous problem. Data doubles every 12-18 months, and that includes bad data. Redman estimates that bad data costs as much as 10-20% of revenue. Doesn’t that mean that the earlier in the data life cycle we get a handle on the data quality problem, the lesser the financial impact?
So says Alena Semeshko in a post on ZDNet UK. “I keep wondering how come data quality check still exists as a procedure performed once in a while, rather than as a part of the front-end process? How come most companies start worrying about the quality of your data only when it’s already dirty and in use?”
Identity resolution plays a vital role in data quality applications. Applying identity resolution on the front-end can ensure that error-filled and fraudulent identity information is detected and kept from entering production systems.
How does it work? Take an example of a web application for applicants or new customers. After an online form is completed, the individual’s identity attributes (name, address, etc.) are compared in real-time to a list of known fraudsters to screen out bad guys. The identity info is also similarity searched to discover an already existing master record for that person, and the two can be matched and resolved to one before the data even enters the system.
Redman’s “rule of 10” states that “it costs ten times as much to complete a unit of work when the input data are defective (late, incorrect, missing, etc.) as it does when the input data are perfect.” With bad data just as with software bugs, the earlier they’re found, the more money is saved.
