Managing Unstructured Data
By Guest Blogger Dan Power, President, Hub Solution Designs
In an earlier article, Governing Unstructured Data, I discussed some of the challenges in managing and securing unstructured data in a large enterprise. Given that unstructured data accounts for more than 80% of all business data, this is a big issue.
In my own company, we use Microsoft Groove as a collaborative document repository and content management system. It was developed by Ray Ozzie at Groove Networks, which Microsoft acquired in 2005 (Ray is now Microsoft’s Chief Software Architect). Groove has its quirks, but it works for us, and its military-grade security features and robust document encryption help guard against “lost laptop” syndrome.
Windows, which in its various versions is the operating system running more than 90% of the world’s computers, is inherently insecure. Users can create documents without any controls, and can store them on their local drives, shared network drives, or even on USB flash drives. Many enterprise applications allow users to download information into Excel spreadsheets or text files. And more than 80% of U.S. firms lose laptops with sensitive data each year.
Undoubtedly, an enterprise content management (ECM) system can help with these issues. According to Gartner, as of 2007, the ECM market leaders were Open Text Corporation, EMC (Documentum), IBM, and Oracle Corporation.
But one of the main challenges in a successful content management system is indexing and categorization. In a Master Data Management (MDM) hub, since that information is largely structured, identity management or resolution (i.e. matching) is relatively straightforward.
But with unstructured data, you’re looking for indications on whether a document deals with a customer, a supplier, an employee, an internal business unit, etc. And once your matching algorithm gets a hint on that, it has to find the identity information (person or business name, street address, city/state/zip/country, etc.) somewhere in the unstructured document.
This is not a trivial challenge. In a typical MDM hub, the hub vendor’s match engine may be sufficient. But when trying to categorize, index, and secure thousands of unstructured documents from all over the enterprise, an advanced matching or identity resolution engine from a company like Infoglide may be needed.
I continue to be interested in the issues of data governance for unstructured data. Not that MDM and data governance for structured data like customers and products is in any way a “solved problem”. Many companies are just starting to think about MDM and data governance on an enterprise scale. And very few are including unstructured data in their first few phases.
But it will eventually have to be tackled. Government and consumer patience for sensitive data being lost, inadvertently being posted on web sites, or stolen outright by international hacker rings, will at some point wear thin. In my opinion, more stringent regulations on how companies must manage and secure sensitive information on companies and individuals are inevitable. As with structured data, the tools exist to manage unstructured data. It’s just a matter of making Master Data Management an enterprise-wide priority and including a method, such as identity resolution, for searching, analyzing, and managing unstructured data in conjunction with structured data.
