Rusty Hubcaps and Old Boots: A Fish Tale by John Ripley, Chief Software Architect
In thinking about the technical challenge of searching data, I think it’s helpful to start with a very non-technical analogy. Consider the case of Fisherman Bob. He is looking for fish in a vast ocean. With the technology of the day, he has two options.
1) Use his standard fishing net, cast it into the water and haul in whatever it catches. Fisherman Bob, having done this for 25 years, is pretty skillful at his craft. He knows where to throw the net, what depth to leave it, and what fish he is likely to catch. Sure enough after two hours of trolling, he hauls in a good size catch. It is primarily cod, and the odd squid and haddock. He can pretty much dump the whole lot into the holding tanks and start his next cast.
2) Try a wider net. It holds the opportunity of bringing in a much bigger catch. It is a little harder to cast due to its size. It also trolls at a wider depth and can catch fish at the surface as well as the bottom of the ocean. After a few hours, Fisherman Bob pulls up the net and is eager to see what he has caught. He has once again caught some cod and more of it. However as Fisherman Bob is about to dump his catch into the holding tanks he notices something red in the net. He has caught a lobster in his net this time . . . and not only one . . . but 57 of them. Well, they may not be fish, but they are seafood. Close enough I suppose. His customers on shore surely won’t mind getting the odd lobster in their delivery of fish, will they? He opens up the net and, as his catch falls into the tanks, Fisherman Bob notices much more than fish and lobster. To his dismay, in addition to what he was looking for, Fisherman Bob has pulled up four bags of trash, 22 rusty hubcaps, and seven old boots. Fisherman Bob’s customers would be pretty upset to find an old boot in their daily delivery of fish. So much so that they would likely question Fisherman Bob’s ability to reliably and consistently deliver fish. To keep his customers satisfied, Fisherman Bob now has to go down in the tanks, amongst the flopping fish and snapping lobsters, and discard the trash, hubcaps, and boots. Any efficiency gained with a wider net has been lost in the manual separation of good from bad and also potential loss of customer confidence.
In terms of search technology, we have all been in Fisherman Bob’s predicament. If you run a search on your favorite electronics e-tailer site and search for “Samsung LX-53567 TV HD 1080p” and get 0 matches. Ooops, forgot the TV was 1080i. Try the search again “Samsung LX-53567 HD”. Still 0 hits. What the . . . ahh the heck with it . . . just search “Samsung TV”. 220 hits! And your TV is on page 14 of 22.
How then can we solve the diametric problems of “exact search” with its indexed high-performance search characteristics but inherent risk of missing records with minor (or major) variances (false negatives or Type II error) vs. “cast a wide net” search with “fuzzy match” techniques and its inherent risk of bringing back non-relevant records (false positives or Type I error)?
Please come back on Friday when I’ll net an answer for you.
