The Setup
For a few days now, I’ve been browsing through various link collection/directory sites, trying to get a large sampling of code and product data and even theme packs for a several projects. The recurring thought throughout this process has been, “Gee, why is sampling so onerous these days? Isn’t there an easier way?”
The Premise
For quite some time, I’ve wondered about whether anybody has created a tool that selectively fetches the contents of URLs as they are encountered on visited Web pages… but not in a fully web crawling robotic way, not even in a rule-based way, but in a user-driven (perhaps even intuitive) manner.
Years ago, there were various competing products on the market (both shareware and commercial) that did a somewhat respectable job in churning through sites that had bazillion links. These were generally sites that were vast collections of static content but arranged in such a way that a person would have to traverse several thousands of links to get to small tidbits of useful information. The benefit/cost ratio for these sites were so low without such tools that the information being “hidden” was often considered lost. Unfortunately, those early tools were obsoleted by the ever-increasing population of dynamic, interactive sites that also contain tons of links that enforce a certain degree of user input. Exacerbating this problem was the ever-expanding frontier of the Web and the other Internet data sources; this led to the formation, rise, and fall of large competing search engine companies and their products.
During the massive upswell of FOSS projects geared toward search technology, the need for significantly more effective web crawlers was identified, so a lot of effort was spent on developing agents that would use various search algorithms and techniques to pull URLs and content snippets to feed the search databases, either as endpoints or as continuing data to feed back into the search iterations for further data. This is all good-and-fine, but for personal research uses and for highly specialized search needs, being fully inclusive is not as important as appropriately selective.
Trying to be the “next Google” has slowly morphed into trying to be the “next adver-Google”, but the massively centralized, uselessly redundant search databases has worsened the search initiative, not improved it. Much of this is due to spammers, cheaters, and strange advertisers trying to capitalize on the Internet’s accessibility to increase their search ranks by getting creative in duplicating their URLs in search results. But for people who are serious on finding some useful information, having to struggle through millions (or in some cases, hundreds of millions) of search results and not being able to easily determine which is relevant and which is not… other than submitting more “search within the results” over and over and over and over and– well, you get the idea.
The Action
Why not leave some of the determination up to the user during the actual data fetch/crawl? So, instead of making the mechanistic aspect of the total search procedure fully automated and performed well in advance of the user’s search requests, let the user feed various search criteria into the crawler to allow more fine-grained control much earlier in the search procedure, so the need to store lots of extraneous search results is dramatically reduced.
Of course, this would mean that “mistakes” during this new search procedure would potentially be fairly high: after all, does a user frequently not know exactly what they want until they find it? Maybe… maybe not. Feeding the results from any of the centralized search engines into the new search procedure may help, so allowing that to be an option makes sense. But if the user already knows, in general, the sites where he wants to start his search, then why not begin there? Intra-site spidering may have a new renaissance because of this, but your garden-variety Web browser is most likely the wrong context for this type of search.
… So a new kind of client or collaborating set of agents is required.
Starting Small…
There’s no need in the FOSS space to have to create a fully realized solution to this ever-increasing problem from the beginning. This is something that may be attacked in logical phases; a lot of the “intermediate” work may be useful products in their own right.
Call to Action
So that’s that for now… the search for this wondrous product begins.
Go figure.
{ 1 trackback }
{ 0 comments… add one now }