Too Busy For Words - the PaulWay Blog

Wed 11th Jul, 2007

Accessing the Deep Web

IP Australia has an interesting post about the "Deep Web" - those documents which are available on the internet but only by typing in a search query on the relevant website.

On reading their article I get the impression that they think that this is both a hitherto-unknown phenomenon and one which is still baffling web developers. This puzzles me, as even a relative neophyte such as myself knows how to make these documents available to search engines: indexes. All you need is a linked-to page somewhere which then lists all of the documents available. This page doesn't have to be as obvious as my Set Dance Music Database index - it can be tucked away in a 'site map' page somewhere so that it doesn't confuse too many people into thinking that that's the correct way to get access to their documents. However, don't try to hide it so that only search engines can see it, or you'll fall afoul of the regular 'link-farming' detection and elimination mechanisms most modern search engines employ.

Of course, being a traditionalist (as you can see from both the content and design of the Set Dance Music Database) I tend to think that lists are still useful, at least if kept small. And I do need to put in some mechanisms for searching on the SDMDB, as well as a few other drill-down methods. So giving your people just a search form alone may not be catering to all the methods people employ when finding content. Wikis have realised this years ago - people like interlinking. And given that these 'deep web' documents are still accessible via a simple URL, if you really need to you can assist the search engines by creating your own index page to their documents by basically scripting up a search on their website that then puts the links into your index, avoiding listing duplicates.

So the real question is: why are the owners of these web sites not doing this? We may just need to suggest it to them if they haven't thought of it themselves. The benefits of having their documents listed on Google are many - what downsides are there? I'm sure the various criticisms of such indexing are mainly due to organisational bias and narrow-mindedness, and can either be solved or routed around.

There are two variants of this that annoy me. One is the various websites where the only way to get to what you want is by clicking - no direct link is ever provided and your entire navigation is all done through javascript, flash or unspeakable black magic. These people are making it purposefully hard for you to get straight to what you want, either because they want to show you a bunch of advertising on the way or because they want to know exactly what you're up to on their site for some insidious purpose. There is already one Irish music CD store online that I've basically had to completely ignore (except for cross-checking with material on other sites) because there is no way for me to refer people directly to a CD. I refuse outright to give instructions such as "go to http://example.com and type in the words 'Tulla Ceili Band' in the search box", because that's not good navigation.

The other type of annoyance I find ties in with this: it is the practice of making a hidden index, or a privileged level of access, available to search engines that normal people don't see. I've seen a few computing and engineering websites do this, and Experts Exchange is particularly annoying for it: you can google your query and see an excerpt from the page with the question but when you go there you find out that access to the answers requires membership and/or payment. This, as far as I'm concerned, is just a blatant money-grabbing exercise and should be anathema. Either your results are free to access, or they're not - search engines should not be privileged in that respect.

Last updated: | path: tech / web | permanent link to this entry


All posts licensed under the CC-BY-NC license. Author Paul Wayper.


Main index / tbfw/ - © 2004-2016 Paul Wayper
Valid HTML5 Valid CSS!