Too Busy For Words - the PaulWay Blog

Wed 9th Jun, 2010


Wikis in general have revolutionised information on the internet. Not only is data and information more accessible but it can be improved as time goes on with the same amount of effort. They apply the 'many eyes' principle of fault-finding and make it easy for someone who can improve something to do so. Before wikis, web pages were arcane things guarded by religious orders of designers and programmers, charged with the sacred task of protecting these pages from just anyone editing them. Now content has been opened up to the masses.

We don't, however, have the same kind of tool for data. I'm talking both about tables of information - phone number lists, customer data, etc. - and the relationships between them. There's masses of data like this around, and most of it is in CSV files, HTML tables and occasionally in some database or other. There are interchange formats around and systems like ODBC for communicating between one database engine and another, but this still involves the database administrators to come forth from their temples and bless the queries and connections.

We need tools that can allow a web site to:

  1. Take a variety of formatted data - CSV, HTML, SQL dump, etc. and try to intelligently work out what it contains.
  2. Store the data in actual SQL tables, using data types appropriate to the task (i.e. everything does not get stored in VARCHAR(255) fields).
  3. Allow the user can specify constraints on the data, such as integer fields, phone number formats, etc.
  4. Allow the user to specify relationships between the data, assisted by the site where possible.
  5. Generate views of the data including joins, grouping and arithmetic expressions.
  6. Detect simple errors and inconsistencies if possible.
  7. Allow access to this data from other web sites in a structured format.
Microsoft Access is a good example of some of this - in one of my former jobs I saw many people use it to store and present a lot of useful data. But often they were really using it as a glorified form of spreadsheet - storing multiple addresses as separate columns rather than in a separate one-to-many table because they didn't know any better. Often these applications could have used a link to common facilities - I think I saw three separate tables of our branch locations, all inaccurate or lacking to some degree. Often these were critical applications to the business being maintained by one person and provided in a haphazard, unreliable fashion. Access's approach to multi-user work was to prevent it completely. While Access put the power of a database in people's hands, it did so badly and with little thought to being scalable (at least when I was using it in 1995-2000).

And while I'm finding Django to be a great framework to work with, I still seem to end up doing all the work of manually importing CSV files, KML lists, HTML tables and former SQL databases. This should be a simple process of no more than half a dozen steps. Every table in Wikipedia should be a data reference that can be sorted, keyed against, and used in someone else's pages. Google has put a lot of effort into understanding everything from movie times to stock prices, but for the rest of us it's a matter of asking a programmer. There's all sorts of interesting data mash-ups going on but they still seem to require APIs, server software and lots of code.

It's hard to stand on the shoulders if giants if you can't climb up...

The title of this page comes from the Wagiman language from the Northern Territory. It roughly means fast-find - given that I know of the language only what the dictionary gave me I don't know if I've formed the words correctly. But it's that key property that I think makes this such a compelling idea. The ability to throw a CSV table into a website and it become searchable, sortable and accessible to others instantly is compelling. While I think hand-crafted data relationships will always be faster or more accurate than automatic imports, the latter is still better than locking the data away for want of a system to access it.

Last updated: | path: tech | permanent link to this entry

All posts licensed under the CC-BY-NC license. Author Paul Wayper.

Main index / tbfw/ - © 2004-2016 Paul Wayper
Valid HTML5 Valid CSS!