UserPreferences

general approach


Why is the Information Commons the way forward?

It would take a long time to explain a complete answer to this evolving question. This page will try to sketch the beginnings of an answer - hopefully enough to keep your interest and excitement.

To begin with, let us restructure the question as follows ...

When unstructured data doesn't bond, and highly structured data breaks, how do you build a strong solution?

This question was posed as an motivating question in one of our presentations.

Unstructured data

By unstructured data we mean the vast amounts of text, image, and speech data increasingly available on the internet. This data can be a wonderful resource, but it doesn't bond together in any coherent fashion. There are any number of websites that will tell you different information about Pittsburgh, Pennsylvania. But this information is disorganized and haphazard. Even with a good internet search engine, all you will find is references to information about Pittsburgh which you, the user, must trawl through. interpret, and synthesize if you are to learn about the city. A search engine can find this information for you - some can even cluster the information into different subject areas - but no search engine can put the different sources together into a complete, scalable information model.

At the same time, there are several other cities with the name Pittsburgh. How are we to distinguish (say) demographic information about Pittsburgh, Pennsylvania from analogous information about Pittsburgh, California? When there is other contextual information such as the name of the state, you can often distinguish these - but if you are just told that "Pittsburgh has a population of 313,210", how on earth are you to work out which Pittsburgh this refers to?

Unstructured data is far too intractable and unpredictable to make all the inferences we need to. Even if technologies like artificial intelligence, text mining and natural language processing worked well enough to analyze and synthesize all the unstructured data for us (as we hope one day they will), it would be far too inefficient to use these technologies from scratch every time any user asks for the same piece of information. These technologies would have to put the information into some intermediate format to enable us to rapidly deploy it to support request from different users (eg. to plot the data on different maps). Which still leaves the question, what should this intermediate representation look like?

Highly structured data

A traditional relational database has some of the opposite problems. If I have a database of US cities, I can distinguish the different cities called Pittsburgh by giving them different index keys in a table. Then when I look up the corresponding population in the table, I know which city I'm referring to. But this doesn't help me at all to integrate information from different tables. Suppose one dataset contains the population data, one contains latitude and longitude data, and others contain lists of schools, social services, museums, theaters, hospitals ... the list goes on indefinitely.

Are we to design a single database table beforehand with space for all the information we might ever want to know about every city in the world? This is very unlikely to succeed. At the beginning, such a database would be vastly over-engineered, setting aside enormous amounts of space for information that we only possess about a few cities, in the hope that we'll fill in the rest of the table as time goes on. As time goes on, we would be faced with the opposite (and much worse) problem - what happens when we realize that we want to add columns that we hadn't previously allowed for?

In short, cities throughout the world have varying attributes, and trying to decide in advance what attributes (and no others) should be possessed by all cities in the world is a fruitless task.

On the other hand, there are a few properties that all cities arguably possess. They all have locations and populations, for example, and are probably contained in countries. Now, many or most of these properties are shared by other things as well. Airports have locations as well - and if we want to represent cities and airports on the same map, it makes sense for their locations to be represented in the same way (for example, by a latitude and longitude which measures their approximate center).

Relational databases are far too brittle to cope with this ever present tension between the need for common representations and indefinite scalability. Nobody believes that such a structure could ever be general and yet flexible enough to be not just a database, but the database.

The Information Commons Solution

In between these two extremes of unstructured and rigidly structured data, the Information Commons follows a simple and scalable middle road. Objects are represented by a collection of attributes and values so that, for example, where something is an airport or a city you can make assertions like latitude=x and longitude=y. You can add new attributes to any object at any time - so if you then want to add the information population=z, you can. You can even add a population to a data object that's supposed to represent an airport (or even a historical event!) - but why would you?

An object represented by a collection of attributes and values is then given a unique identifier or UUID, and this bundle of information is called a u-form. That's really all the Information Commons is - a collection of u-forms. New u-forms can be added and old u-forms can be extended at any time, with any information.

So if any data object is just a bundle of attributes and values, how does any piece of software know what to do with this data? This ability is conferred by the system of roles. Similar but subtly different from a traditional data-schema, a role is not a way of saying what structure a piece of information must have. Instead, it is a means of interpreting certain information in predictable ways if the information is present. Once you've defined a u-form for a city, you need to add the interpretation rule that says this u-form represents a populated place. Then the software using the Information Commons will know that the values of the latitude and longitude attributes can be used to plot the data on a map, the value of the name attribute should be placed in a global name index, and so on.

As users see their information indexed and interpreted - literally appearing on maps before their eyes - the benefits of the Information Commons become very apparent. Instead of waiting for a search engine to come and trawl your website, and then hoping that some user types in the right keywords to find your information, people can find and fuse data in seconds, seeing not just individual webpages but whole datasets described and plotted so that you can interpret the whole picture. This may be the way a population varied over time, the way transportation routes are distributed over a region, or even the relationship between temperature and rainfall in a region over several years.


Back to FrontPage