Understanding the future of Data: Data 2.0

Okay so by now, unless you have been in a cave, you've heard all about Web 2.0 and you've probably used at least one Web 2.0 app. Cool... the world is changing.

Still it is important to note that Web 2.0 is nothing without data. So far most of the successful Web 2.0 applications get their data from their users, that is the way things like Flickr and Del.icio.us work.

Going forward more and more Web 2.0 apps will need a lot of existing data BEFORE they can get any users. So they will need to mashup existing sources of data etc.

Unfortunately currently there is no purpose built foundation to support this growing trend of pulling internet data together. So most Web 2.0 apps will be forced to build their own foundation manually integrating web-services and other sources of data.

I think the future of data lies in creating a virtual database over web-services and other sources of data. Data 2.0 if you will. If we had this virtual database spanning web-services then life would be so much easier for Web 2.0 application developers.

Imagine being able to tell the virtual database what you want (a little like SQL) rather than having to manually figure out how to get what you want by writing plumbing code to link one webservice with another yourself.

It would be declarative integration or Nirvana.

Right now Base4 provides services for integrating existing databases. But that isn't enough, to truly provide Data 2.0 we need to be able to create a virtual database over web-services and other sources of data, why? Well you 'the Web 2.0 developer' probably don't own all the data you want to use, and the real owner of that data probably isn't going to give us access to their database!

What we need is something to pull other peoples data together and make it look like it is ours!

So for the last month or so I've been doing a lot of thinking about pulling data together from all over the web, creating a virtual internet database.

What I see as the real key to creating a virtual internet database or Data 2.0 is upgrading the idea of the foreign key.

A foreign Key in row in a database table is kind of the same thing as a hyperlink in a webpage.

Wow that is a big conceptual jump I know, but see if you can give me the benefit of the doubt on this.

So a foreign key is a hyperlink (or url), but it has one MASSIVE limitation: the foreign key must point to a row in the SAME database. Not much good for the web I think you will agree!

Continuing with the webpage analogy: this is a hyperlink to another page on the same site. Or somewhat more formally, but less accurately, you can think of it as a RELATIVE hyperlink, i.e. something like this: '/Table/Key' rather than something like this: 'http://Server/Database/Table/Key'.

Running blindly with this analogy some more: What we need to create 'a virtual internet database' is the ability to use ABSOLUTE hyperlinks too: i.e. allowing the foreign key to get *really* foreign and point anywhere that data might exist on the internet.

Of course all this is a little academic, because surely there is nothing big enough to run a virtual database server over the top of all the internets data. Nothing could manage all that data right?

Well actually if you look at the way OR Mappers work you can see the embyro of a solution I think.

OR Mappers typically provide a mechanism for pulling objects out of databases as required, and then walking relationships like foreign keys as required to hydrate other related objects.

To do this most OR Mappers share concepts like EntityProxies, or EntityListProxies, that will hydrate related objects by using information stored in the foreign key.

Interestingly the application using the OR Mapper is acting in some sense as an in memory cache for these objects. You can be sure that as the amount of data or objects grow it becomes very unlikely that all of the objects are in memory at the same time. Only what you want is in memory. The OR Mapper is essentially just providing a virtual memory for objects (as previously discussed by Mats Helander), the database is the disk the objects are the memory.

The question is could we do the same with internet's data? What if rather than hydrating objects using RELATIVE foreign keys you could hydrate ROWS or TUPLES from a hyperlink or URL? Hydrating only what you want when you want it. You would be able to access a ROW or TUPLE from any URL, but you would only ever need fraction of all the ROWS or TUPLES in memory at once.

I get the feeling that using OR Mapping principles you could create a unified data plaform for Web 2.0 -> Data 2.0. Now that would really powerful, unlocking scenarios we can only dream of today.

So perhaps the legacy of OR Mappers will be that they laid the ground work for Data 2.0?

Keen to hear your thoughts... email me or contact me here.

Alex James
15-06-2006