Cluster computing commoditization

I came across an interesting report on the first Hadoop Summit that happened last week. Hadoop (an open-source implementation of Google's distributed mapreduce infrastructure) is getting a lot of steam, and there are now higher-level open source projects emerging on top of it that I wouldn't even haven't dreamed of a few years ago.

This is a perfect example of commoditization at work, and actually multiple commoditization trends nurturing each other: the cluster physical infrastructure is being commoditized by people like Amazon with EC2 (now with static IP!). Setting up your own cluster which previously required careful planning, good sys admins and a lot of money is now a few mouse clicks away, and Hadoop provides the software infrastructure to easily use this computing power for complex processing of large datasets.

And now that the basic blocks are in place, people are looking at the higher levels: Mahout builds the machine learning tools (classification, clustering, etc) that are so useful when you have to process large sets of user profiles and activity logs to give value to any social network website, HBase to store huge amounts of data in a cluster, and some high-level query languages like Pig or Jaql. What will be the next higher-level that will be addressed?

When I introduced Hadoop at Joost more than 18 months ago to process the user activity logs, it was still very young and a bit shaky. It's nice to see it maturing quickly and getting so much interest.

For sure not every project needs that, but when you're building a social website you have to consider this kind of tool to face the explosive growth that you expect to see happen. It not always happens of course, but if it does you have to be prepared for it!