Monday, May 18, 2009

The datacenter is the new mainframe

For a while, I have been planning on writing a post comparing large scale clusters with the mainframes of yore, a piece that would be full of colorful references to timesharing, scheduling and renting compute resources, and other tales that would date me as the fossil that I am.

Fortunately, Googlers Luiz Andre Barroso and Urs Holzle recently wrote a fantastic long paper, "The Datacenter as a Computer" (PDF), that not only spares me from this task, but covers it with much more data and insight than I ever could.

Some excerpts:
New large datacenters ... cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in [our] facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment.

In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC).

Much like an operating system layer is needed to manage resources and provide basic services in a single computer, a system composed of thousands of computers, networking, and storage also requires a layer of software that provides an analogous functionality at this larger scale.

[For example] resource management ... controls the mapping of user tasks to hardware resources, enforces priorities and quotas, and provides basic task management services. Nearly every large-scale distributed application needs ... reliable distributed storage, message passing, and cluster-level synchronization.
The paper goes on to describe the challenges of making an entire datacenter behave like it is one large compute resource to applications and programmers, including discussing the existing application frameworks and need for further tools.

Do not miss Figure 1.3 that shows latency, bandwidth, and capacity to resources in the data center. It includes an insightful look at latency to local and remote memory and the equivalent latencies but drastically different capacities of local and remote disk for programs running in the cluster. As the authors say, a key challenge is to expose these differences when they matter and hide them when they don't.

There are also thought-provoking tidbits on minimizing interference between jobs on the cluster and maximizing utilization, two goals that often are at odds with each other.

Much of the rest of the paper covers the cost and efficiency issues of data centers more generally, nicely and concisely summarizing much of the recent publicly known work on power, cooling, and infrastructure.

One small thing it does not mention is the ability to rent resources (e.g. EC2) on a WSC, much like buying time on the old mainframes, and the impact that could have on utilization, especially once we allow variable pricing and priorities.

[Google paper found via James Hamilton]

Update: A couple weeks later, security guru Bruce Schneier writes, "Cloud computing is nothing new. It's the modern version of the timesharing model from the 1960s."

Update: Seven months later, Amazon launches variable pricing for low priority jobs on EC2.

No comments: