Now that Web "cloud" computing and data storage are available through Amazon, Sun Microsystems, and IBM, is it time for high-energy physicists to ditch their traditional, custom-built computing networks in favor of commercial services?
A new study looks at this question in detail for perhaps the first time. The conclusion: Not yet. In a paper available here, the researchers outline a number of things that would need to change before Amazon's S3 data storage and EC2 computing services could meet the sophisticated data-heavy needs of physicists.
The researchers traced 27 months' worth of data usage by DZero, one of two experiments at Fermilab's Tevatron accelerator, to see how physicists actually handle and crunch data. The study analyzed 113,062 DZero jobs executed between January 2003 and March 2005. These involved nearly a million hours of computation and processed more than 5.2 million gigabytes of data.
The study tested the reliability and accessibility of Amazon's Simple Storage Service (S3) and Elastic Compute Cloud (EC2) from five public Internet nodes in the US and Europe.
The authors are Mayur Palankar and Adriana Iamnitchi of the University of South Florida, Matei Ripeanu of the University of British Columbia, and Simson Garfinkel of Harvard. The study will be presented at the Data-Aware Distributed Computing Workshop being held in late June in Boston.
The commercial utility computing services work like this: The service provider buys the computers and servers and other equipment. They hire the people needed to keep it all running. You pay only for the amount of storage space or computer time you actually use. Amazon Web Services makes three promises: It will never lose your data, you'll always have access to it, and that access will be quick.
But when it comes to computing, high-energy physics is a tough customer. It was an early adopter of the Grid--the idea that computing could be spread among many computer networks in widely scattered places and function as a utility, with users submitting jobs that could run anywhere. (See our story on Grid computing and physics in the November 2005 symmetry.) Further, physicists have a tradition of custom-building their own computer networks to meet the rigorous demands of experiments, which may involve hundreds of collaborators in dozens of countries. And they have a unique way of approaching data, says Gabriele Garzoglio, head of the Open Science Grid group at Fermilab. Unlike the Internet as a whole, where huge numbers of people try to access a few very popular files, physicists access and analyze the same data over and over again. In fact, they may consume seven or eight times as much data as they store.
The amount of data those experiments churn out is staggering--and rapidly increasing. DZero, for instance, has processed about 45 petabytes, which is 45 million gigabytes, of data since it began in 1999; nearly half of that--20 petabytes--was in the past year alone.
"The most valuable thing experiments have is their data," Garzoglio says. "The amount of data consumed grows almost exponentially with time. We haven't seen any business-provided external system that can handle this amount of data. For now, we like the idea of having the situation in our hands."
Amazon's S3 service stores data in "buckets," which are essentially folders that hold unlimited numbers of "data objects" of up to 5 gigabytes each. Storage costs 15 cents per gigabyte per month in the US--slightly more in Europe--and uploads and downloads cost between 10 and 18 cents each. Computer time on the EC2 system costs 10 cents per CPU-hour, and there is no bandwidth charge to send data between EC2 and S3, so handling data through EC2 potentially could save money.
The system appears to be as reliable as advertised, Iamnitchi said. No data was permanently lost in 12 months of experimentally transferring data in and out of Amazon storage using S3, although the researchers add that their study spanned too short a period to make an adequate assessment.
It appears that the Amazon services would cost significantly more than the system now in place for DZero, Iamnitchi said. But because it's difficult to separate DZero's costs from the rest of the lab's computing budget, the study was not able to quantify that.
Handling the amount of data produced by DZero through Amazon would cost $691,000 per year for storage and $335,000 for transfer, for a total of about $1 million per year at US rates, the study found. It outlined possible ways to reduce this cost. For instance, only half of the data files the experiment generated were still being actively accessed after one month, and only 35 percent after five months; the rest could be archived in some cheaper form of storage. Or Amazon could separate the three performance characteristics it offers and charge users only for the ones they want; the person who values durable storage over instant availability should be able to pay less for an agreement in which their data is available only three weeks out of the month.
"One significant problem, I think, is the matter of trust," Iamnitchi tells me. The S3 service agreement says users will be compensated for system failures, but "there is no way you can prove to a court, for example, that they lost your data," she says. Further, since your account is linked to your credit card, "if somebody breaks into your account, not only can they steal your data but they can drain your credit card." The money would go to Amazon rather than to the thief, but no matter, Iamnitchi said; the loss is the same.
Another inconvenient aspect of the Amazon service is that it doesn't allow you to search across all your buckets of data; to find something, you have to know which bucket it's in.
"Clearly these technologies are very, very young," says Fermilab's Garzoglio, who assisted the researchers but was not a study author. "So while on one hand we want to keep being informed and participating in these studies, we feel it's not the right time at this point to trust these new models. People here know how to develop these technologies and operate theses technologies. Essentially we know what we are doing, and the experiments trust us to safeguard their data."