As the cloud becomes a more popular computing solution in the commercial world, it is starting to pique the interest of the academic research community. Collaborators of the experiment at the High Energy Accelerator Research Organization, known as KEK, in Japan are considering supplementing their computing with Amazon’s Elastic Compute Cloud (EC2), which provides on-demand, virtual computing resources over the Internet.
The KEKB particle accelerator produces a more densely-packed, intense beam than any currently operating collider, and an upgrade to the machine in 2013 is expected to increase the intensity by a factor of 50. The new Belle II detector will come online at the same time equipped to handle the estimated hundred-fold increase in data—expected to total 40 petabytes per year. This load will require more than 100,000 CPU cores, leading the collaboration to consider new computing options, said Martin Sevior, a KEK collaborator based at the University of Melbourne in Australia.
Sevior and his colleagues ran the complete Belle simulated data analysis chain on EC2 to test it, and found it easy to deploy jobs. They created an Amazon Machine Image (AMI), a computing environment customized to contain both the Scientific Linux operating system and applications for the Belle analysis system. Each AMI contains eight CPU cores—imagine eight PCs—and can be duplicated 20 times, creating a virtual 160-core cluster. To lower costs, the team set up an automation system that duplicates the AMI on demand and shuts down instances of it as need drops.
CPU usage is not consistent in high-energy physics experiments, so building a data center to satisfy peak demand would result in significant periods of underutilization. Since EC2 allows flexibility in scaling, Sevior said, a possible solution might be a hybrid system where the cloud provides for peak demand and the grid supplies base-load resources.
Sevior is concerned about remaining challenges, however. Although EC2 works nicely at the scales tested so far, it may not be feasible at the scales needed by Belle II in 2013. Pricing may need to come down before then, too, if it is to be a competitive option. Finally, he suspects that Belle II’s requirements for data production may exceed Amazon’s capacity to transfer the data back to the Belle II grid.
“EC2 is competitive for jobs that need a lot of resources for a short time, but we have not yet demonstrated it will be useful for the very large-scale needs of Belle II,” Sevior said. “What is clear is that we will need a distributed computing solution very much like a conventional grid and are currently planning to employ gLite for the bulk of our computing. Our investigations are to see to what extent cloud computing can or should supplement this.”
This story first appeared in International Science Grid This Week on May 20, 2009.