The most detailed three-dimensional maps of the universe so far came from the Sloan Digital Sky Survey. Between 2000 and 2010, SDSS collected 20 terabytes of data, photographing one-third of the night sky.
When the Large Synoptic Survey Telescope high in the Chilean Andes becomes fully operational in 2022, its 3.2-gigapixel camera will collect the same amount of data—every night. And it will do so over and over again for ten years.
Back in the days of SDSS, scientists often downloaded data to their own institutions’ computers and ran analyses on their own equipment. That won’t be possible with LSST. “At half an exabyte, people are not going to be able to put this on their laptops,” Yusra AlSayyad, technical manager for the Princeton branch of the data management team, says of the LSST data.
Instead of bringing the data to scientists, LSST will need to bring scientists to the data.
The LSST data management team, consisting of approximately 80 people spread over six sites in the United States, is responsible for turning this deluge into something scientists can access and analyze.
Keeping the instructions clear
LSST has two main objectives: immediate data processing and long-term data aggregation.
In the very short term—in the first 60 seconds after the LSST captures an image, to be precise—the National Center for Supercomputing Applications in Illinois will process the image. It will send alerts to scientists who study supernovas, asteroids and other quickly-changing phenomena if there have been any changes to that portion of the sky when compared to a reference image.
In the long term, LSST will create comprehensive catalogs of the telescope’s observations—both the photographs themselves and tables of data extracted from them—to be published yearly.
LSST’s data management team has people working on both of these objectives.
The sheer magnitude of the data collected, the size of the team, and the number of different people and organizations who will want to access the data all pose challenges to the group. Making sure they have good documentation—human-readable information about what each piece of code is doing and how to use it—is one of them.
“The most popular projects out there have been popular not because their code is implemented the best way—they’re popular because their documentation is the best and easiest,” AlSayyad says.
Documentation is important for both scientists who will use LSST data and for the data management team working on code for use within the LSST project.
The team has a developer guide and regular code reviews to help keep coding practices consistent. Any team member can initiate requests for comments and modifications of policies.
With members of the team spread out geographically, the developer guide helps keep everyone on the same page from a distance. “I don’t get to go down the hall to help them with something,” says Jonathan Sick, a member of the Science Quality and Reliability Engineering team, which is based at NSF’s National Optical-Infrared Astronomy Research Laboratory in Tucson. “I have to spend a lot of time literally documenting how to document.”
A common challenge
Another challenge facing the LSST data management team is deciding what technologies to use. “Whatever you choose, it needs to be supported in the future, and it needs to be widely used in the future,” AlSayyad says.
That is not only to address the needs of scientists wanting to study LSST data when it is collected, but also to address the long-term needs of the profession, she says.
AlSayyad and the other project managers want to make sure early-career members find their time at LSST valuable whether they eventually wind up as astronomy faculty or in data science, programming, or other jobs and to make the platform useful for astronomy students who may be accessing LSST data years from now. “We understand that academia is a pyramid, and not everybody who majors in astronomy as an undergraduate is going to become faculty,” she says.
LSST is making use of the growing availability of cloud computing platforms. The team’s Science Platform, based on the JupyterLab software development environment, will allow anyone to run their code with LSST’s data right from their web browsers—no locally saved data required. The LSST Education and Public Outreach team is also working to make parts of the environment as user-friendly as possible so it can be used in classrooms and for citizen science projects.
It is nearly impossible to predict how technology will change over the course of LSST’s mission, so flexibility is key, says Fritz Mueller, a technical manager based at DOE’s SLAC National Accelerator Laboratory. “We have to be prepared to change and evolve,” he says.
One of the team’s priorities is to make sure that their design decisions do not commit them to a single way of dealing with the data. “You try to keep the individual pieces as flexible and general-purpose as you can, so that if you find you need to reorganize them later, that’s possible,” he says.
Financial support for LSST comes from the US National Science Foundation, the US Department of Energy and private funding raised by the LSST Corporation.
LSST funding requires that all LSST software be open source, meaning that anyone can freely use and modify the code. A major goal of the data management group is to deliver software that is as flexible as possible, allowing scientists to adapt the software easily for new types of analyses that were not built in from the beginning.
“The whole mission of a survey telescope is that you’re not necessarily making the discoveries yourself,” says Nate Lust, a member of the team in Princeton. “The LSST project is a community tool for all scientists to make discoveries. The same is true for all of its software.”