With open data, scientists share their work

It could be said that astronomy, one of the oldest sciences, was one of the first fields to have open data. The open records of Chinese astronomers from 1054 A.D. allowed astronomer Carlo Otto Lampland to identify the Crab Nebula as the remnant of a supernova in 1921. In 1705 Edward Halley used the previous observations of Johannes Kepler and Petrus Apianus—who did their work before Halley was old enough to use a telescope—to deduce the orbit of his eponymous comet.

In science, making data open means making available, free of charge, the observations or other information collected in a scientific study for the purpose of allowing other researchers to examine it for themselves, either to verify it or to conduct new analyses.

Scientists continue to use open data to make new discoveries today. In 2010, a team of scientists led by Professor Doug Finkbeiner at Harvard University found vast gamma-ray bubbles above and below the Milky Way. The accomplishment was compared to the discovery of a new continent on Earth. The scientists didn’t find the bubbles by making their own observations; they did it by analyzing publicly available data from the Fermi Gamma Ray Telescope.

“Open data often can be used to answer other kinds of questions that the people who collected the data either weren’t interested in asking, or they just never thought to ask,” says Kyle Cranmer, a professor at New York University. By making scientific data available, “you’re enabling a lot of new science by the community to go forward in a more efficient and powerful way.”

Cranmer is a member of ATLAS, one of the two general-purpose experiments that, among other things, co-discovered the Higgs boson at the Large Hadron Collider at CERN. He and other CERN researchers recently published a letter in Nature Physics titled “Open is not enough,” which shares lessons learned about providing open data in high-energy physics. The CERN Open Data Portal, which facilitates public access of datasets from CERN experiments, now contains more than two petabytes of information.

The fields of both particle physics and astrophysics have seen rapid developments in the use and spread of open data, says Ulisses Barres, an astrophysicist at the Brazilian Center for Research in Physics. “Astronomy is going to, in the next decade, increase the amount of data that it produces by a factor of hundreds,” he says. “As the amount of data grows, there is more pressure for increasing our capacity to convert information into knowledge.”

The Square Kilometer Array Telescope—built in Australia and South Africa and set to turn on in the 2020s—is expected to produce about 600 terabytes of data per year.

Raw data from studies conducted during the site selection process are already available on the SKA website, with a warning that “these files are very large indeed, and before you download them you should check whether your local file system will be able to handle them.”

Barres sees the growth in open data as an opportunity for developing nations to participate in the global science community in new ways. He and a group of fellow astrophysicists helped develop something called the Open Universe Initiative “with the objective of stimulating a dramatic increase in the availability and usability of space science data, extending the potential of scientific discovery to new participants in all parts of the world and empowering global educational services.”

The initiative, proposed by the government of Italy, is currently in the “implementation” phase within the United Nations Office for Outer Space Affairs.

“I think that data is this proper entry point for science development in places that don’t have much science developed yet,” Barres says. “Because it’s there, it’s available, there is much more data than we can properly analyze.”

There are barriers to implementing open data. One is the concept of ownership—a lab might not want to release data that they could use for another project or might worry about proper credit and attribution. Another is the natural human fear of being accused of being wrong or having your data used irresponsibly.

But one of the biggest barriers, according to physics professor Jesse Thaler of MIT, is making the data understandable. “From the user perspective, every single aspect of using public data is challenging,” Thaler says.

Think of a high school student’s chemistry lab notebook. A student might mark certain measurements in her data table with a star, to remind herself that she used a different instrument to take those measurements. Or she may use acronyms to name different samples. Unless she writes these schemes down, another student wouldn’t know the star’s significance and wouldn’t be able to know what the samples were.

This has been a challenge for the CERN Open Data Portal, Cranmer says. “It’s very well curated, but it's hard to use, because the data has got a lot of structure to it. It's very complicated. You have to put additional effort to make it more usable.”

And for a lot of scientists already working to manage gigantic projects, doing extra work to make their data useable to outside groups—well, “that’s just not mission critical,” he says. But Thaler adds that the CMS experiment has been very responsive to the needs of outside users.

“Figuring out how to release data is challenging because you want to provide as much relevant information to outside users as possible,” Thaler says. “But it’s often not obvious, until outside users actually get their hands on the data, what information is relevant.”

Still, there are many examples of open data benefiting astrophysics and particle physics. Members of the wider scientific community have discovered exoplanets through public data from the Kepler Space Telescope. When the Gaia spacecraft mapped the positions of 1.7 billion stars and released them as open data, scientists flocked to hackathons hosted by the Flatiron Institute to interpret it and produced about 20 papers’ worth of research.

Open data policies have allowed for more accountability. The physics community was able to thoroughly check data from the first black hole collisions detected by LIGO and question a proposed dark-matter signal from the DAMA/LIBRA experiment.

Open data has also allowed for new collaborations and has nourished existing ones. Thaler, who is a theorist, says the dialogue between experimentalists and theorists has always been strong, but “open data is an opportunity to accelerate that conversation,” he says.

For Cari Cesarotti, a graduate student who uses CMS Open Data for research in particle physics theory at Harvard, one of the most important benefits of open data is how it maximizes the scientific value of data experimentalists have to work very hard to obtain.

“Colliders are really expensive and quite laborious to build and test,” she says. “So the more that we can squeeze out utility using the tools that we already have—to me, that's the right thing to do, to try to get as much mileage as we possibly can out of the data set.”