By Melissa Haendel & Nicole Vasilevsky
In this update to the PLOS Open Data Collection we highlight a few of our favorite picks from the last year. In our earlier treatise on this subject we considered several criteria, which we’ve reused to select new additions to this collection: What is important for data sharing? The impact on policy change? Highlighting ethical issues? Data science that advances our abilities to share or vice versa? Technologies that leverage shared data (the noble discipline of “data parasitism”)? Community-focused efforts that implore the world to change? How sexy the figures are? And then of course, there is simply what do people think – how much is the article being discussed? Finally, we felt that we must consider disciplinary perspectives that foster cross-pollination of ideas and approaches. We also broke down and included one of our own papers as it has been so highly discussed and because only our colleague Julie McMurry could take the exceptionally nerdy topic of persistent and unique identifier provisioning and management and make beautiful illustrations about it.
“Who shares genetic data?”
With the explosion of direct-to-consumer (DTC) products coming on the market, and increased usage, consumers now have the opportunity to share their own genetic data with databases such as OpenSNP. The study “Open sharing of genomic data: Who does it and why? evaluates demographics and motivations for data sharing via DTC genetic testing. The majority of respondents chose to be tested not due to health reasons, but because they wanted to learn about themselves and to contribute to science. Interestingly, age distribution was broad, there was no gender divide, and educational background was varied – characteristics that do not corroborate with other research that suggests individuals purchasing DTC genetic and genomic tests are highly educated, middle aged users. It seems our population is just more curious than one might have expected!
Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data in PLOS Biology was a community treatise describing how our biological knowledge is fundamentally is reliant on the quality and management of identifiers that identify everything from an organism, to a gene, to a dataset, to a transcript, to a Gene Ontology term. A companion blog, “Bad Identifiers are the Potholes of the Information Superhighway”, aims to explain to the less nerdy what this means for the average researcher – consumers who, when everything works well (with identifiers) should not notice anything awry.
“Rules to navigate a thorny (ethical) bramble”
We liked Ten simple rules for responsible big data research because as ethics teachers, we find that the issues relating to unintended consequences of information are some of the most thorny. This manuscript implores researchers to recognize that all data is essentially about people – and that improper use or distribution can cause harm. The examples are extensive and hit home, such as “categorization based on zip codes resulted in less access to Amazon Prime same-day delivery service for African-Americans in United States cities” and “Google’s reverse image search can connect previously separate personal activities—such as dating and professional profiles—in unanticipated ways.” The authors also point out that some practices, such as “marketing based on search patterns” can have a certain creepiness factor, even if no harm is done and there has been no security breach. I also really liked some of the final rules that speak to my own ethics mantra, “ethics is a team sport,” – debate, codes of conduct, auditing, and engagement all speak to best practice in generating and using big data responsibly and ethically.
“We’re on to you”
In Wide-Open: Accelerating public data release by automating detection of overdue datasets, the authors used text-mining (regex for standard identifiers!) to identify datasets that have not been publicly shared where they should have been, given the publication of a corresponding manuscript. Since data sharing is often required by funding agencies within a certain time frame, this constitutes a violation of those policies. The authors identified 473 datasets that were overdue for public release, and as a result, got data stewards to release 400 datasets in one week. This paper was interesting because it turned what had been a manual nagging process into an automated one. However, if authors release their datasets at the time of their contribution, we would not need such a wonderful tool!
As there are increasing demands for science to be more open, reproducible, transparent, and frankly efficient in this modern world, it is so rewarding to see our scientific colleagues embracing so many ways to make open science the norm.
About the Authors
Melissa Haendel is trained as a developmental neuroscientist and is deeply invested making every graduate student’s data count.
Nicole Vasilevsky aims to improve all research by educating the scientific bucket brigade in open data practices.
Both are faculty at Oregon Health & Science University.