Published on March 6, 2014
...how licensing can change the way we do research Nicole Nogoy VUW, 7 March 2014 Open-Review Open-Source Open-Access Open-Data
Journal, data-platform and database for large-scale data in conjunction with Editor-in-Chief: Laurie Goodman Executive Editor: Scott Edmunds Commissioning Editor: Nicole Nogoy Lead Curator: Chris Hunter Data Platform: Peter Li Data Scientist: Rob Davidson www.gigasciencejournal.com
Open-Review Open-Source Open-Access Open-Data
Why? How? What can be achieved?
Take home message: Its all about the re-use To do this everything needs to be free and accessible to be read by humans & machines* * See: http://www.biomedcentral.com/about/datamining
Era of Data-Driven Science Big Potential: Using networking power of the internet to tackle problems Can ask new questions & find patterns & connections hidden in others data Build on each others efforts quicker & more efficiently Harness wisdom of the crowds: crowdsourcing, citizen science Big Challenges: cultural and technical Removing silos and putting in the commons Usability: interoperable standards/formats for humans/machines
Good for a field: Genomics/Bioinformatics Long term sharing infrastructure: Strong use of standards/policies: Plummeting cost/explosion in volumes:
Sharing aids specific communities… Rice v Wheat: consequences of publically available genome data. rice 700 600 500 Papers 400 300 200 100 0 wheat
Sharing aids authors… Sharing Detailed Research Data Is Associated with Increased Citation Rate. Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
Established in 1995
We’re not laughing now
Problem: growing replication gap Out of 18 microarray papers, results from 10 could not be reproduced 1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14 2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Growing Issue: increasing number of retractions >15X increase in last decade Strong correlation of “retraction index” with higher impact factor At current % increase by 2045 as many papers published as retracted! 1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
Reasons • Data not available • From the start – Lost over time • Software not available • From the start – Lost over time • Lack of standards • None established – Not followed • Unclear methods • Missing information • Honest errors • Pure and simple data fabrication
Impact Wasted Time Wasted money **Delayed ‘payoff’ to the community** *** Distrust of Scientists and science***
How a New Hope in Cancer Fell Apart - NYTimes.com http://www.nytimes.com/2011/07/08/health/research/08genes.h... Reprints This copy is for your personal, noncommercial use only. You can order presentation-ready copies for distribution to your colleagues, clients or customers here or use the "Reprints" tool that appears next to any article. Visit www.nytreprints.com for samples and additional information. Order a reprint of this article now. July 7, 2011 How Bright Promise in Cancer Testing Fell Apart By GINA KOLATA When Juliet Jacobs found out she had lung cancer, she was terrified, but realized that her hope lay in getting the best treatment medicine could offer. So she got a second opinion, Juliet Jacobs found out she had lung cancer, she was terrified then a third. In February of 2010, she ended up at Duke University, where she entered a research study whose promise seemed stunning. Doctors would assess her tumor cells, looking for gene patterns that would determine which drugs would best attack her particular cancer. She would not waste precious time with ineffective drugs or trial-and-error treatment. The Duke program — considered a breakthrough at the time — was the first fruit of the new genomics, a way of letting a cancer cell’s own genes reveal the cancer’s weaknesses. But the research at Duke turned out to be wrong. Its gene-based tests proved worthless, and the research behind them was discredited. Ms. Jacobs died a few months after treatment, and her husband and other patients’ relatives have retained lawyers. The episode is a stark illustration of serious problems in a field in which the medical community has placed great hope: using patterns from large groups of genes or other molecules to improve the detection and treatment of cancer. Companies have been formed and products have been introduced that claim to use genetics in this way, but assertions have turned out to be unfounded. While researchers agree there is great promise in this science, it has yet to yield many reliable methods for diagnosing cancer or identifying the best treatment. But the research at Duke turned out to be wrong. Its genebased tests proved worthless, and the research behind them was discredited. Ms. Jacobs died a few months after treatment Instead, as patients and their doctors try to make critical decisions about serious illnesses, they may be getting worthless information that is based on bad science. The scientific world is concerned enough that two prominent groups, the National Cancer Institute and the Institute of Medicine, have begun examining the Duke case; they hope to find new ways to evaluate claims based on emerging and complex analyses of patterns of genes and other molecules. 1 of 4 10/31/13 1:49 AM
GigaSolution: deconstructing the paper Provide infrastructure and mechanisms of reward for: • Data availability • Metadata/curation Metadata • Analyses Interoperability Methods • Availability of workflows • Transparent analyses Data
GigaSolution: deconstructing the paper Combines and integrates: Open-access journal Data Publishing Platform Data Analysis Platform Utilizes big-data infrastructure and expertise from: Worlds largest genomics organisation with: 20PB storage, 20.5K cores, 212TFlops, >1000 bioinformaticians www.gigadb.org www.gigasciencejournal.com
Why/what/how? Where does licensing fit? Open-Access
Importance of licensing: ability to mine & reuse content Budapest Open Access Initiative: “By “open access” to *peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” Needs to be: = NC, ND put unnecessary restrictions and are not counted as “true OA” = CC0 better than CC-BY for datasets to prevent “attribution stacking”
Importance of licensing: ability to mine & reuse content = • Gives authors control over the integrity of their work and the right to be properly acknowledged and cited. • Does not grant publicity rights, and attribution can be used to clearly disclaim endorsement • Restrictions rarely benefit author, and inhibit reuse Prevents translations, incompatibility issues mixing other licenses, some combinations illegal (e.g. CC-NC-SA & CC-BYSA), hinders non-profits and mixed-collaborations, practically unenforceable, and dealing with requests more trouble than its worth. Use of non CC-BY by publishers = “double dipping” (selling content, reprints, etc.) Further reading: http://www.nature.com/nature/journal/v495/n7442/full/495440a.html http://blogs.ch.cam.ac.uk/pmr/2011/11/29/scientists-should-never-use-cc-nc-this-explains-why/
Open-Data Data Publishing Why/what/how?
New incentives/credit Credit where credit is overdue: “One option would be to provide researchers who release data to public repositories with a means of accreditation.” “An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “ Nature Biotechnology 27, 579 (2009) Prepublication data sharing (Toronto International Data Release Workshop) “Data producers benefit from creating a citable reference, as it can ? later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)
New incentives/credit = Data Citation? “increase acceptance of research data as legitimate, citable contributions to the scholarly record”. “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”. ? http://www.force11.org/datacitation
Anatomy of a Publication Idea Study Metadata Data Analysis Answer
Anatomy of a Data Publication Idea Study Metadata Data Analysis Answer
GigaScience Data Publishing Platform Currently 60 datasets & almost 50TB data
• TBs of data from: BGI, ACRG, G10K • Provide curation & integration with other DBs
Many data types…
BGI Datasets Get DOIs Invertebrate Ant - Florida carpenter ant - Jerdon’s jumping ant - Leaf-cutter ant Roundworm Schistosoma Silkworm Parasitic nematode Pacific oyster Human Asian individual (YH) - DNA Methylome - Genome Assembly v1+2 - Transcriptome Cancer (14TB) Single cell bladder cancer HBV infected exomes Ancient DNA - Saqqaq Eskimo - Aboriginal Australian Released pre-publication Paper Published in GigaScience Vertebrates Darwin’s Finch Giant panda Macaque -Chinese rhesus -Crab-eating Mini-Pig Naked mole rat Parrot, Puerto Rican Penguin - Emperor penguin - Adelie penguin Pigeon, domestic Polar bear DA and F344 rats Sheep Tibetan antelope Microbe/metagenomics E. Coli O104:H4 TY-2482 T2D gut metagenome Bulk pooled insects T. Tengcongensis proteome Cell-Lines Chinese Hamster Ovary Mouse methylomes Cancer quantitative protemics Plants Chinese cabbage Cucumber Foxtail millet Pigeonpea Potato Sorghum Wheat A+B Other fMRI
Reward better handling of metadata… Novel tools/formats for data interoperability/handling. Cloud solutions?
Reward better handling of metadata… Novel tools/formats for data interoperability/handling. Cloud solutions? BMC Research Awards 2013 Winner of open data award
Open-Source Why/what/how? The new way of doing science?
Open-Source: the source of it all Software community understands benefits • Transparent, fast, collaborative • Long history, large community • Many licenses • Many repositories • Many users/platforms
New & more transparent peer-review: Pre-publication: pre-prints
New & more transparent peer-review: During-publication: open-review BMC Series Medical Journals
New & more transparent peer-review: Post-publication review Open content lets you do interesting things post-publication: New pub models: Comments, blogs , online journal clubs Altmetrics:
Open-Data Data Publishing
Our first DOI: To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as: Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
The Peoples Parrot: Amazona vittata Puerto Rican Parrot Genome Project Rarest parrot, national bird of Puerto Rico Community funded from artworks, fashion shows, crowdfunding… Genome annotated by students in community college as part of bioinformatics education Paper and Data published in GigaScience and GigaDB Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young Researcher Education. GigaScience 2012, 1:14 Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13 Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience. http://dx.doi.org/10.5524/100039
Disseminating new types of data
Open-Source Software Publishing
How are we supporting data reproducibility? Open-Data Open-Paper Data sets DOI:10.5524/100038 78GB CC0 data DOI:10.1186/2047-217X-1-18 ~21,000 accesses Open-Pipelines Open-Workflows Analyses DOI:10.5524/100044 Open-Review 8 reviewers tested data in ftp server & named reports published Open-Code ~21,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2 Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/
New & more transparent peer-review: The GigaScience way: 8 referees downloaded & tested data, then signed reports
New & more transparent peer-review: The GigaScience way: Real-time open-review = paper in arXiv + blogged reviews
Implement workflows in a community-accepted format Open source Over 36,000 main Galaxy server users Over 1000 papers citing Galaxy use Over 55 Galaxy servers deployed http://galaxyproject.org
GigaGalaxy & Metabolomics Tool list Tool parameterisation Results panel Results panel
Changing the way we publish:
“Regular” Journal “Conscientious” “Deconstructed” Journal Online Journal
Help us make it happen! Give us your data, papers & pipelines* Contact us: firstname.lastname@example.org email@example.com firstname.lastname@example.org * APC’s currently FREE until end of December 2014 , saving you up to £1,250 – courtesy of BGI www.gigasciencejournal.com
Thanks to: team: Peter Li Chris Hunter Rob Davidson Jesse Si Zhe Scott Edmunds Nicole Nogoy Laurie Goodman Follow us: Our collaborators: Ruibang Luo (BGI/HKU) Shaoguang Liang (BGI-SZ) Tin-Lap Lee (CUHK) Huayen Gao (CUHK) Qiong Luo (HKUST) Senghong Wang (HKUST) Yan Zhou (HKUST) Funding from: CBIIT @gigascience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigadb.org galaxy.cbiit.cuhk.edu.hk www.gigasciencejournal.com
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
... how licensing can change the way we do research Nicole Nogoy: GigaScience...how licensing can change the way we do research