10 Simple rules for design, provision, and reuse of identifiers for web-based life science data

McMurry, Julie; Blomberg, Niklas; Burdett, Tony; Conte, Nathalie; Dumontier, Michel; Fellows, Donal K.; Gonzalez-Beltran, Alejandra; Gormanns, Philipp; Hastings, Janna; Haendel, Melissa A.; Hermjakob, Henning; Hériché, Jean-Karim; Ison, Jon C.; Jimenez, Rafael C.; Jupp, Simon; Juty, Nick; Laibe, Camille; Le Novère, Nicolas; Malone, James; Martin, Maria J.; McEntyre, Johanna R.; Morris, Chris; Muilu, Juha; Müller, Wolfgang; Mungall, Christopher J.; Rocca-Serra, Philippe; Sansone, Susanna-Assunta; Sariyar, Murat; Snoep, Jacky L.; Stanford, Natalie J.; Swainston, Neil; Washington, Nicole; Williams, Alan R.; Wolstencroft, Katherine; Goble, Carole; Parkinson, Helen

Links

[dx.doi.org] | PDF

Tools

Export citation

Search in Google Scholar

10 Simple rules for design, provision, and reuse of identifiers for web-based life science data

Preprint published in 2015 by Julie McMurry, Niklas Blomberg, Tony Burdett, Nathalie Conte, Michel Dumontier, Donal K. Fellows, Alejandra Gonzalez-Beltran, Philipp Gormanns, Janna Hastings, Melissa A. Haendel, Henning Hermjakob, Jean-Karim Hériché, Jon C. Ison, Rafael C. Jimenez, Simon Jupp and other authors.

This paper is available in a repository.

Full text: Download

Preprint: policy unknown

Upload

Postprint: policy unknown

Upload

Published version: policy unknown

Upload

Abstract

Life science data is evolving to be ever larger, more distributed, and more natively web-based. However, our collective handling of identifiers has lagged behind these advances. Diverse identifier issues (for instance “link rot” and “content drift”) have hampered our ability to integrate data and derive new knowledge from it. Optimizing web-based identifiers is harder than it appears and no single scheme is perfect: Identifiers are reused in different ways for different reasons, by different consumers. Moreover, digital entities (e.g., files), physical entities (e.g., biosamples), and descriptive entities (e.g., ‘mitosis’) have different requirements for identifiers. Nevertheless, there is substantial room for improvement throughout the life sciences and several other groups have been converging on identifier standards that are broadly applicable. Building on these efforts and drawing on our experience, we focus on the use case of large-scale data integration: we outline the identifier qualities and best practices that we feel are most important in this context. Specifically, we propose actions that providers of online databases (repositories, registries, and knowledgebases) should take when designing new identifiers or maintaining existing ones ( Rules 1-9 ). In Rule 10 , we conclude with guidance to data integrators and redistributors on how best to reference identifiers from these diverse sources. This article may also be useful to data generators and end users as it offers insight into the issues associated with data provision in a web environment. We call upon data providers to take a long-term view of their entities’ scope and lifecycle, and to consider existing identifier platforms and services. Rule 1. Use established identifiers Rule 2. Design identifiers for use by others Rule 3. Help local identifiers travel well: document Prefix and Namespace Rule 4. Opt for simple durable web resolution Rule 5. Avoid embedding meaning Rule 6. Make URIs clear and findable Rule 7. Implement a version management policy Rule 8. Do not re-assign or delete identifiers Rule 9. Document the identifiers you issue and use Rule 10. Reference responsibly ; Other ; This manuscript is a revision of doi:10.5281/zenodo.18003 and was recently resubmitted to PLoS Computational Biology