Sherlock: an open-source data platform to store, analyze and integrate Big Data for biology

Bohár, Balázs; Fazekas, David; Madgwick, Matthew; Csabai, Luca; Olbei, Marton; Korcsmáros, Tamás; Szalay-Beko, Mate

Published in

F1000Research, F1000Research, (10), p. 409, 2021

DOI: 10.12688/f1000research.52791.1

Tools

Export citation

Search in Google Scholar

Sherlock: an open-source data platform to store, analyze and integrate Big Data for biology

Journal article published in 2021 by Balázs Bohár, David Fazekas, Matthew Madgwick, Luca Csabai, Marton Olbei, Tamás Korcsmáros

, Mate Szalay-Beko

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving forbidden

Postprint: archiving forbidden

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

In the era of Big Data, data collection underpins biological research more so than ever before. In many cases this can be as time-consuming as the analysis itself, requiring downloading multiple different public databases, with different data structures, and in general, spending days before answering any biological questions. To solve this problem, we introduce an open-source, cloud-based big data platform, called Sherlock (https://earlham-sherlock.github.io/). Sherlock provides a gap-filling way for biologists to store, convert, query, share and generate biology data, while ultimately streamlining bioinformatics data management. The Sherlock platform provides a simple interface to leverage big data technologies, such as Docker and PrestoDB. Sherlock is designed to analyse, process, query and extract the information from extremely complex and large data sets. Furthermore, Sherlock is capable of handling different structured data (interaction, localization, or genomic sequence) from several sources and converting them to a common optimized storage format, for example to the Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and easily execute distributed analytical queries on extremely large data files as well as share datasets between teams. The Sherlock platform is freely available on Github, and contains specific loader scripts for structured data sources of genomics, interaction and expression databases. With these loader scripts, users are able to easily and quickly create and work with the specific file formats, such as JavaScript Object Notation (JSON) or ORC. For computational biology and large-scale bioinformatics projects, Sherlock provides an open-source platform empowering data management, data analytics, data integration and collaboration through modern big data technologies.

Published in

Links

Tools

Sherlock: an open-source data platform to store, analyze and integrate Big Data for biology

Abstract