A self-supervised deep learning method for data-efficient training in genomics

Gündüz, Hüseyin Anil; Binder, Martin; To, Xiao-Yin; Mreches, René; Bischl, Bernd; McHardy, Alice C.; Münch, Philipp C.; Rezaei, Mina

Published in

Nature Research, Communications Biology, 1(6), 2023

DOI: 10.1038/s42003-023-05310-2

Tools

Export citation

Search in Google Scholar

A self-supervised deep learning method for data-efficient training in genomics

Journal article published in 2023 by Hüseyin Anil Gündüz, Martin Binder, Xiao-Yin To, René Mreches, Bernd Bischl, Alice C. McHardy

, Philipp C. Münch

, Mina Rezaei

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving forbidden

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

AbstractDeep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.

Published in

Links

Tools

A self-supervised deep learning method for data-efficient training in genomics

Abstract