Pitfalls of Merging GWAS Data: Lessons Learned in the eMERGE Network and Quality Control Procedures to Maintain High Data Quality

Zuvich, Rebecca L.; Armstrong, Loren L.; Bielinski, Suzette J.; Bradford, Yuki; Carlson, Christopher S.; Crawford, Dana C.; Crenshaw, Andrew T.; de Andrade, Mariza; Doheny, Kimberly F.; Haines, Jonathan L.; Hayes, M. Geoffrey; Jarvik, Gail P.; Jiang, Lan; Kullo, Iftikhar J.; Li, Rongling; Ling, Hua; Manolio, Teri A.; Matsumoto, Martha E.; McCarty, Catherine A.; McDavid, Andrew N.; Mirel, Daniel B.; Olson, Lana M.; Paschall, Justin E.; Pugh, Elizabeth W.; Rasmussen, Luke V.; Rasmussen-Torvik, Laura J.; Turner, Stephen D.; Wilke, Russell A.; Ritchie, Marylyn D.

Published in

Wiley, Genetic Epidemiology, 8(35), p. 887-898, 2011

DOI: 10.1002/gepi.20639

Tools

Export citation

Search in Google Scholar

Pitfalls of Merging GWAS Data: Lessons Learned in the eMERGE Network and Quality Control Procedures to Maintain High Data Quality

Journal article published in 2011 by Rebecca L. Zuvich, Loren L. Armstrong, Suzette J. Bielinski, Yuki Bradford, Christopher S. Carlson, Dana C. Crawford, Andrew T. Crenshaw, Mariza de Andrade, Kimberly F. Doheny, Jonathan L. Haines, M. Geoffrey Hayes, Gail P. Jarvik, Lan Jiang, Iftikhar J. Kullo, Rongling Li and other authors.

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving restricted

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.

Published in

Links

Tools

Pitfalls of Merging GWAS Data: Lessons Learned in the eMERGE Network and Quality Control Procedures to Maintain High Data Quality

Abstract