Skip navigation and jump directly to page content

 IU Trident Indiana University

Analyzing population genomic data

Project Leads: Michael Lynch (PI), Matthew S. Ackerman (Project lead)

Research made possible by:  UITS Research Technologies' (RT's) High Performance Systems (HPS), RT's Scientific Application and Performance Tuning (SciApt), National Center for Genome Analysis Support (NCGAS), IU's Karst supercomputer

Daphnia Water Flea
Figure 1. Contributes to master, excluding merge commits.

As the cost of sequencing an organism's genome has declined, it has become possible to sequence the genomes of many individuals within a population, giving birth to the era of population genomics. In order to use the rich data sets generated, two kinds of errors must be accounted for: 1) sequencing errors, where an erroneous base has been inferred by the sequencing machine, and 2) under sampling, where only one of the two copies, or haplotypes, of an individual's genome have been sequenced. Most techniques address these problems in the same fashion: by sequencing large amounts of DNA from each individual. Sequencing large amounts of DNA ensures that each location in an organism's genome is sequenced many independent times, making sequencing errors obvious and making it nearly certain that both of the copies of an individual's genome will be sequenced. However, we take an alternative approach. Instead of sequencing enough DNA to make the analysis simple,
we perform sophisticated analysis of small amounts of DNA. Typically investigators spend tens of thousands of dollars on generating sufficient amount of DNA sequence for population genomic studies, and by decreasing the amount of sequencing per individual we dramatically reduce the cost of these studies, or allow investigators to dramatically increase the scale of their studies.

The computational investment in is not trivial, and a cluster computing approach is necessary to make analysis practical. The program 'mapgd' (Maximum-likelihood Analysis of Population Genomic Data) produces statistically rigorous estimates obtained using a maximum likelihood methodology. In addition to allowing the investigators to optimize the statistical power of their experiment, it also introduces several new analytical techniques, including techniques for the analysis of pooled sequencing (where a DNA is prepared from many unknown individuals, rather than a collection of separate individuals) and for analyzing the genealogical relationships of individuals. A development version of this program is currently available on Github ( and is being prepared for release early in November.

About RT Groups: The High Performance Systems (HPS) group implements, operates, and supports some of the fastest supercomputers in the world, IU's Big Red II, the Quarry cluster, Karst, and the large memory Mason system, in order to advance Indiana University's mission in research, training, and engagement in the state. HPS also supports databases and database engines used by the IU community.

The mission of the Scientific Applications and Performance Tuning (SciAPT) group is to deliver and support software tools that promote effective and efficient use of IU's advanced cyberinfrastructure which, in turn, improves research and enables discoveries.

NSF GSS Codes:

Primary Field: Genetics (610) Human-Medical Genetics

Secondary Field: Computer Science (401) Data Modeling