Sampling of genealogies from an approximate posterior under the infinite sites model with recombination.

The ancestry of a collection of individuals at a single locus is most naturally represented by a binary tree. In order to represent ancestry across multiple loci we must extend this to a graph, the ancestral recombination graph (ARG). Under some simplifying asumptions (see e.g. the Handbook of Statistical Genetics) we can show that the coalescent with recombination gives a good model for the distribution of these graphs. Given genetic data, this allows us to estimate genealogies, calculate likelihoods, and estimate parameters of the evolutionary process.

We present here a Monte Carlo approach to simulating heuristic genealogies for large regoins of the genome. These can be powerful tools for a range of applications including imputation and detection of genetic associations. Building on the approach of Fearnhead and Donnelly (Genetics, 2001) We introduce a number of heuristics which, when employed, greatly improve the speed of their calculations for a single genealogy, at the cost of exploring the posterior as evenly.

treesim can also be used to calculate likelihoods using importance sampling, and approximates the optimal importance sampler, following the approach of Stephens and Donnelly and Fearnhead and Donnelly in constructing an importance sampler to calculate the likelihood curves under both the SMC and the Coalescent. The method generates a coalescent genealogy backwards in time until the most recent common ancestor is reached at all sites. The genealogy is constructed using the sample and is augmented with mutations so that the probability of the sample configuration given the augmented genealogy is 1 or 0.

The program treesim is written in C++ and can be run from the command line on Windows, Linux and Mac OSX.

Treesim was used by Zhan Su to create a set of genealogies for the hapmap CEU and YRI populations, here

Source Code (.tar.gz file)
Files for a test run (.tar file)
Helper R code - to run and parse/collate/plot output