%0 Journal Article
%A Lawson, Daniel John
%A Falush, Daniel
%T Population Identification Using Genetic Data
%D 2012
%J Annual Review of Genomics and Human Genetics,
%V 13
%N Volume 13, 2012
%P 337-361
%@ 1545-293X
%R https://doi.org/10.1146/annurev-genom-082410-101510
%K population structure
%K genetic distance
%K haplotypes
%K similarity measure
%K principal components
%K PCA
%I Annual Reviews,
%X A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise similarity measures between individuals. Similarity matrices have been constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different frequencies. Additionally, methods are now being developed that take linkage into account. We review several such matrices and evaluate their information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering. We review a range of common clustering algorithms and evaluate their performance through a simulation study. The clustering step can be performed either on the matrix or by first using a dimension-reduction technique; we find that the latter approach substantially improves the performance of most algorithms. Based on these results, we describe the population structure signal contained in each similarity matrix and find that accounting for linkage leads to significant improvements for sequence data. We also perform a comparison on real data, where we find that population genetics models outperform generic clustering approaches, particularly with regard to robustness for features such as relatedness between individuals.
%U https://www.annualreviews.org/content/journals/10.1146/annurev-genom-082410-101510