
Full text loading...
A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise similarity measures between individuals. Similarity matrices have been constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different frequencies. Additionally, methods are now being developed that take linkage into account. We review several such matrices and evaluate their information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering. We review a range of common clustering algorithms and evaluate their performance through a simulation study. The clustering step can be performed either on the matrix or by first using a dimension-reduction technique; we find that the latter approach substantially improves the performance of most algorithms. Based on these results, we describe the population structure signal contained in each similarity matrix and find that accounting for linkage leads to significant improvements for sequence data. We also perform a comparison on real data, where we find that population genetics models outperform generic clustering approaches, particularly with regard to robustness for features such as relatedness between individuals.
Article metrics loading...
Full text loading...
Data & Media loading...
Download the Supplemental Appendix as a PDF.
The supplemental material expands on results discussed in the main text and provides a more complete literature overview. The clustering score for real data is precisely defined, and seven approaches from the literature are categorized by the possible choices of similarity measure, clustering algorithm, and dimensionality reduction. The Human Genomic Diversity Project results are given for more cases, and the commands used to run external programs are given.
Contents:
• Supplemental Section 1: Clustering score based on the ChromoPainter matrix
• Supplemental Section 2: Additional approaches
• Supplemental Section 3: HGDP Results
• Supplemental Section 4: Details of algorithms
• Supplemental Figure 1: Simulated Data Similarity Matrices
• Supplemental Figure 2: Correlation Between Similarity Measures
• Supplemental Figure 3 Simulated Data Clustering Performance
• Supplemental Figure 4: HGDP PCA plots
• Supplemental Figure 5: HGDP results for CPU
• Supplemental Figure 6: HGDP results for Spectral approaches
• Supplemental Figure 7: HGDP IBD Spectral TW results
• Supplemental Figure 8: HGDP IBD Spectral PA results
• Supplemental Figure 9: HGDP Spectral Reconstruction
• Supplemental Figure 10: Determining the number of Eigenvalues