106
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A Genealogical Interpretation of Principal Components Analysis

      research-article
      *
      PLoS Genetics
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's f st and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.

          Author Summary

          Genetic variation in natural populations typically demonstrates structure arising from diverse processes including geographical isolation, founder events, migration, and admixture. One technique commonly used to uncover such structure is principal components analysis, which identifies the primary axes of variation in data and projects the samples onto these axes in a graphically appealing and intuitive manner. However, as the method is non-parametric, it can be hard to relate PCA to underlying process. Here, I show that the underlying genealogical history of the samples can be related directly to the PC projection. The result is useful because it is straightforward to predict the effects of different demographic processes on the sample genealogy. However, the result also reveals the limitations of PCA, in that multiple processes can give the same projections, it is strongly influenced by uneven sampling, and it discards important information in the spatial structure of genetic variation along chromosomes.

          Related collections

          Most cited references9

          • Record: found
          • Abstract: found
          • Article: not found

          Calibrating a coalescent simulation of human genome sequence variation.

          Population genetic models play an important role in human genetic research, connecting empirical observations about sequence variation with hypotheses about underlying historical and biological causes. More specifically, models are used to compare empirical measures of sequence variation, linkage disequilibrium (LD), and selection to expectations under a "null" distribution. In the absence of detailed information about human demographic history, and about variation in mutation and recombination rates, simulations have of necessity used arbitrary models, usually simple ones. With the advent of large empirical data sets, it is now possible to calibrate population genetic models with genome-wide data, permitting for the first time the generation of data that are consistent with empirical data across a wide range of characteristics. We present here the first such calibrated model and show that, while still arbitrary, it successfully generates simulated data (for three populations) that closely resemble empirical data in allele frequency, linkage disequilibrium, and population differentiation. No assertion is made about the accuracy of the proposed historical and recombination model, but its ability to generate realistic data meets a long-standing need among geneticists. We anticipate that this model, for which software is publicly available, and others like it will have numerous applications in empirical studies of human genetics.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The fate of mutations surfing on the wave of a range expansion.

            Many species, including humans, have dramatically expanded their range in the past, and such range expansions had certainly an impact on their genetic diversity. For example, mutations arising in populations at the edge of a range expansion can sometimes surf on the wave of advance and thus reach a larger spatial distribution and a much higher frequency than would be expected in stationary populations. We study here this surfing phenomenon in more detail, by performing extensive computer simulations under a two-dimensional stepping-stone model. We find that the probability of survival of a new mutation depends to a large degree on its proximity to the edge of the wave. Demographic factors such as deme size, migration rate, and local growth rate also influence the fate of these new mutations. We also find that the final spatial and frequency distributions depend on the local deme size of a subdivided population. This latter result is discussed in the light of human expansions in Europe as it should allow one to distinguish between mutations having spread with Paleolithic or Neolithic expansions. By favoring the spread of new mutations, a consequence of the surfing phenomenon is to increase the rate of evolution of spatially expanding populations.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Interpreting principal component analyses of spatial population genetic variation.

              Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions. They interpreted gradient and wave patterns in these maps as signatures of specific migration events. These interpretations have been controversial, but influential, and the use of PCA has become widespread in analysis of population genetics data. However, the behavior of PCA for genetic data showing continuous spatial variation, such as might exist within human continental groups, has been less well characterized. Here, we find that gradients and waves observed in Cavalli-Sforza et al.'s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events. Our findings aid interpretation of PCA results and suggest how PCA can help correct for continuous population structure in association studies.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Genet
                plos
                plosgen
                PLoS Genetics
                Public Library of Science (San Francisco, USA )
                1553-7390
                1553-7404
                October 2009
                October 2009
                16 October 2009
                : 5
                : 10
                : e1000686
                Affiliations
                [1]Department of Statistics, University of Oxford, Oxford, United Kingdom
                University of Chicago, United States of America
                Author notes

                Conceived and designed the experiments: GM. Performed the experiments: GM. Analyzed the data: GM. Wrote the paper: GM.

                Article
                09-PLGE-RA-0897R3
                10.1371/journal.pgen.1000686
                2757795
                19834557
                b254f3a1-5e7b-46b0-9f78-b669262284e4
                Gil McVean. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 2 June 2009
                : 16 September 2009
                Page count
                Pages: 10
                Categories
                Research Article
                Evolutionary Biology/Human Evolution
                Genetics and Genomics/Population Genetics

                Genetics
                Genetics

                Comments

                Comment on this article