Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 29 August 2022

Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated

  • Eran Elhaik 1  

Scientific Reports volume  12 , Article number:  14683 ( 2022 ) Cite this article

110k Accesses

40 Citations

579 Altmetric

Metrics details

  • Computational models
  • Population genetics

Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes. PCA adjustment also yielded unfavorable outcomes in association studies. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns about the validity of results reported in the population genetics literature and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations and that 32,000-216,000 genetic studies should be reevaluated. An alternative mixed-admixture population genetic model is discussed.

Similar content being viewed by others

research paper on principal component

The influence of a priori grouping on inference of genetic clusters: simulation study and literature review of the DAPC method

Joshua M. Miller, Catherine I. Cullingham & Rhiannon M. Peery

research paper on principal component

Population relationships based on 170 ancestry SNPs from the combined Kidd and Seldin panels

Andrew J. Pakstis, William C. Speed, … Kenneth K. Kidd

research paper on principal component

Factor analysis of ancient population genomic samples

Olivier François & Flora Jay

Introduction

The ongoing reproducibility crisis, undermining the foundation of science 1 , raises various concerns ranging from study design to statistical rigor 2 , 3 . Population genetics is confounded by its utilization of small sample sizes, ignorance of effect sizes, and adoption of questionable study designs. The field is relatively small and may involve financial interests 4 , 5 , 6 and ethical dilemmas 7 , 8 . Since biases in the field rapidly propagate to related disciplines like medical genetics, biogeography, association studies, forensics, and paleogenomics in humans and non-humans alike, it is imperative to ask whether and to what extent our most elementary tools satisfy risk criteria.

Principal Component Analysis (PCA) is a multivariate analysis that reduces the data’s dimensionality while preserving their covariance. When applied to genotype bi-allelic data, typically encoded as AA, AB, and BB, PCA finds the eigenvalues and eigenvectors of the covariance matrix of allele frequencies. The data are reduced to a small number of dimensions termed principal components (PCs); each describes a decreased proportion of the genomic variation. Genotypes are then projected onto space spanned by the PC axes, which allows visualizing the samples and their distances from one another in a colorful scatter plot. In this visualization, sample overlap is considered evidence of identity, due to common origin or ancestry 9 , 10 . PCA’s most attractive property for population geneticists is that the distances between clusters allegedly reflect the genetic and geographic distances between them. PCA also supports the projection of points onto the components calculated by a different dataset, presumably accounting for insufficient data in the projected dataset. Initially adapted for human genomic data in 1963 11 , the popularity of PCA has slowly increased over time. It was not until the release of the SmartPCA tool (EIGENSOFT package) 10 that PCA was propelled to the front stage of population genetics.

PCA is used as the first analysis of data investigation and data description in most population genetic analyses, e.g., Refs. 12 , 13 , 14 , 15 . It has a wide range of applications. It is used to examine the population structure of a cohort or individuals to determine ancestry, analyze the demographic history and admixture, decide on the genetic similarity of samples and exclude outliers, decide how to model the populations in downstream analyses, describe the ancient and modern genetic relationships between the samples, infer kinship, identify ancestral clines in the data, e.g., Refs. 16 , 17 , 18 , 19 , detect genomic signatures of natural selection, e.g., Ref. 20 and identify convergent evolution 21 . PCA or PCA-like tools are considered the ‘gold standard’ in genome-wide studies (GWAS) and GWAS meta-analyses. They are routinely used to cluster individuals with shared genetic ancestry and detect, quantify, and adjust for population structure 22 . PCA is also used to identify cases, controls 23 , 24 , 25 , and outliers (samples or data) 17 , and calculate population structure covariates 26 . The demand for large sample sizes has prompted researchers to “outsource” analyses to direct-to-consumer companies, which employ discretion in their choice of tools, methods, and data—none of which are shared—and return the PCA loadings and other “summary statistics” 27 , 28 . Loadings are also offered by databases like gnomAD 29 and the UK Biobank 30 . PCA serves as the primary tool to identify the origins of ancient samples in paleogenomics 14 , to identify biomarkers for forensic reconstruction in evolutionary biology 31 , and geolocalize samples 32 . As of April 2022, 32,000-216,000 genetic papers employed PC scatterplots to interpret genetic data, draw historical and ethnobiological conclusions, and describe the evolution of various taxa from prehistorical times to the present—no doubt Herculean tasks for any scatterplot.

PCA’s widespread use could not have been achieved without several key traits that distinguish it from other tools—all tied to the replicability crisis. PCA can be applied to any numerical dataset, small or large, and it always yields results. It is parameter-free and nearly assumption-free 9 . It does not involve measures of significance, effect size evaluations, or error estimates. It is, by large, a “black box” harboring complex calculations that cannot be traced. Excepting the squared cosines, which is not commonly used, the proportion of explained variance of the data is the single quantity to evaluate the quality of PCA. There is no consensus on the number of PCs to analyze. Price et al. 10 recommended using 10 PCs, and Patterson et al. 9 proposed the Tracy–Widom statistic to determine the number of components. However, this statistic is highly sensitive and inflates the number of PCs. In practicality, most authors use the first two PCs, which are expected to reflect genetic similarities that are difficult to observe in higher PCs. The remaining authors use an arbitrary number of PCs or adopt ad hoc strategies to aid their decision, e.g., Ref. 33 . Pardiñas et al. 34 , for example, selected the first five PC “as recommended for most GWAS approaches” and principal components 6, 9, 11, 12, 13, and 19, whereas Wainschtein et al. 35 preferred the top 280 PCs. There are no proper usage guidelines for PCA, and “innovations” toward less restrictive usage are adopted quickly. Recently, even the practice of displaying the proportion of variation explained by each PC faded as those proportions dwarfed 14 . Since PCA is affected by the choice of markers, samples, populations, the precise implementation, and various flags implemented in the PCA packages—each has an unpredictable effect on the results—replication cannot be expected.

In population genetics, PCA and admixture-like analyses are the de-facto standards used as non-parametric genetic data descriptors. They are considered the hammer and chisel of genetic analyses 36 . Lawson et al. 37 and Elhaik and Graur 38 commented on the misuse of admixture-like tools and argued that they should not be used to draw historical conclusions. Thus far, no investigation has thoroughly explored PCA usage and accuracy across most common study designs.

Because PCA fulfills many of the risk criteria for reproducibility 2 and its typical usage as a first hypothesis generator in population genetic studies, this study will assess its reliability, robustness, and reproducibility. As PCA is a mathematical model employed to describe the unknown truth, testing its accuracy requires a convincing model where the truth is unambiguous. For that, we developed an intuitive and simple color-based model (Fig.  1 A). Because all colors consist of three dimensions—red, green, and blue—they can be plotted in a 3D plot representing the true colors (Fig.  1 B). Applied to these data, PCA reduces the dataset to two dimensions that explain most of the variation. This allows us to visualize the true colors (still using their 3D values) in PCA’s 2D scatterplot, measure the distances of the PCs from each other, and compare them to their true 3D distances. We can thereby generate “color populations,” always consisting of 3 variables, analogous to SNPs, to aid us in evaluating the accuracy of PCA. If PCA works well, we expect it to properly represent the true distances of the colors from one another in a 2D plot (i.e., light Green should cluster near Green; Red, Green, and Blue should cluster away from each other). Let us agree that if PCA cannot perform well in this simplistic setting, where subpopulations are genetically distinct ( F ST is maximized), and the dimensions are well separated and defined, it should not be used in more complex analyses and certainly cannot be used to derive far-reaching conclusions about history. In parallel, we analyzed genotype data of modern and ancient human populations. Because the inferred population structure and population history may be debatable, we asked whether and to what extent PCA can generate contradictory results and lead to absurd conclusions ( reductio ad absurdum ), whether seemingly “correct” conclusions can be derived without prior knowledge ( cherry-picking or circular reasoning ), and whether PCA grants a posteriori knowledge independent of experience (a priori). Let us also agree that if the answer to any of those questions is negative, PCA is of no use to population geneticists.

figure 1

Applying PCA to four color populations. ( A ) An illustration of the PCA procedure (using the singular value decomposition (SVD) approach) applied to a color dataset consisting of four colors ( n All  = 1). ( B ) A 3D plot of the original color dataset with the axes representing the primary colors, each color is represented by three numbers (“SNPs”). After PCA is applied to this dataset, the projections of color samples or populations (in their original color) are plotted along their first two eigenvectors (or principal components [PCs]) with ( C ) n All  = 1, ( D ) n All  = 100, and ( E ) n All  = 10,000. The latter two results are identical to those of ( C ). Grey lines and labels mark the Euclidean distances between the color populations calculated across all three PCs.

We carried out an extensive empirical evaluation of PCA through twelve test cases, each assessing a typical usage of PCA using color and human genomic data. In all the cases, we applied PCA according to the standards in the literature but modulated the choice of populations, sample sizes, and, in one case, the selection of markers. The PCA tool used here yields near-identical results to the PCA implemented in EIGENSOFT (Supplementary Figs. S1 – S2 ). To illustrate the way PCA can be used to support multiple opposing arguments in the same debate, we constructed fictitious scenarios with parallels to many investigations in human ancestry that are shown in boxes. We reasoned that if PCA results are irreproducible, contradictory, or absurd, and if they can be manipulated, directed, or controlled by the experimenter, then PCA must not be used for genetic investigations, and an incalculable number of findings based on its results should be reevaluated. We found that this is indeed the case.

The near-perfect case of dimensionality reduction

Applying principal component analysis (PCA) to a dataset of four populations sampled evenly: the three primary colors (Red, Green, and Blue) and Black illustrate a near-ideal dimension reduction example. PCA condensed the dataset of these four samples from a 3D Euclidean space (Fig.  1 B) into three principal components (PCs), the first two of which explained 88% of the variation and can be visualized in a 2D scatterplot (Fig.  1 C). Here, and in all other color-based analyses, the colors represent the true 3D structure, whereas their positions on the 2D plots are the outcome of PCA. Although PCA correctly positioned the primary colors at even distances from each other and Black, it distorted the distances between the primary colors and Black (from 1 in 3D space to 0.82 in 2D space). Thereby, even in this limited and near-perfect demonstration of data reduction, the observed distances do not reflect the actual distances between the samples (which are impossible to recreate in a 2D dataset). In other words, distances between samples in a reduced dimensionality plot do not and cannot be expected to represent actual genetic distances. Evenly increasing all the sample sizes yields identical results irrespective of the sample size (Fig.  1 D,E).

When analyzing human populations, which harbor most of the genomic variation between continental populations (12%) with only 1% of the genetic variation distributed within continental populations 39 , PCA tends to position Africans, Europeans, and East Asians at the corners of an imaginary triangle, which closely resembles our color-population model and illustration. Analyzing continental populations, we obtained similar results for two even-sized sample datasets (Fig.  2 A,C) and their quadrupled counterparts (Fig.  2 B,D). As before, the distances between the populations remain similar (Fig.  2 A–D), demonstrating that for same-sized populations, sample size does not contribute to the distortion of the results if the increase in size is proportional.

figure 2

Testing the effect of even-sample sizes using two population sets. The top plots show nine populations with n  = 50 ( A ) and n  = 188 ( B ). The bottom plots show a different set of nine populations with n  = 50 ( C ) and n  = 192 ( D ). In both cases, increasing the sample size did not alter the PCs (the y-axis flip between ( C ) and ( D ) is a known phenomenon).

The case of different sample sizes

The extent to which different-sized populations produce results with conflicting interpretations is illustrated through a typical study case in Box 1 .

Note that unlike in Figs.  1 C and 3 A, where Black is in the middle, in other figures, the overrepresentation of certain “alleles” (e.g., Fig. 4 B) shifts Black away from (0,0). Intuitively, this can be thought of as the most common “allele” (Green in Fig. 4 B) repelling Black, which has three null or alternative “alleles”.

PCA is commonly reported as yielding a stable differentiation of continental populations (e.g., Africans vs. non-Africans, Europeans vs. Asians, and Asians vs. Native Americans or Oceanians, on the primary PCs 40 , 41 , 42 , 43 ). This prompted prehistorical inferences of migrations and admixture, viewing the PCA results that position Africans, East Asians, and Europeans in three corners of an imaginary triangle as representing the post Out Of Africa event followed by multiple migrations, differentiation, and admixture events. Inferences for Amerindians or Aboriginals typically follow this reconstruction. For instance, Silva-Zolezzi et al. 42 argued that the Zapotecos did not experience a recent admixture due to their location on the Amerindian PCA cluster at the Asian end of the European-Asian cline.

Here we show that the appearance of continental populations at the corners of a triangle is an artifact of the sampling scheme since variable sample sizes can easily create alternative results as well as alternative “clines”. We first replicated the triangular depiction of continental populations (Fig. 3 A,B) before altering it (Fig. 3 C–F). Now, East Asians appear as a three-way admixed group of Africans, Europeans, and Melanesians (Fig. 3 C), whereas Europeans appear on an African-East Asian cline (Fig. 3 D). Europeans can also be made to appear in the middle of the plot as an admixed group of Africans-Asians-Oceanians origins (Fig. 3 E), and Oceanians can cluster with (Fig. 3 F) or without East Asians (Fig. 3 E). The latter depiction maximizes the proportion of explained variance, which common wisdom would consider the correct explanation. According to some of these results, only Europeans and Oceanians (Fig. 3 C) or East Asians and Oceanians (Fig. 3 D) experienced the Out of Africa event. By contrast, East Asians (Fig. 3 C) and Europeans (Fig. 3 D) may have remained in Africa. Contrary to Silva-Zolezzi et al.’s 42 claim, the same Mexican–American cohort can appear closer to Europeans (Fig. 3 A) or as a European-Asian admixed group (Fig. 3 B). It is easy to see that none of those scenarios stand out as more or less correct than the other ones.

figure 3

PCA of uneven-sized African (Af), European (Eu), Asian (As), and Mexican-Americans (Ma) or Oceanian (Oc) populations. Fixing the sample size of Mexican-Americans and altering the sample sizes of other populations: ( A ) n Af  = 198; n Eu  = 20; n As  = 483; n Ma  = 64 and ( B ) n Af  = 20; n Eu  = 343; n Ma  = 20; n Am  = 64 changes the results. An even more dramatic change can be seen when repeating this analysis on Oceanians: ( C ) n Af  = 5; n Eu  = 25; n As  = 10; n Oce  = 20 and ( D ) n Afr  = 5; n Eu  = 10; n As  = 15; n Oc  = 20 and when altering their sample sizes: ( E ) n Af  = 98; n Eu  = 25; n As  = 150; n Oc  = 24 and ( F ) n Af  = 98; n Eu  = 83; n As  = 30; n Oc  = 15.

Reich et al. 44 presented further PCA-based “evidence” to the ‘out of Africa’ scenario. Applying PCA to Africans and non-Africans, they reported that non-Africans cluster together at the center of African populations when PC1 was plotted against PC4 and that this “rough cluster[ing]” of non-Africans is “about what would be expected if all non-African populations were founded by a single dispersal ‘out of Africa.’” However, observing PC1 and PC4 for Supplementary Fig. S3 , we found no “rough cluster” of non-Africans at the center of Africans, contrary to Reich et al.’s 44 claim. Remarkably, we found a “rough cluster” of Africans at the center of non-Africans (Supplementary Fig. S3 C), suggesting that Africans were founded by a single dispersal ‘into Africa’ by non-Africans. We could also infer, based on PCA, either that Europeans never left Africa (Supplementary Fig. S3 D), that Europeans left Africa through Oceania (Supplementary Fig. S3 B), that Asians and Oceanians never left Europe (or the other way around) (Supplementary Fig. S3 F), or, since all are valid PCA results, all of the above. Unlike Reich et al. 44 , we do not believe that their example “highlights how PCA methods can provide evidence of important migration events”. Instead, our examples (Fig. 3 , Supplementary Fig. S3 ) show how PCA can be used to generate conflicting and absurd scenarios, all mathematically correct but, obviously, biologically incorrect and cherry-pick the most favorable solution. This is an example of how vital a priori knowledge is to PCA. It is thereby misleading to present one or a handful of PC plots without acknowledging the existence of many other solutions, let alone while not disclosing the proportion of explained variance.

Box 1: Studying the origin of Black using the primary colors

Three research groups sought to study the origin of Black. A previous study that employed even sample-sized color populations alluded that Black is a mixture of all colors (Fig.  1 B–D). A follow-up study with a larger sample size ( n Red  =  n Green  =  n Blue  = 10) and enriched in Black samples ( n Black  = 200) (Fig. 4 A) reached the same conclusion. However, the Black-is-Blue group suspected that the Blue population was mixed. After QC procedures, the Blue sample size was reduced, which decreased the distance between Black and Blue and supported their speculation that Black has a Blue origin (Fig. 4 B). The Black-is-Red group hypothesized that the underrepresentation of Green, compared to its actual population size, masks the Red origin of Black. They comprehensively sampled the Green population and showed that Black is very close to Red (Fig. 4 C). Another Black-is-Red group contributed to the debate by genotyping more Red samples. To reduce the bias from other color populations, they kept the Blue and Green sample sizes even. Their results replicated the previous finding that Black is closer to Red and thereby shares a common origin with it (Fig. 4 D). A new Black-is-Green group challenged those results, arguing that the small sample size and omission of Green samples biased the results. They increased the sample sizes of the populations of the previous study and demonstrated that Black is closer to Green (Fig. 4 E). The Black-is-Blue group challenged these findings on the grounds of the relatively small sample sizes that may have skewed the results and dramatically increased all the sample sizes. However, believing that they are of Purple descent, Blue refused to participate in further studies. Their relatively small cohort was explained by their isolation and small effective population size. The results of the new sampling scheme confirmed that Black is closer to Blue (Fig. 4 F), and the group was praised for the large sample sizes that, no doubt, captured the actual variation in nature better than the former studies.

figure 4

PCA of uneven-sized samples of four color populations. ( A ) n Red  =  n Green  =  n Blue  = 10; n Black  = 200, ( B ) n Red  =  n Green  = 10; n Blue  = 5; n Black  = 200, ( C ) n Red  = 10; n Green  = 200; n Blue  = 50; n Black  = 200 ( D ) n Red  = 25; n Green  =  n Blue  = 50; n Black  = 200, ( E ) n Red  = 300; n Green  = 200; n Blue  =  n Black  = 300, and ( F ) n Red  = 1000; n Green  = 2000; n Blue  = 300; n Black  = 2000. Scatter plots show the top two PCs. The numbers on the grey bars reflect the Euclidean distances between the color populations over all PCs. Colors include Red [1,0,0], Green [0,1,0], Blue [0,0,1], and Black [0,0,0].

The case of one admixed population

The question of who the ancestors of admixed populations are and the extent of their contribution to other groups is at the heart of population genetics. It may not be surprising that authors hold conflicting views on interpreting these admixtures from PCA. Here, we explore how an admixed group appears in PCA, whether its ancestral groups are identifiable, and how its presence affects the findings for unmixed groups through a typical study case (Box 2 ).

To understand the impact of parameter choices on the interpretation of PCA, we revisited the first large-scale study of Indian population history carried out by Reich et al. 45 . The authors applied PCA to a cohort of Indians, Europeans, Asians, and Africans using various sample sizes that ranged from 2 (Srivastava) (out of 132 Indians) to 203 (Yoruban) samples. After applying PCA to Indians and the three continental populations to exclude “outliers” that supposedly had more African or Asian ancestries than other samples, PCA was applied again in various settings.

At this point, the authors engaged in circular logic as, on the one hand, they removed samples that appeared via PCA to have experienced gene flow from Africa (their Note 2, iii ) and, on the other hand, employed a priori claim (unsupported by historical documents) that “African history has little to do with Indian history” (which must stand in sharp contrast to the rich history of gene flow from Utah (US) residents to Indians, which was equally unsupported). Reich et al. provided no justification for the exact protocol used or any discussion about the impact of using different parameter values on resulting clusters. They then generated a plethora of conflicting PCA figures, never disclosing the proportion of explained variance along with the first four PCs examined. They then inferred based on PCA that Gujarati Americans exhibit no “unusual relatedness to West Africans (YRI) or East Asians (CHB or JPT)” (Supplementary Fig. S4 ) 45 . Their concluding analysis of Indians, Asians, and Europeans (Fig. 4 ) 45 showed Indians at the apex of a triangle with Europeans and Asians at the opposite corners. This plot was interpreted as evidence of an “ancestry that is unique to India” and an “Indian cline”. Indian groups were explained to have inherited different proportions of ancestry from “Ancestral North Indians” (ANI), related to western Eurasians, and “Ancestral South Indians” (ASI), who split from Onge. The authors then followed up with additional analyses using Africans as an outgroup, supposedly confirming the results of their selected PCA plot. Indians have since been described using the terms ANI and ASI.

In evaluating the claims of Reich et al. 45 that rest on PCA, we first replicated the finding of the alleged “Indian cline” (Fig. 5 A). We next garnered support for an alternative cline using Indians, Africans, and Europeans (Fig. 5 B). We then demonstrated that PCA results support Indians to be European (Fig. 5 C), East Asians (Fig. 5 D), and Africans (Fig. 5 E), as well as a genuinely European-Asian, admixed population (Fig. 5 F). Whereas the first two PCs of Reich et al.’s primary figure explain less than 8% of the variation (according to our Fig. 5 A, Reich et al.’s Fig. 4 does not report this information), four out of five of our alternative depictions explain 8–14% of the variation. Our results also expose the arbitrariness of the scheme used by Reich et al. and show how radically different clustering can be obtained merely by manipulating the non-Indian populations used in the analyses. Our results also question the authors’ choice in using an analysis that explained such a small proportion of the variation (let alone not reporting it), yielded no support for a unique ancestry to India, and cast doubt on the reliability and usefulness of the ANI-ASI model to describe Indians provided their exclusive reliability on a priori knowledge in interpreting the PCA patters. Although supported by downstream analyses, the plurality of PCA results could not be used to support the authors’ findings because using PCA, it is impossible to answer a priori whether Africa is in India or the other way around (Fig. 5 E). We speculate tat the motivation for Reich et al.'s strategy was to declare Africans an outgroup, an essential component of D-statistics. Clearly, PCA-based a posteriori inferences can lead to errors of Colombian magnitude.

figure 5

Studying the origin of Indians using PCA. ( A ) Replicating Reich et al.’s 45 results using n Eu  = 99; n As  = 146; n Ind  = 321. Generating alternative PCA scenarios using: ( B ) n Af  = 178; n Eu  = 99; n Ind  = 321, ( C ) n Af  = 400; n Eu  = 40; n As  = 100; n Ind  = 321, ( D ) n Af  = 477; n Eu  = 253; n As  = 23; n Ind  = 321, ( E ) n Af  = 25; n Eu  = 220; n As  = 490; n Ind  = 320, and ( F ) n Af  = 30; n Eu  = 200; n As  = 50; n Ind  = 320.

To evaluate the extent of deviation of PCA results from genetic distances, we adopted a simple genetic distance scheme where we measured the Euclidean distance between allelic counts (0,1,2) in the same data used for PCA calculations. We are aware of the diversity of existing genetic distance measures. However, to the best of our knowledge, no study has ever shown that PCA outcomes numerically correlate with any genetic distance measure, except in very simple scenarios and tools like ADMIXTURE-like tools, which, like PCA, exhibit high design flexibility. Plotting the genetic distances against those obtained from the top two PCs shows the deviation between these two measures for each dataset. We found that all the PC projections (Fig. 6 ) distorted the genetic distances in unexpected ways that differ between the datasets. PCA correctly represented the genetic distances for a minority of the populations, and just like the most poorly represented populations—none were distinguishable from other populations. Moreover, populations that clustered under PCA exhibited mixed results, questioning the accuracy of PCA clusters. Although it remains unclear which sampling scheme to adopt, neither scheme is genetically accurate. These results further question the genetic validity of the ANI-ASI model.

figure 6

Comparing the genetic distances with PCA-based distances for the corresponding datasets of Fig. 5 . Genetic and PCA (PC1 + PC2) distances between populations pairs (symbol pairs) and 2000 random individual pairs (grey dots) were calculated using Euclidean distances and normalized to range from 0 to 1. Population and individual pairs whose PC distances reflect their genetic distances are shown along the x = y dotted line. Note that the position of heterogeneous populations on the plot may deviate from that of their samples and that some populations are very small.

We are aware that PCA disciples may reject our reductio ad absurdum argument and attempt to read into these results, as ridiculous as they may be, a valid description of Indian ancestry. For those readers, demonstrating the ability of the experimenter to generate near-endless contradictory historical scenarios using PCA may be more convincing or at least exhausting. For brevity, we present six more such scenarios that show PCA support for Indians as a heterogeneous group with European admixture and Mexican-Americans as an Indian-European mixed population (Supplementary Fig. S4 A), Mexican–American as an admixed African-European group with Indians as a heterogeneous group with European admixture (Supplementary Fig. S4 B), Indians and Mexican-Americans as European-Japanese admixed groups with common origins and high genetic relatedness (Supplementary Fig. S4 C), Indians and Mexican-Americans as European-Japanese admixed groups with no common origins and genetic relatedness (Supplementary Fig. S4 D), Europans as Indian and Mexican-Americans admixed group with Japanese fully cluster with the latter (Supplementary Fig. S4 E), and Japanese and Europeans cluster as an admixed Indian and Mexican-Americans groups (Supplementary Fig. S4 F). Readers are encouraged to use our code to produce novel alternative histories. We suspect that almost any topology could be obtained by finding the right set of input parameters. In this sense, any PCA output can reasonably be considered meaningless.

Contrary to Reich et al.'s claims, a more common interpretation of PCA is that the populations at the corners of the triangle are ancestral or are related to the mixed groups within the triangle, which are the outcome of admixture events, typically referred to as “gradient” or “clines 45 ”. However, some authors held different opinions. Studying the African component of Ethiopian genomes, Pagani et al. 46 produced a PC plot showing Europeans (CEU), Yoruba (western African), and Ethiopians (Eastern Africans) at the corners of a triangle (Supplementary Fig. S4 ) 46 . Rather than suggesting that the populations within the triangle (e.g., Egyptians, Spaniards, Saudi) are mixtures of these supposedly ancestral populations, the authors argued that Ethiopians have western and eastern African origins, unlike the central populations with “different patterns of admixture”. Obviously, neither interpretation is correct. Reich et al.’s interpretation does not explain why CEUs are not an Indian-African admix nor why Africans are not a European-Indian admix and is analogous to arguing that Red has Green and Blue origins (Fig.  1 ). Pagani et al.’s interpretation is a tautology, ignores the contribution of non-Africans, and is analogous to arguing that Red has Red and Green origins. We carried out forward simulations of populations with various numbers of ancestral populations and found that admixture cannot be inferred from the positions of samples in a PCA plot (Supplementary Text 1 ).

In a separate effort to study the origins of AJs, Need et al. 47 applied PCA to 55 Ashkenazic Jews (AJs) and 507 non-Jewish Caucasians. Their PCA plot showed that AJs (marked as “Jews”) formed a distinct cluster from Europeans (marked as “non-Jews”). Based on these results, the authors suggested that PCA can be used to detect linkage to Jewishness. A follow-up PCA where Middle Eastern (Bedouin, Palestinians, and Druze) and Caucasus (Adygei) populations were included showed that AJs formed a distinct cluster that nested between the Adygei (and the European cluster) and Druze (and the Middle Eastern cluster). The authors then concluded that AJs might have mixed Middle Eastern and European ancestries. The proximity to the Adygei cluster was noted as interesting but dismissed based on the small sample size of the Adygei ( n  = 17). The authors concluded that AJ genomes carry an “unambiguous signature of their Jewish heritage, and this seems more likely to be due to their specific Middle Eastern ancestry than to inbreeding”. A similar strategy was employed by Bray et al. 48 to claim that PCA “confirmed that the AJ individuals cluster distinctly from Europeans, aligning closest to Southern European populations along with the first principal component, suggesting a more southern origin, and aligning with Central Europeans along the second, consistent with migration to this region.” Other authors 49 , 50 made similar claims.

It is easy to show why PCA cannot be used to reach such conclusions. We first replicated Need et al.’s 47 primary results (Fig. 7 A), showing that AJs cluster separately from Europeans. However, such an outcome is typical when comparing Europeans and non-European populations like Turks (Fig. 7 B). It is not unique to AJs, nor does it prove that they are genetically detectable. A slightly modified design shows that most AJs overlap with Turks in support of the Turkic (or Near Eastern) origin of AJs (Fig. 7 C). We can easily refute our conclusion by including continental populations and showing that most AJs cluster with Iberians rather than Turks (Fig. 7 D). This last design explains more of the variance than all the previous analyses together, although, as should be evident by now, it is not indicative of accuracy. This analysis questions PCA's use as a discriminatory genetic utility and to infer genetic ancestry.

figure 7

Studying the origin of 55 AJs using PCA. ( A ) Replicating Need et al.’s results using n Eu  = 507; Generating alternative PCA scenarios using: ( B ) n Eu  = 223; n Turks  = 56; ( C ) n Eu  = 400; n Turks + Caucasus  = 56, and ( D ) n Af  = 100, n As  = 100 (Africans and Asians are not shown), n Eu  = 100; and n Turks  = 50. Need et al.'s faulty terminology was adopted in A and B .

There are several more oddities with the report of Need et al. 47 . First, they did not report the variance explained by their sampling scheme (it is, likely, ~1%, as in Fig. 7 A). Second, they misrepresented the actual populations analyzed. AJs are not the only Jews, and Europeans are not the only non-Jews (Figs.  1 , 7 A) 47 . Finally, their dual interpretations of AJs as a mixed population of Middle Eastern origin are based solely on a priori belief: first, because most of the populations in their PCA are nested between and within other populations, yet the authors did not suggest that they are all admixed and second because AJs nested between Adygii and Druze 51 , 52 , both formed in the Near Eastern. The conclusions of Need et al. 47 were thereby obtained based on particular PCA schemes and what may be preconceived ideas of AJs origins that are no more real than the Iberian origin of AJs (Fig. 7 D). This is yet another demonstration (discussed in Elhaik 36 ) of how PCA can be misused to promote ethnocentric claims due to its design flexibility.

Box 2: Studying the origin of Black using the primary and one secondary (admixed) color populations

Following criticism on the sampling scheme used to study the origin of Black (Box 1 ), the redoubtable Black-is-Red group genotyped Cyan. Using even sample sizes, they demonstrated that Black is closer to Red ( D Black-Red  = 0.46) (Fig. 8 A), where D is the Euclidean distance between the samples over all three PCs (short distances indicate high similarity). The Black-is-Green school criticized their findings on the grounds that their Cyan samples were biased and their results do not apply to the broad Black cohort. They also reckoned that the even sampling scheme favored Red because Blue is related to Cyan through shared language and customs. The Black-is-Red group responded by enriching their cohort in Cyan and Black ( n Cyan , n Black  = 1000) and provided even more robust evidence that Black is Red ( D Black-Red  = 0.12) (Fig. 8 B). However, the Black-is-Green camp dismissed these findings. Conscious of the effects of admixture, they retained only the most homogeneous Green and Cyan ( n Green , n Cyan  = 33), genotyped new Blue and Black ( n Blue , n Black  = 400), and analyzed them with the published Red cohort ( n Red  = 100). The Black-is-Green results supported their hypothesis that Black is Green ( D Black-Green  = 0.27) and that Cyan shared a common origin with Blue ( D Blue-Green  = 0.27) (Fig. 8 C) and should thereby be considered an admixed Blue population. Unsurprisingly, the Black-is-Red group claimed that these results were due to the under-representation of Black since when they oversampled Black, PCA supported their findings (Fig. 8 A). In response, the Black-is-Green school maintained even sample sizes for Cyan, Blue, and Green ( n Blue , n Green , n Cyan  = 33) and enriched Black and Red ( n Red , n Black  = 100). Not only did their results ( D Black-Green  = 0.63 <  D Black-Red  = 0.89) support their previous findings, but they also demonstrated that Green and Blue completely overlapped, presumably due to their shared co-ancestry, and that together with Cyan ( D Cyan-Green  = 0.63 <  D Cyan-Red  = 1.09) (Fig. 8 B,D) they represent an antique color clade. They explained that these color populations only appeared separated due to genetic drift. However, they still retained sufficient cryptic genetic information that PCA can uncover if the correct sampling scheme is used. Further analyses by the other groups contested these findings (Supplementary Fig. S5 A-D). Among else, it was argued that Black is a Green–Red admixed group (Supplementary Fig. S5 C) and that Black and Cyan were the ancestors of Blue and Green (Supplementary Fig. S5 D).

figure 8

PCA with the primary and mixed color populations. ( A ) n all  = 100; n Black  = 200, ( B ) n Red  =  n Green  =  n Blue  = 100; n Black  =  n Cyan  = 500, ( C ) n Red  = 100; n Green  =  n Cyan  = 33; n Blue  =  n Black  = 400; and ( D ) n Red  =  n Black  = 100; n Green  =  n Blue  =  n Cyan  = 33; Scatter plots show the top two PCs. The numbers on the grey bars reflect the Euclidean distances between the color populations over all PCs. Colors include Red [1,0,0], Green [0,1,0], Blue [0,0,1], Cyan [0,1,1], and Black [0,0,0].

The case of a multi-admixed population

The question of how analyzing admixed groups with multiple ancestral populations affects the findings for unmixed groups is illustrated through a typical study case in Box 3 .

To understand how PCA can be misused to study multiple mixed populations, we will investigate other PCA applications to study AJs. Such analyses have a thematic intepretation, where the clustering of AJ samples is evidence of a shared Levantine origin, e.g., Refs. 12 , 13 , that “short” distances between AJs and Levantines indicate close genetic relationships in support of a shared Levantine past, e.g., Ref. 12 , whereas the “short” distances between AJs and Europeans are evidence of admixture 13 . Finally, as a rule, the much shorter distances between AJs and the Caucasus or Turkish populations, observed by all recent studies, were ignored 12 , 13 , 47 , 48 . Bray et al. 48 concluded that not only do AJs have a “more southern origin” but that their alignment with Central Europeans is “consistent with migration to this region”. In these studies, "short" and “between” received a multitude of interpretations. For example, Gladstein and Hammer's 53 PCA plot that showed AJs in the extreme edge of the plot with Bedouins and French in the other edges was interpreted as AJs clustering “tightly between European and Middle Eastern populations”. The authors interpreted the lack of “outliers” among AJs (which were never defined) as evidence of common AJ ancestry.

Following the rationale of these studies, it is easy to show how PCA can be orchestrated to yield a multitude origins for AJs. We replicated the observation that AJs are “population isolate,” i.e., AJs form a distinct group, separated from all other populations (Fig. 9 A), and are thereby genetically distinguishable 47 . We also replicated the most common yet often-ignored observation, that AJs cluster tightly with Caucasus populations (Fig. 9 B). We next produced novel results where AJs cluster tightly with Amerindians due to the north Eurasian or Amerindian origins of both groups (Fig. 9 C). We can also show that AJs cluster much closer to South Europeans than Levantines (Fig. 9 D), and overlap Finns entirely, in solid evidence of AJ’s ancient Finnish origin (Fig. 9 E). Last, we wish to refute our previous finding and show that only half of the AJs are of Finnish origin. The remaining analysis supports the lucrative Levantine origin (Fig. 9 F)—a discovery touted by all the previous reports though never actually shown. Excitingly enough, the primary PCs of this last Eurasian Finnish-Levantine mixed origin depiction explained the highest amount of variance. An intuitive interpretation of those results is a recent migration of the Finnish AJs to the Levant, where they experienced high admixture with the local Levantine populations that altered their genetic background. These examples demonstrate that PCA plots generate nonsensical results for the same populations and no a posteriori knowledge.

figure 9

An in-depth study of the origin of AJs using PCA in relation to Africans (Af), Europeans (Eu), East Asians (Ea), Amerindians (Am), Levantines (Le), and South Asians (Sa). ( A ) n Eu  = 159; n AJ  = 60; n Le  = 82, ( B ) n Af  = 30; n Eu  = 159; n Ea  = 50; n AJ  = 60; n Le  = 60, ( C ) n Af  = 30; n Ea  = 583; n AJ  = 60; n Am  = 255; ( D ) n Af  = 200; n Eu  = 115; n Ea  = 200; n AJ  = 60; n Le  = 235; n Sa  = 88, ( E ) n Af  = 200; n Eu  = 30; n AJ  = 400, n Le  = 80 ( F ) n Af  = 200; n Eu  = 30; n AJ  = 50; n Le  = 160. Large square indicate insets.

Box 3: Studying the origin of Black using the primary and multiple mixed colors

The value of using mixed color populations to study origins prompted new analyses using even (Fig. 10 A) and variable sample sizes (Fig. 10 B–D). Using this novel sampling scheme, the Black-is-Green school reaffirmed that Black is the closest to Green (Fig. 10 A, 10 C, and 10 D) in a series of analyses, but using a different cohort yielded a novel finding that Black is closest to Pink (Fig. 10 B).

figure 10

PCA with the primary and multiple mixed color populations. ( A ) n all  = 50, ( B ) n all  = 50 or 10, ( C , D ) n All  = [50, 5, 100, or 25]. Scatter plots show the top two PCs. Colors codes are shown. ( E ) The difference between the true distances calculated over a 3D plane between every color population pair (shown side by side) from ( D ) and their Euclidean distances calculated from the top two PCs. Pairs whose PC distances from each other reflect their true 3D distances are shown along the x = y dotted line. One of the largest PCA distortions is the distances between the Red and Green populations (inset). The true Red-Green distance is 1.41 (x-axis), but the PCA distance is 0.5 (y-axis).

The extent to which PCA distances obtained by the top two PCs reflect the true distances among color population pairs is shown in Fig. 10 E. PCA distorted the distances between most color populations, but the distortion was uneven among the pairs, and while a minority of the pairs are correctly projected via PCA, most are not. Identifying which pairs are correctly projected is impossible without a priori information. For example, some shades of blue and purple were less biased than similar shades. We thereby show that PCA inferred distances are biased in an unpredicted manner and thereby uninformative for clustering.

The case of multiple admixed populations without “unmixed” populations

Unlike stochastic models that possess inherent randomness, PCA is a deterministic process, a property that contributes to its perceived robustness. To explore the behavior of PCA, we tested whether the same computer code can produce similar or different results when the only variable that changes is the standard randomization technique used throughout the paper to generate the individual samples of the color populations (to avoid clutter).

We evaluated two color sets. In the first set, Black was the closest to Yellow (Fig.  11 A), Purple (Fig.  11 C), and Cyan (Fig.  11 D,E). When adding White, in the second set, Black behaved as an outgroup as the distances between the secondary colors largely deviated from the expectation and produced false results (Fig.  11 D–F). These results illustrate the sensitivity of PCA to tiny changes in the dataset, unrelated to the populations or the sample sizes.

figure 11

Studying the effects of minor sample variation on PCA results using color populations ( n all  = 50). ( A – C ) Analyzing secondary colors and Black. ( D – E ) Analyzing secondary colors, White, and Black. Scatter plots show the top two PCs. Colors include Cyan [0,1,1], Purple [1,0,1], Yellow [1,1,0], White [1,1,0], and Black [0,0,0].

To explore this effect on human populations, we curated a cohort of 16 populations. We carried out PCA on ten random individuals from 15 random populations. We show that these analyses result in spurious and conflicting results (Fig.  12 ). Puerto Ricans, for instance, clustered close to Europeans (A), between Africans and Europeans (B), close to Adygei (C), and close to Europe and Adygei (D). Indians clustered with Mexicans (A, B, and D) or apart from them (C). Mexicans themselves cluster with (A and D) or without (B and C) Africans. Papuans and Russians cluster close (B) or afar (C) from East Asian populations. More robust clustering was observed for East Asians, Caucasians, and Europeans, as well as Africans. However, these were not only indistinguishable from the less robust clustering but also failed to replicate over multiple runs (results not shown). These examples show that PCA results are unpredictable and irreproducible even when 94% of the populations are the same. Note that the proportion of explained variance was similar in all the analyses, demonstrating that it is not an indication of accuracy or robustness.

figure 12

Studying the effect of sampling on PCA results. A cohort of 16 worldwide populations (see legend) was selected. In each analysis, a random population was excluded. Populations were represented by random samples ( n  = 10). The clusters highlight the most notable differences.

We found that although a deterministic process, PCA behaves unexpectedly, and minor variations can lead to an ensemble of different outputs that appear stochastic. This effect is more substantial when continental populations are excluded from the analysis.

The cases of case–control matching and GWAS

Samples of unknown ancestry or self-reported ancestry are typically identified by applying PCA to a cohort of test samples combined with reference populations of known ancestry (e.g., 1000 Genomes), e.g., Refs. 22 , 54 , 55 , 56 . To test whether using PCA to identify the ancestry of an unknown cohort with known samples is feasible, we simulated a large and heterogeneous Cyan population (Fig.  13 A, circles) of self-reported Blue ancestry. Following a typical GWAS scheme, we carried out PCA for these individuals and seven known and distinct color populations. PCA grouped the Cyan individuals with Blue and Black individuals (Fig.  13 B), although none of the Cyan individuals were Blue or Black (Fig.  13 A), as a different PCA scheme confirmed (Fig.  13 C). A case–control assignment of this cohort to Blue or Black based on the PCA result (Fig.  13 B) produced poor matches that reduced the power of the analysis. When repeating the analysis with different reference populations (Fig.  13 D), the simulated individuals exhibited minimal overlap with Blue, no overlap with Black, and overlapped mostly with the Cyan reference population present this time. We thereby showed that the clustering with Blue and Black is an artifact due to the choice of reference populations. In other words, the introduction of reference populations with mismatched ancestries respective to the unknown samples biases the ancestry inference of the latter.

figure 13

Evaluating the accuracy of PCA clustering for a heterogeneous test population in a simulation of a GWAS setting. ( A ) The true distribution of the test Cyan population ( n  = 1000). ( B ) PCA of the test population with eight even-sized ( n  = 250) samples from reference populations. ( C ) PCA of the test population with Blue from the previous analysis shows a minimal overlap between the cohorts. ( D ) PCA of the test population with five even-sized ( n  = 250) samples from reference populations, including Cyan (marked by an arrow). Colors ( B ) from top to bottom and left to right include: Yellow [1,1,0], light Red [1,0,0.5], Purple [1,0,1], Dark Purple [0.5,0,0.5], Black [0,0,0], dark Green [0,0.5,0], Green [0,1,0], and Blue [1,0,0].

We next asked whether PCA results can group Europeans into homogeneous clusters. Analyzing four European populations yielded 43% homogeneous clusters (Fig.  14 A). Adding Africans and Asians and then South Asian populations decreased the European cluster homogeneity to 14% and 10%, respectively (Fig.  14 B,C). Including the 1000 Genome populations, as customarily done, yielded 14% homogeneous clusters (Fig.  14 D). Although the Europeans remained the same, the addition of other continental populations resulted in a three to four times decrease in the homogeneity of their clusters.

figure 14

Evaluating the cluster homogeneity of European samples. PCA was applied to the four European populations (Tuscan Italians [TSI], Northern and Western Europeans from Utah [CEU], British [GBR], and Spanish [IBS]) alone ( A ), together with an African and Asian population ( B ), as well as South Asian population ( C ), and finally with all the 1000 Genomes Populations ( D ). ( E ) Evaluating the usefulness of PCA-based clustering. The bottom two plots show the sizes of non-homogeneous and homogeneous clusters, and the top three plots show the proportion of individuals in homogeneous clusters. Each plot shows the results for 10 or 20 random African, European, or Asian populations for the same PCs ( x -axis).

The number of PCs analyzed in the literature ranges from 2 to, at least, 280 35 , which raises the question of whether using more PCs increases cluster homogeneity or is another cherry-picking strategy. We calculated the cluster homogeneity for different PCs for either 10 or 20 African ( n 10  = 337, n 20  = 912), Asian ( n 10  = 331, n 20  = 785), and European ( n 10  = 440, n 20  = 935) populations of similar sample sizes (Fig.  14 E). Even in this favorable setting that included only continental populations, on average, the homogeneous clusters identified using PCA were significantly smaller than the non-homogeneous clusters ( µ Homogeneous  = 12.5 samples; σ Non-homogeneous  = 42.6 samples; µ Homogeneous  = 12.5 samples; µ Non-homogeneous  = 42.6 samples; Kruskal–Wallis test [ n Homogeneous  =  n Non-homogeneous  = 238 samples, p  = 1.95 × 10 –75 , Chi-square = 338]) and included a minority of the individuals when 20 populations were analyzed. Analyzing higher PCs decreased the size of the homogeneous clusters and increased the size of the non-homogeneous ones. The maximum number of individuals in the homogeneous clusters fluctuated for different populations and sample sizes. Mixing other continental populations with each cohort decreased the homogeneity of the clusters and their sizes (results now shown). Overall, these examples show that PCA is a poor clustering tool, particularly as sample size increases, in agreement with Elhaik and Ryan 57 , who reported that PCA clusters are neither genetically nor geographical homogeneous and that PCA does not handle admixed individuals well. Note that the cluster homogeneity in this limited setting should not be confused with the amount of variance explained by additional PCs.

To further assess whether PCA clustering represents shared ancestry or biogeography, two of the most common applications of PCA, e.g., Ref. 22 , we applied PCA to 20 Puerto Ricans (Fig.  15 ) and 300 Europeans. The Puerto Ricans clustered indistinguishably with Europeans (by contrast to Fig.  12 ) using the first two and higher PCs (Fig.  15 ). The Puerto Ricans represented over 6% of the cohort, sufficient to generate a stratification bias in an association study. We tested that by randomly assigning case–control labels to the European samples with all the Puerto Ricans as controls. We then generated causal alleles to the evenly-sized cohorts and computed the association before and after PCA adjustment. We repeated the analysis with randomly assigned labels to all the samples. In all our 12 case–control analyses, the outcome of the PCA adjustment for 2 and 10 PCs were worse than the unadjusted results, i.e., PCA adjusted results had more false positives, fewer true positives, and weaker p -values than the unadjusted results (Supplementary Text 3 ).

figure 15

PCA of 20 Puerto Ricans and 300 random Europeans from the 1000 Genomes. The results are shown for various PCs.

We next assessed whether the distance between individuals and populations is a meaningful biological or demographic quantity by studying the relationships between Chinese and Japanese, a question of major interest in the literature 58 , 59 . We already applied PCA to Chinese and Japanese, using Europeans as an outgroup (Supplementary Fig. S2.4 ). The only element that varied in the following analyses was the number of Mexicans as the second outgroup (5, 25, and 50). We found that the proportion of homogeneous Japanese and Chinese clusters dropped from 100% (Fig.  16 A) to 93.33% (Fig.  16 B) and 40% (Fig.  16 C), demonstrating that the genetic distances between Chinese and Japanese depend entirely on the number of Mexicans in the cohort rather than the actual genetic relationships between these populations as one may expect.

figure 16

The effect of varying the number of Mexican–American on the inference of genetic distances between Chinese and Japanese using various PCs. We analyzed a fixed number of 135 Han Chinese (CHB), 133 Japanese (JPT), 115 Italians (TSI), and a variable number of Mexicans (MXL), including 5 (left column), 25 (middle column), and 50 (right column) individuals over the top four PCs. We found that the overlap between Chinese and Japanese in PC scatterplots, typically used to infer genomic distances, was unexpectedly conditional on the number of Mexican in the cohort. We noted the meaning of the axes of variation whenever apparent (red). The right column had the same axes of variations as the middle one.

Some authors consider higher PCs informative and advise considering these PCs alongside the first two. In our case, however, these PCs were not only susceptible to bias due to the addition of Mexicans but also exhibited the exact opposite pattern observed by the primary PCs (e.g., Fig.  16 G–I). It has also been suggested that in datasets with ancestry differences between samples, axes of variation often have a geographic interpretation 10 . Accordingly, the addition of Mexicans altered the order of axes of variation between the cases, making the analysis of additional PCs valuable. We demonstrate that this is not always the case. Excepting PC1, over 60% of the axes had no geographical interpretation or an incorrect one. An a priori knowledge of the current distribution of the population was essential to differentiate these cases. The addition of the first 20 Mexicans replaced the second axis of variation (initially undefined) with a third axis (Eurasia-America) in the middle and right columns and resulted in a minor decline of ~ 5% of the homogeneous clusters. Adding 25 Mexicans to the second cohort did not affect the axes, but the proportion of homogeneous clusters declined by 66%. The axes changes were unexpected and altered the interpretation of PCA results. Such changes were not detectable without an a priori knolwedge.

These results demonstrate that (1) the observable distances (and thereby clusters) between populations inferred from PCA plots (Figs.  14 , 15 , 16 ) are artifacts of the cohort and do not provide meaningful biological or historical information, (2) that distances betewen samples can be easily manipulated by the experimenter in a way that produces unpredictable results, (3) that considering higher PCs produces conflicting patterns, which are difficult to reconcile and interpret, and (4) that our extensive “exploration” of PCA solutions to Chinese and Japanese relationships using 18 scatterplots and four PCs produced no insight. It is easy to see that the multitude of conflicting results, allows the experimenter to select the favorable solution that reflects their a priori knowledge.

The case of projections

Incorporating precalculated PCA is done by projecting the PCA results calculated for the first dataset onto the second one, e.g., Ref. 17 . Here, we tested the accuracy of this approach by projecting one or more color populations onto precalculated color populations that may or may not match the projected ones. The accuracy of the results was dependent on the identity of the populations of the two cohorts. When the same populations were analyzed, they overlapped (Fig.  17 A), but when unique populations were found in the two datasets, PCA created misleading matches (Figs.  17 B–D). In the latter case, and when the sample sizes were uneven (Fig.  17 C), the projected samples formed clusters with the wrong populations, and their positioning in the plot was incorrect. Overall, we found that PCA projections are unreliable and misleading, with correct outcomes indistinguishable from incorrect ones.

figure 17

Examining the accuracy of PCA projections. The PCA results of one dataset (circles) were projected onto another (squares). In ( A ), testing the case of varying sample sizes between the first ( n Red  = 200, n Green  = 10, n Blue  = 200, n Purple  = 10) and second ( n Red  = 200, n Green  = 200, n Blue  = 10, n Purple  = 10) datasets, where in the second dataset, colors varied a little (e.g., [1,0,0] → [1,0.1,0.1]). In ( B – D ), the sample size varied (10 ≤  n  ≤ 300) for both datasets. Colors include Red [1,0,0], Green [0,1,0], light Green [1,0.2,1], Cyan [0,1,1], Blue [0,0,1], Purple [1,0,1], Yellow [1,1,0], Grey [0.5,0.5,0.5], White [1,1,1], and Black [0,0,0].

To evaluate the reliability of projections for human populations, we tested whether the projected populations cluster with their closest groups and to what extent these results can be manipulated. We found that populations can be shown to correctly align with continental populations when the base (or test) populations and the projected populations are very similar (Fig.  18 A), which gives us confidence in the accuracy of PCA projections. However, even in the simplest scenario of using three continental populations, it is unclear how to interpret the overlap between the base and projected populations since the Spanish would not be considered genetically closer to Finns than Italians, as suggested by PCA. In another simple scenario, where Europeans are projected onto other Europeans, distinct populations like AJs, Iberians, French, CEU, and British overlap entirely (Fig.  18 B), whereas Finns and Italians were separate. Not only do the results share no apparent resemblance to the geographical distribution, but they also produce conflicting information as to the genetic distances between these populations—two properties that PCA enthusiastics claim it represents. Adding more populations, even if only to the projected populations, contributes to further distortions with previously distinct populations (Fig.  18 B) now clustering (Fig.  18 C). In a different dataset, projecting Japanese onto a base dataset of Africans and Europeans places them as an admixed African-European population. The projected Finns cluster with other Europeans (Fig.  18 D), at odds with the previous results (Fig.  18 B) that singled them out.

figure 18

PCA projections of populations (italic and black star inside the shape) onto base populations with even-sized sample ( n  = 50, unless noted otherwise) (regular font). In ( A ) n projected  = 100, ( B ) n projected  = 50, ( C ) n projected  = 20, ( D ) n projected  = 100, ( E ) n projected  = 80 and n projected  = 100, and ( F ) 80 ≤  n projected  ≤ 100 and 12 ≤  n projected  ≤ 478.

To test the behavior of PCA when projecting populations different from the base populations, we projected Chinese, Finns, Indians, and AJs onto Levantine and two European populations (Fig.  18 E). The results imply that the Chinese and AJs are of an Indian origin originating from a European-Levantine mix. Replacing Levantines with Africans does not stabilize the projected results (Fig.  18 F). Now the projected Chinese and Japanese overlap, and AJs cluster with Iranians.

Overall, our results show that it is unfeasible to rely on PCA projections, particularly in studies involving different populations, as is commonly done. Even when the projected populations are identical to the base ones, the base and projected populations may or may not overlap.

The case of ancient DNA

PCA is the primary tool in paleogenomics, where ancient samples are initially identified based on their clustering with modern or other ancient samples. Here, a wide variety of strategies is employed. In some studies, ancient and modern samples are combined 60 . In other studies, PCA is performed separately for each ancient individual and “particular reference samples”, and the PC loadings are combined 61 . Some authors projected present-day human populations onto the top two principal components defined by ancient hominins (and non-humans) 62 . The most common strategy is to project ancient DNA onto the top two principal components defined by modern-day populations 14 . Here, we will investigate the accuracy of this strategy.

Since ancient populations show more genetic diversity than modern ones 14 , we defined “ancient colors” ( a ) as brighter colors whose allele frequency is 0.95 with an SD of 0.05 and “modern colors” ( m ) as darker colors whose allele frequency is 0.6 with an SD of 0.02. Two approaches were used in analyzing the two datasets: calculating PCA separately for the two datasets and presenting the results jointly (Fig.  19 A,B), and projecting the PCA results of the “ancient” populations onto the “modern” ones (Fig.  19 C,D). In both cases, meaningful results would show the ancient colors clustering close to their modern counterparts in distances corresponding to their true distances.

figure 19

Merging PCA of “ancient” (circles) and “modern” (squares) color populations using two approaches. First, PCA is calculated separately on the two datasets, and the results are plotted together ( A , B ). Second, PCA results of “ancient” populations are projected onto the PCs of the “modern” ones ( C , D ). In ( A ), even-sized samples from “ancient” ( n  = 25) and “modern” ( n  = 75) color populations are used. In ( B ), different-sized samples from “ancient” (10 ≤  n  ≤ 25) and “modern” (10 ≤  n  ≤ 75) populations are used. In ( C ) and ( D ), different-sized samples from “ancient” (10 ≤  n  ≤ 75) are used alongside even-sized samples from “modern” populations: ( C ) ( n  = 15) and ( D ) n  = 25. Colors include Red [1,0,0], dark Red [0.6,0,0], Green [0,1,0], dark Green [0,0.6,0], Blue [0,0,1], dark Blue [0,0,0.6], light Cyan [0,0.6,0.6], light Yellow [0.6,0.6,0], light Purple [0.6,0,0.6], and Black [0,0,0].

These are indeed the results of PCA when even-sized “modern” and “ancient” samples from color populations are analyzed and the color pallett is balanced (Fig.  19 A). In the more realistic scenario where the color pallet is imbalanced and sample sizes differ, PCA produced incorrect results where ancient Green (aGreen) clustered with modern Yellow (mYellow) away from its closest mGreen that clustered close to aRed. mPurple appeared as 4-ways mixed of aRed, aBlue, mCyan, and mDark Blue. Instead of being at the center (Fig.  19 A), Black became an outgroup and its distances to the other colors were distorted (Fig.  19 B). Projecting “ancient” colors onto “modern” ones also highly misrepresented the relationships among the ancient samples as aRed overlapped with aBlue or aGreen, mYellow appeared closer to mCyan or aRed, and the outgroups continuously changed (Fig.  19 C,D). Note that the first two PCs of the last results explained most of the variance (89%) of all anlyses.

Recently, Lazaridis et al. 14 projected ancient Eurasians onto modern-day Eurasians and reported that ancient samples from Israel clustered at one end of the Near Eastern “cline” and ancient Iranians at the other, close to modern-day Jews. Insights from the positions of the ancient populations were then used in their admixture modeling that supposedly confirmed the PCA results. To test whether the authors’ inferences were correct and to what extent those PCA results are unique, we used similar modern and ancient populations to replicate the results of Lazaridis et al. 14 (Fig.  20 A). By adding the modern-day populations that Lazaridis et al. 14 omitted, we found that the ancient Levantines cluster with Turks (Fig.  20 B), Caucasians (Fig.  20 C), Iranians (Fig.  20 D), Russians (Fig.  20 E), and Pakistani (Fig.  20 F) populations. The overlap between the ancient Levantines and other populations also varied widely, whereas they cluster with ancient Iranians and Anatolians, Caucasians, or alone, as a “population isolate.” Moreover, the remaining ancient populations exhibited conflicting results inconsistent with our understanding of their origins. Mesolithic and Neolithic Swedes, for instance, clustered with modern Eastern Europeans (Fig.  20 A–C) or remotely from them (Fig.  20 D–F). These examples show the wide variety of results and interpretations possible to generate with ancient populations projected onto modern ones. Lazaridis et al.’s 14 results are neither the only possible ones nor do they explain the most variation. It is difficult to justify Lazaridis et al.’s 14 preference for the first outcome where the first two components explained only 1.35% of the variation (in our replication analysis. Lazaridis et al. omitted the proportion of explained variation) (Fig.  20 A), compared to all the alternative outcomes that explained a much larger portion of the variation (1.92–6.06%).

figure 20

PCA of 65 ancient Palaeolithic, Mesolithic, Chalcolithic, and Neolithic from Iran (12), Israel (16), the Caucasus (7), Romania (10), Scandinavia (15), and Central Europe (5) (colorful shapes) projected onto modern-day populations of various sample sizes (grey dots, black labels). The full population labels are shown in Supplementary Fig. S8 . In addition to the modern-day populations used in ( A ), the following subfigures also include ( B ) Han Chinese, ( C ) Pakistani (Punjabi), ( D ) additional Russians, ( E ) Pakistani (Punjabi) and additional Russians, and ( F ) Pakistani (Punjabi), additional Russians, Han Chinese, and Mexicans. The ancient samples remained the same in all the analyses. In each plot ( A – F ), the ancient Levantines cluster with different modern-day populations.

We note that for high dimensionality data where markers are in high LD, projected samples tend to “shrink,” i.e., move towards the center of the plot. Corrections to this phenomenon have been proposed in the literature, e.g., Ref. 63 . This phenomenon does not affect our datasets, which are very small (Fig.  19 ) or LD pruned (Fig.  20 ).

The case of marker choice

The effect of marker choice on PCA results received little attention in the literature. Although PCA is routinely applied to different SNP sets, the PCs are typically deemed comparable. In forensic applications, that typically employ 100–300 markers, this is a major problem. To evaluate the effect of various markers on PCA outcomes, it is unfeasible to use our color model, although it can be used to study the effects of missing data and noise, which are common in genomic datasets and reflect the biological properties of different marker types in capturing the population structure. Remarkably, the addition of 50% (Fig.  21 A) and even 90% missingness (Fig.  21 B) allowed recovering the original population structure. The structure decayed when random noise was added to the latter dataset (Fig.  21 C). To further explore the effect of noise, we added random markers to the dataset. An addition of 10% of noisy markers increased the dataset's disparity, but it still retained the original structure (Fig.  21 D). Interestingly, even adding 100% noisy markers allowed identifying the original structure's key features (Fig.  21 E). Only when adding 1000%, noisy markers did the original structure disappear (Fig.  21 F). Note that the introduction of noise has also sliced the percent of variation explained by the PCs. These results highlight the importance of using ancestry informative markers (AIMs) to uncover the true structure of the dataset and accounting for disruptive markers.

figure 21

Testing the effects of missingness and noise in a PCA of six fixed-size ( n  = 50) samples from color populations. The top plots show the effect of missingness alone or combined with noise: ( A ) 50% missingness, ( B ) 90% missingness, and ( C ) 90% missingness and low-level random noise in all the markers. The bottom plots test the effect of noise when added to the original markers in the above plots using: ( D ) 30 random markers, ( E ) 300 random markers, and ( F ) 3000 random markers. Colors include Red [1,0,0], Green [0,1,0], Blue [0,0,1], Cyan [0,1,1], Yellow [1,1,0], and Black [0,0,0].

To evaluate the extent to which marker types represent the population structure, we studied the relationships between UK British and other Europeans (Italians and Iberians) using different types of 30,000 SNPs, a number of similar magnitude to the number of SNPs analyzed by some groups 64 , 65 . According to the full SNP set, the British do not overlap with Europeans (Fig.  22 A). However, coding SNPs show considerable overlap (Fig.  22 B) compared with intronic SNPs (Fig.  22 C). Protein coding SNPs, RNA molecules, and upstream or downstream SNPs (Fig.  22 D–F, respectively) also show small overlap. The identification of “outliers,” already a subjective measure, may also differ based on the proportions of each marker type. These results not only illustrate how the choice of markers and populations profoundly affect PCA results but also the difficulties in recovering the population structure in exome datasets. Overall, different marker types represent the population structure differently.

figure 22

PCA of Tuscany Italians ( n  = 115), British ( n  = 105), and Iberians ( n  = 150) across all markers ( p  ~ 129,000) ( A ) and different marker types ( p  ~ 30,000): ( B ) coding SNPs, ( C ) intronic SNPs, ( D ) protein-coding SNPs, ( E ) RNA molecules, and ( F ) upstream and downstream SNPs. Convex hull was used to generate the European cluster.

The case of inferring a personal ancestry

PCA is used to infer the ancestry of individuals for various purposes, however a minimal sample size of one, may be even more subjected to biases than in population studies. We found that such biases can occur when individuals with Green (Fig.  23 A) and Yellow (Fig.  23 B) ancestries clustered near admixed Cyan individuals and Orange, rather than with Greens or by themselves, respectively. One Grey individual clustered with Cyan (Fig.  23 C) when it is the only available population, much like a Blue sample clustered with Green samples (Figs. 23 D).

figure 23

Inferring single individual ancestries using reference individuals. In ( A ) Using even-sized samples from reference populations ( n  = 37): Red [1,0,0], Green [0,1,0], bright Cyan [0, 0.9, 0.8], dark Cyan [0, 0.9, 0.6], heterogeneous darker Cyan [0, 0.9, 0.4] with high standard deviation (0.25) with a light Green test individual [0, 0.5, 0]. In ( B ) Using the same reference populations as in ( A ) with uneven-sizes: Red ( n  = 15), Green ( n  = 15), bright Cyan ( n  = 100), dark Cyan ( n  = 15), heterogeneous darker Cyan ( n  = 100), with a Yellow test indiviaul (1,1,0). In ( C ) A heterogeneous Cyan population [0, 1, 1] ( n  = 300) with high standard deviation (0.25) and a Grey test individual (0.5, 0.5, 0.5). In ( D ) Red [1,0,0] ( n  = 10), Green [0,1,0] ( n  = 10), a heterogeneous population [1, 1, 0.5] ( n  = 200) and a Blue test individual (0,0,1).

Arguably, one of the most famous cases of personal ancestral inference occurred during the 2020 US presidential primaries when a candidate published the outcome of their genetic test undertaken by Carlos Bustamante that tested their Native American ancestry ( https://elizabethwarren.com/wp-content/uploads/2018/10/Bustamante_Report_2018.pdf ). Analyzing 764,958 SNPs, Bustamante sought to test the existence of Native American ancestry using populations from the 1000 Genomes Project and Amerindians. RFMix 66 was used to identify Native American ancestry segments and PCA, elevated to be a “machine learning technique,” to verify that ancestry independently of RFMix. The longest of five genetic segments, judged to be of Native American origin, was analyzed using PCA and reported to be “clearly distinct from segments of European ancestry” and “strongly associated with Native American ancestry” as it clustered with Native Americans distinctly from Europeans and Africans (Fig.  1 in their report) and between Native American samples (Fig.  2 in their report). Bustamante concluded that “While the vast majority of the individual’s ancestry is European, the results strongly support the existence of an unadmixed Native American ancestor in the individual’s pedigree, likely in the range of 6–10 generations ago”.

We have already shown that AJs (Fig.  9 C) and Pakistanis (Fig.  14 D) can cluster with Native Americans. With the candidate’s DNA unavailable (and their specific European ancestry undisclosed), we tested whether the two PCA patterns observed by Bustamante can be reproduced for modern-day Eurasians without any reported Native American ancestry (Pakistani, Iranian, Even Russian, and Moscow Russian) (Figs.  24 A–D, respectively).

figure 24

Evaluation of Native American ancestry for four Eurasians. ( A ) Using even-sample size ( n  = 37) for Africans, Mexican-Americans, British, Puerto Ricans, Colombians, and a Pakistani. ( B ) Using uneven-sample sizes, for Africans ( n  = 100), Mexican-Americans ( n  = 20), British ( n  = 50), Puerto Ricans ( n  = 89), Colombians ( n  = 89), and an Iranian. ( C ) Analyzing a whole-Amerindian cohort of Colombian ( n  = 93), Mexican-Americans ( n  = 117), Peruvian ( n  = 75), Puerto Ricans ( n  = 102), and an Even Russian. ( D ) Using uneven-sample sizes, for Africans ( n  = 100), Mexican-Americans ( n  = 53), British ( n  = 20), Puerto Ricans ( n  = 30), Colombians ( n  = 89), and a Moscow Russian. All the samples were randomly selected.

These analyses show that the experimenter can easily generate desired patterns to support personal ancestral claims, making PCA an unreliable and misleading tool to infer personal ancestry. We further question the accuracy of Bustamante’s report, provided the biased reference population panel used by RFMix to infer the DNA segments with the alleged Amerindian origin, which excluded East European and North Eurasian populations. We draw no conclusions about the candidate’s ancestry.

The reproducibility crisis in science called for a rigorous evaluation of scientific tools and methods. Due to PCA’s centrality in population genetics, and since it was never proven to yield correct results, we sought to assess its reliability, robustness, and reproducibility for twelve test cases using a simple color-based model where the true population structure was known and real human populations. PCA failed in all three measures.

PCA did not produce correct and\or consistent results across all the design schemes, whether even-sampling was used or not, and whether for unmixed or admixed populations. We have shown that the distances between the samples are biased and can be easily manipulated to create the illusion of closely or distantly related populations. Whereas the clustering of populations between other populations in the scatter plot has been regarded as “decisive proof” or “very strong evidence” of their admixture 18 , we demonstrated that such patterns are artifacts of the sampling scheme and meaningless for any bio historical purposes. Sample clustering, a subject that received much attention in the literature, e.g., Ref. 9 , is another artifact of the sampling scheme and likewise biologically meaningless (e.g., Figs.  12 , 13 , 14 , 15 ), which is unsurprising if the distances are distorted. PCA violations of the true distances and clusters between samples limit its usability as a dimensional reduction tool for genetic analyses. Excepting PC1, where the distribution patterns may (e.g., Fig.  5 a) or may not (e.g., Fig.  9 ) bear some geographical resemblance, most of the other PCs are mirages (e.g., Fig.  16 ). The axes of variation may also change unexpectedly when a few samples are added, altering the interpretation.

Specifically, in analyzing real populations, we showed that PCA could be used to generate contradictory results and lead to absurd conclusions ( reductio ad absurdum ), that “correct” conclusions cannot be derived without a prior i knowledge and that  cherry-picking or circular reasoning  are always needed to interpret PCA results. This means that the difference between the a posteriori knowledge obtained from PCA and a priori knowledge rests solely on belief. The conflicting PCA outcomes shown here via over 200 figures demonstrate the high experimenter’s control over PCA’s outcome. By manipulating the choice of populations, sample sizes, and markers, experimenters can create multiple conflicting scenarios with real or imaginary historical interpretations, cherry-pick the one they like, and adopt circular reasoning to argue that PCA results support their explanation.

Overall, the notion that PCA can yield biologically or historically meaningful results is a misconception supported by a priori knowledge and post hoc reasoning. PCA “correct” results using some study designs are utterly indistinguishable from incorrect results constructed using other study designs, and neither design could be justified a priori to be the correct one. Likewise, PCA correctly represented the genetic distances and clusters for a  miniscule fraction of the samples (e.g., Fig.  6 ) who were otherwise indistinguishable from the remaining samples whose genetic distances were distorted. Therefore, like a broken clock, PCA can be tuned by the experimenter (e.g., Fig.  20 ) to yield presumed “correct” results, and “correct” results can be cherry-picked if known a priori , but neither is evidence to the accuracy of PCA. Just like a broken clock, working clocks (i.e., other tools) are essential to decide on the “correct” PCA results. This begs the question of why use PCA at all, particularly as a first hypothees generator.

Some authors 67 revealed the cards by proposing to use PCA for “exploration” purposes; however, the “exploration” protocol was never scripted, and neither was the method by which a posteriori knowledge can be garnered from this journey to the unknown. “Exploration” is thereby synonymous with cherry-picking specific PCA results deemed similar to those generated by other tools. If this was a realistic approach, the practice of PCA could have been simply dismissed as cumbersome and unnecessary. However, in the literature, the reverse procedure is dominant, i.e., the broken clock is used to call the hours for the other clocks. We believe that such design is popular because downstream analyses are equally manuverable or designed to address specific questions, allowing the experimenter a control over the general narrative.

Indeed, after “exploring” 200 figures generated in this study, we obtained no a posteriori wisdom about the population structure of colors or human populations. We showed that the inferences that followed the standard interpretation in the literature were wrong. PCA is highly subjected to minor alterations in the allele frequencies (Fig.  12 ), study design (e.g., Fig.  9 ), or choice of markers (Fig.  22 ) (see also Refs. 57 , 68 ). PCA results also cannot be reproduced (e.g., Fig.  13 ) unless an identical dataset is used, which defeats the usefulness of this tool. In that, our findings thereby join similar reports on PCA’s unexpected and arbitrary behavior 69 , 70 . Note that variations in the implementations of PCA (e.g., PCA, singular value decomposition [SVD], and recursive PCA), as well as various flags, as implemented in EIGENSOFT, yield major differences in the results—none more biologically correct than the other. That the same mathematical procedure produces biologically conflicting and false results proves that bio historical inferences drawn only from PCA are fictitious.

Several aspects of this study are important to emphasize. First, this study does not ask whether the PC transformation is correct. If properly implemented, the computational procedure that computes the principal components and uses them to change the basis of the data is considered correct. This study asks whether the PC transformation produces correct or wrong outcomes for the original datasets, consisting of colors populations, where the truth is known. For real populations, we avoided judging results to be correct or not since many of those questions are subjects of ongoing debates. Instead, we asked whether PCA results are consistent with each another, align with their interpretation in the literature, and can lead to absurd conclusions. Second, this study focuses on genetic variation data, particularly human data, that have particular characteristics. For other data types or datasets not tested here, PC analyses may be more successful, e.g., Ref. 71 , if they survive the test criteria presented here. We note, however, that PCA produced incorrect results in our simple model (e.g., Fig.  3 ) and that criticism is neither rare nor unique to genetics (see criticism of PCA in geology 72 and physical anthropology 73 ). To better understand how PCA reached prominence, we shall review the historical debate on whether the PCA trnasformation represents the genetic data correctly.

A brief history of PCA and its application to population genetics

It is well-recognized that Pearson 74 introduced PCA and Hotelling 75 the terminology. Hotelling’s motivation was to address the problem of evaluating independent mental traits in psychology. Thurstone presented another principal axes solution to the problem of factor analysis 75 . However, he later reconciled, as he could not see how they describe a meaningful psychological model 76 . The argument about the truthfulness and reliability of PCs continues to this day 77 .

In population genetics, PCA is primarily used to reduce the dimensionality of multivariate datasets by linearly transforming the genotypes into a set of mutually uncorrelated principal components (PCs) ranked according to their variances. As most of the original variability is contained in the primary two PCs, they are typically visualized on a colorful scatter plot. The early work of Cavalli-Sforza suggested that PCA can detect ancient migrations and population spreads 78 , 79 in the genomic data. The authors proposed that PCA will “give us new insight into the evolutionary history of the populations represented in the map” 78 although later they explained their inability to interpret the PCA results for Africans because “the genetic and archeological knowledge in these regions is not as detailed as in Europe” 79 , i.e., in the lack of a priori knowledge. Cavalli-Sforza’s arguments were not very convincing.

During the twentieth century, PCA was sparsely employed in genomic analyses alongside other multidimensional scaling tools. The next-generation sequencing revolution in the early twenty-first century produced large genomic datasets that required new and powerful computational tools with appealing graphical interfaces, like STRUCTURE 80 . PCA was not used in the publications of the first two HapMaps nor the HGDP dataset 81 , 82 , 83 .

In 2006, Price et al. 10 introduced the SmartPCA tool (EIGENSOFT package) and claimed that PCA has “a solid statistical footing” that can “discover structure in genetic data” even in admixed populations. Those claims were made based on a simulated dataset and an application of PCA to a dataset of European Americans, which revealed an incoherent pattern claimed to reflect genetic variation between northwest and southeast Europe. Simultaneously, Patterson et al. 9 applied PCA to three African and three Asian populations claiming that the dispersion patterns of the primary two PCs reflect the true population structure. SmartPCA offered no remedy to the known problems with PCA, only new promises.

The next milestone in the rise of PCA to prominence was the work of Novembre and colleagues 32 that showed a correlation between PCA and geography among Europeans. The authors applied PCA to a dataset of European genotypes, positioned the PCs on Europe’s map, and rotated their axes to increase the correlation with Europe’s map. After fitting a model of longitude and latitude that included PC1, PC2, and their interactions, samples were positioned on Europe’s map. The authors claimed that “the resulting figure bears a notable resemblance to a geographic map of Europe” and reported that, on average, 50% of samples from populations with greater than six samples were predicted within less than 400 km of their country. Most of those populations, however, were from the extreme ends of the map (Italy, UK, and Spain) and were predicted most accurately because PCA maximizes the variance along the two axes. By contrast, samples from mid and north-Europe were predicted most poorly. Overall, the authors’ approach classified about 50% of the samples in the final dataset to within 400 km of their countries. Only 24% of the samples from all European countries (Table 3 in Ref. 32 ) were predicted to their correct country, 50% of the populations were predicted within 574 km (about the distance from Berlin to Warsaw), and 90% of the populations were predicted within 809 km (about the distance from Berlin to Zurich). Overall, it is fair to say that in practice, this method does not perform as implied because it strongly depends on the specific cohort. Therefore, it does not have any practical applications. A more proper title for the paper would have been “populations can be selected to mirror geography in a quarter of Europe”. Novembre et al.'s study was iconic, which in retrospect may be unwarranted, since authors always claimed to see geographical patterns in PCA results irrespective to Novembre et al.'s transformation. Later, Yang et al. 84 claimed to have expanded the method to global samples. Elhaik et al. 85 showed that the new method has less than 2% accuracy, with some samples being predicted outside our planet. Thus far, no PCA or PCA-like application has ever reached an accuracy higher than 2% worldwide 86 . By contrast, an admixture-based approach achieved 83% accuracy in classifying individuals to countries and even islands and villages 85 .

Ignoring these methodological problems and further promoting their PCA tool, Reich et al. 44 wrote in an editorial for the Novembre et al.'s study that “PCA has a population genetics interpretation and can be used to identify differences in ancestry among populations and samples, regardless of the historical patterns underlying the structure,” that “PCA is also useful as a method to address the problem of population stratification—allele frequency differences between cases and controls due to ancestry differences—that can cause spurious associations in disease association studies,” and finally that “PCA methods can provide evidence of important migration events”—none of which were supported by the work of Novembre et al.

After its applications to the HGDP 87 and HapMap 3 88 datasets, PCA became the foremost utility in population genetic investigations, reaching ”fixation” by 2013, the point where it is used almost in every paper in the field (Fig.  25 ).

figure 25

Evaluating the usability of a PCA in population genetic publications by sampling four random population genetic papers per year from Nature and PNAS. The percent of publications that used at least one PCA is shown.

Evaluating the core properties of PCA

Table 1 summarizes the main findings of the twelve test cases analyzed here. Several additional limitations of PCA are worth highlighting since they may not have been evident in the test cases. First, PCA typically explains a tiny part of the variation (Supplementary Fig. S9 ) that may have a genealogical interpretation 69 , but not only does it grow smaller as more samples are added (Supplementary Fig. S9 ), it also grows in inaccuracy (Fig.  9 ). This leads to a paradox, whereas increasing the sample size, which intuitively should be expected to increase the accuracy of analyses, decreases the proportion of explained variance and accuracy. Second, analyzing only the top two PCs does not solve the rapid decline in the proportion of explained variation (Supplementary Fig. S10 ). Interestingly, the average variance explained by the two primary PCs over hundreds and thousands of individuals from different populations is very small (Supplementary Fig. S10 , inset). Third, PCs higher than three not only explain a minuscule amount of variation, but they also cannot differentiate the true data structure from noise (Supplementary Fig. S11 ). In other words, PC plots where the first two PCs explain ~ 1% of the variance, as we calculated for Lazaridis et al. 14 , capture as much of the population structure as they would from a randomized dataset. Recall that all the datasets analyzed here include AIMs that improve the discovery of population structure. The fourth limitation concerning PCA’s characteristic is the “big-p, little-n,” where p stands for dimensions and n for samples, otherwise known as the p  >  >  n problem or the curse of dimensionality 89 . Briefly, it refers to the phenomenon that arises when analyzing data in high-dimensional spaces unobserved in lower-dimensional spaces. As a dimensionality reduction technique, PCA aims to address this problem. However, PCA introduces biases of its own. PCA misrepresents the distances and clusters. In high-dimensional space, the distances between the data points increase compared to low-dimensional space (Supplementary Fig. S12 ). As such, formerly close population samples appear more distanced and no longer cluster. In other words, cases and controls cannot be reliably identified in high-dimensional data, as is commonly done. Finally, PCA adjustments may be disadvantageous. We show that applying PCA adjustment to case–control data yielded a higher proportion of false positives, a smaller proportion of true positives, and weaker p -values (Supplementary Text 3 ).

Misuses of PCA in the literature

To understand how and why a tool with so many limitations became the foremost tool in population genetics, we will briefly review how authors handled those limitations.

We have already demonstrated that authors misinterpret PCA findings and do not disclose the amount of variation explained by PCA. Fascinatinglyedly, in 2008 Reich and colleagues found it necessary to assess “whether the proportion of the variance explained by the first PC is sufficiently large,” most likely before they realized just how small this variation really is. To the best of our knowledge, they omitted this information in their numerous publications that employed PCA, e.g., Refs. 14 , 45 , 62 , 90 , 91 , 92 , 93 .

Remarkably, Novembre and Stephens 94 warned that “PCA results depend on the details of a particular dataset, they are affected by factors in addition to population structure, including distribution of sampling locations and amounts of data. Both these features limit the utility of PCA for drawing inferences about underlying processes” but nonetheless found PCA to be “undoubtedly an extremely useful tool for investigating and summarizing population structure,” and correctly anticipated that it will play “a prominent role in analyses of ongoing studies of population genetic variation”.

Although authors were aware that PCA results depended on the sample cohort, they continued using it, presenting only the results that fit their a priori hypotheses. For example, Tian et al. 49 recognized that PCA “is sensitive to differences in the inclusion or exclusion of specific population groups” and that it “can be dramatically affected by differences in relatively small genomic regions that may not reflect true population substructure”. Likewise, Tian et al. 50 noted that Ashkenazic Jews (AJs) “have a unique genotypic pattern that may not reflect geographic origins” and that “the inclusion or exclusion of particular ethnic groups… shifted the relationships in PCA”. They acknowledged that their findings “show that PCA results are highly dependent on which population groups are included in the analysis”. Still, both groups drew conclusions based on PCA and their a priori perceptions. Price et al. 95 needed no Leavnatine populations to conclude from a PCA plot with Ashkenazic Jews and Europeans that “both Ashkenazi Jewish and southeast European ancestries are derived from migrations/expansions from the Middle East and subsequent admixture with existing European populations”. Provided its flexibility, it should come as no surprise that PCA and, in one case, Multi-dimensional scaling (MDS) 96 spearheaded claims of Levantine origin for AJs 97 . We showed that PCA could be easily engineered to foster American, Iberian, West, Central, and South European, Britain, Scandinavian, South Central Asian, Central Asian, Middle Eastern, Caucasian, and even Levantine origins for AJs.

PCA applications in biology have been criticized by several groups. McVean 69 cautioned that “Sub-sampling from populations to achieve equal representation, as in Novembre et al. 32 , is the only way to avoid this problem [= the distortion of the projection space]” and that “the influence of uneven sample size can be to bias the projection of samples on the first few PCs in unexpected ways”. However, these statements are incorrect. First, Novembre et al.’s sample sizes ranged from 1 to 219. Second, McVean’s simulation was limited to the case of symmetric populations arranged in a lattice formation, as in Figs.  1 C or 19 A. This led McVean to believe that accuracy can be achieved when sample sizes are even and thereby have some merit (“The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture”). Had McVean explored the slightly more realistic case of populations sampled evenly with uneven contributions to the covariance matrix (e.g., Figs.  4 A, 9 A), he would have realized that PCA’s accuracy is extremely limited to well-controlled simulations of even-sized samples from isotropic populations (symmetrically distributed across all the dimensions). In reality, “populations” are unknown, are of uneven population sizes, are anisotropic, and sampled unevenly using different markers. These limitations invalidate PCA as a useful tool for population genomic studies. Elhaik and Ryan 57 showed that PCA could not model admixed samples, resonating our findings using forward simulation (Supplementary Text 1 ). Elhaik et al. 85 showed that PCA-like tools could not be used for biogeography, which is not surprising if PC distances are meaningless. François et al. 67 noticed that the gradients observed in the first PC often contradict formulated expectations and offered a biological explanation for the phenomenon. They concluded that PCA should be considered as a data exploration tool (i.e., cherry-picking ) and that interpreting the results in terms of past routes of migration “remains a complicated exercise”. Björklund 98 raised concerns about sampling problems that render PCA biologically meaningless and provided several recommendations, like evaluating the distinctness of the PC’s and presenting the percent of explained variance. The practice of ignoring sample dates in paleogenomic analyses that incorporates ancient and modern samples has also been criticized 99 . Recently, Chari et al. 70 showed that in single-cell gene expression analyses, where PCA pre-conditioned t -SNE and UMAP visuals are often used to infer or confirm relationships between cells in qualitative and quantitative manners for many purposes, including to “validate” clustering, PCA caused major distortion of the data and when analyzing equidistant points was tantamount to applying a random projection. The authors developed an art model and showed that it produces comparable metrics to those produced by the PCA-refined dataset on which t -SNE and UMAP were applied. The authors reported that the “application of PCA to a set of equidistant points produces an arbitrary projection that will depend on software implementation details, including random number seeds and the numerical methods implemented for computing eigenvalues and eigenvectors”. Our findings, albeit in population genetics, demonstrate that with the exceptions discussed above, all PCA results are wrong and are independent of the level of “cautiousness” exhibited by the experimenter even for “exploration” purposes.

PCA as a Dataism exercise in population genetics

Dataism describes an ideology formed by the emergence of Big Data, where measuring the data is the ultimate achievement 100 . Dataism proponents believe that with sufficient data and computing power, the world’s mysteries would reveal themselves. Dataism enthusiasts rarely ask themselves if PCA results are correct but rather how to interpret the results correctly . As such, clustering is interpreted as identity, due to  common ancestry and its absence as genetic drift . Populations nested between other populations are admixed or isolates, and those at the corners of the PC scatter plot are unmixed, pure, or races.

Although a newly coined term, the roots of the dataism philosophy are traceable to the Hotelling-Thurstone debate and specifically to the Cavalli-Sforza-Sokal conundrum. Cavalli-Sforza et al. 101 (p338) explained the first six components in ancient human cross-continental expansions, but they never explained to what extent those historical inferences were distinguished from the null hypothesis since they did not have any. Sokal and colleagues showed that the PCA maps are subject to substantial errors and that apparent geographic trends may be detected in spatially random data (the null). Sokal et al. did not express doubt in human history, only that it reveals itself in the PC maps, as do we. Cavalli-Sforza’s group responded that Sokal et al.’s sampling scheme was extremely irregular 102 and questioned Sokal et al.’s disbelief in a wrong method that yields a conclusion that they were willing to accept otherwise. Sokal et al. 103 were concerned with the lack of response to their original inquiries, the PC’s interpolation (to overcome gaps in the data) and smoothing technique that introduced more noise, the specific sampling scheme of Cavalli-Sforza and colleagues that appeared incidental rather than genuinely comprehensive, and the continued absence of a null model. In further criticism of Cavalli-Sforza et al. 101 , they claimed that whereas some of the results appear biologically sound, others are not, yet both are discussed equally seriously. Cavalli-Sforza 104 stuck by PCA and the historical inferences (The Neolithic spread to Europe made “between 8000 and 5000 years ago”) that can be allegedly derived from it. In other words, whereas Cavalli-Sforza and colleagues believed that once sufficient data are available, the value of PCA for bio-history would reveal itself, Sokal and colleagues questioned the robustness and reliability of the approach to generate valid historical and ethnobiological results and cautioned that data that “have been interpolated or smoothed, invite ethnohistorical interpretation by the unwary” 105 . The issues at the heart of the debate were not as much about biostatistics as about dataism.

At first, Sokal and colleagues had the upper hand in the debate. PCA was not used in the first Big Data analyses of 2003–2005 until resurrected by Price et al. 10 . Price et al. ignored Sokal’s reasoning. They produced no null model nor proved that the method yields biologically correct results. The appeal of their tool was mainly its applicability to the large genetic datasets that had begun emerging at that time and the visual appeal of PC scatterplots that condensed these data. Interestingly, Novembre and Stephens 94 showed that the PCA structured patterns that Cavalli-Sforza and others have interpreted as migration events are no more than mathematical artifacts that arise when PCA is applied to standard spatial data in which the similarity between locations decays with geographic distance. Nonetheless, their warning was largely ignored, perhaps because the parallel study of Novembre et al. 32 left a stronger impact, and Cavalli-Sforza’s dataism was vindicated.

Evidently, PCA produces patterns no more historical than Alice in Wonderland and bear no more similarity to geographical maps. Overall, the positioning of a method that lacks any measurable power, a test of significance, or a null model, which any diligent scientist should seek at the forefront of population genetic analyses, is problematic at the very least. It would not be an exaggeration to consider PCA the Rorschach of population genetics, a procedure that is almost entirely open to manipulations and consequent interpretations, where researchers may see “geographical maps” or “Neolithic clines” as they will. In PCA-driven science, almost all the answers are equally acceptable, and the truth is in the eyes of the beholder.

Moving beyond PCA

As an alternative to PCA, we briefly note the advantages of a supervised machine-like model implemented in tools like the Geographic Population Structure (GPS) 85 and Pairwise Matcher (PaM) 57 . In this model, gene pools are simulated from a collection of geographically localized populations. The ancestry of the tested individuals is next estimated in relation to these gene pools. In this model, all individuals are represented as the proportion of gene pools. Their results do not change when samples are added or removed in the second part of the analysis. Population groups are bounded within the gene pools, and inclusion in these groups can be evaluated. This model was shown to be reliable, replicable, and accurate for many of the applications discussed here, including biogeography 85 , population structure modeling 106 , ancestry inference 107 , paleogenomic modeling 108 , forensics 86 , and cohort matching 57 . An evaluation of other tools that may be useful to infer the population structure and their limitations can be found elsewhere 37 , 109 .

Conclusions

PCA is a mathematical transformation that reduces the dimensionality of the data to a smaller set of uncorrelated dimensions called principal components (PCs), which has numerous applications in science. In population genetics alone, PCA usage is ubiquitous, with dozen standard applications. PCA is typically the first and primary analysis, and its outcomes determine the study design. That PCA is completely non-parametric is the source of its strength. Any genotype dataset can be rapidly processed with no concerns about parameters or data validity. It is also a weakness because the answer is unique and depends on the particular dataset, which is when reliability, robustness, and reproducibility become a concern. The implicit expectation employed by PCA users is that the variance explained along the first two PCs provides a reasonable representation of the complete dataset. When this variance is minuscule (as often with human populations), it poorly represents the data. Rather than consider using alternative analyses, authors often choose not to report the variation explained by PCA. Regardless, it is not a proxy for the reliability of the results.

Here, we carried out extensive analyses on twelve PCA applicaitons, using model- and real-populations to evaluate the reliability, robustness, and reproducibility of PCA. We found that PCA failed in all criteria and showed how easily it could generate erroneous, contradictory, and absurd results. This is not surprising because PCA is blind to the data and their meaning. The covariance matrix is calculated from the centered matrix itself created simply by subtracting the mean A u from the original matrix A, disregarding the weights and geography. The remaining transformation consists of the dimensionality reduction, which is less problematic; however, that the first two PCs that capture most, but still a very small part of the genetic variation, are typically analyzed creates further misinterpretations. Given the omnipresence of PCA in science, an intriguing question is whether multidisciplinary PCA results should be reevaluated? Based on our analyses and critical evaluations published elsewhere, we cannot dismiss this possibility.

As PCA lacks any measurable significance or accuracy, we argue that its dominance in population genetics could not have been achieved without the adoption of two fallacies: cherry-picking or circular reasoning (i.e., “exploration”), the screening and selecting PCA scatterplots that fit preconceived hypotheses while ignoring the other plots, and the a priori where PCA results are interpreted based on pre-existing knowledge because PCA scatterplots are uninformative a posteriori. As a “black box” basking in bioinformatic glory free from any enforceable proper usage rules, PCA misappropriations, demonstrated here for the first time, are nearly impossible to spot.

The fact that population affinities vary appreciably between closely related, ostensively equivalent datasets is deeply worrying (PCA applications were cited 32,000-216,000 times). Researchers from adjacent fields like animal and plant or medical genetics may be even less aware of the inherent biases in PCA and the variety of nonsensical results that it can generate. We consider PCA scatterplots analogous to Rorschach plots. We find PCA unsuitable for population genetic investigations and recommend reevaluating all PCA-based studies.

Generating the color populations

All the color populations were generated in a similar way with the number of dimensions p equals 3. Every individual color was represented by [ P 1 * R * N, P 2 * R * N, P 3 * R * N ], where P 1–3 are the three color dimensions or components that range from 0 to 1, R is pseudorandom value drawn from the standard normal distribution (Matlab’s function randn ). N is noise set to 0.01 in almost all analyses, with the following exceptions where a larger noise was needed in Figs.  17  ( N  = 0.02), 19 ( N  = 0.02 or 0.05),  13 B, 13 C ( N  = 0.05), 13 A ( N  = 0.17), and Supplementary Fig. S2.3C ( N  = 0.015). Colors are represented by a name and value (i.e., Red is [1,0,0] to which R and N were added), rounded up for brevity.

Sample collection

Alongside the simulated color datasets, we employed three human genotype datasets:

2068 global modern-DNA samples genotyped over 621,799 SNPs 14 available at https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/NearEastPublic.tar.gz ,

the overlap of dataset 1, 2504 humans from the 1000 genome project 110 available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp , and 471 Ashkenazic Jews 48 available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23636  (overall 5,043 samples) and

the overlap of dataset 2 and 514 ancient DNA samples from Allen Ancient DNA Resource (AADR) (version 44.3) 14 (Supplementary Table S1 ) (overall, 5,557 samples).

We used Lazaridis et al.’s 14 dataset to LD-prune all the datasets. After LD pruning using PLINK command (50 10 0.8) and removing SNPs with missingness, allowing no more than five missing SNPs per sample, the datasets included: p 1  = 230,569, p 2  = 128,568, and p 3  = 128,568 autosomal SNPs, respectively.

Data analyses

All calculations PCA were carried out using Matlab’s (R2020a, Statistics and Machine Learning Toolbox Version 11.7) PCA function, which uses singular value decomposition (SVD), like SmartPCA, and yields nearly identical results to the basic SmartPCA tool 9 (Version 7.2.1 without removing outliers, normalization, or projection) (Supplementary Figs. S1 – S2 ).

In test cases where simulated data were used, we manipulated the colors and the sample size, both shown in each figure legend and caption. We evaluated the accuracy of PCA’s projections of the colors on a 2D plane as deviations from the true distances of the colors from each other on a 3D plane.

In test cases where human data were used, we modulated the choice of populations and sample size (individuals were always randomly sampled), both shown in each figure legend and caption. Dataset 1 was used to produce Supplementary Figs. S1 – S2 . All the human test cases were carried out on dataset 2, except of the case of ancient DNA, where the 3rd dataset was used. By large, we refrained from commenting on the accuracy of the prediction, even when it is well established, and instead focused on conflicting interpretations produced by PCA.

To evaluate the proportion of homogeneous clusters, we applied a k -means clustering (Matlab’s kmean function) to the two top PCs. Cluster homogeneity was calculated by using k -means clustering to PC1 and PC2 for K clusters (unless stated otherwise), where k was the square root of the number of samples. Clusters were considered homogeneous if they harbored only samples from one population.

Evaluating missingness and noise

To evaluate the effects of missingness and noise in the case of marker choice, each color component was evenly divided across a window size of 200, generating a dataset of 600 “SNPs”. Missingness was then simulated by randomly nullifying different values of the matrix. The tri-color component structure was recovered by the reverse operation of summing the three 200-SNP-sets. The noise was generated by adding random markers (generated using Matlab’s rand function) to the color SNP set.

Projection of ancient samples

A major challenge in projecting ancient samples onto modern-day samples is handling the high data absences. Lazaridis et al. 14 addressed this problem using the least-squares projection (lsqproject) implemented in EIGENSOFT. Wang et al. 68 cautioned that this method does not address the shrinkage problem (where all the projected samples cluster together) and that the results might be misleading. To avoid this problem and the difficulties associated with missing data, in the case of ancient DNA, we analyzed 65 out of 102 of the ancient samples of interest with over 10,000 SNPs in our dataset (with a median of 48,249 SNPs). We then projected one ancient sample at a time, based on the modern-day samples, using only the genotyped SNPs of the former.

Estimating the citation number of PCA tools

Very conservatively, we estimate that, as of 4/2022, 32,000 genetic studies employed PCA based on Google Scholar’s citation count for the most commonly used PCA tools using the following searches: “EIGENSTRAT OR EIGENSOFT OR smartPCA” (8300), “PLINK AND PCA -EIGENSOFT -SNPRelate” (8390), “genalex AND PCA” (5990), “FlashPCA OR FlashPCA2” (365), “PCA in R AND genetics” (530), “adegenet AND PCA” (5350), ClustVis AND PCA (2170), and pcadapt AND PCA (624). A search for “(population Genetics) AND ("PCA")” yielded 159,000 results. This is also likely a small fraction of the true number of studies that employed PCA. Searching for “(Genetics OR genome) AND ("PCA")” yielded 216,000 results.

Data availability

All our data and scripts that can replicate our results and figures are available via GitHub: https://github.com/eelhaik/PCA_critique .

Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533 , 452–454. https://doi.org/10.1038/533452a (2016).

Article   ADS   CAS   PubMed   Google Scholar  

Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2 , e124. https://doi.org/10.1371/journal.pmed.0020124 (2005).

Article   PubMed   PubMed Central   Google Scholar  

Krafczyk, M., Shi, A., Bhaskar, A., Marinov, D. & Stodden, V. Learning from reproducing computational results: Introducing three principles and the Reproduction Package. Philos. Trans. R. Soc. A 379 , 20200069 (2021).

Article   ADS   CAS   Google Scholar  

Stokstad, E. Genetics lab accused of misusing African DNA. Science 366 , 555–556. https://doi.org/10.1126/science.366.6465.555 (2019).

Lee, S. S., Bolnick, D. A., Duster, T., Ossorio, P. & Tallbear, K. Genetics. The illusive gold standard in genetic ancestry testing. Science 325 , 38–39. https://doi.org/10.1126/science.1173038 (2009).

Article   CAS   PubMed   Google Scholar  

Kaiser, J. Who has your DNA—Or wants it. Science 349 , 1475 (2015).

Article   ADS   Google Scholar  

Pennisi, E. Private partnership to trace human history. Science 308 , 340–340. https://doi.org/10.1126/science.308.5720.340a (2005).

Holmes, I. What happens when geneticists talk sloppily about race. (2018). https://www.theatlantic.com/science/archive/2018/04/reich-genetics-racism/558818/ (Accessed 3 May 2020).

Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2 , e190. https://doi.org/10.1371/journal.pgen.0020190 (2006).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 , 904–909. https://doi.org/10.1038/ng1847 (2006).

Edwards, A. & Cavalli-Sforza, L. Analysis of human evolution. In Genetics Today. Proceedings, 11th International Congress of Genetics, The Hague, The Netherlands 3 , 923–933 (1963).

Behar, D. M. et al. The genome-wide structure of the Jewish people. Nature 466 , 238–242. https://doi.org/10.1038/nature09103 (2010).

Atzmon, G. et al. Abraham’s children in the genome era: Major Jewish diaspora populations comprise distinct genetic clusters with shared Middle Eastern ancestry. Am. J. Hum. Genet. 86 , 850–859. https://doi.org/10.1016/j.ajhg.2010.04.015 (2010).

Lazaridis, I. et al. Genomic insights into the origin of farming in the ancient Near East. Nature 536 , 419–424. https://doi.org/10.1038/nature19310 (2016).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Campbell, C. L. et al. North African Jewish and non-Jewish populations form distinctive, orthogonal clusters. Proc. Natl. Acad. Sci. USA 109 , 13865–13870. https://doi.org/10.1073/pnas.1204840109 (2012).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42 , 565–569. https://doi.org/10.1038/ng.608 (2010).

Moorjani, P. et al. The history of African gene flow into Southern Europeans, Levantines, and Jews. PLoS Genet. 7 , e1001373. https://doi.org/10.1371/journal.pgen.1001373 (2011).

Patterson, N. et al. Genetic structure of a unique admixed population: Implications for medical research. Hum. Mol. Genet. 19 , 411–419. https://doi.org/10.1093/hmg/ddp505 (2010).

Ramstetter, M. D. et al. Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics 207 , 75–82. https://doi.org/10.1534/genetics.117.1122 (2017).

Duforet-Frebourg, N., Luu, K., Laval, G., Bazin, E. & Blum, M. G. B. Detecting genomic signatures of natural selection with principal component analysis: Application to the 1000 genomes data. Mol. Biol. Evol. 33 , 1082–1093. https://doi.org/10.1093/molbev/msv334 (2015).

Galinsky, K. J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98 , 456–472. https://doi.org/10.1016/j.ajhg.2015.12.022 (2016).

Chen, G.-B. et al. Across-cohort QC analyses of GWAS summary statistics from complex traits. Eur. J. Hum. Genet. 25 , 137. https://doi.org/10.1038/ejhg.2016.106 (2017).

Genovese, G. et al. A risk allele for focal segmental glomerulosclerosis in African Americans is located within a region containing APOL1 and MYH9. Kidney Int. 78 , 698–704. https://doi.org/10.1038/ki.2010.251 (2010).

Luca, D. et al. On the use of general control samples for genome-wide association studies: Genetic matching highlights causal variants. Am. J. Hum. Genet. 82 , 453–463. https://doi.org/10.1016/j.ajhg.2007.11.003 (2008).

Mobuchon, L. et al. A GWAS in uveal melanoma identifies risk polymorphisms in the CLPTM1L locus. NPJ Genom. Med. 2 , 5. https://doi.org/10.1038/s41525-017-0008-5 (2017).

Peterson, R. E. et al. Genome-wide association studies in ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations. Cell 179 , 589–603. https://doi.org/10.1016/j.cell.2019.08.051 (2019).

Ganna, A. et al. Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365 , eaat7693. https://doi.org/10.1126/science.aat7693 (2019).

Ball, C. A. et al. Ethnicity Estimate 2020 White Paper (2020). https://www.ancestrycdn.com/dna/static/pdf/whitepapers/Ethnicity2020_white%20paper.pdf (accessed 9 Dec 2020).

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. bioRxiv. https://doi.org/10.1101/531210 (2020).

Article   Google Scholar  

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209. https://doi.org/10.1038/s41586-018-0579-z (2018).

Li, J. et al. Robust genome-wide ancestry inference for heterogeneous datasets: Illustrated using the 1000 genome project with 3D facial images. Sci. Rep. 10 , 11850. https://doi.org/10.1038/s41598-020-68259-w (2020).

Novembre, J. et al. Genes mirror geography within Europe. Nature 456 , 98–101. https://doi.org/10.1038/nature07331 (2008).

Solovieff, N. et al. Clustering by genetic ancestry using genome-wide SNP data. BMC Genet. 11 , 108 (2010).

Pardiñas, A. F. et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat. Genet. 50 , 381–389. https://doi.org/10.1038/s41588-018-0059-2 (2018).

Wainschtein, P. et al. Recovery of trait heritability from whole genome sequence data. bioRxiv https://doi.org/10.1101/588020 (2019).

Elhaik, E. In search of the jüdische Typus : A proposed benchmark to test the genetic basis of Jewishness challenges notions of “Jewish biomarkers”. Front. Genet. https://doi.org/10.3389/fgene.2016.00141 (2016).

Lawson, D. J., van Dorp, L. & Falush, D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat. Commun. 9 , 3258. https://doi.org/10.1038/s41467-018-05257-7 (2018).

Elhaik, E. & Graur, D. On the unfounded enthusiasm for soft selective sweeps III: The supervised machine learning algorithm that isn’t. Genes 12 , 527 (2021).

Article   CAS   Google Scholar  

Elhaik, E. Empirical distributions of F ST from large-scale Human polymorphism data. PLoS ONE 7 , e49837. https://doi.org/10.1371/journal.pone.0049837 (2012).

Qin, P. et al. Quantitating and dating recent gene flow between European and East Asian populations. Sci. Rep. 5 , 9500. https://doi.org/10.1038/srep09500 (2015).

Li, J. Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319 , 1100–1104. https://doi.org/10.1126/science.1153717 (2008).

Silva-Zolezzi, I. et al. Analysis of genomic diversity in Mexican Mestizo populations to develop genomic medicine in Mexico. Proc. Natl. Acad. Sci. USA 106 , 8611–8616. https://doi.org/10.1073/pnas.0903045106 (2009).

Pugach, I., Delfin, F., Gunnarsdottir, E., Kayser, M. & Stoneking, M. Genome-wide data substantiate Holocene gene flow from India to Australia. Proc. Natl. Acad. Sci. USA 110 , 1803–1808. https://doi.org/10.1073/pnas.1211927110 (2013).

Reich, D., Price, A. L. & Patterson, N. Principal component analysis of genetic data. Nat. Genet. 40 , 491. https://doi.org/10.1038/ng0508-491 (2008).

Reich, D., Thangaraj, K., Patterson, N., Price, A. L. & Singh, L. Reconstructing Indian population history. Nature 461 , 489–494. https://doi.org/10.1038/nature08365 (2009).

Pagani, L. et al. Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet. https://doi.org/10.1016/j.ajhg.2012.05.015 (2012).

Need, A. C., Kasperaviciute, D., Cirulli, E. T. & Goldstein, D. B. A genome-wide genetic signature of Jewish ancestry perfectly separates individuals with and without full Jewish ancestry in a large random sample of European Americans. Genome Biol. 10 , R7. https://doi.org/10.1186/gb-2009-10-1-r7 (2009).

Bray, S. M. et al. Signatures of founder effects, admixture, and selection in the Ashkenazi Jewish population. Proc. Natl. Acad. Sci. USA 107 , 16222–16227. https://doi.org/10.1073/pnas.1004381107 (2010).

Tian, C. et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 4 , e4. https://doi.org/10.1371/journal.pgen.0040004 (2008).

Tian, C. et al. European population genetic substructure: further definition of ancestry informative markers for distinguishing among diverse European ethnic groups. Mol. Med. 15 , 371–383. https://doi.org/10.2119/molmed.2009.00094 (2009).

Shlush, L. I. et al. The Druze: A population genetic refugium of the Near East. PLoS One 3 , e2105. https://doi.org/10.1371/journal.pone.0002105 (2008).

Marshall, S., Das, R., Pirooznia, M. & Elhaik, E. Reconstructing Druze population history. Sci. Rep. 6 , 35837. https://doi.org/10.1038/srep35837 (2016).

Gladstein, A. L. & Hammer, M. F. Substructured population growth in the Ashkenazi Jews inferred with approximate bayesian computation. Mol. Biol. Evol. 36 , 1162–1171. https://doi.org/10.1093/molbev/msz047 (2019).

Connolly, S., Anney, R., Gallagher, L. & Heron, E. A. Evidence of assortative mating in autism spectrum disorder. Biol. Psychiatry 86 , 286–293. https://doi.org/10.1016/j.biopsych.2019.04.014 (2019).

Article   PubMed   Google Scholar  

Suzuki, K. et al. Identification of 28 new susceptibility loci for type 2 diabetes in the Japanese population. Nat. Genet. 51 , 379–386. https://doi.org/10.1038/s41588-018-0332-4 (2019).

Willis, J. et al. Genome-wide analysis of the role of copy-number variation in pancreatic cancer risk. Front. Genet. https://doi.org/10.3389/fgene.2014.00029 (2014).

Elhaik, E. & Ryan, D. M. Pair Matcher (PaM): Fast model-based optimisation of treatment/case-control matches. Bioinformatics 35 , 2243–2250. https://doi.org/10.1093/bioinformatics/bty946 (2019).

Wang, Y., Lu, D., Chung, Y.-J. & Xu, S. Genetic structure, divergence and admixture of Han Chinese, Japanese and Korean populations. Hereditas 155 , 19. https://doi.org/10.1186/s41065-018-0057-5 (2018).

Tian, C. et al. Analysis of East Asia genetic substructure using genome-wide SNP arrays. PLoS ONE 3 , e3862. https://doi.org/10.1371/journal.pone.0003862 (2008).

Gamba, C. et al. Genome flux and stasis in a five millennium transect of European prehistory. Nat. Commun. 5 , 5257. https://doi.org/10.1038/ncomms6257 (2014).

Skoglund, P. et al. Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science 336 , 466–469. https://doi.org/10.1126/science.1216304 (2012).

Reich, D. et al. Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468 , 1053–1060. https://doi.org/10.1038/nature09710 (2010).

Zou, F., Lee, S., Knowles, M. R. & Wright, F. A. Quantification of population structure using correlated SNPs by shrinkage principal components. Hum. Hered. 70 , 9–22. https://doi.org/10.1159/000288706 (2010).

Watkins, L. E. et al. FKBP5 polymorphisms, childhood abuse, and PTSD symptoms: Results from the National Health and Resilience in Veterans Study. Psychoneuroendocrinology 69 , 98–105. https://doi.org/10.1016/j.psyneuen.2016.04.001 (2016).

Wright, K. M. et al. A prospective analysis of genetic variants associated with human lifespan. G3 Genes Genomes Genet. 9 , 2863–2878. https://doi.org/10.1534/g3.119.400448 (2019).

Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: A discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93 , 278–288. https://doi.org/10.1016/j.ajhg.2013.06.020 (2013).

François, O. et al. Principal component analysis under population genetic models of range expansion and admixture. Mol. Biol. Evol. 27 , 1257–1268. https://doi.org/10.1093/molbev/msq010 (2010).

Wang, C., Zhan, X., Liang, L., Abecasis, G. R. & Lin, X. Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am. J. Hum. Genet. https://doi.org/10.1016/j.ajhg.2015.04.018 (2015).

McVean, G. A genealogical interpretation of principal components analysis. PLoS Genet. 5 , e1000686. https://doi.org/10.1371/journal.pgen.1000686 (2009).

Chari, T., Banerjee, J. & Pachter, L. The specious art of single-cell genomics. bioRxiv https://doi.org/10.1101/2021.08.25.457696 (2021).

Elhaik, E., Graur, D. & Josić, K. “Genome order index” should not be used for defining compositional constraints in nucleotide sequences—A case study of the Z-curve. Biol. Direct 5 , 10. https://doi.org/10.1186/1745-6150-5-10 (2010).

Article   CAS   PubMed   PubMed Central   MATH   Google Scholar  

Trochimczyk, J. & Chayes, F. Sampling variation of principal components. J. Int. Assoc. Math. Geol. 9 , 497–506. https://doi.org/10.1007/BF02100961 (1977).

Adams, D., Cardini, A., Monteiro, L., O’higgins, P. & Rohlf, F. Morphometrics and phylogenetics: Principal components of shape from cranial modules are neither appropriate nor effective cladistic characters. J. Hum. Evol. 60 , 240–243 (2011).

Pearson, K. L. I. I. I. On lines and planes of closest fit to systems of points in space. Lond. Edinburgh Dublin Philos. Mag. J. Sci. 2 , 559–572. https://doi.org/10.1080/14786440109462720 (1901).

Article   MATH   Google Scholar  

Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24 , 417–441. https://doi.org/10.1037/h0071325 (1933).

Thurstone, L. L. The Vectors of Mind: Multiple-Factor Analysis for the Isolation of Primary Traits . (University of Chicago Press, 1935).

Hubert, L. Whence principal components. (2016).

Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201 , 786–792. https://doi.org/10.1126/science.356262 (1978).

Piazza, A., Menozzi, P. & Cavalli-Sforza, L. L. Synthetic gene frequency maps of man and selective effects of climate. Proc. Natl. Acad. Sci. USA 78 , 2638–2642. https://doi.org/10.1073/pnas.78.4.2638 (1981).

Article   ADS   CAS   PubMed   PubMed Central   MATH   Google Scholar  

Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155 , 945–959 (2000).

The International HapMap Consortium. A haplotype map of the human genome. Nature 437 , 1299–1320. https://doi.org/10.1038/nature04226 (2005).

Article   ADS   CAS   PubMed Central   Google Scholar  

The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449 , 851–861. https://doi.org/10.1038/nature06258 (2007).

Article   CAS   PubMed Central   Google Scholar  

Conrad, D. F. et al. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat. Genet. 38 , 1251–1260. https://doi.org/10.1038/ng1911 (2006).

Yang, W. Y., Novembre, J., Eskin, E. & Halperin, E. A model-based approach for analysis of spatial structure in genetic data. Nat. Genet. 44 , 725–731. https://doi.org/10.1038/ng.2285 (2012).

Elhaik, E. et al. Geographic population structure analysis of worldwide human populations infers their biogeographical origins. Nat. Commun. 5 , 1–12. https://doi.org/10.1038/ncomms4513 (2014).

Mason-Buck, G. et al. DNA based methods in intelligence-moving towards metagenomics. Preprints 2020020158 (2020).

Biswas, S., Scheinfeldt, L. B. & Akey, J. M. Genome-wide insights into the patterns and determinants of fine-scale population structure in humans. Am. J. Hum. Genet. 84 , 641–650. https://doi.org/10.1016/j.ajhg.2009.04.015 (2009).

Altshuler, D. M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467 , 52–58. https://doi.org/10.1038/nature09298 (2010).

Rohlf, F. J. Why clusters and other patterns can seem to be found in analyses of high-dimensional data. Evol. Biol. 48 , 1–16. https://doi.org/10.1007/s11692-020-09518-6 (2021).

Mathieson, I. & Reich, D. Differences in the rare variant spectrum among human populations. PLoS Genet. 13 , e1006581. https://doi.org/10.1371/journal.pgen.1006581 (2017).

Moorjani, P. et al. Reconstructing roma history from genome-wide data. PLoS One 8 , e58633. https://doi.org/10.1371/journal.pone.0058633 (2013).

Olalde, I. et al. The genomic history of the Iberian Peninsula over the past 8000 years. Science 363 , 1230–1234. https://doi.org/10.1126/science.aav4040 (2019).

Pickrell, J. K. et al. The genetic prehistory of southern Africa. Nat. Commun. 3 , 1143. https://doi.org/10.1038/ncomms2140 (2012).

Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40 , 646–649. https://doi.org/10.1038/ng.139 (2008).

Price, A. L. et al. Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 4 , e236. https://doi.org/10.1371/journal.pgen.0030236 (2008).

Kopelman, N. M. et al. High-resolution inference of genetic relationships among Jewish populations. Eur. J. Hum. Genet. https://doi.org/10.1038/s41431-019-0542-y (2020).

Elhaik, E. Editorial: Population genetics of worldwide Jewish people. Front. Genet. https://doi.org/10.3389/fgene.2017.00101 (2017).

Björklund, M. Be careful with your principal components. Evolution 73 , 2151–2158 (2019).

Francois, O. & Jay, F. Factor analysis of ancient population genomic samples. Nat. Commun. 11 , 4661. https://doi.org/10.1038/s41467-020-18335-6 (2020).

Brooks, D. The philosophy of data. N.Y. Times 4 , 2013 (2013).

Google Scholar  

Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes . (Princeton University Press, 1994).

Rendine, S., Piazza, A., Menozzi, P. & Cavalli-Sforza, L. L. A problem with synthetic maps: Reply to Sokal et al. Hum. Biol. 71 , 15–25 (1999).

Sokal, R. R., Oden, N. L. & Thomson, B. A. Problems with synthetic maps remain: Reply to Rendine et al. Hum. Biol. 71 , 447–453 (1999).

CAS   PubMed   Google Scholar  

Manni, F. Interview with Luigi Luca Cavalli-Sforza: Past research and directions for future investigations in human population genetics. Hum. Biol. 82 , 245–266 (2010).

Sokal, R. R., Oden, N. L. & Thomson, B. A. A problem with synthetic maps. Hum. Biol. 71 , 1–13 (1999).

Das, R., Wexler, P., Pirooznia, M. & Elhaik, E. Localizing Ashkenazic Jews to primeval villages in the ancient Iranian lands of Ashkenaz. Genome Biol. Evol. 8 , 1132–1149. https://doi.org/10.1093/gbe/evw046 (2016).

Baughn, L. B. et al. Differences in genomic abnormalities among African individuals with monoclonal gammopathies using calculated ancestry. Blood Cancer J. 8 , 96. https://doi.org/10.1038/s41408-018-0132-1 (2018).

Esposito, U., Das, R., Syed, S., Pirooznia, M. & Elhaik, E. Ancient ancestry informative markers for identifying fine-scale ancient population structure in Eurasians. Gene 9 , 625. https://doi.org/10.3390/genes9120625 (2018).

Carress, H., Lawson, D. J. & Elhaik, E. Population genetic considerations for using biobanks as international resources in the pandemic era and beyond. BMC Genom. 22 , 351. https://doi.org/10.1186/s12864-021-07618-x (2021).

MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335 , 823–828. https://doi.org/10.1126/science.1215040 (2012).

Download references

Acknowledgements

EE was partially supported by the Crafoord Foundation, the Swedish Research Council (2020-03485), and Erik Philip-Sörensen Foundation (G2020-011). The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Lund, partially funded by the Swedish Research Council through Grant agreement no. 2018-05973.

Open access funding provided by Lund University.

Author information

Authors and affiliations.

Department of Biology, Lund University, 22362, Lund, Sweden

Eran Elhaik

You can also search for this author in PubMed   Google Scholar

Contributions

E.E. carried out all the work.

Corresponding author

Correspondence to Eran Elhaik .

Ethics declarations

Competing interests.

EE consults the DNA Diagnostics Center.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., supplementary information 2., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Elhaik, E. Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Sci Rep 12 , 14683 (2022). https://doi.org/10.1038/s41598-022-14395-4

Download citation

Received : 14 January 2022

Accepted : 06 June 2022

Published : 29 August 2022

DOI : https://doi.org/10.1038/s41598-022-14395-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

The demographic history and adaptation of canarian goat breeds to environmental conditions through the use of genome-wide snp data.

  • Gabriele Senczuk
  • Martina Macrì
  • Amparo Martínez

Genetics Selection Evolution (2024)

A genome-wide association study reveals the relationship between human genetic variation and the nasal microbiome

  • Xiaomin Liu

Communications Biology (2024)

Assessing genetic diversity in critically endangered Chieniodendron hainanense populations within fragmented habitats in Hainan

  • Hai-Li Zhang

Scientific Reports (2024)

State-transition modeling of blood transcriptome predicts disease evolution and treatment response in chronic myeloid leukemia

  • David E. Frankhouser
  • Russell C. Rockne
  • Guido Marcucci

Leukemia (2024)

Application of Several Fuzzy-Based Techniques for Estimating Tunnel Boring Machine Performance in Metamorphic Rocks

  • Hanan Samadi
  • Arsalan Mahmoodzadeh
  • Ahmed Babeker Elhag

Rock Mechanics and Rock Engineering (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper on principal component

The Official Journal of the Pan-Pacific Association of Input-Output Studies (PAPAIOS)

  • Open access
  • Published: 02 March 2022

A new method of identifying key industries: a principal component analysis

  • Lefteris Tsoulfidis   ORCID: orcid.org/0000-0003-2691-0128 1 &
  • Ioannis Athanasiadis 1  

Journal of Economic Structures volume  11 , Article number:  2 ( 2022 ) Cite this article

5064 Accesses

5 Citations

Metrics details

This article using the principal components analysis identifies key industries and groups them into particular clusters. The data come from the US benchmark input–output tables of the years 2002, 2007, 2012 and the most recently published input–output table of the year 2019. We observe some intertemporal switches of industries both between and within the top clusters. The findings further suggest that structural change is a slow-moving process and it takes time for some industries to move from one cluster to the other. This information may be proved important in the designation of effective economic policies by targeting key industries and also for the stability properties of the economic system.

1 Introduction

In this article, we apply dimensionality reduction to three benchmark input–output tables of the USA of the years 2002, 2007 and 2012 as well as the last available input–output table, as of this writing, of the year 2019. The dimensions of the tables are reduced to 70 × 70, as we have eliminated the industry housing because of its many imputations, and also the fictitious household industry. The idea of dimensionality reduction, that is, the way to reduce the complexity in the modeling of an economy can be traced in the writings of the Physiocrats and their tableau économique, whose purpose was to mimic the operation of the entire economy by compressing it into three sectors. We do know that the tableau is essentially the prototype of a one (multipurpose) commodity world. A similar idea can be recognized in Ricardo’s corn model and certainly in Marx’s schemes of simple reproduction, where a single commodity functions both as a consumer and an investment good. This dual property of the commodity is derived from the uniform capital intensity of the two sectors (departments) of the economy. A similar idea could be identified in Samuelson’s ( 1962 ) parable production function also based on the production of a single commodity. Finally, in recent economic growth theory, we increasingly observe the case of models of a one-commodity world.

In the works (Mariolis and Tsoulfidis 2018 , Tsoulfidis 2021 and 2022 ) there has been an effort to stripe down the behavior of the entire economic system and compress it into a single hyper-industry through the application of the Schur and singular value decomposition (SVD) techniques (Meyer 2002 , ch. 5). The rationale for the application of these techniques is the skew distribution of the economic system’s eigenvalues and the wide gap of the second from the maximal. The exponential falling pattern of the subdominant eigenvalues ensures that their effect on the economy is minimal and, therefore, for all practical purposes can be side-stepped. Similarly, using the principal components analysis (PCA), we separate the impact of the top two (at most three) eigenvalues, which is equivalent to saying that the movement of prices induced by changes in the rate of profit is curvilinear and the same is true with the wage rate of profit curves or what is the same factor price frontier. Such findings suggest that the distribution of eigenvalues is mainly responsible for the shape of the price–profit or wage–profit curves and that the first eigenvalue along with the second compress most of the information regulating the economic system’s motion leaving not much to be explained by the third or fourth eigenvalues (see Tsoulfidis 2021 , ch. 6 and 2022 ).

The remainder of the article continues as follows: Sect.  2 gives a brief description of the fundamentals of the PCA and its use in determining clusters in the economy. Subsequently it applies the PCA and contrasts the first principal component against the backward linkages of the industries. The k-means clustering technique is used to extract the optimal number of groups of industries followed by the Silhouette method to form the clusters of industries. Sect.  3 orders each of the industries in the form of a dendrogram and respective network. Finally, Sect.  4 summarizes and concludes with remarks about future directions of the research.

2 Methods and results

2.1 dimensionality reduction through principal components analysis.

PCA is an effective dimensionality reduction technique that constructs relevant features through linear combinations of the original features. The construction of relevant features is carried out by linearly transforming correlated variables into fewer uncorrelated variables. This transformation becomes possible through the projection of the initial data into the reduced PCA space using the eigenvectors of the covariance/correlation matrix, or what is the same, the principal components (PCs). The resulting projected data comprise essentially linear combinations of the initial data capturing most, if not all, of the variance in the data. Furthermore, the PCA increases interpretability and, in so doing, can become particularly helpful in dealing with economic datasets. Notwithstanding its great advantages, the PCA has not been used in economics or in input–output analysis in any extensive way. The usual economic applications of the PCA are in questions in finance (e.g., Plerou et al. 2002 ; Farné and Vouldis 2021 ), in economic geography and regional analysis mainly in determining clusters of industries and agglomeration economies (e.g., Vom Hofe and Bhatta 2007 ; Ramos and Moreno 2013 ).

The advantages of the PCA render it particularly applicable in input–output data in the direction of identifying the relative importance of industries in the operation of the economy. From the estimated PCs, the dimensionality reduction requires just the top two. Experience has repeatedly shown that the top two eigenvalues compress most of the influence or what is the same most of the variance contained in the economic system’s input–output data. A third or a fourth PC could also be included in the case the first two PC are inadequate. However, by adding more PCs the model gains very little in accuracy, but loses a lot from its parsimonious character, and therefore, its analytical strength progressively fades away. In dealing with the available input–output matrices and their eigendecomposition, we have repeatedly found that the linear and, at most, quadratic approximations are adequate to mimic the actual movement of prices induced by changes in income distribution. The eigendecomposition's remaining terms, although many, nevertheless contain minuscule information, and therefore, we can dispense with them (see Bienenfeld 1988 ; Iliadi et al. 2014 ; Mariolis and Tsoulfidis 2018 ; Tsoulfidis 2021 , ch. 6 and 2022 ). The application of the PC analysis in input–output data consists of the following main operational steps:

Starting with the matrix of direct input requirements or Leontief inverse provided by the U.S. Bureau of Economic Analysis (BEA) for the benchmark years, that is, the matrix of total requirements or Leontief inverse \(\left[ {{\mathbf{I}} - {\mathbf{A}}} \right]^{ - 1}\) , we estimate the input–output matrix, \({\mathbf{A}}\) , as follows:

From which in turn, we get the vertically integrated input–output coefficients,

and so, we end up with the matrix \({\mathbf{H}}\) of 70 × 70 industries, whose PCs we seek to estimate in the effort to group industries into clusters with differences in relative importance and meaning. The advantage of the matrix \({\mathbf{H}}\) is that it gives more (less) weight to the larger (smaller) coefficients in matrix \({\mathbf{A}}\) . Footnote 1 Furthermore, the matrix \({\mathbf{H}}\) is used in the estimation of prices and their changes in the face of income redistribution (Tsoulfidis 2021 , ch. 6 and the literature cited there).

We centered the matrix \({\mathbf{H}}\) , by subtracting from each column element the column’s mean, and we repeat the process for each of our 70 industries and get

where \({\mathbf{e}}\) is the row (1 × 70) vector of ones or simply the summation vector and a prime indicates the transpose of a vector or matrix. In order to find the variance/covariance matrix, we multiply \({\overline{\mathbf{H}}}\) from the left by \({\overline{\mathbf{H}}}^{\prime}\) and form the new matrix

The eigenvalues of the above variance/covariance matrix ranked from the maximum to minimum along with their respective eigenvectors. It is important to note that the eigenvalues of the matrix \({\overline{\mathbf{H}}}^{^{\prime}} {\overline{\mathbf{H}}}/69\) , the maximal eigenvalue stands for the maximal variance and so forth for the rest.

We know that the eigenvalues denote the relative importance of their corresponding eigenvectors. It follows, therefore, that the ratio of each of the eigenvalues relative to their total sum gives the proportion of variance explained. From the estimated eigenvalues, we isolate the first couple, whose percentage in the total is found quite satisfactory. This requirement has indicated that with the first two eigenvalues, despite their relatively low percentage, they provide a pretty accurate description of the motion of the entire economic system. Consequently, by adding the next in ranking eigenvalues, we do not improve in any significant way our overall understanding of the relative importance and interconnections of industries. The eigenvectors indicate both the size and the direction of the variance and they are ranked according to their respective eigenvalues starting with the maximal going to the second, third, and so forth. We rotate the eigenvectors such that to place the first PC on the horizontal axis and the second PC on the vertical axis.

Figuratively speaking, and in case of two PCs, we may think of the first PC as the orthogonal that one gets from the vertical intersection of a cylinder, provided that its height is by far higher than its periphery. The second PC is a horizontal intersection of the same cylinder perpendicular to the orthogonal. The variance equated with the height of the first PC is meaningful only if it is significantly larger than the second PC and both are distant enough from the rest. In our data, a third PC will give rise to a three-dimensional graph, but this would not add much in our denoising process and the extraction the relative importance of each of our 70 industries. The PCA seeks to maximize, to the extent possible, the information content in the first PC, the remaining information is in the second PC and so forth. The scree plot, that is, the distribution of eigenvalues of the matrix \({\overline{\mathbf{H}}}^{^{\prime}} {\overline{\mathbf{H}}}/69\) , signifies the relative importance of each of the PC.

Having selected the first two eigenvalues of the variance/covariance matrix \({\overline{\mathbf{H}}}^{^{\prime}} {\overline{\mathbf{H}}}/69\) and respective eigenvectors or rather feature vectors discarding those of lesser significance (of lower than the second eigenvalues), and form with the remaining ones a matrix of vectors that we call feature vectors. In short, the feature vector is simply a matrix whose columns are the eigenvectors of the components that we decide to utilize.

This makes it the first step towards dimensionality reduction, because if we choose to keep only the first two eigenvectors (components) out of 70, the final data set will have only 70 × 2 dimensions. Subsequently, for the derivation of PC we apply the following multiplication

And we end up with the \(\overline{{{\mathbf{PC}}}}\) of the economic system’s matrices from which we keep the eigenvectors corresponding to the top two eigenvalues.

2.6 Principal components and clusters of industries with input–output data

Before we introduce the PCA and its application in identifying key industries in the input–output structure of the economy, it is important to establish its connection and relation in general with what has been hitherto used in input–output analysis. For this purpose, we start with the Leontief inverse, which is directly given in the input–output tables of the BEA and make a comparison between the estimated first PC and the estimates of linkages, backward or forward. From the examination of the simple additions of columns or rows of total requirement matrices of the benchmark years 2002, 2007, 2012 and also 2019, the last input–output data available, we find as expected that the simple row sum of \(\left[ {{\mathbf{I}} - {\mathbf{A}}} \right]^{ - 1}\) has a low correlation with the estimated first PC of the matrix \({\mathbf{H}}\) . By contrast the column sums or the column norms of the matrix \(\left[ {{\mathbf{I}} - {\mathbf{A}}} \right]^{ - 1}\) are closely associated with the first PC.

It is important to emphasize at this juncture that the forward linkages (FL) or backward linkages (BL) of industries of matrices have the exact same ranking regardless of the use of matrix \(\left[ {{\mathbf{I}} - {\mathbf{A}}} \right]^{ - 1}\) or the vertically integrating technical coefficients matrix \({\mathbf{H}}\) , which is used for the estimation of our PC. We opted that for the estimation of PC to utilize the matrix \({\mathbf{H}}\) which has been utilized in the estimation of relative prices as well as the price and wage rate of profit curves (Tsoulfidis 2021 ). The estimated first PC from the above matrix is found to be highly correlated with the direct (unweighted) BL of the total requirement \(\left( {{\mathbf{e}}\left[ {{\mathbf{I}} - {\mathbf{A}}} \right]^{ - 1} } \right)^{^{\prime}} /70\) , where \({\mathbf{e}}\) is the row (1 × 70) summation vector of ones, the so derived industry average BL was further divided by the economy’s average BL (Chenery and Watanabe 1958 ; Miller and Blair 2009 , ch.12). If an industry’s linkages are higher than one, it follows that this industry weighs more than the economy-wide average. Consequently, changes in this particular industry exert higher than average effects on the total economy, the converse holds for those with less than one.

The first PC corresponding to the eigenvector with the maximal eigenvalue provides us with information as to where the data are maximally spread out and, therefore, explains the most variance of the system. The second PC has a lower eigenvalue and thus encompasses most of the system’s remaining variance. The skew distribution of eigenvalues allows the selection of the top two eigenvalues, which although they account for nearly 50% of the total variance; nevertheless, are adequate since we are dealing with too many (70 × 70) observations.

The idea is that an industry with a high power of dispersion (variance) concentrates the features of a key industry. This is equivalent to saying that a given percentage increase in its output will deliver a significant impact on its suppliers. By contrast, in an industry with relatively small variance, a change in its output draws evenly and lightly on its suppliers. Footnote 2

It is also important to note that we did not get an equally strong relationship between the principal components of the matrix \({\mathbf{H}}\) and the FL. The idea is that for the sum of rows, we refer to output proportions and so a closer relationship would require estimation of \(\overline{{{\mathbf{PC}}}}\) from the matrix \({\mathbf{H^{\prime}}} = \left[ {{\mathbf{I}} - {\mathbf{A}}} \right]^{ - 1} {\mathbf{A}}\) . This does not mean that the FL are not important in the understanding of interrelationships between industries and the structural changes in the economic system as a whole. For this purpose, neither the first PC is adequate, and needs to be supplemented by the second, at least, PC.

From the panel of four graphs (Fig.  1 ), we observe that the first PC is strongly associated with the BL of each industry, as this is reflected in the high R-square, which varies from 72 to 80%. The Pearson correlation coefficient (displayed in each of the panel of four graphs in Fig.  1 ), which is well above 0.80 indicates a strong similarity and positive relationship between the two in comparison variables, in each year of our study. These findings suggest that the application of the PCA may lead to fuller and more informative results with respect to the interrelations of industries. Furthermore, since we are looking for higher variances, it becomes particularly operational in evaluating the relative importance of each industry, as one may judge by the covariance of the central array and then find the eigenvalue and eigenvectors of the covariance. The dominant eigenvalue explains most of the variance in the data and the second along the subdominant eigenvalues ranked from the top down are used to categorize each industry to appropriate cluster as shown below. The covariance matrix \({\overline{\mathbf{H}}}^{^{\prime}} {\overline{\mathbf{H}}}/69\) gave the following eigenvalues for each of the four years of our study, which we plot in a panel of scree graphs in Fig.  2 .

figure 1

Principal components vs. backward linkages a horn kind of relationship

figure 2

Scree plots of covariance matrices 2002, 2007, 2012, and 2019

In spectral methods, the top very few eigenvalues decide on the dimensions or, what is the same, the number of PC to be selected depending on the so-called “elbow rule”, according to which the usually selected percentage explained by the PC should be around 70%. However, when we are dealing with large dimensions, as in our case, this percentage might be significantly lower. From the panel of four scree graphs in Fig.  2 , we observe that the first two (or at most three) eigenvalues are much higher than the rest and from the third eigenvalue onwards starts the decaying of eigenvalues. By adding the subdominant eigenvalues and the associated with these eigenvectors, we do not improve our overall explanation or variability, and certainly, we do not affect qualitatively our results.

2.7 Identifying industry clusters

In data science and especially in datasets with many features (or variables) such as in our case, clustering is a very useful tool. The greatest dissimilarity between different clusters and most important the greatest similarity within the same cluster, is the method for finding cluster structure in a dataset. The earliest method used mainly by biologists and social scientists is the hierarchical clustering, setting cluster analysis as a branch of multivariate statistical analysis (Jain and Dubes 1988 ; Kaufman and Rousseeuw 1990 ). This approach to machine learning is also called unsupervised because unlike supervised learning where the data or inputs are placed in already known categories (fruits are classified in apples and oranges) as well as separating the dependent from independent variables. By contrast, in the unsupervised learning the algorithms are left to their own devises to detect the presence of similarities and extract patterns between the inputs so that similar inputs can be clustered together.

These clustering methods, from a statistical viewpoint, generally are considered as probability and non-parametric model-based approaches. The first approach follow that the observations are from a mixture probability model with the consequence to use a mixture likelihood approach to clustering (McLachlan and Basford 1988 ). The Expectation and Maximization (EM) algorithm is the most frequently used in model-based approaches (Yu et al. 2018 ). The use of an objective function of similarity or dissimilarity measure is the preferred method for clustering in a non-parametric approach, with the division into hierarchical and partitional methods in nearly every case (Kaufman and Rousseeuw 1990 ; Jain 2010 ; Yang et al. 2018 ).

On the other hand, the main problem of these k-means clustering algorithms is the need to give a number of clusters a priori. For solving this, validity indices supposed to be independent of clustering algorithms should be used (Halkidi et al. 2001 ). Many of these validity indices had been proposed such as Bayesian Information Criterion (BIC), Kass and Raftery 1995 ; Akaike information criterion (AIC), Bozdogan 1987 ; Silhouette width (SW), Rousseeuw 1987 , among others.

How similar an object is to its own cluster (cohesion) compared to other clusters (separation) is a very useful measure obtained by the Silhouette value (Rousseeuw 1987 ) that we use in our approach in deriving the key industries of the US economy in 4 years quite apart from each other. This measure ranges from − 1 to + 1 with higher values better match to its own cluster, whereas low indicate poorly matched values to neighboring clusters (Sinaga and Young 2020 ).

Finite cluster prototypes with their own objective functions can represent the various partition methods. Furthermore, the dissimilarity (or distance) between a point and a cluster prototype is crucial for the partition methods (Jain and Dubes 1988 ). The method of k-means clustering with various extensions is very popular in the literature with application in various scientific areas (Alhawarat and Hegazi 2018 ).

The next step is to separate our 70 industries into distinct clusters according to their similarity or, what is the same, their homogeneity. The similarity or dissimilarity of the industries depends on the question asked and the type of industries and in our case, the relevant question is the ranking of our industries in order of importance; namely, not all industries impart or incur the internally or externally generated shocks in the same way. Some industries are tightly connected to each other but weakly to others and others are only lightly connected with others. In our case, we categorized the 70 industries into three clusters according to the popular k-means criterion. The latter is a method of partitioning  n  observations into  k  clusters in which each industry is assigned to a particular cluster according to the nearest mean or cluster centroid around which industries are crowded. The property of  k -means clustering is that it minimizes within-cluster variances or Euclidean distances. The mean optimizes squared errors, whereas the geometric mean minimizes Euclidean distances. Cluster analysis starts by selecting a distance measure and optimization process which meaningfully determines the number of k  partitions or clusters (only a few) and the industries contained in each. In Fig.  3 , we determine the optimal number of clusters following the Silhouette method for each of our four years of the analysis. The results show that in every particular year the optimal number of clusters is three.

figure 3

Optimal number clusters with Silhouette method

Having established that the number of clusters is three, the industries are classified in each of these three clusters according to their respective centroid. The scree plot helps us to choose the principal components and understand the basic data structure. In Figs.  4 , 5 , 6 and 7 , we display the three distinct clusters and the number of industries contained in each.

figure 4

Clusters of industries, 2002

figure 5

Clusters of industries, 2007

figure 6

Clusters of industries, 2012

figure 7

Clusters of industries, 2019

In the interest of brevity and clarity of presentation, we explain for each of the four years of our analysis the clusters that we form as well as the industries they include. As we have already mentioned the first PC is placed on the horizontal axis and the second PC on the vertical axis. From the three clusters, we separate the blue or the South–East (S–E) cluster and the green or the North–East (N–E) cluster as the most important ones containing the key industries. The ranking in each particular cluster is according to the first PC and also take into account the second PC. The majority of industries are compressed in the orange or Western (W) cluster. The nomenclature of industries is in the Appendix 1 . Footnote 3 Thus, the benchmark input–output data of the year 2002 are as given in Fig.  4 .

In Fig.  4 , the industries in the S–E (blue) and N–E (green) clusters are ranked starting from the South–East (S–E) and placing industries from the right going to the left until we exhaust the industries in the S–E cluster and then continue with the N–E cluster. Thus, we have the following ranking noting first the number of the industry and then its name. Thus, we have 25 chemical products, 27 wholesale trade, 52 miscellaneous professional scientific, 54 administrative and support services, 48 other real estate, 10 primary metals, 11 fabricated metal products, 13 computer and electronic products, 53 management of companies and enterprises, 6 utilities, 22 paper products, 15 motor vehicles bodies and trailers. Thirteen industries in total are included in the first cluster while the second paint in green cluster N–E cluster contains the following three industries: 44 Federal credit intermediation, 3 Oil and gas extraction, 45 Securities commodity contracts. Footnote 4

Continuing with the benchmark year 2007, we distinguish the following three clusters shown in Fig.  5 .

We observe that the number of industries in the blue or S–E cluster decreased and increased in the about North–East (green) cluster. More specifically, in dissenting order and starting from the right end of the first PC, we have the following four industries: 25 chemical products, 10 primary metals, 27 wholesale trade, 11 fabricated metal products. While the NE cluster includes more industries, which are also ranked in dissenting order, and these are: 3 oil and gas extraction, 52 miscellaneous professional scientific and technical services, 24 petroleum and coal products, 48 other real estate, 54 administrative and support services, 44 federal credit intermediation, 45 securities commodity contracts, 46 insurance carriers and related activities.

Continuing with the benchmark year 2012 the three clusters along with the industries contained in each are displayed in Fig.  6 .

We observe that in the year 2012 the number of industries in the top South–East cluster dropped to only three and these are: 25 chemical products, 27 wholesale trade, 10 primary metals.

The North-East cluster includes the following industries ranked in dissenting order: 3 oil and gas extraction, 52 miscellaneous professional, scientific and technical services, 24 petroleum and coal products, 48 other real estate, 54 administrative and support services, 53 management of companies and enterprises, 44 federal credit intermediation, 46 insurance carriers and related activities, 45 securities, commodity contracts, etc.

Finally, the lack of a benchmark input–output table for the year 2017, which is still in its making, we utilize the last available input–output table is for the year 2019 and the three clusters are shown in Fig.  7 .

The blue or S-E cluster of the year 2019 is augmented to include besides the three top industries of 2012 (27 Wholesale trade, 25 Chemical products, 10 Primary metals), which are to the right end of the cluster and so continue to exert most of their influence on the economy and three more industries are added; namely, 53 management of companies and enterprises, 11 fabricated metal products, 15 motor vehicles, bodies and trailers.

By contrast the green or N-E cluster contains the following industries: 52 miscellaneous professional, scientific and technical services, 48 other real estate, 54 administrative and support services, 46 insurance carriers and related activities, 45 securities, commodity contracts and 44 federal credit intermediations.

3 Discussion

3.1 ranking of industries.

The clustering of industries into three groups alerts us into an altogether different vantage point of view. Hence, the grouping of industries into particular clusters ranked in order of importance makes possible the estimation of the impact of internally or externally generated shocks in the totality of the economy. For this purpose, we compare our findings from the PCA with those of total BL and FL. In effect, we experimented with the traditional techniques of identifying key industries according to the forward or BL. The results showed that neither the forward, nor the BL, taken individually, accord to the ranking of industries found through the clustering procedure. However, by taking the total linkages, that is, the average of an industry’s forward and BL, the resulting ranking is quite like that of the PCA. Footnote 5 Thereby, lending support to our clustering technique as a meaningful and, at the same time, effective way of ranking industries.

The next task is to group industries into particular clusters according to how homogenous they are. The common property characterizing industries is their variance which can be classified into high, medium, and low. This clustering would make a fine example of “unsupervised learning” since we are not making predictions; we merely categorize the industries into particular groups. There is no doubt that the PCA captures better than any other parsimonious method the variance associated with each of the industries. Furthermore, the PCA enables the classification of industries according to their positive distance from zero. In particular, the further away from zero, the higher, the variance of the industry from the rest, and other similarly situated industries can be grouped forming a cluster. The location of the industry into a particular cluster indicates its association with respect to the two PC. Thus in our case, the industries grouped in the S–E cluster are characterized by high variance and therefore have a much larger effect on the economy. The second in importance N–E area, although it contains influential industries, nevertheless, they are less so, than those in the S–E. The majority of industries are crowded near zero and they are far to be considered key industries.

It is interesting to note that the ranking using the average of both BL and FL is quite close to that of the PCA. The underlying idea behind the total linkages is that if higher than one, it follows that the importance of this particular industry exceeds that of the economy-wide average. The converse is in case that this figure is lower than one. We picked 1.18 as our threshold for total linkages. We found that about 15 are the key industries, in the 4 years of our study, which are no different than those derived by our two clusters. Of course, we have differences in rankings, but the clustering method based on PCA gives consistent results from one benchmark year to the next and the last 2019 year. The PCA results are close to those derived using the traditional total linkages. Furthermore, the PCA possesses some additional properties that may give rise to a new research agenda and methodology in identifying clusters of industries and their possible economic impact.

An inspection of the three clusters reveals that the outer right cluster is the one whose relative importance makes it the principal cluster. Thus, the industries in the S–E, other things equal, are more influential than similarly situated industries located in the N-E cluster. The rankings of industries according to the PCA are displayed in Table 1 . More specifically and for each of our 4 years, the first column displays the ranking of industries. The second column shows the industry number placed according to the traditional method. In the third column, we show the industry number contained in each cluster starting from the S–E and continuing to N–E and going to the W cluster until the 15th industry. There is variability between industries but not much.

3.2 Industries ordered in dendrograms

A salient feature of the PCA is that clustering enables the presentation of the industries in a dendrogram, which marks the last step in cluster analysis. A dendrogram is a hierarchical tree plot that displays a grouping of industries into distinct clusters. The length of each branch on the graph measures the distance between industries in the cluster. The purpose of dendrograms is to decide upon the suitable number of clusters. For this purpose, we employ the agglomerative method, which creates a hierarchy of industries starting with all of them as if they were completely separated and then fuses them until there is only one cluster left.

The dendrograms below mark the presence of three clusters. The clusters are distinguished starting from the horizontal axis, where all 70 industries stand like the leaves of the tree. As we move upward, ideally drawing a parallel line to the hypothetical horizontal axis, we start distinguishing the branches directed to the core of the imaginary tree. In this process, we end up with three main branches for each year of our study, exactly as indicated by our k-means testing procedure. In each of these branches, we identify not only the tree clusters but also their detailed connections. We observe a hierarchical location of the industries ranked according to the height of the branches. The higher a branch, the higher the relative importance of the industry.

figure a

The nodes that are higher indicate the importance of each cluster and within the same cluster the importance of the industry. In other words the longer the branch the more important the industries that branch out of it. And from each branch the industries located above the others carry more weight than those being below it. Thus starting from the benchmark year 2002 the dashed horizontal in red line indicates the presence of three clusters and the top one consists in dissenting order of industries (45, 3 and 44), which make up the first main branch. Going to the next in importance branch the industries 52, 48, 54 industries 25, 27 follow 15, 13, 22 and so forth. This ordering maybe in some deviation from that of clustering and displayed in Table 1 , but certainly the dendrogram pretty much makes the same selection with that of the Silhouette method of clustering and differences are border cases.

figure b

In the dendrogram of the year 2007, we have the following ranking in descending order 10, 25, 27 from the first major branch and for the second 3, 45, 46, 24, 25, 44, 6, 53, 48 and 54.

figure c

The same ranking with respect to the top longer branch is repeated in the dendrogram of the year 2012. Thus, we have industries once again 10, 25, 27 followed by 3, 45, 46, 24, 25, 6, 53, 44, 48 and 54.

figure d

Finally, the dendrogram of the year 2019 gives the following ranking of industries 25, 27, 10, 53, 11, 15, 45, 44, 46, 54, 48, 52. In the four dendrograms that we constructed, we observe that industries 10, 25 and 27 are connected to each other and are distinct from the other industries and are ranked in the top industries for the last three input–output data.

Dendrograms may be proved particularly helpful in our understanding of the formation and the internal structure of clusters, and they can be profitably used in laying bare underlying trends and highlighting outliers. Such information is unquestionably practical in tracking down the process of structural change and technological change. For this purpose, the more informed inter-cluster and intra-cluster connections of industries shed more light on all of the above. The panel of four dendrograms displayed in Fig.  5 provides us with a visual description of such inter- and intra-cluster connections of industries.

A similar picture is drawn by looking at the particular networks consisting of industries displayed in a panel of four graphs in Fig.  5 . We observe that industries form clusters which we paint in green, blue, and red colors to be distinguished from each other. In these networks, we identify the connections between industries within each of the three clusters and their branching out. The latter indicates more clearly the interconnections between industries, which may prove invaluable in picking strategic industries to become the targets of particular industrial policies.

figure e

In the above 2002 network the green branch has a superiority over the blue and the red branches. Industry 45 in the green branch appears to be the most important followed by industries 44 and 3. The second in importance blue branch has as its top industries 27, 25, 11, 10 and 52. Finally from the red branch of our dendrogram we distinguish in order of importance industries 47 and 46 against all others remaining industries.

The usefulness of this graphic is that we can see clearly as branches and leaves all the connections between industries and the precedence or priority of certain industries over others.

figure f

In the above network in the blue branch 10 has the priority over 25 and 27. In the green branch industry 45 is the most influential followed by industries 11, 3, 52 and so forth. On the other hand the red one has the priority of 17, 15 and 47 over all others.

figure g

In the 2012 network the blue branch the priority is in the industry 10, the green has the number 3 following by 45 and 46, and the red one number 11 following by numbers 15, 18 and 36.

figure h

Finally in the 2019 network we observe in the blue branch the priority of number 10 industry, in the number 3 over 45 and 46 has the priority, and in the red one numbers 11, 36, 18 and 38 over all the remaining industries.

4 Concluding remarks

In evaluating the key industries grouped into clusters and basic economic structures, the PCA possesses distinct advantages compared to the standard BL and FL methods. The PCA, a mathematically rigorous and parsimonious technique, enables the more efficient utilization of input–output data. As a result, the PCA besides the ranking of industries, as in the standard linkages methods, further refines this ranking by placing the industries into distinct and well-defined clusters. In so doing, the PCA expands the identification of key industries in new directions. Thus, starting with clusters and going into dendrograms and networks, we identify the relative importance of industries and their connections with each other within the same cluster.

In experimenting with the data from our four input–output tables of the US economy, we observed a close association between the leading PCs and the total linkages of industries. A result that encourages the use of the PCA and its application to input–output data for the identification of key industries. The next step was to use the top two PCs perpendicular to each other, meaning their correlation is zero. We have also utilized these two PCs as the horizontal and vertical axis, respectively. In such representation, we grouped the data into three particular clusters for each of the four distant years of our study. The clustering of industries was based on the k-means and Silhouette procedures. It is interesting to note that the top two clusters include nearly the same industries over the years and, the very few that are not included stand as border cases. Moreover, the industries in the top two clusters do not differ from those derived from the backward and forward linkages. The salient feature of the PCA is the grouping of industries into clusters and dendrograms. The presentation of networks reveals the interlinkages between industries within clusters and the hierarchical positions of clusters and their interconnections. Future research efforts should focus on the application of the above techniques on input–output data from many countries and years. In this respect, the more industry-detailed input–output data would enhance our understanding of the interconnections of industries and changes in the structure of the economies. Such information is extremely useful in the planning of an effective industrial policy.

Availability of data and materials

The datasets for the present study are available electronically and publicly. Data on input–output tables are available from the website of the BEA.

The Leontief inverse of the input–output matrix accounts for both the direct and indirect interindustry linkages. The data were downloaded in March 2021.

In the literature there are various ways to weigh the relative importance of each industry, for example, relative industries’ shares in final demand or total output instead of unweighted measures as the one we used. However, we do not expect any qualitative differences in our results (see Miller and Blair 2009 , ch. 12).

The clusters were determined using the R programming language.

The nomenclature of industries is in the Appendix 1 .

The characterization of an industry’s relative importance by accounting both the BL and FL can be found in Rasmussen ( 1956 ) and Hewings ( 1982 ) among others. The idea is that key industries are those whose links are such that they have an over-average influence on the rest of the economy. Based on the hierarchies of backward and forward linkages, key industries are those with BL and FL are both greater than one.

Alhawarat M, Hegazi M (2018) Revisiting K-means and topic modeling, a comparison study to cluster Arabic documents. IEEE Access 6:42740–42749

Article   Google Scholar  

Bienenfeld M (1988) Regularity in price changes as an effect of changes in distribution. Camb J Econ 12:247–255. https://doi.org/10.1093/oxfordjournals.cje.a035057

Bozdogan H (1987) Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika 52(3):345–370

Chenery H, Watanabe T (1958) International comparison of the structure of production. Econometrica 26(4):487–521

Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57

Farné M, Vouldis A (2021) Banks’ business models in the euro area: a cluster analysis in high dimensions. Ann Oper Res. https://doi.org/10.1007/s10479-021-04045-

Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2):107–145

Hewings G (1982) The empirical identification of key sectors in an economy: a regional perspective. Dev Econ 20(2):173–195

Iliadi F, Mariolis T, Soklis G, Tsoulfidis L (2014) Bienenfeld’s approximation of production prices and eigenvalue distribution: further evidence from five European economies. Contrib Polit Econ 33(1):35–54

Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

Google Scholar  

Kass RE, Raftery AE (1995) Bayes factors. J Amer Stat Assoc 90:773–795

Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

Book   Google Scholar  

Mariolis T, Tsoulfidis L (2018) Less is more: capital theory and almost irregular uncontrollable actual economies. Contrib Polit Econ 37(1):65–88

McLachlan GJ, Basford KE (1988) Mixture to models: inference and applications clustering. Marcel Dekker, New York

Meyer C (2002) Matrix analysis and applied linear algebra. Society for Industrial and Applied Mathematics, New York

Miller R, Blair P (2009) Input-output analysis: foundations and extensions. Cambridge University Press, New York

Plerou V, Gopikrishnan P, Rosenow B, Amaral L, Guhr T, Stanley E (2002) Random matrix approach to cross correlations in financial data. Phys Rev E Stat Nonlin Soft Matter Phys 65(6):1–18

Ramos C, Moreno B (2013) Characterization of Spanish economic sectors from an economic and environmental perspective: evolution and forecast of greenhouse gas emissions. Reg Sect Econ Stud 13(2):117–134

Rasmussen P (1956) Studies in intersectoral relations. North-Holland, Amsterdam

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

Samuelson PA (1962) Parable and realism in capital theory: The surrogate production function. Rev Econ Stud 29(3):193–206

Sinaga KP, Yang MS (2020) Unsupervised K-means clustering algorithm. IEEE Access 8:80716–80727

Tsoulfidis L (2021) Capital theory and political economy: prices, income distribution and stability. Routledge, London

Tsoulfidis L (2022) Capital theory debates: new developments and direction. Invest Econ 81(318):3–36. https://doi.org/10.22201/fe.01851667p.2022.319.79901

Vom HofeBhatta RSD (2007) Method for identifying local and domestic industrial clusters using interregional commodity trade data. Ind Geogr 4(2):1–27

Yang M-S, Chang-Chien S-J, Nataliani Y (2018) A fully-unsupervised possibilistic C-means clustering algorithm. IEEE Access 6:78308–78320

Yu J, Chaomurilige C, Yang M-S (2018) On convergence and parameter selection of the EM and DA-EM algorithms for Gaussian mixtures. Pattern Recognit 77:188–203

Zhu J, Jiang Z, Evangelidis GD, Zhang C, Pang S, Li Z (2019) Efficient registration of multi-view point sets by K-means clustering. Inf Sci 488:205–218

Download references

Acknowledgements

Not applicable.

We thank the two reviewers of this journal for their constructive comments and advise that helped us to present more clearly our arguments. The usual caveat applies.

The authors have not received funding from any sources for this research.

Author information

Authors and affiliations.

Department of Economics, University of Macedonia, Thessaloniki, Greece

Lefteris Tsoulfidis & Ioannis Athanasiadis

You can also search for this author in PubMed   Google Scholar

Contributions

Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Lefteris Tsoulfidis .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Nomenclature of industries

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Tsoulfidis, L., Athanasiadis, I. A new method of identifying key industries: a principal component analysis. Economic Structures 11 , 2 (2022). https://doi.org/10.1186/s40008-022-00261-z

Download citation

Received : 08 November 2021

Revised : 22 January 2022

Accepted : 05 February 2022

Published : 02 March 2022

DOI : https://doi.org/10.1186/s40008-022-00261-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

JEL classification

  • Principal components
  • Structural change
  • Dimensionality reduction

research paper on principal component

American Accounting Association

Principal Component Analysis and Factor Analysis in Accounting Research

ORCID logo

We thank Ed deHaan (editor), two anonymous reviewers, and the technical support teams from SAS and Stata for their assistance with our discussion and analysis.

Kristian D. Allee, University of Arkansas, Walton College of Business, William Dillard Department of Accounting, Fayetteville, AR, USA; Chuong Do, University of Nevada, Reno, College of Business, Department of Accounting, Reno, NV, USA; Fellipe G. Raymundo, University of Arkansas, Walton College of Business, William Dillard Department of Accounting, Fayetteville, AR, USA.

Supplemental materials can be accessed by clicking the links in Appendix D.

Editor's note: Accepted by Ed deHaan.

  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Get Permissions
  • Search Site
  • View This Citation
  • Add to Citation Manager

Kristian D. Allee , Chuong Do , Fellipe G. Raymundo; Principal Component Analysis and Factor Analysis in Accounting Research. Journal of Financial Reporting 1 September 2022; 7 (2): 1–39. https://doi.org/10.2308/JFR-2021-005

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Principal component analysis (PCA) and factor analysis (FA) are both variable reduction techniques used to represent a set of observed variables in terms of a smaller number of variables. While both PCA and FA are similar along several dimensions (e.g., extraction of common components/factors), researchers often fail to recognize that these techniques are designed to achieve different goals and can produce significantly different results. We conduct a comprehensive review of the use of PCA and FA in accounting research. We offer simple guidelines on how to program PCA and FA in SAS/Stata and emphasize the importance of the implementation techniques as well as the disclosure choices made when utilizing these methodologies. Furthermore, we present a few intuitive, practical examples highlighting the unique differences between the techniques. Finally, we provide some recommendations, observations, notes, and citations for researchers considering using these procedures in future research.

Data Availability : The data used in this paper are publicly available from the sources indicated in the text.

JEL Classifications:  C38; C88; M41.

Member Login

Citing articles via, email alerts, affiliations.

  • Online ISSN 2380-2146
  • Print ISSN 2380-2154
  • Information for Authors
  • Authorship Policy
  • Plagiarism Policy
  • Data Integrity Policy
  • Citing Corrected Articles Policy
  • Minimizing Overlapping Decision Rights Policy
  • Prior Publication Policy

research paper on principal component

  • Cookie Policy
  • Accessibility
  • Terms & Conditions
  • Get Adobe Acrobat Reader
  • © Copyright 1998 - 2023 American Accounting Association. All rights reserved.

This Feature Is Available To Subscribers Only

Sign In or Create an Account

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Philos Trans A Math Phys Eng Sci

Principal component analysis: a review and recent developments

Ian t. jolliffe.

1 College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter, UK

Jorge Cadima

2 Secção de Matemática (DCEB), Instituto Superior de Agronomia, Universidade de Lisboa, Tapada da Ajuda, Lisboa 1340-017, Portugal

3 Centro de Estatística e Aplicações da Universidade de Lisboa (CEAUL), Lisboa, Portugal

Associated Data

The fossil teeth data are available from I.T.J. The atmospheric science data were taken from the publicly accessible NCEP/NCAR reanalysis database (see [ 36 ] for details).

Large datasets are increasingly common and are often difficult to interpret. Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. Finding such new variables, the principal components, reduces to solving an eigenvalue/eigenvector problem, and the new variables are defined by the dataset at hand, not a priori , hence making PCA an adaptive data analysis technique. It is adaptive in another sense too, since variants of the technique have been developed that are tailored to various different data types and structures. This article will begin by introducing the basic ideas of PCA, discussing what it can and cannot do. It will then describe some variants of PCA and their application.

1. Introduction

Large datasets are increasingly widespread in many disciplines. In order to interpret such datasets, methods are required to drastically reduce their dimensionality in an interpretable way, such that most of the information in the data is preserved. Many techniques have been developed for this purpose, but principal component analysis (PCA) is one of the oldest and most widely used. Its idea is simple—reduce the dimensionality of a dataset, while preserving as much ‘variability’ (i.e. statistical information) as possible.

Although it is used, and has sometimes been reinvented, in many different disciplines it is, at heart, a statistical technique and hence much of its development has been by statisticians.

This means that ‘preserving as much variability as possible’ translates into finding new variables that are linear functions of those in the original dataset, that successively maximize variance and that are uncorrelated with each other. Finding such new variables, the principal components (PCs), reduces to solving an eigenvalue/eigenvector problem. The earliest literature on PCA dates from Pearson [ 1 ] and Hotelling [ 2 ], but it was not until electronic computers became widely available decades later that it was computationally feasible to use it on datasets that were not trivially small. Since then its use has burgeoned and a large number of variants have been developed in many different disciplines. Substantial books have been written on the subject [ 3 , 4 ] and there are even whole books on variants of PCA for special types of data [ 5 , 6 ]. In § 2 , the formal definition of PCA will be given, in a standard context, together with a derivation showing that it can be obtained as the solution to an eigenproblem or, alternatively, from the singular value decomposition (SVD) of the (centred) data matrix. PCA can be based on either the covariance matrix or the correlation matrix. The choice between these analyses will be discussed. In either case, the new variables (the PCs) depend on the dataset, rather than being pre-defined basis functions, and so are adaptive in the broad sense. The main uses of PCA are descriptive, rather than inferential; an example will illustrate this.

Although for inferential purposes a multivariate normal (Gaussian) distribution of the dataset is usually assumed, PCA as a descriptive tool needs no distributional assumptions and, as such, is very much an adaptive exploratory method which can be used on numerical data of various types. Indeed, many adaptations of the basic methodology for different data types and structures have been developed, two of which will be described in § 3 a,d. Some techniques give simplified versions of PCs, in order to aid interpretation. Two of these are briefly described in § 3 b, which also includes an example of PCA, together with a simplified version, in atmospheric science, illustrating the adaptive potential of PCA in a specific context. Section 3 c discusses one of the extensions of PCA that has been most active in recent years, namely robust PCA (RPCA). The explosion in very large datasets in areas such as image analysis or the analysis of Web data has brought about important methodological advances in data analysis which often find their roots in PCA. Each of § 3 a–d gives references to recent work. Some concluding remarks, emphasizing the breadth of application of PCA and its numerous adaptations, are made in § 4 .

2. The basic method

(a) principal component analysis as an exploratory tool for data analysis.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i1.jpg

Thus, a must be a (unit-norm) eigenvector, and λ the corresponding eigenvalue, of the covariance matrix S . In particular, we are interested in the largest eigenvalue, λ 1 (and corresponding eigenvector a 1 ), since the eigenvalues are the variances of the linear combinations defined by the corresponding eigenvector a : var( X a )= a ′ S a = λ a ′ a = λ . Equation ( 2.1 ) remains valid if the eigenvectors are multiplied by −1, and so the signs of all loadings (and scores) are arbitrary and only their relative magnitudes and sign patterns are meaningful.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i2.jpg

It is these linear combinations X a k that are called the principal components of the dataset, although some authors confusingly also use the term ‘principal components’ when referring to the eigenvectors a k . In standard PCA terminology, the elements of the eigenvectors a k are commonly called the PC loadings , whereas the elements of the linear combinations X a k are called the PC scores , as they are the values that each individual would score on a given PC.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i3.jpg

Denoting by X * the n × p matrix whose columns are the centred variables x * j , we have

equation image

where U , A are n × r and p × r matrices with orthonormal columns ( U ′ U = I r = A ′ A , with I r the r × r identity matrix) and L is an r × r diagonal matrix. The columns of A are called the right singular vectors of Y and are the eigenvectors of the p × p matrix Y ′ Y associated with its non-zero eigenvalues. The columns of U are called the left singular vectors of Y and are the eigenvectors of the n × n matrix Y Y ′ that correspond to its non-zero eigenvalues. The diagonal elements of matrix L are called the singular values of Y and are the non-negative square roots of the (common) non-zero eigenvalues of both matrix Y ′ Y and matrix Y Y ′. We assume that the diagonal elements of L are in decreasing order, and this uniquely defines the order of the columns of U and A (except for the case of equal singular values [ 4 ]). Hence, taking Y = X *, the right singular vectors of the column-centred data matrix X * are the vectors a k of PC loadings. Due to the orthogonality of the columns of A , the columns of the matrix product X * A = ULA ′ A = UL are the PCs of X *. The variances of these PCs are given by the squares of the singular values of X *, divided by n −1. Equivalently, and given ( 2.2 ) and the above properties,

equation image

where L 2 is the diagonal matrix with the squared singular values (i.e. the eigenvalues of ( n −1) S ). Equation ( 2.4 ) gives the spectral decomposition , or eigendecomposition , of matrix ( n −1) S . Hence, PCA is equivalent to an SVD of the column-centred data matrix X *.

The properties of an SVD imply interesting geometric interpretations of a PCA. Given any rank r matrix Y of size n × p , the matrix Y q of the same size, but of rank q < r , whose elements minimize the sum of squared differences with corresponding elements of Y is given [ 7 ] by

equation image

where L q is the q × q diagonal matrix with the first (largest) q diagonal elements of L and U q , A q are the n × q and p × q matrices obtained by retaining the q corresponding columns in U and A .

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i6.jpg

The quality of any q -dimensional approximation can be measured by the variability associated with the set of retained PCs. In fact, the sum of variances of the p original variables is the trace (sum of diagonal elements) of the covariance matrix S . Using simple matrix theory results it is straightforward to show that this value is also the sum of the variances of all p PCs. Hence, the standard measure of quality of a given PC is the proportion of total variance that it accounts for,

equation image

It is common practice to use some predefined percentage of total variance explained to decide how many PCs should be retained (70% of total variability is a common, if subjective, cut-off point), although the requirements of graphical representation often lead to the use of just the first two or three PCs. Even in such situations, the percentage of total variance accounted for is a fundamental tool to assess the quality of these low-dimensional graphical representations of the dataset. The emphasis in PCA is almost always on the first few PCs, but there are circumstances in which the last few may be of interest, such as in outlier detection [ 4 ] or some applications of image analysis (see § 3 c).

PCs can also be introduced as the optimal solutions to numerous other problems. Optimality criteria for PCA are discussed in detail in numerous sources (see [ 4 , 8 , 9 ], among others). McCabe [ 10 ] uses some of these criteria to select optimal subsets of the original variables, which he calls principal variables . This is a different, computationally more complex, problem [ 11 ].

(b) Example: fossil teeth data

PCA has been applied and found useful in very many disciplines. The two examples explored here and in § 3 b are very different in nature. The first examines a dataset consisting of nine measurements on 88 fossil teeth from the early mammalian insectivore Kuehneotherium, while the second, in § 3 b, is from atmospheric science.

Kuehneotherium is one of the earliest mammals and remains have been found during quarrying of limestone in South Wales, UK [ 12 ]. The bones and teeth were washed into fissures in the rock, about 200 million years ago, and all the lower molar teeth used in this analysis are from a single fissure. However, it looked possible that there were teeth from more than one species of Kuehneotherium in the sample.

Of the nine variables, three measure aspects of the length of a tooth, while the other six are measurements related to height and width. A PCA was performed using the prcomp command of the R statistical software [ 13 ]. The first two PCs account for 78.8% and 16.7%, respectively, of the total variation in the dataset, so the two-dimensional scatter-plot of the 88 teeth given by figure 1 is a very good approximation to the original scatter-plot in nine-dimensional space. It is, by definition, the best variance-preserving two-dimensional plot of the data, representing over 95% of total variation. All of the loadings in the first PC have the same sign, so it is a weighted average of all variables, representing ‘overall size’. In figure 1 , large teeth are on the left and small teeth on the right. The second PC has negative loadings for the three length variables and positive loadings for the other six variables, representing an aspect of the ‘shape’ of teeth. Fossils near the top of figure 1 have smaller lengths, relative to their heights and widths, than those towards the bottom. The relatively compact cluster of points in the bottom half of figure 1 is thought to correspond to a species of Kuehneotherium, while the broader group at the top cannot be assigned to Kuehneotherium, but to some related, but as yet unidentified, animal.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-g1.jpg

The two-dimensional principal subspace for the fossil teeth data. The coordinates in either or both PCs may switch signs when different software is used.

(c) Some key issues

(i) covariance and correlation matrix principal component analysis.

So far, PCs have been presented as linear combinations of the (centred) original variables. However, the properties of PCA have some undesirable features when these variables have different units of measurement. While there is nothing inherently wrong, from a strictly mathematical point of view, with linear combinations of variables with different units of measurement (their use is widespread in, for instance, linear regression), the fact that PCA is defined by a criterion (variance) that depends on units of measurement implies that PCs based on the covariance matrix S will change if the units of measurement on one or more of the variables change (unless all p variables undergo a common change of scale, in which case the new covariance matrix is merely a scalar multiple of the old one, hence with the same eigenvectors and the same proportion of total variance explained by each PC). To overcome this undesirable feature, it is common practice to begin by standardizing the variables. Each data value x ij is both centred and divided by the standard deviation s j of the n observations of variable j ,

equation image

Thus, the initial data matrix X is replaced with the standardized data matrix Z , whose j th column is vector z j with the n standardized observations of variable j ( 2.7 ). Standardization is useful because most changes of scale are linear transformations of the data, which share the same set of standardized data values.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i9.jpg

Correlation matrix PCs are invariant to linear changes in units of measurement and are therefore the appropriate choice for datasets where different changes of scale are conceivable for each variable. Some statistical software assumes by default that a PCA means a correlation matrix PCA and, in some cases, the normalization used for the vectors of loadings a k of correlation matrix PCs is not the standard a ′ k a k =1. In a correlation matrix PCA, the coefficient of correlation between the j th variable and the k th PC is given by (see [ 4 ])

equation image

In the fossil teeth data of § 2 b, all nine measurements are in the same units, so a covariance matrix PCA makes sense. A correlation matrix PCA produces similar results, since the variances of the original variable do not differ very much. The first two correlation matrix PCs account for 93.7% of total variance. For other datasets, differences can be more substantial.

(ii) Biplots

One of the most informative graphical representations of a multivariate dataset is via a biplot [ 14 ], which is fundamentally connected to the SVD of a relevant data matrix, and therefore to PCA. A rank q approximation X * q of the full column-centred data matrix X *, defined by ( 2.5 ), is written as X * q = GH ′, where G = U q and H = A q L q (although other options are possible, see [ 4 ]). The n rows g i of matrix G define graphical markers for each individual, which are usually represented by points. The p rows h j of matrix H define markers for each variable and are usually represented by vectors. The properties of the biplot are best discussed assuming that q = p , although the biplot is defined on a low-rank approximation (usually q =2), enabling a graphical representation of the markers. When q = p the biplot has the following properties:

  • — The cosine of the angle between any two vectors representing variables is the coefficient of correlation between those variables; this is a direct result of the fact that the matrix of inner products between those markers is HH ′= AL 2 A ′=( n −1) S ( 2.4 ), so that inner products between vectors are proportional to covariances (variances for a common vector).
  • — Similarly, the cosine of the angle between any vector representing a variable and the axis representing a given PC is the coefficient of correlation between those two variables.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i12.jpg

  • — The Euclidean distance between the markers for individuals i and i ′ is proportional to the Mahalanobis distance between them (see [ 4 ] for more details).

As stated above, these results are only exact if all q = p dimensions are used. For q < p , the results are merely approximate and the overall quality of such approximations can be measured by the percentage of variance explained by the q largest variance PCs, which were used to build the marker matrices G and H .

Figure 2 gives the biplot for the correlation matrix PCA of the fossil teeth data of § 2 b. The variable markers are displayed as arrows and the tooth markers as numbers. The group of three nearly horizontal and very tightly knit variable markers for two width variables and one height variable, WIDTH , HTMDT and TRIWIDTH , suggests a group of highly correlated variables, which are also strongly correlated with the first PC (represented by the horizontal axis). The very high proportion of variability explained by the two-dimensional principal subspace provides solid grounds for these conclusions. In fact, the smallest of the three true coefficients of correlation between these three variables is 0.944 ( HTMDT and TRIWIDTH ), and the smallest magnitude correlation between PC1 and any of these variables is 0.960 ( TRIWIDTH ). The sign difference in PC2 loadings between the three length variables (towards the bottom left of the plot) and the other variables is clearly visible. Projecting the marker for individual 58 onto the positive directions of all variable markers suggests that fossil tooth 58 (on the left of the biplot) is a large tooth. Inspection of the data matrix confirms that it is the largest individual on six of the nine variables, and close to largest on the remaining three. Likewise, individuals 85–88 (on the right) are small-sized teeth. Individuals whose markers are close to the origin have values close to the mean for all variables.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-g2.jpg

Biplot for the fossil teeth data (correlation matrix PCA), obtained using R’s biplot command. (Online version in colour.)

(iii) Centrings

As was seen in § 2 , PCA amounts to an SVD of a column-centred data matrix. In some applications [ 15 ], centring the columns of the data matrix may be considered inappropriate. In such situations, it may be preferred to avoid any pre-processing of the data and to subject the uncentred data matrix to an SVD or, equivalently, to carry out the eigendecomposition of the matrix of non-centred second moments, T , whose eigenvectors define linear combinations of the uncentred variables. This is often referred to as an uncentred PCA and there has been an unfortunate tendency in some fields to equate the name SVD only with this uncentred version of PCA.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i13.jpg

In some applications, row centrings, or both row- and column-centring (known as double-centring) of the data matrix, have been considered appropriate. The SVDs of such matrices give rise to row-centred and doubly centred PCA , respectively.

(iv) When n < p

Datasets where there are fewer observed entities than variables ( n < p ) are becoming increasingly frequent, thanks to the growing ease of observing variables, together with the high costs of repeating observations in some contexts (such as microarrays [ 16 ]). For example, [ 17 ] has an example in genomics in which n =59 and p =21 225.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i15.jpg

Recent research (e.g. [ 18 , 19 ]) has examined how well underlying ‘population’ PCs are estimated by the sample PCs in the case where n ≪ p , and it is shown that in some circumstances there is little resemblance between sample and population PCs. However, the results are typically based on a model for the data which has a very small number of structured PCs, and very many noise dimensions, and which has some links with recent work in RPCA (see § 3 c).

3. Adaptations of principal component analysis

The basic idea of PCA, leading to low-dimensional representations of large datasets in an adaptive and insightful way, is simple. However, the subsections in § 2 have shown a number of subtleties that add some complexity. Going further, there are many ways to adapt PCA to achieve modified goals or to analyse data of different types. Because PCA is used in a large number of areas, research into modifications and adaptations is spread over literatures from many disciplines. Four such adaptations, chosen fairly arbitrarily from the many that exist, namely functional PCA, modifications of PCA to simplify interpretations, RPCA and symbolic data PCA, are described in the following subsections. Other adaptations are briefly mentioned in § 4 .

(a) Functional principal component analysis

In some applications, such as chemical spectroscopy, observations are functional in nature, changing with some continuous variable which, for simplicity, we assume is time. The dataset is then a collection of n functions x i ( t ).

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i17.jpg

The analogue of the eigen-equation ( 2.1 ) involves an integral transform , which reflects the functional nature of S ( s , t ) and of inner products

equation image

The eigenfunctions a ( t ) which are the analytic solutions of equation ( 3.2 ) cannot, in general, be determined. Ramsay & Silverman [ 20 ] discuss approximate solutions based on numerical integration. An alternative approach, which they explore in greater detail, involves the assumption that the curves x i ( t ) can be written as linear combinations of a set of G basis functions ϕ 1 ( t ),…, ϕ G ( t ), so that, for any data function i ,

equation image

These basis functions can be chosen to reflect characteristics that are considered relevant in describing the observed functions. Thus, Fourier series functions may be chosen to describe periodic traits and splines for more general trends (B-splines are recommended). Other basis functions that have been used and can be considered are wavelets, exponential, power or polynomial bases. In theory, other bases, adapted to specific properties of a given set of observed functions, may be considered, although the computational problems that arise from any such choice must be kept in mind. The advantage of the basis function approach lies in the simplification of the expressions given previously. Denoting the n -dimensional vector of functions x i ( t ) as x ( t ), the G -dimensional vector of basis functions as ϕ ( t ) and the n × G matrix of coefficients c ij as C , the n data functions in equation ( 3.3 ) can be written as a single equation x ( t )= C ϕ ( t ). The eigenfunction a ( t ) can also be written in terms of the basis functions, with a ( t )= ϕ ( t )′ b for some G -dimensional vector of coefficients b =( b 1 ,…, b G ). Assuming furthermore that x ( t ) and ϕ ( t ) are centred, the covariance function at time ( s , t ) becomes

equation image

and eigen-equation ( 3.2 ) becomes, after some algebraic manipulation (see [ 4 , 20 ] for details),

equation image

If the basis functions are orthonormal, W is the G × G identity matrix and we end up with a standard eigenvalue problem which provides the solutions a ( t )= ϕ ( t )′ b to equation ( 3.2 ).

Ramsay & Silverman [ 20 ] further explore methods in which data functions x i ( t ) are viewed as solutions to differential equations, an approach which they call principal differential analysis , in order to highlight its close connections with PCA.

Research on functional PCA has continued apace since the publication of Ramsay and Silverman’s comprehensive text. Often this research is parallel to, or extends, similar ideas for data of non-functional form. For example, deciding how many PCs to retain is an important topic. A large number of suggestions have been made for doing so [ 4 ] and many selection criteria are based on intuitive or descriptive ideas, such as the obvious ‘proportion of total variance’. Other approaches are based on models for PCs. The problem of ‘how many functional PCs?’ is addressed in [ 22 ] using a model-based approach and criteria based on information theory.

As with other statistical techniques, it is possible that a few outlying observations may have a disproportionate effect on the results of a PCA. Numerous suggestions have been made for making PCA more robust to the presence of outliers for the usual data structure (see [ 4 ] and also § 3 c). One suggestion, using so-called S-estimators, is extended to functional PCA in [ 23 ].

Sometimes, as well as correlations between the p variables, there is a dependence structure between the n observations. A ‘dynamic’ version of functional PCA is proposed in [ 24 ], which is relevant when there are correlations between the observed curves, as well as the obvious correlation within the curves. It is based on an idea first suggested by Brillinger [ 25 ] for vector time series and uses frequency domain analysis.

(b) Simplified principal components

PCA gives the best possible representation of a p -dimensional dataset in q dimensions ( q < p ) in the sense of maximizing variance in q dimensions. A disadvantage is, however, that the new variables that it defines are usually linear functions of all p original variables. Although it was possible to interpret the first two PCs in the fossil teeth example, it is often the case for larger p that many variables have non-trivial coefficients in the first few components, making the components difficult to interpret. A number of adaptations of PCA have been suggested that try to make interpretation of the q dimensions simpler, while minimizing the loss of variance due to not using the PCs themselves. There is a trade-off between interpretability and variance. Two such classes of adaptations are briefly described here.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i22.jpg

Rotation can considerably simplify interpretation and, when viewed with respect to the q -dimensional space that is rotated, no variance is lost, as the sum of variances of the q rotated components is the same as for the unrotated components. What is lost is the successive maximization of the unrotated PCs, so that the total variance of the q components is more evenly distributed between components after rotation.

Drawbacks of rotation include the need to choose from the plethora of possible rotation criteria, though this choice often makes less difference than the choice of how many components to rotate. The rotated components can look quite different if q is increased by 1, whereas the successively defined nature of unrotated PCs means that this does not happen.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i23.jpg

A difference between the rotation and constraint approaches is that the latter has the advantage for interpretation of driving some loadings in the linear functions exactly to zero, whereas rotation usually does not. Adaptations of PCA in which many coefficients are exactly zero are generally known as sparse versions of PCA, and there has been a substantial amount of research on such PCs in recent years. A good review of such work can be found in Hastie et al. [ 28 ] (see also § 3 c).

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-i25.jpg

The original SCoTLASS optimization problem is non-convex and is also not solvable by simple iterative algorithms, although it is possible to re-express SCoTLASS as an equivalent, though still non-convex, optimization problem for which simple algorithms can be used [ 30 ]. Another approach, due to d’Aspremont et al. [ 31 ], reformulates SCoTLASS in a more complex manner, but then drops one of the constraints in this new formulation in order to make the problem convex.

Achieving sparsity is important for large p and particularly when n ≪ p . A number of authors have investigated versions of sparse PCA for this situation using models for the data in which the vast majority of the variables are completely unstructured noise [ 18 , 19 , 32 ]. These papers and others suggest and investigate the properties of algorithms for estimating sparse PCs when data are generated from their models. Lee et al. [ 17 ] use a different type of model, this time a random effects model for PC loadings, to derive an alternative penalty function to that used by SCoTLASS, giving another sparse PCA method. Additionally incorporating shrinkage of eigenvalues leads to yet another method, deemed super-sparse PCA in [ 17 ]. Comparisons are given between their methods, SCoTLASS and the elastic net [ 28 ] for simulated data and a large genomic example.

(i) Example: sea-level pressure data

One discipline in which PCA has been widely used is atmospheric science. It was first suggested in that field by Obukhov [ 33 ] and Lorenz [ 34 ] and, uniquely to that discipline, it is usually known as empirical orthogonal function (EOF) analysis. The book by Preisendorfer & Mobley [ 35 ] discusses many aspects of PCA in the context of meteorology and oceanography.

The format of the data in atmospheric science is different from that of most other disciplines. This example is taken from [ 36 ]. The data consist of measurements of winter (December, January and February) monthly mean sea-level pressure (SLP) over the Northern Hemisphere north of 20° N. The dataset is available on a 2.5°×2.5° regular grid and spans the period from January 1948 to December 2000. Some preprocessing is done to adjust for the annual cycle and the different areas covered by grid squares at different latitudes. In many atmospheric science examples, the variables are measurements at grid points, and the loadings, known as EOFs, are displayed as smooth spatial patterns, as in figure 3 for the first two correlation-based EOFs for the SLP data [ 36 ]. There are 1008 variables (grid-points) in this dataset, and the first two PCs account for 21% and 13% of the variation in these 1008 variables. Figure 3 gives a pattern which is commonly known as the Arctic Oscillation (AO). It is a measure of the north–south pressure gradient in the Atlantic Ocean and, to a lesser extent, in the Pacific Ocean and is a major source of variation in weather patterns. The second EOF is dominated by variation in the Pacific Ocean. The PCs for examples of this type are time series so the first PC, for example, will display which years have high values of the AO and which have low values.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-g3.jpg

( a , b ) The first two correlation-based EOFs for the SLP data account for 21% and 13% of total variation. (Adapted from [ 36 ].)

Figure 4 shows simplified EOFs based on SCoTLASS [ 36 ]. The main difference from the EOFs in figure 3 is for the first EOF, which is now completely dominated by the north–south pressure gradient in the Atlantic (the North Atlantic Oscillation) with exactly zero loadings for many grid-points. The simplification is paid for by a reduction in percentage of variation explained for the corresponding simplified PC (17% compared with 21%). The second simplified PC is very similar to the original second EOF, also explaining 13% of variation.

An external file that holds a picture, illustration, etc.
Object name is rsta20150202-g4.jpg

( a , b ) LASSO-based simplified EOFs for the SLP data. Grey areas are grid-points with exactly zero loadings. (Adapted from [ 36 ].)

(c) Robust principal component analysis

By its very nature, PCA is sensitive to the presence of outliers and therefore also to the presence of gross errors in the datasets. This has led to attempts to define robust variants of PCA and the expression RPCA has been used for different approaches to this problem. Early work by Huber [ 37 , 38 ] discussed robust alternatives to covariance or correlation matrices and ways in which they can be used to define robust PCs. This work was extended in [ 39 , 40 ]; see also [ 41 ].

The need for methods to deal with very large datasets in areas such as image processing, machine learning, bioinformatics or Web data analysis has generated a recent renewed interest in robust variants of PCA and has led to one of the most vigorous lines of research in PCA-related methods. A discussion of this issue can be found in [ 42 ]. Wright et al. [ 43 ] defined RPCA as a decomposition of an n × p data matrix X into a sum of two n × p components: a low-rank component L and a sparse component S . More precisely, a convex optimization problem was defined as identifying the matrix components of X = L + S that minimize a linear combination of two different norms of the components:

equation image

(d) Symbolic data principal component analysis

There is a recent body of work with so-called symbolic data, which is a general designation for more complex data structures, such as intervals or histograms [ 47 , 48 ].

Interval data arise when one wishes to retain a measure of underlying variability in the observations. This may occur if we wish to reflect the lack of precision of a measuring instrument or, more fundamentally, because the data are summary observations for which associated variability is considered inherent to the measurement. This is often the case when each observation corresponds to a group, rather than an individual, as would be the case with measurements on species, for which a range of values is considered part of the group value. If all p observed variables are of this type, each observation is represented by a hyper-rectangle, rather than a point, in p -dimensional space. Extensions of PCA for such data [ 47 , 49 ] seek PCs that are also of interval type, and which therefore also reflect ranges of values.

Another common type of symbolic data is given by histograms, which can be considered a generalization of interval-valued data where for each observation there are several intervals (the histogram bins) and associated frequencies. A recent review [ 50 ] covers several proposed definitions of PCA-type analyses for histogram data. Most of them require the definition of concepts such as distances between histograms (the Wasserstein distance being a common choice) or the sum and mean of histograms.

4. Conclusion

Although PCA in its standard form is a widely used and adaptive descriptive data analysis tool, it also has many adaptations of its own that make it useful to a wide variety of situations and data types in numerous disciplines. Adaptations of PCA have been proposed, among others, for binary data, ordinal data, compositional data, discrete data, symbolic data or data with special structure, such as time series [ 4 ] or datasets with common covariance matrices [ 6 , 40 ]. PCA or PCA-related approaches have also played an important direct role in other statistical methods, such as linear regression (with principal component regression [ 4 ]) and even simultaneous clustering of both individuals and variables [ 51 ]. Methods such as correspondance analysis, canonical correlation analysis or linear discriminant analysis may be only loosely connected to PCA, but, insofar as they are based on factorial decompositions of certain matrices, they share a common approach with PCA. The literature on PCA is vast and spans many disciplines. Space constraints mean that it has been explored very superficially here. New adaptations and methodological results, as well as applications, are still appearing.

Acknowledgements

We thank Pamela Gill and Abdel Hannachi for helpful discussions regarding their data and results.

Data accessibility

Authors' contributions.

Both authors were equally involved in drafting the manuscript.

Competing interests

We have no competing interests.

Research by J.C. is partially supported by the Portuguese Science Foundation FCT - PEst-OE/MAT/UI0006/2014.

Prediction of Water Quality Using Principal Component Analysis

  • Published: 04 May 2012
  • Volume 4 , pages 93–104, ( 2012 )

Cite this article

  • S. S. Mahapatra 1 ,
  • Mrutyunjaya Sahu 2 ,
  • R. K. Patel 3 &
  • Biranchi Narayan Panda 4  

2047 Accesses

58 Citations

Explore all metrics

The groundwater is contaminated heavily with acidity, alkalinity, toxicity, heavy minerals, and microbes throughout the world due to population growth, urbanization and industrialization. Hence, evaluation of water quality of groundwater is extremely important to prepare for remedial measures. This paper presents application of an empirical approach for classification of water samples based on 10 quality parameters of water. In this research work, water samples from 10 sources in three different years and seasons have been collected to assess the quality of water. Q-mode principal component analysis has been applied to classify the water samples into four different categories considering parameters such as pH, DO, turbidity, TDS, hardness, calcium ion (Ca ++ ), chloride ion (Cl − ), BOD, iron (Fe ++ ), sulfate ( \(\mathrm{SO}_{4}^{--}\) ). This classification will be useful for the planners and field engineers for taking ameliorative measures in advance for preventing the contamination of groundwater. The non-parametric method proposed here efficiently assesses water quality index for classification of water quality. The model can also be used for estimating water quality on-line but the accuracy of the model depends upon the judicious selection of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

research paper on principal component

Water quality prediction using machine learning models based on grid search method

Mahmoud Y. Shams, Ahmed M. Elshewey, … Zahraa Tarek

research paper on principal component

A comprehensive review of water quality indices (WQIs): history, models, attempts and perspectives

Sandra Chidiac, Paula El Najjar, … Desiree El Azzi

research paper on principal component

Groundwater quality assessment using water quality index (WQI) under GIS framework

Arjun Ram, S. K. Tiwari, … Y. V. Singh

Ahmed S, David KS, Gerald S (2004) Environmental assessment: an innovation index for evaluation water quality in streams. Environ Manag 34:406–414

Article   Google Scholar  

Almasri ML, Kaluarachchi JJ (2004) Assessment and management of long-term water pollution of groundwater in agriculture-dominated watersheds. J Hydrol 295(1–4):225–245

Article   CAS   Google Scholar  

Albadawi Z, Bashir HA, Chen M (2005) A mathematical approach for the formation of manufacturing cells. Comput Ind Eng 48:3–21

Ammann A, Eduard H, Sabine K (2003) Groundwater pollution by roof infiltration evidenced with multi-tracer experiments. Water Res 37(5):1143–1153

Brown RM, McClelland NI, Deininger RA, Ronald GT (1970) A water quality index Do we dar? Water Sew Works 117(10):339–343

Google Scholar  

Basilevsky A (1994) Statistical factor analysis and related methods. Wiley, New York

Book   Google Scholar  

Bhargava DS (1983) Use of a water quality index for river classification and zoning of the Ganga River. Environ Pollut B 6:51–67

Bhargava DS (1987) Nature and the Ganga. Environ Conserv 14:307–318

Chaudhari GR, Sohani D, Shrivastava VS (2004) Groundwater quality index near industrial area. Indian J Environ Prot 24(1):29–32

CAS   Google Scholar  

Couillard D, Lefebvre Y (1985) Analysis of water quality indices. J Environ Manag 21:161–179

Dahiya S, Singh B, Gaur S, Garg VK, Kushwaha HS (2007) Analysis of groundwater quality using fuzzy synthetic evaluation. J Hazard Mater 147(3):938–946

Dijksterhuis G (1998) European dimensions of coffee: rapid inspection of a data set using Q-PCA. Food Qual Prefer 9(3):95–98

Dinius SH (1972) Social accounting system for evaluating water resource. Water Resour Res 8(5):1159–1177

Dinius SH (1987) Design of an index of water quality. Water Resou Bull 23(5):833–843

Faisal K, Tahir H, Ashok L (2003) Water quality evaluation and trend analysis in selected watersheds of the Atlantic region of Canada. Environ Monit Assess 88:221–248

Hair JF Jr, Black WC, Babin BJ, Anderson RE (2009) Multivariate data analysis, 7th edn. Prentice Hall, New York

Harrison TD, Cooper JAG, Ramm AEL (2000) Water quality and aesthetics of South African estuaries, Department of Environment Affairs and Tourism, South Africa. Available from www.environment.gov.za/soer/reports/ehi/ehi_ch4.pdf

Horton RK (1965) An index number system for rating water quality. J Water Pollut Control Fed 37(3):300–306

Iscen CF, Emiroglu O, Ilhan S, Arslan N, Yilmaz V, Ahiska S (2008) Application of multivariate statistical techniques in the assessment of surface water quality in Uluabat Lake. Environ Monit Assess 144(1–3):269–276

Jha AN, Verma PK (2000) Physico-chemical properties of drinking water in town area of Godda district under Santal Pargana (Bihar), India. Pollut Res 19(2):245–247

Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20:141–151

Karnchanawong S, Ikeguchi SKT (1993) Evaluation of shallow well water quality near a waste disposal site. Environ Int 19(6):579–587

Landwehr JM (1979) A statistical view of a class of water quality indices. Water Resour Res 15(2):460–468

Lind CJ, Carol L, Creasey CA (1998) In situ alteration of minerals by acidic groundwater resulting from mining activities: preliminary evaluation of method. J Geochem Explor 64(1–3):293–305

Lohani BN, Todino G (1984) Water quality index for Chao Phraya River. J Environ Eng 110(6):1163–1176

Lumb A, Sharma TC, Bibeault JF (2011) A review of genesis and evolution of water quality index (WQI) and some future directions. Water Qual Health Expos: 11–24

Lumb A, Halliwell D, Sharma T (2006) Application of CCME water quality index to monitor water quality: a case of the Mackenzie River Basin, Canada. Environ Monit Assess 113:411–429

Maticie B (1999) The impact of agriculture on groundwater quality in Slovenia: standards and strategy. Agric Water Manag 40(2–3):235–247

Montgomery DC, Runger GC (1999) Applied statistics and probability for engineers. Wiley, New York

Nagarajan P, Priya GK (1999) Groundwater quality deterioration in Tiruchirapalli, Tamilnadu. J Ecotoxicol Environ Monit 9(2):155–159

Nives SG (1999) Water quality evaluation by index in Dalmatia. Water Resour 33:3423–3440

Park S, Choi D, Jun CH (2001) A clustering method for discovering patterns using gene regulatory processes. Genome Inform 12:249–251

Parinet B, Lhote A, Legube B (2004) Principal component analysis: an appropriate tool for water quality evaluation and management-application to a tropical lake system. Ecol Model 178:295–311

Sabal D, Khan TI (2008) Fluoride contamination status of groundwater in Phulera tehsil of Jaipur district, Rajasthan. J Environ Biol 29:871–876

Sarkar C, Abbasi SA (2006) Qualidex—a new software for generating water quality indices. Environ Monit Assess 119:201–231

Shaji C, Nimi H, Bindu L (2009) Water quality assessment of open wells in and around Chavara industrial area, Quilon, Kerala. J Environ Biol 30(5):701–704

Shamruck M, Corapcioglu MY, Fayek AA, Hassona (2001) Modeling the effect of chemical fertilizers on groundwater quality in the Nile Valley aquifer, Egypt. Groundwater 39(1):59–67

Sharma RK, Agarwal M (2005) Biological effects of heavy metals. J Environ Biol 26:301–313

Shrestha S, Kazama F (2007) Assessment of surface water quality using multivariate statistical techniques: a case study of the Fuji River Basin, Japan. Environ Model Softw 22:464–475

Singh KP, Parwana HK (1999) Groundwater pollution due to industrial wastewater in Punjab state and strategies for its control. Indian J Environ Prot 19(4):241–244

Singh B, Sudhir D, Sandeep J, Garg VK, Kushwaha HS (2008) Use of fuzzy synthetic evaluation for assessment of groundwater quality for drinking usage: a case study of southern Haryana. Indian Environ Geol 54:249–255

Singh UK, Kumar M, Chauhan R, Jha PK, Ramanathan AL, Subramanian V (2008) Assessment of the impact of landfill on groundwater quality: a case study of the Pirana site in western India. Environ Monit Assess 141(1–3):309–321

Srinivas C, Shankar R, Venkateshwar C, Rao MSS, Reddy RR (2000) Studies on groundwater quality of Hyderabad. Pollut Res 19(2):285–289

Swamee PK, Tyagi A (2000) Describing water quality with aggregate index. J Environ Eng 126(5):451–455

Shiow-Mey L, Shang-Lien, Shan-Hsien W (2004) A generalized water quality index for Taiwan. Environ Monit Assess 96:35–52

Tyagi P, Budhi D, Sawhney RL (2003) A correlation among physico-chemical parameters of groundwater in and around Pithampur industrial area. Indian J Environ Prot 23(11):1276–1282

Tiwari TN, Mishra M (1985) A preliminary assignment of water quality index of major Indian Rivers. Indian J Environ Prot 5(4):276–279

Ubala B, Farooqui M, Arif M, Zaheer A, Dhule DG (2001) Regression analysis of groundwater quality data of Chikalthana industrial area, Aurangabad (Maharashtra). Orient J Chem 17(2):347–348

Valarmathie P, Srinath MV, Dinakaran K (2009) Increased performance of clustering high dimensional data through dimensionality reduction technique. J Theor Appl Inf Technol 5(6):731–733

Walski TM, Parker FL (1974) Consumers water quality index. J Environ Eng Div 100(3):593–611

WHO (2006) Guidelines for drinking water quality first addendum to 3rd edn (I) recommendations, Geneva, Switzerland

Woocay A, Walton J (2008) Multivariate analyses of water chemistry: surface and ground water interactions. Groundwater 46(3):437–449

Zhang WL, Tian X, Zhang N, Li XQ (1996) Water pollution of groundwater in northern China. Ecosyst Environ 59(3):223–231

Download references

Author information

Authors and affiliations.

Department of Mechanical Engineering, National Institute of Technology, Rourkela, India

S. S. Mahapatra

Department of Civil Engineering, National Institute of Technology, Rourkela, India

Mrutyunjaya Sahu

Department of Chemistry, National Institute of Technology, Rourkela, India

R. K. Patel

Department of Manufacturing Science, Veer Surendra Sai University, Burla, India

Biranchi Narayan Panda

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to S. S. Mahapatra .

Rights and permissions

Reprints and permissions

About this article

Mahapatra, S.S., Sahu, M., Patel, R.K. et al. Prediction of Water Quality Using Principal Component Analysis. Water Qual Expo Health 4 , 93–104 (2012). https://doi.org/10.1007/s12403-012-0068-9

Download citation

Received : 12 December 2011

Revised : 01 April 2012

Accepted : 05 April 2012

Published : 04 May 2012

Issue Date : June 2012

DOI : https://doi.org/10.1007/s12403-012-0068-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Water quality index
  • Principal component analysis
  • Correlation
  • Classification
  • Find a journal
  • Publish with us
  • Track your research

Principal Component Analysis for Nonstationary Series

This paper develops a procedure for uncovering the common cyclical factors that drive a mix of stationary and nonstationary variables. The method does not require knowing which variables are nonstationary or the nature of the nonstationarity. An application to the FRED-MD macroeconomic dataset demonstrates that the approach offers similar benefits to those of traditional principal component analysis with some added advantages.

This research was supported by the Robert F. Engle Chair in Economics. The views expressed herein are those of the author and do not necessarily reflect the views of the National Bureau of Economic Research.

MARC RIS BibTeΧ

Download Citation Data

  • data appendix

Conferences

More from nber.

In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship  — as well as online conference reports , video lectures , and interviews .

15th Annual Feldstein Lecture, Mario Draghi, "The Next Flight of the Bumblebee: The Path to Common Fiscal Policy in the Eurozone cover slide

Research and Evaluation Based on Principal Component Analysis and BP Neural Networks

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. (PDF) Introduction to Principal Component Analysis in Applied Research

    research paper on principal component

  2. components of research paper

    research paper on principal component

  3. (PDF) Usage of Principal Component Analysis (PCA) in AI Applications

    research paper on principal component

  4. Best Steps to Write a Research Paper in College/University

    research paper on principal component

  5. Types of research papers

    research paper on principal component

  6. Guide to Principal Component Analysis

    research paper on principal component

VIDEO

  1. 10 Kernel Principal Component Analysis 1

  2. 6 Details About Principal Component Analysis 1

  3. 4 Prior Knowledge in Principal Component Analysis 1

  4. B.Sc.Ag.3rd semester (2024), seventh paper, (principal of intigrated pest management )

  5. Lecturer Islamiyat 2017 Solved paper. (PPSC) : Lesson No. 02

  6. B.A1st sem COMMERCE-GENERIC|| 2023Q.paper (Principle of Microeconomics) #commerce

COMMENTS

  1. (PDF) Principal Component Analysis

    Principal component analysis (PCA) is a multivar iate technique that analyzes a data table in which. observations are described by several inter-correlated quantita tive dependent variables. Its ...

  2. Principal component analysis: a review and recent developments

    (a) Principal component analysis as an exploratory tool for data analysis. The standard context for PCA as an exploratory data analysis tool involves a dataset with observations on p numerical variables, for each of n entities or individuals. These data values define p n-dimensional vectors x 1,…,x p or, equivalently, an n×p data matrix X, whose jth column is the vector x j of observations ...

  3. Principal component analysis

    Principal component analysis 1,2,3,4,5,6,7,8,9 (PCA) is a multivariate statistical method that combines information from several variables observed on the same subjects into fewer variables ...

  4. Principal component analysis

    PCA helps you interpret your data, but it will not always find the important patterns. Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends ...

  5. (PDF) Principal component analysis

    statistical method for reducing a ca ses-by-. variables data table t o its essential features, called principal compo nents. Principal components. are a few linear combinations o f the original ...

  6. [1404.1100] A Tutorial on Principal Component Analysis

    Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box. This manuscript focuses on building a solid intuition for how and why principal component analysis works. This manuscript crystallizes this knowledge by deriving from simple intuitions ...

  7. PDF A Tutorial on Principal Component Analysis

    A Tutorial on Principal Component Analysis Jonathon Shlens Google Research Mountain View, CA 94043 (Dated: April 7, 2014; Version 3.02) Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box ...

  8. Principal Component Analyses (PCA)-based findings in ...

    Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. ... 32,000-216,000 genetic papers employed PC scatterplots ...

  9. Introduction to Principal Components Analysis

    Principal components analysis (PCA) is a powerful statistical tool that can help researchers analyze datasets with many highly related predictors. PCA is a data reduction technique—that is, it reduces a larger set of predictor variables to a smaller set with minimal loss of information. PCA may be applied before running regression analyses or ...

  10. Principal Component Analysis

    Principal component analysis is central to the study of multivariate data. Although one of the earliest multivariate techniques, it continues to be the subject of much research, ranging from new model-based approaches to algorithmic ideas from neural networks. ... He is author or co-author of over 60 research papers and three other books. His ...

  11. A new method of identifying key industries: a principal component

    This article using the principal components analysis identifies key industries and groups them into particular clusters. The data come from the US benchmark input-output tables of the years 2002, 2007, 2012 and the most recently published input-output table of the year 2019. We observe some intertemporal switches of industries both between and within the top clusters. The findings further ...

  12. Principal Component Analysis and Factor Analysis in Accounting Research

    Principal component analysis (PCA) and factor analysis (FA) are both variable reduction techniques used to represent a set of observed variables ... and citations for researchers considering using these procedures in future research. Data Availability: The data used in this paper are publicly available from the sources indicated in the text ...

  13. Principal component analysis: a review and recent developments

    (a) Principal component analysis as an exploratory tool for data analysis. The standard context for PCA as an exploratory data analysis tool involves a dataset with observations on p numerical variables, for each of n entities or individuals. These data values define p n-dimensional vectors x 1,…,x p or, equivalently, an n×p data matrix X, whose jth column is the vector x j of observations ...

  14. A Simple Review of Sparse Principal Components Analysis

    Sparse Principal Component Analysis (SPCA) was proposed to overcome these two disadvantages of PCA under the circumstances. This review paper will mainly focus on the research about SPCA, where the basic models of PCA and SPCA, various algorithms and extensions of SPCA are summarized.

  15. PDF A Tutorial on Principal Component Analysis

    Principal component analysis (PCA) has been called one of the most valuable results from applied linear al-gebra. PCA is used abundantly in all forms of analysis -from neuroscience to computer graphics - because it is a simple, non-parametric method of extracting relevant in-formation from confusing data sets.

  16. Prediction of Water Quality Using Principal Component Analysis

    This paper presents application of an empirical approach for classification of water samples based on 10 quality parameters of water. In this research work, water samples from 10 sources in three different years and seasons have been collected to assess the quality of water. ... Q-mode principal component analysis has been applied to classify ...

  17. PDF Sparse Principal Component Analysis

    However, PCA suffers from the fact that each principal component is a linear combi-nation of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings.

  18. Two Case Studies in the Application of Principal Component Analysis

    The basic technique of principal component analysis is well described by Kendall (1957), Seal (1964), Quenouille (1962) and many others. In order to define precisely the technique as it has been employed in the two case studies described in this paper, (i) plotting or further analysis of transformed values.

  19. Principal Component Analysis for Nonstationary Series

    James D. Hamilton & Jin Xi. Working Paper 32068. DOI 10.3386/w32068. Issue Date January 2024. This paper develops a procedure for uncovering the common cyclical factors that drive a mix of stationary and nonstationary variables. The method does not require knowing which variables are nonstationary or the nature of the nonstationarity.

  20. A Principal Component Analysis (PCA)-based framework for automated

    Within this context, our paper introduces a principal component analysis (PCA)-based automated variable selection methodology that has the objective of identifying candidate inputs to a geodemographic classification from a collection of variables. ... The limitations of this research are discussed in Section 6, alongside some plans for further ...

  21. A 50-year personal journey through time with principal component

    There were again few papers on PCA in JMVA in the 14-year period 1991-2004, specifically only ten with "principal component analysis", "principal components" or "PCA" in their title. However in the following six years there were ten more such papers, with 27 in the final ten years of JMVA's 50-year existence.

  22. Research and Evaluation Based on Principal Component Analysis and BP

    In this paper, based on the GBET dataset, 12 factors are used as secondary indicators, and then the 12 indicators are extracted into 3 primary indicators, and then the fuzzy comprehensive evaluation model is used for principal component analysis. Taking the 3 indicators as inputs and the evaluation results as outputs, the BP neural network model is used to establish the evaluation model ...

  23. Rapid assessment of surface water quality using statistical

    The approach uses multivariate statistical quality control chart based on Principal Component Analysis (PCA) model with two well-known measures of abnormal behaviour in a process or system: Hoteling's T 2 statistics and Q-statistic. The proposed TQ_PCA quality index provides on-line assessment of the water sample quality, with no specific ...