Common genetic variation associated with Mendelian disease severity revealed through cryptic phenotype analysis

Clinical datasets

Phenotypic analyses were conducted using the University of California San Francisco De-Identified Clinical Data Warehouse (UCSF-CDW)58, a database of structured health information that is made available to UCSF researchers free-of-charge. The data was captured for use on 31 May 2019 and includes roughly 8 years of clinic visits and inpatient hospitalizations (see Supplementary Methods). Following capture, patient demographic data was aligned to the International Classification of Disease, Tenth Revision, Clinical Modification (ICD10-CM) diagnostic codes available within the medical encounters. The individual diagnostic codes were simplified by collapsing multiple appearances of each code into a single value (at-least-one binarization), enabling the full set of diagnostic codes specific to each patient to be stored as a sparse, binary array. The ICD10-CM codes were filtered according to multiple criteria, which are described in the Supplementary Methods. This generated a dataset containing 10,483 ICD10-CM codes aligned to 1,204,212 patients. This is subsequently referred to as the UCSF-ICD10-CM dataset.

The UCSF-ICD10-CM was further processed in two ways. First, the ICD10-CM codes were transformed into human phenotype ontology (HPO)26 terms using a customized mapping, the construction of which is outlined below and in the Supplementary Methods (resulting dataset denoted UCSF-HPO). This alignment resulted in a global diagnostic matrix encoding 1674 HPO symptoms. Second, we translated the ICD10-CM codes into the ICD10 terminology utilized by the UK Biobank (ICD10-UKBB), taking advantage of the fact that the UKBB encoding is a less granular subset of the ICD10-CM (details regarding the precise translation can be found within our vLPI software package available on Github59). This processed dataset is subsequently referred to as UCSF-ICD10-UKBB. The UCSF-ICD10-UKBB dataset was also translated into HPO terms (denoted UCSF-HPO-UKBB). These less granular datasets contained 4933 and 1423 diagnostic terms respectively.

The UK Biobank (UKBB) is a collection of ≈500,000 middle-aged British adults who have received extensive genotyping and phenotyping23. The bulk UKBB dataset was downloaded on 22 January 2020 using the software provided by the organization. Following download, the raw data file was parsed, isolating demographic variables of interest and collapsing main/secondary inpatient summary diagnoses into a single data value (using at-least-one binarization). The resulting diagnostic codes were filtered according to multiple criteria (see Supplementary Methods), resulting in a 1:1 correspondence between the diagnostic codes available within the UKBB and the UCSF-ICD10-UKBB datasets. These ICD10 codes were then translated into HPO terms. The full UKBB dataset (after removing withdrawn subjects; N = 502,488) was used for cryptic phenotype inference, but the subjects were also filtered according to recommended best practices for genetic analyses23,60. Filtering resulted in the following two subsets: (1) 485,014 subjects (with exome data, N = 199,234) that remained after removing individuals whose genetic data is likely to be confounded by artifact (UKBB-Full), and 2) 342,796 unrelated subjects (with exome data, N = 153,182) of likely Western European (Caucasian) ancestry (UKBB-Unrelated). Further details regarding this processing can be found in the Supplementary Methods.

Because the UCSF-CDW and UKBB were both used for phenotype model inference and evaluation, the datasets were a priori divided each into training and testing subsets. To ensure that the testing datasets contained positive cases for each rare disease included in our analysis, distinct training and testing subsets were generated for every disorder. The subsets were constructed by randomly subsampling 75% of the data for training and 25% for testing while maintaining an equal ratio of diagnosed rare disease cases in each. All model inference and preliminary analyses were performed using the training datasets, while the testing datasets were only used for the final evaluation of cryptic phenotypes (see below, Fig. 2d and e, and Supplementary Methods).

Aligning rare diseases to structured medical data

Based on previous work9,61,62,63, we integrated multiple biomedical ontologies and terminologies to map rare diseases and their symptoms to structured medical data (i.e. diagnostic billing codes). To generate a set of rare diseases for analysis, we first used the Human Disease Ontology64 to obtain mappings between the Online Mendelian Inheritance in Man (OMIM) database65 and the ICD10-CM terminologies. Building on previous work62, we curated the OMIM-to-ICD10-CM alignments, selecting and grouping ICD10-CM codes that reliably mapped to a single or homogenous set of OMIM diseases, ensuring that the disorders were also annotated within the Human Phenotype Ontology26. This resulted in 166 rare, Mendelian conditions that were aligned to both the HPO and ICD10 terminologies (Supplementary Fig. 1). The 166 diseases were sorted according to their diagnostic prevalence in the UCSF-CDW; 50 disorders were selected for follow up testing (see Supplementary Methods; listed in Supplementary Data 1).

The HPO symptoms themselves were aligned to the ICD10-CM terminology in an automated fashion by integrating the information contained within multiple biomedical ontologies66,67,68,69. Details regarding the alignment are provided in the Supplementary Methods. This resulted in 1674 unique alignments between HPO terms and ICD10-CM codes (Supplementary Data 2). We assessed their performance by using them as features in a rare disease diagnosis prediction task (Supplementary Fig. 2). We found that prediction models constructed from the annotated26, ICD10-CM-aligned HPO terms had performances that were similar to models constructed using the complete ICD10-CM codebook (see Supplementary Table 1).

Cryptic phenotype analysis

Cryptic phenotype analysis (CPA) refers to the process by which a set of symptoms is used to infer a univariate, quantitative trait that captures the clinical heterogeneity observed within a disease of interest. This quantitative but cryptic phenotype can be used to assess clinical variability in both the diagnosed cases and the more general population, enabling the types of analyses described above. CPA consists of two stages. In the first, the symptoms annotated to a particular disease are decomposed into a low-dimensional set of quantitative, latent phenotypes. In the second stage, the trait that best captures disease morbidity (i.e. its symptom expressivity) is identified, since multiple latent traits are often recovered from a single symptom matrix. Below, we briefly outline the two stages of CPA. A more detailed description is provided in the Supplementary Methods.

Latent phenotype inference

Consider the set of (K) symptoms that are associated with some rare disease of interest, and furthermore, assume that these symptoms are binary (present/absent) and permanent (i.e. once diagnosed, they do not resolve). Let ({S}_{i,j}) denote the status of the (j)th symptom in the (i)th subject such that ({S}_{i,j}=1) indicates that the patient has been diagnosed with this symptom. Furthermore, let ({{{{{boldsymbol{S}}}}}}) denote an (Ntimes K)-dimensional matrix of symptom diagnoses such that the (i)th row of the matrix (denoted ({{{{{{boldsymbol{S}}}}}}}_{i})) contains the diagnoses for subject (i). Finally, let ({{{{{boldsymbol{Z}}}}}}) denote an (Ntimes L)-dimensional matrix of latent phenotypes, where each column represents the magnitude (i.e. severity) of an independent latent phenotype. We modeled the joint likelihood of the disease symptoms and latent phenotypes according to

$$Pleft({{{{{boldsymbol{S}}}}}},{{{{{boldsymbol{Z}}}}}}|theta right)=fleft({{{{{boldsymbol{Z}}}}}}{{{{{rm{;}}}}}}theta right)times P({{{{{boldsymbol{Z}}}}}})$$


where (fleft({{{{{bf{Z}}}}}};{{{{{rm{theta }}}}}}right)) is the symptom risk function (defined by the parameter set ({{{{{rm{theta }}}}}})) that maps the latent phenotypes onto the matrix of symptom probabilities (i.e. (fleft({{{{{boldsymbol{Z}}}}}};{{{{{rm{theta }}}}}}right)in {left[{{{{mathrm{0,1}}}}}right]}^{Ntimes K}equiv Pleft({{{{{boldsymbol{S}}}}}}|{{{{{boldsymbol{Z}}}}}},{{{{{rm{theta }}}}}}right))) and (Pleft({{{{{bf{Z}}}}}}right)) is a generative model for the latent phenotypes themselves. Additional details regarding (fleft({{{{{boldsymbol{Z}}}}}};theta right)) and (Pleft({{{{{bf{Z}}}}}}right)) are provided in the Supplementary Methods.

Given an observed symptom matrix (denoted ({{{{{boldsymbol{S}}}}}}={{{{{boldsymbol{s}}}}}})), we obtained estimates for the symptom risk function parameters (denoted (hat{{{{{{rm{theta }}}}}}})) by optimizing a lower bound approximation to the model marginal likelihood (i.e. P(s|θ) = ∫P(s,Z|θ)dZ) using an amortized, variational inference algorithm27,28. Model inference was conducted using the training subsets only. Estimates for the latent phenotypes of interest (denoted (hat{{{{{{boldsymbol{Z}}}}}}})) were obtained as a direct by-product of this optimization process (see Supplementary Methods). In practice, the observed symptom matrices for each rare disease were constructed from the UCSF-HPO, the UKBB-HPO, and the UCSF-HPO-UKBB datasets using the annotations available on the HPO website (see Supplementary Data 3 for the complete disease-to-symptom mappings). However, some of the aligned symptoms were manually curated to resolve convergence issues (see Supplementary Methods); Supplementary Data 5 contains the final disease-to-symptom mappings used to infer the cryptic phenotypes for the 10 diseases that passed all our filters (see below). Additional details concerning our model inference and evaluation procedures are provided in the Supplementary Methods.

Cryptic phenotype identification and evaluation

Following inference, we assigned each rare disease a single cryptic phenotype, which we define as the latent trait that best captures the symptom frequency intrinsic to the rare disease of interest (i.e. its morbidity). By default, all our models were initialized with a total of 10 possible latent phenotype components, as multiple pathologic processes can contribute to the correlation structure observed among some set of symptoms (see Supplementary Methods for more information). Although this meant that many of our models were initially overdetermined, we found that our inference algorithm was able to automatically remove unnecessary components by zeroing out their parameters in the symptom risk function. The number of latent components that remained following model inference was termed the model’s effective rank (({L}_{{{mbox{eff}}}}), see Supplementary Methods for precise definition), which was typically much less than the number of components used to initialize the model (Supplementary Fig. 4). When ({L}_{{{mbox{eff}}}}=1), then this single component was automatically selected to be the disease’s cryptic phenotype. When ({L}_{{{mbox{eff}}}} , > ,1), then each inferred latent phenotype was used separately as a classifier to predict rare disease diagnoses in the training dataset, noting that the component that best captures the morbidity of a disease should be most predictive of its diagnostic status (see Fig. 3a–c, inset for examples). This top-performing latent component (assessed using the average precision score implemented in scikit-learn70) was then selected as the disease’s cryptic phenotype.

Note, the model fitting described above was completed in both the UCSF and UKBB datasets, with the caveat that not all the Mendelian diseases in Supplementary Data 1 map to specific ICD10 diagnostic codes in the UKBB dataset (the encoding for this dataset is more limited, see above). Therefore, all cryptic phenotype models inferred in the UKBB dataset were also applied to the UCSF dataset (using UCSF-HPO-UKBB, see Fig. 2e for results). To ensure the assigned cryptic phenotypes were in fact capturing Mendelian-disease related morbidity, we compared the average cryptic phenotype severity among diagnosed cases to their undiagnosed controls (using the test datasets only). For a cryptic phenotype to successfully capture disease morbidity, the average symptom severity among Mendelian disease cases had to be significantly higher in both the UCSF and UKBB datasets (significance assessed through bootstrapped re-sampling71 after performing Bonferroni corrections, see Fig. 2d and e for results). If Mendelian disease diagnostic codes were not available in the UKBB, then this increase in cryptic phenotype severity only needed to occur in the UCSF dataset.

Beyond capture, we also wanted to ensure that models inferred within the two independent datasets were consistent, meaning that they generated similar results when applied to the same dataset. Therefore, the phenotype models inferred within the UKBB were directly applied to the UCSF-HPO-UKBB dataset. Consistency was then assessed in three ways. First, the same latent component had to be assigned as the cryptic phenotype in both datasets (see above). Second, the UKBB model had to reproduce the increase in phenotype severity observed among the Mendelian disease cases within this new dataset. Third, the cryptic phenotypes produced by the UCSF and UKBB models needed to be correlated (as assessed through the coefficient of determination, r2). Using an r2 cutoff of 0.2, ten of the original fifty Mendelian disorders survived our capture, replication, and consistency filters. However, it is entirely plausible that replicable and consistent cryptic phenotypes could have been inferred for the other disorders through careful curation of annotated symptoms, larger sample sizes, and more focused adjustment of inference algorithm parameters (see Supplementary Methods).

Cryptic phenotype validation

The cryptic phenotypes for the five diseases listed in Table 1 were further validated through rare variant association studies. This required identifying pathogenic variant carriers within the UKBB. For A1ATD, the causal Pi*Z allele (rs28929474) was directly ascertained through array-based genotyping, so carriers of the Pi*MZ and Pi*ZZ genotypes were identified in the call/imputation files (see UKBB Data Category 263). For the remaining diseases, we downloaded the VCF files that contained the known causal genes (see Table 1). We then used the ClinVar database VCF (available at to identify all variants in the UKBB that have pathogenic/likely pathogenic annotations (accomplished using bcftools72). Because heterozygous loss-of-function (LoF) is an established molecular mechanism for each of diseases in Table 1 (except for A1ATD), we also identified LoF variants that were not listed in ClinVar. These were annotated using the LOFTEE plugin29 for the ensembl variant effect predictor73. Not all the variants isolated in this manner have equivalent levels of evidence for pathogenicity. Therefore, we added a flag to each variant to indicate if: (1) it had conflicting annotations, (2) it was annotated by a single submitter, or (3) it was located within a non-canonical transcript (LoF variants only). Supplementary Data 6 contains a complete list of the P/LP variants analyzed in this study.

Using the VCF and genotype call files, we then identified all carriers of the P/LP variants described above. To assess whether the variants were associated with cryptic phenotype severity, we estimated their average genetic effect using the following linear model:

$${mathrm {C{P}}}_{i}={beta }_{0}+{beta }_{mathrm {{P/{LP}}}}times {G}_{i}+{{{{{boldsymbol{a}}}}}}times {{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}}$$


where ({mathrm {C{P}}}_{i}) denotes the cryptic phenotype of the ith subject, ({beta }_{0}) is an intercept parameter, ({beta }_{{{{{{rm{P}}}}}}/{{{{{rm{LP}}}}}}}) is the average effect parameter for the P/LP variants, ({G}_{i}) is the carrier status of the ith patient, and ({{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}}) denotes a vector of covariates (with their corresponding parameter vector given by ({{{{{bf{a}}}}}})). Sex, age (inverse rank-transformed to remove skew), UKBB array platform, and the first 10 principal components of the genetic relatedness matrix were used as covariates. The analysis was limited to unrelated individuals of similar ancestry (Caucasian) to reduce the risk for population structure confounding (N = 153,182). Estimates for the parameters were produced using ordinary least squares, and per-parameter significance was assessed using a two-sided T-test. To account for effects related to variant annotation, we also fit the following linear model:

$${mathrm {C{P}}}_{i}={beta }_{0}+{beta }_{mathrm {{P/{LP}}}}times {G}_{i}+{beta }_{mathrm {{{{Unflagged},{P}}/{LP}}}}times {G}_{i}times {U}_{i}+{{{{{boldsymbol{a}}}}}}times {{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}}$$


where ({U}_{i}) is a binary variable that indicates if the ith patient carries a variant without any annotation flags (see above). This enabled us to decompose the phenotypic contributions of P/LP variants into baseline (({beta }_{{{{{{rm{P}}}}}}/{{{{{rm{LP}}}}}}})) and unflagged (({beta }_{{{{{{rm{UnflaggedP}}}}}}/{{{{{rm{LP}}}}}}})) effects, which are displayed at the top of the panels in Fig. 3a–c and Supplementary Fig. 6b. Note, the AS phenotype is known to be more severe among hemizygous male carriers of COL4A5 pathogenic variants, consistent with X-linked inheritance. Therefore, we included an interaction term between sex and COL4A5 carrier status during our molecular validation of the AS cryptic phenotype. This interaction effect did not reach statistical significance (βCOL4A5xSex = 0.09 ± 0.15; P-value = (0.56)), likely due to the small number of male P/LP COL4A5 carriers in the dataset (N = 12 in UKBB-Unrelated). As a result, sex-specific interaction effects were not included in downstream analyses.

Common variant genome-wide association analyses

Genome-wide association studies were performed to identify common genetic variants associated with the cryptic phenotypes assigned to the diseases in Table 1. To reduce the risk of confounding, the association analyses were conducted using a subset of patients isolated from UKBB-Unrelated ((N={{{{mathrm{342,796}}}}})) that met the following criteria: (1) did not possess a P/LP variant in a gene linked to the disease of interest, (2) were never diagnosed with this disease, and (3) were not a 3rd degree or closer relative of any of these subjects. From this training cohort, a random subset of 10% were removed and added to the Mendelian disease P/LP carriers. This second dataset is called the target cohort, and it was used to perform polygenic score replication and validation. All SNPs meeting the following criteria were in included into the analyses: directly genotyped by the UKBB, minor allele frequency (MAF) ≥ 1%, missing genotype fraction ≤5%, and Hardy–Weinberg equilibrium (HWE) P-value ≥ 10−12. Note, a relatively limited number of genetic markers (579,429 SNPs) met these criteria, but this smaller set of features enabled us construct individual-level prediction models for polygenic score inference (see below).

Genome-wide association studies (GWAS) were conducted by fitting the following linear model to each cryptic phenotype:

$${mathrm {C{P}}}_{i}={beta }_{0}+{beta }_{j}^{{{mathrm {SNP}}}}times {G}_{i,j}+{{{{{boldsymbol{a}}}}}}times {{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}}$$


where ({mathrm {C{P}}}_{i}) indicates the cryptic phenotype in the ith patient, ({{{{{{rm{beta }}}}}}}_{j}^{{{mathrm {SNP}}}}) represents the average effect of the jth SNP, ({G}_{i,j}) encodes the minor allele count (({G}_{i}in {{{{{mathrm{0,1,2}}}}}}); additive model), and ({{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}})/({{{{{bf{a}}}}}}) denote covariates/effect parameters respectively. Sex, age (rank-normalized), UKBB array platform, and the first 10 principal components of the genetic relatedness matrix were used as covariates. Association statistics were estimated using the Plink274 software package (–glm command). Lead SNPs and their corresponding annotations were obtained using the FUMA75 platform. The loci identified for the three diseases with genome-wide significant effects are provided as Supplementary Tables 2–4 (A1ATD, AS, and ADPKD, respectively).

SumHer (available within the LDAK toolkit76) was used to produce estimates for the fraction of the additive variance explained by the genotyped SNPs (narrow-sense heritability, denoted h2). This required the specification of an underlying heritability model76. Based on recommended best-practices, we used the LDAK-Thin model given its simplicity and portability to individual-level prediction. This required computing a tagging file, which was constructed using a random subset (N = 10,000) of UKBB-Unrelated. First, duplicate SNPs were identified using the ldak—thin command with the following options:–window-prune .98–window-kb 100. Next, the tagging file itself was constructed using the ldak –calc-tagging command (with options –power -.25—window-cm 1 –save-matrix YES). Finally, narrow-sense heritability estimates were produced from the GWAS summary statistics using the ldak –sum-hers command (while also storing the per-SNP heritability estimates for downstream analyses).

Polygenic prediction models summarizing the common variant association statistics were inferred for the three diseases in Table 1 that had cryptic phenotype h2 estimates significantly >0 (A1ATD, AS, and ADPKD). These models were estimated using the individual-level genotype data available for each training cohort. More specifically, we used LDAK-Bolt-Predict32 (ldak –bolt command) to estimate effect sizes for every SNP included in the cryptic phenotype association analyses (while conditioning on the covariates included in the initial linear model, see above). This required access to the per-SNP heritability estimates, which were produced by SumHer (see above). Note, 10% of the training data was withheld during model inference (using the –cv-proportion .1 flag) to estimate prior parameters. After model fitting was complete, polygenic scores were imputed into the target cohort using the –calc-scores command (with –power flag set to 0). The per-SNP effect size estimates produced by the predictor models are included as Supplementary Data 7–9 (A1ATD, AS, and ADPKD respectively).

Estimating the effects of polygenic load on Mendelian disease severity and outcomes

Polygenic scores (PGS) were imputed into the target cohorts for each rare disease in order to: (1) replicate the PGS-cryptic phenotype relationships, (2) assess for interaction effects between the PGS and P/LP variants, and (3) determine if high polygenic load was associated with established Mendelian disease outcomes.

The first two analyses were accomplished by fitting the following linear model within the target cohort constructed for each cryptic phenotype:

$${{{mathrm {CP}}}}_{i}={beta }_{0}+{beta }_{{{mathrm {PGS}}}}times {xi }_{i}+{beta }_{{mathrm {P}}{{backslash }}{{mathrm {LP}}}}times {G}_{i}+{beta }_{{{mathrm {PGSxP}}}{{backslash }}{{mathrm {LP}}}}times {G}_{i}times {xi }_{i}+{{{{{boldsymbol{a}}}}}}times {{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}}$$


where ({{{{{{rm{xi }}}}}}}_{{{{{{rm{i}}}}}}}) represents the PGS for the ith patient, ({{{{{{rm{beta }}}}}}}_{{{{{{rm{PGS}}}}}}}) represents its average phenotypic effect, and ({{{{{{rm{beta }}}}}}}_{{{{{{rm{PGSxPbackslash LP}}}}}}}times {{G}}_{{i}}times {{{{{{rm{xi }}}}}}}_{{i}}) models the interaction between the PGS and the P/LP variants. In the case of A1ATD, the two pathogenic genotypes (Pi*ZZ and Pi*MZ) were modeled as separate genetic effects, each with their own PGS interaction terms. For AS, both flagged and unflagged P/LP variants were included into the analysis, as they were both shown to influence cryptic phenotype severity (see Supplementary Fig. 6b). For ADPKD, only the unflagged variants were included, as there was no detectable phenotypic effect for the flagged variants, suggesting that they most likely represent annotation noise (see Fig. 3c). The previous model was fit using ordinary least squares, and association statistics were computed using two-sided T-tests.

Regarding covariates (i.e. a(times {{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}})), sex, age, UKBB array platform, and the first 10 principal components of the genetic relatedness matrix were included in every model. Smoking history was also included into each model, although its incorporation varied across diseases. For A1ATD and AS, self-reported pack-years (defined by Data Field: 20161) were used to quantify smoking history. Note, there was a significant interaction effect between the Pi*Z allele and smoking history (as expected), so interaction terms between pack-years and the pathogenic genotypes were included into the regression model for this disease (see Fig. 5b). There was no significant interaction effect between pack-years and P/LP carrier status for AS (βPack-years x P/LP = 0.03 ± 0.05; P-value = 0.55), so smoking interaction terms were not included for this disorder.

Regarding ADPKD, smoking had a strong protective effect on cryptic phenotype severity such that those P/LP carriers with a history of ever-smoking (provided by Data Field: 20160) had systematically lower cryptic phenotype scores (βSmoke x P/LP = −0.41 ± 0.11; P-value = (1.2times {10}^{-4})). This result is clearly at odds with the known pathophysiology of smoking and renal disease, and it likely stems from the fact that subjects with moderate-to-severe ADPKD are often diagnosed at a young age, prior to when smoking behavior is established (see Supplementary Fig. 9d for Kaplan–Meier curve of ESRD among P/LP carriers). Consistent with this hypothesis, significantly fewer P/LP carriers reported ever-smoking when compared to the general population (see Supplementary Fig. 9b). Based on these results, the relationship between smoking history and ADPKD severity is likely to be confounded by multiple unmeasured factors (specifically, medical intervention and counseling). Given our inability to adequately adjust for such complex confounding, smoking history in ADPKD was modeled using a simple binary variable (UKBB Data Field: 20160), which was included along with a P/LP interaction term. Note, similar confounding likely plays a role in the interaction effects between smoking and genotype for the other disorders (see Fig. 5e and Supplementary Fig. 7b for examples), but it was only significant enough to reverse the established morbidity relationship for ADPKD.

To confirm a role for polygenic load on Mendelian disease outcomes, we examined its effect on quantitative measurements that capture established pathophysiology but are distinct from the symptoms used to construct the cryptic phenotype. For A1ATD, we used the FEV1/FVC ratio (UKBB Data Field: 20258), a measurement derived from spirometry that quantifies the severity of obstructive lung disease (see Supplementary Fig. 7c). For AS, we examined urine microalbumin level (UKBB Data Field: 30500), which correlates with renal health and glomerular barrier function (see Supplementary Fig. 8c). Finally, for ADPKD, we computed an estimate46 of the glomerular filtration rate (eGFR) from the serum creatinine level (UKBB Data Field: 30700), which is often used as a proxy for overall renal function (see Supplementary Fig. 9c). The regression models themselves incorporated the same genetic and covariate effects that were used for the cryptic phenotypes, and they were again fit using ordinary least squares with association statistics computed using two-sided T-tests.

Finally, the effect of polygenic load on Mendelian disease severity was assessed by estimating its association with: (1) the rare disease diagnosis itself (when available) and (2) the onset of clinically important outcomes. The effect of the PGS on Mendelian disease diagnostic risk was modeled using logistic regression according to

$${{{mbox{Log-Odds}}}}left({D}_{i}right)={beta }_{0}+{beta }_{{{mathrm {P GS}}}}times {xi }_{i}+{beta }_{{mathrm {P}}{{backslash }}{{mathrm {LP}}}}times {G}_{i}+{beta }_{{{mathrm {PGSxP}}}{{backslash }}{{mathrm {LP}}}}times {G}_{i}times {xi }_{i}+{{{{{boldsymbol{a}}}}}}times {{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}}$$


where ({D}_{i}) is in a binary variable indicating whether a disease diagnosis is present or absent. The covariates included were sex, age, UKBB array platform, the first 10 principal components of the genetic relatedness matrix, and smoking history (plus interaction terms where relevant, see above). Model fitting was performed using the maximum-likelihood method with a Firth penalty term, which was included given the risk for Type I error rate inflation in the setting of unbalanced samples and rare predictors77. Significance for a given association was assessed using a likelihood-ratio ({chi }^{2}) test78.

The age-of-onset for clinically important Mendelian disease outcomes was also used to assess the effects of polygenic load on disease severity. The outcomes included in this study were: end-stage renal disease (ESRD; UKBB Data Field: 42026), chronic obstructive pulmonary disease (COPD; UKBB Data Field: 42016), recurrent and persistent hematuria (UKBB Data Field: 132002), and cystic kidney disease (UKBB Data Field: 132532). Details concerning the construction of these data fields are available through the UKBB. For each outcome, age-of-onset was modeled using Cox proportional hazards (CPH) regression:

$${lambda }_{i}={beta }_{{{mathrm {PGS}}}}times {xi }_{i}+{beta }_{{mathrm {P}}{{backslash}}{{mathrm {LP}}}}times {G}_{i}+{beta }_{{{mathrm {PGSxP}}}{{backslash}}{{mathrm {LP}}}}times {G}_{i}times {xi }_{i}+{{{{{boldsymbol{a}}}}}}times {{{{{{boldsymbol{X}}}}}}}^{{mathrm {T}}}$$


where ({lambda }_{i}) represents the logarithm of the partial hazard function for the ith subject. The following covariates were included into the model: sex, UKBB array platform, the first 10 principal components of the genetic relatedness matrix, and smoking history (with interaction terms as described above). Model fitting was performed by maximizing the partial likelihood (using the lifelines software package79), and significance was assessed using a likelihood-ratio ({{{{{{rm{chi }}}}}}}^{2}) test.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Leave a Comment