Nov 05, 2024

Public workspaceA new method to refine GWAS results based on the UKBiobank phenotype database

  • 1University of Palermo
Icon indicating open access to content
QR code linking to this content
Protocol CitationDavide Noto 2024. A new method to refine GWAS results based on the UKBiobank phenotype database. protocols.io https://dx.doi.org/10.17504/protocols.io.ewov1o5nklr2/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: January 31, 2023
Last Modified: November 05, 2024
Protocol Integer ID: 76145
Keywords: UKBiobank, GWAS, REGENIE, R
Disclaimer
All the necessary files are contained in the atteched ZIP file. For all the necessary files, routines, related to UK Biobank and RAP contact the UKBiobank organization for details.
Abstract
Genome wide association studies (GWAS) is an untargeted methodology able to identify novel gene variants associated with diseases. Sometimes the gene variants identified by GWAS are located within genes whose function is unknown or not related to the investigated trait. This paper describes a novel methodology based on GWAS filtering, aimed to find novel phenotypes associated to genetic loci that may determine cardiovascular disease (CVD), not conventionally listed among CVD canonical risk factors. UKBiobank, the largest clinical, laboratory, instrumental and genetic phenotypes database, was interrogated by an automated routine. Six gene variants associated with CVD, independently of canonical risk factors, were identified using a variants database of more than 400k genotyped subjects, and many novel clinical and biochemical phenotypes have been associated to the variants. The phenotypical characterisation of the loci resolved some ambiguities regarding gene loci lacking a clear CVD-associated identification.
Attachments
Safety warnings
There is no plan to upgrade the R routines in the future. Please check DNAnexus documentation for database creation and manipulation, REGENIE instruction for the RAP platform, and all other workflows within RAP.
Description of the Protocol
This protocol describes an automated workflow able to enrich the results of a Genome Wide Association Study (GWAS ) obtained from the UK Biobank data. In particular the workflow uses the Single Nucleotide Polymorphisms (SNP) that resulted associated with the investigated trait ( Cardiovascular disease in the current example) and search for further associations with the hundreds of phenotypic variables stored in theUK Biobank Database.Phenotypic association are evaluated by an automated set of multiple regression analyses (for numerical variables) and multiple logistic analyses (for binary variables).
All the procedures, from GWAS to multivariate analyses, are adjusted for canonical risk factors (of CVD in our example) in order to let emerge novel loci associated to the trait (CVD) independent from canonical risk factor. The association of other phenotypes to the SNP has some advantages:
i)If the SNP is located in proximity of a gene with unknown function, or with functions not directly related to the investigated trait, the enrichment of the phenotype could supply clues of how the gene is linked to the trait.
ii)If the locus contains different genes with different functions, the colocalization of the investigated trait with other phenotypes could help to discriminate which gene could be responsible for the association with the trait, if some associated phenotype points to a gene with known function.

The Figure shows the proposed protocol in graphical format.


The figure shows the diagram flow of the proposed study workflow. The first GWAS identified six loci associated with CVD independently from canonical risk factors. For every locus, the representative SNP were selected considering the SNP with highest LOD score after pruning of the variables in Linkage Disequilibrium. The six SNP were then used as predictors in a series of multiple logistic regression analyses using the clinical conditions stored as ICD10 codes in the UK Biobank database as dependent variables, while a set of multiple regression analyses were used to deal with numerical (clinical and Biochemical ) parameters stored in the database. The clinical conditions and the numerical parameters associated with the SNPs with p-values < 1E07 were then subjected to a new round of GWAS. The results were then merged and the colocalizations of the CVD GWAS with the novel binary/quantitative trait loci were investigated by a merged plot of the relative LOD scores, normalized between 0 and 1 for comparison purposes.

STEP2: GWAS analysis


The first step is the preparation of the database for the GWAS analysis of UKBiobank. The following procedures describe the workflow in the Research Analysis Platform (RAP) online platform from DNAnexus. (https://ukbiobank.dnanexus.com/landing).

The second step is the creation of the cohorts of cases and controls for the analysis. This procedure is realized with the Cohort Browser application of RAP. Check the DNAnexus documentation for details (https://documentation.dnanexus.com/user/cohort-browser).

The third step is the preparation of the phenotype file. Again check the documentation. (https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/ukb-rap)

In particular, extraction of the traits and other database manipulation can be executed using the JupiterNotebook pipeline within RAP. Explanation can be found in youtube DNAnexus channels (https://www.youtube.com/watch?v=762PVlyZJ-U&ab_channel=DNAnexus).



The final product of the REGENIE pipeline is the result file with extension “.regenie”, (CVD_out_firth_pheno_CVD.regenie in our example). The structure is:




The result con be plotted with the R “qqman”package. The code to plot our results is in the “plot GWAS_base.R” script.





STEP3: Phenotypes association to SNPs (representative of GWAS loci)

All the data can be downloaded by the “Table exporter” app from RAP. SNP genotypes in (0,1,2 minor allele format can be downloaded by plink. Data can be merged within R with the “merge” function using the “IID” column as index. Data are reduced in width. The list of the phenotypes is usually from 100 to 1000 column long
The main database needs to be arranged to be analyzed by the R script. The final structure should be:





Once ready the file is analyzed by the main R script “UK_Biobank_phenotypizer_ ver2.0.R”.
The script checks if the all the supporting files are present in the working directory. Then, the first analysis is executed, and the association of diseases stored in ICD10 column with the SNP is evaluated for every ICD10 term. This ICD10 column corresponds to the p41202 field of the UK Biobank database, and it stores all information in a single string format, whose elements are separated by a “|” sign. Example:


The function “ICDxSNP” ask for some information about the column position of the trait, covariate, phenotypes and SNP and start to calculate the association of phenotypes to SNP adjusting for covariates. The routine extracts every ICD term detected in more than a x number of subjects, where x is a user defined threshold, and . Then the routine performs a series of chi2 test on a 6 cells table (ICD10 yes/no) x (SNP 0,1,2) . If the chi2 testp-value is < 10E-04 it proceeds to a logistic regression with format:
The it perform a series of logistic regression with format:

glm[ ICD10 (0,1)~ covariates + SNP rs(0,1,2)].

If the glm p-value is < 10E-07, it stores the prevalence of the ICD10 condition according to the SNP (0,1,2) , the odds ratios and confidence intervals.

The function “NUMPARxSNP” ask for some information about the column position of the trait, covariate, phenotypes and SNP and start to calculate the association of numeric phenotypes to SNP adjusting for covariates. The routine performs a series of ANOVAtest on a numerical variables x (SNP 0,1,2) . If the ANOVAtestp-value is < 10E-04 it proceeds to a series of multiple regression with format:

lm[ numeric_parameters~ covariates + SNP rs(0,1,2)].

If the glm p-value is < 10E-07, it stores mean ± SD according to the SNP, multiple regression coefficient beta coefficient and multiple regression p-value.


STEP4 : REGENIE analysis of ICD10 conditions and numerical parameters obtained in STEP3.

A new set of REGENIE analyses is performed on the variables resulted in STEP3.
The aim of this step is to refine and enrich the loci found in STEP1 by adding new phenotypes. Then REGENIE is set to work with single chromosomes, those containing the loci associated with the main trait (CVD in out example).

For binary phenotypes (as ICD10 condition) REGENIE option is set to “bt”:
Example for chrom 12.
regenie \
--step 1 \
--bed ukb22418_c12_b0_v2 \ #chrom 12
--extract snps_qc_pass.snplist \
--covarFile phenofile_to_refine_eg1.txt \
--phenoFile phenofile_to_refine_eg1.txt \
--phenoCol pheno_CVD \ # main trait
--phenoCol E039 \ # ICD10 condition
--phenoCol K900 \ # ICD10 condition
--covarCol age --covarCol sex --covarCol diab --covarCol hbp --covarCol disl \
--bsize 750 \
--bt --lowmem --loocv \ # bt for binary trait
--lowmem-prefix tmp_rg \
--out CVD_fit_bin_out_refine_c12 #result file

regenie \
--step 2 \
--bed ukb22418_c12_b0_v2 \
--extract snps_qc_pass.snplist \
--covarFile phenofile_to_refine_eg1.txt \
--phenoFile phenofile_to_refine_eg1.txt \
--phenoCol pheno_CVD \
--phenoCol E039 \
--phenoCol K900 \
--covarCol age --covarCol sex --covarCol diab --covarCol hbp --covarCol disl \
--bsize 750 \
--bt \
--strict \
--firth --approx \
--pThresh 0.01 \
--pred CVD_fit_bin_out_refine_c12_pred.list \
--out test_bin_out_firth_c12_refine

For numeric phenotypes (as p30010_i0, plasma Albumin concentration ) REGENIE option is set to “qt”:
Example for chrom 10.

regenie \
--step 1 \
--bed ukb22418_c10_b0_v2 \ #chrom 10
--extract snps_qc_pass.snplist \
--covarFile phenofile_to_refine_eg1_unix.txt \
--phenoFile phenofile_to_refine_eg1_unix.txt \
--phenoCol p30010_i0 \ # quantitative trait
--covarCol age --covarCol sex --covarCol diab --covarCol hbp --covarCol disl \
--bsize 750 \
--qt --lowmem --loocv \ # qt for quantitative traits
--lowmem-prefix tmp_rg \
--out CVD2_fit_bin_out_refine_c10 #result files

regenie \
--step 2 \
--bed ukb22418_c10_b0_v2 \
--extract snps_qc_pass.snplist \
--covarFile phenofile_to_refine_eg1_unix.txt \
--phenoFile phenofile_to_refine_eg1_unix.txt \
--phenoCol p30010_i0 \
--covarCol age --covarCol sex --covarCol diab --covarCol hbp --covarCol disl \
--bsize 750 \
--qt \
--strict \
--firth --approx \
--pThresh 0.01 \
--pred CVD2_fit_bin_out_refine_c10_pred.list \
--out test2_bin_out_firth_c10_refine

Check the REGENIE pipeline in DNAnexus documentation for details.
STEP5: plotting the results of REGENIE analyses The final step is to merge the LOD score from the main REGENIE file (for CVD in out example ) with the secondary REGENIE analyses on selected traits (from STEP3) together with the chromosomal positions of the gene contained in the loci. The “plot_multi_LOD” functions ask for position of regenie file to merge within the result directory, extract and merge LOD scores, extract gene positions from the “Megabed_hg19.bed” file and arrange all the information in a single plot. This the results for chr10.
The Figure shows the chromosomal loci in chromosome 10 associated with CVD identified by the present study. LOD scores (y-axis) are plotted according to the chromosomal position (x-axis) of the SNPs. The phenotypes associated to the chromosomal loci are also superimposed to the CVD LOD scoresThe colours of the dots correspond to the label colours.