GSCAN dbGaP
Revision as of 00:05, 25 August 2016 by Mengzhen Liu (talk | contribs)
Framingham
ID Mapping
The following snppit describes how the PLINK genotype file IDs (which use "sampid") were mapped to the dbGaP_Subject_ID contained in the available phenotype files. The snippet only contains one of Joyce Ling's phenotype files for illustration only.
This script can be found in /work/KellerLab/GSCAN/dbGaP/Framingham/ID_mapping.r
### ID mapping file from dbGaP_Subject_ID to SAMPID ID.map <- read.table(gzfile("/work/KellerLab/GSCAN/dbGaP/Framingham/PhenoGenotypeFiles/RootStudyConsentSet_phs000007.Framingham.v28.p10.c1.HMB-IRB-MDS/PhenotypeFiles/phs000007.v28.pht001415.v16.p10.Framingham_Sample.MULTI.txt.gz"), header=T, sep="\t", stringsAsFactors=F) ### Genotype files genotype.IDs <- read.table("/work/KellerLab/GSCAN/dbGaP/Framingham/PhenoGenotypeFiles/ChildStudyConsentSet_phs000342.Framingham.v16.p10.c1.HMB-IRB-MDS/GenotypeFiles/phg000006.v9.FHS_SHARe_Affy500K.genotype-calls-matrixfmt.c1/subject_level_PLINK_sets/FHS_SHARe_Affy500K_subjects_c1.fam", header=F) names(genotype.IDs) <- c("famid", "SAMPID", "patid", "matid", "sex", "phenotype") length(which(genotype.IDs$SAMPID %in% ID.map$SAMPID)) ## 6954 x <- merge(genotype.IDs, ID.map, by="SAMPID", all.x=T) x <- x[,c(1:5,7)] ### One of Joyce's phenotype ID files (eventually all phenotypes for all available participants were merged in, but this is a good example.) phenotypes <- read.table("OffC_Exam_1.txt", header=T, sep="\t") xx <- merge(x, phenotypes, by="dbGaP_Subject_ID", all=TRUE) xx <- xx[,c(2:ncol(xx),1)] ## Reorder so the dbGaP_Subject_ID is the last column write.table(xx, file="framingham_GSCAN_phenotypesCovariates.ped", quote=FALSE, sep="\t", row.names=F)
Genotypes
Pre-Phasing QC
QC parameters that we chose: MAF > 0.01
SNP callrate > 0.95
Missingness per individual > 0.95
HWE = 0.05 / number of markers but greater than 5e-8
To update the strand builds: http://www.well.ox.ac.uk/~wrayner/strand/
## Check strands against latest 1000G: http://www.well.ox.ac.uk/~wrayner/tools/ #!/bin/bash #SBATCH --qos=blanca-ibg #SBATCH --mem=40gb perl HRC-1000G-check-bim.pl -b ARIC_b37_filtered.bim -f ARIC_b37_filtered.frq -r 1000GP_Phase3_combined.legend -g -p EUR
## Phasing using shapeit #!/bin/bash #SBATCH --mem=20gb #SBATCH --time=48:00:00 #SBATCH -o shapeit_aric_%j.out #SBATCH -e shapeit_aric_%j.err #SBATCH --qos blanca-ibgc1 #SBATCH --ntasks-per-node 48 #SBATCH -J shapeit_aric shapeit -B ARIC_b37_filtered-updated-chr${1} -M /rc_scratch/meli7712/dbGAP/1000GP_Phase3/genetic_map_chr${1}_combined_b37.txt -O phased/ARIC_b37_filtered-updated-chr${1}.phased -T 48
## To convert the shapeit output into vcf #!/bin/bash #SBATCH --mem=20gb #SBATCH --time=24:00:00 #SBATCH -o shapeit_mesa_%j.out #SBATCH -e shapeit_mesa_%j.err #SBATCH --qos janus #SBATCH --ntasks-per-node 12 #SBATCH -J shapeit_mesa shapeit -convert --input-haps mesa-chr${1}.phased --output-vcf mesa-chr${1}.phased.vcf -T 12
## Imputation #!/bin/bash #SBATCH --mem=30gb #SBATCH --time=72:00:00 #SBATCH -o impute_mesa_%j.out #SBATCH -e impute_mesa_%j.err #SBATCH --qos blanca-ibgc1 #SBATCH --ntasks-per-node 48 #SBATCH -J impute_mesa /work/KellerLab/Zhen/bin/Minimac3/bin/Minimac3 --haps mesa-chr${1}.phased.vcf --cpus 48 --refHaps /rc_scratch/meli7712/dbGAP/references/${1}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --chr ${1} --noPhoneHome --format GT,DS,GP --allTypedSites --prefix mesa-chr${1}.phased.imputed