GSCAN dbGaP: Difference between revisions
Jump to navigation
Jump to search
Mengzhen Liu (talk | contribs) No edit summary |
|||
| Line 26: | Line 26: | ||
write.table(xx, file="framingham_GSCAN_phenotypesCovariates.ped", quote=FALSE, sep="\t", row.names=F) | write.table(xx, file="framingham_GSCAN_phenotypesCovariates.ped", quote=FALSE, sep="\t", row.names=F) | ||
==Genotypes== | |||
===Pre-Phasing QC=== | |||
QC parameters that we chose: | |||
MAF > 0.01 | |||
SNP callrate > 0.95 | |||
Missingness per individual > 0.95 | |||
HWE = 0.05 / number of markers but greater than 5e-8 | |||
To update the strand builds: http://www.well.ox.ac.uk/~wrayner/strand/ | |||
## Check strands against latest 1000G: http://www.well.ox.ac.uk/~wrayner/tools/ | |||
#!/bin/bash | |||
#SBATCH --qos=blanca-ibg | |||
#SBATCH --mem=40gb | |||
perl HRC-1000G-check-bim.pl -b ARIC_b37_filtered.bim -f ARIC_b37_filtered.frq -r 1000GP_Phase3_combined.legend -g -p EUR | |||
## Phasing using shapeit | |||
#!/bin/bash | |||
#SBATCH --mem=20gb | |||
#SBATCH --time=48:00:00 | |||
#SBATCH -o shapeit_aric_%j.out | |||
#SBATCH -e shapeit_aric_%j.err | |||
#SBATCH --qos blanca-ibgc1 | |||
#SBATCH --ntasks-per-node 48 | |||
#SBATCH -J shapeit_aric | |||
shapeit -B ARIC_b37_filtered-updated-chr${1} -M /rc_scratch/meli7712/dbGAP/1000GP_Phase3/genetic_map_chr${1}_combined_b37.txt -O phased/ARIC_b37_filtered-updated-chr${1}.phased -T 48 | |||
## To convert the shapeit output into vcf | |||
#!/bin/bash | |||
#SBATCH --mem=20gb | |||
#SBATCH --time=24:00:00 | |||
#SBATCH -o shapeit_mesa_%j.out | |||
#SBATCH -e shapeit_mesa_%j.err | |||
#SBATCH --qos janus | |||
#SBATCH --ntasks-per-node 12 | |||
#SBATCH -J shapeit_mesa | |||
shapeit -convert --input-haps mesa-chr${1}.phased --output-vcf mesa-chr${1}.phased.vcf -T 12 | |||
## Imputation | |||
#!/bin/bash | |||
#SBATCH --mem=30gb | |||
#SBATCH --time=72:00:00 | |||
#SBATCH -o impute_mesa_%j.out | |||
#SBATCH -e impute_mesa_%j.err | |||
#SBATCH --qos blanca-ibgc1 | |||
#SBATCH --ntasks-per-node 48 | |||
#SBATCH -J impute_mesa | |||
/work/KellerLab/Zhen/bin/Minimac3/bin/Minimac3 --haps mesa-chr${1}.phased.vcf --cpus 48 --refHaps /rc_scratch/meli7712/dbGAP/references/${1}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --chr ${1} --noPhoneHome --format GT,DS,GP --allTypedSites --prefix mesa-chr${1}.phased.imputed | |||
Revision as of 00:05, 25 August 2016
Framingham
ID Mapping
The following snppit describes how the PLINK genotype file IDs (which use "sampid") were mapped to the dbGaP_Subject_ID contained in the available phenotype files. The snippet only contains one of Joyce Ling's phenotype files for illustration only.
This script can be found in /work/KellerLab/GSCAN/dbGaP/Framingham/ID_mapping.r
### ID mapping file from dbGaP_Subject_ID to SAMPID
ID.map <- read.table(gzfile("/work/KellerLab/GSCAN/dbGaP/Framingham/PhenoGenotypeFiles/RootStudyConsentSet_phs000007.Framingham.v28.p10.c1.HMB-IRB-MDS/PhenotypeFiles/phs000007.v28.pht001415.v16.p10.Framingham_Sample.MULTI.txt.gz"), header=T, sep="\t", stringsAsFactors=F)
### Genotype files
genotype.IDs <- read.table("/work/KellerLab/GSCAN/dbGaP/Framingham/PhenoGenotypeFiles/ChildStudyConsentSet_phs000342.Framingham.v16.p10.c1.HMB-IRB-MDS/GenotypeFiles/phg000006.v9.FHS_SHARe_Affy500K.genotype-calls-matrixfmt.c1/subject_level_PLINK_sets/FHS_SHARe_Affy500K_subjects_c1.fam", header=F)
names(genotype.IDs) <- c("famid", "SAMPID", "patid", "matid", "sex", "phenotype")
length(which(genotype.IDs$SAMPID %in% ID.map$SAMPID))
## 6954
x <- merge(genotype.IDs, ID.map, by="SAMPID", all.x=T)
x <- x[,c(1:5,7)]
### One of Joyce's phenotype ID files (eventually all phenotypes for all available participants were merged in, but this is a good example.)
phenotypes <- read.table("OffC_Exam_1.txt", header=T, sep="\t")
xx <- merge(x, phenotypes, by="dbGaP_Subject_ID", all=TRUE)
xx <- xx[,c(2:ncol(xx),1)] ## Reorder so the dbGaP_Subject_ID is the last column
write.table(xx, file="framingham_GSCAN_phenotypesCovariates.ped", quote=FALSE, sep="\t", row.names=F)
Genotypes
Pre-Phasing QC
QC parameters that we chose: MAF > 0.01
SNP callrate > 0.95
Missingness per individual > 0.95
HWE = 0.05 / number of markers but greater than 5e-8
To update the strand builds: http://www.well.ox.ac.uk/~wrayner/strand/
## Check strands against latest 1000G: http://www.well.ox.ac.uk/~wrayner/tools/ #!/bin/bash #SBATCH --qos=blanca-ibg #SBATCH --mem=40gb perl HRC-1000G-check-bim.pl -b ARIC_b37_filtered.bim -f ARIC_b37_filtered.frq -r 1000GP_Phase3_combined.legend -g -p EUR
## Phasing using shapeit
#!/bin/bash
#SBATCH --mem=20gb
#SBATCH --time=48:00:00
#SBATCH -o shapeit_aric_%j.out
#SBATCH -e shapeit_aric_%j.err
#SBATCH --qos blanca-ibgc1
#SBATCH --ntasks-per-node 48
#SBATCH -J shapeit_aric
shapeit -B ARIC_b37_filtered-updated-chr${1} -M /rc_scratch/meli7712/dbGAP/1000GP_Phase3/genetic_map_chr${1}_combined_b37.txt -O phased/ARIC_b37_filtered-updated-chr${1}.phased -T 48
## To convert the shapeit output into vcf
#!/bin/bash
#SBATCH --mem=20gb
#SBATCH --time=24:00:00
#SBATCH -o shapeit_mesa_%j.out
#SBATCH -e shapeit_mesa_%j.err
#SBATCH --qos janus
#SBATCH --ntasks-per-node 12
#SBATCH -J shapeit_mesa
shapeit -convert --input-haps mesa-chr${1}.phased --output-vcf mesa-chr${1}.phased.vcf -T 12
## Imputation
#!/bin/bash
#SBATCH --mem=30gb
#SBATCH --time=72:00:00
#SBATCH -o impute_mesa_%j.out
#SBATCH -e impute_mesa_%j.err
#SBATCH --qos blanca-ibgc1
#SBATCH --ntasks-per-node 48
#SBATCH -J impute_mesa
/work/KellerLab/Zhen/bin/Minimac3/bin/Minimac3 --haps mesa-chr${1}.phased.vcf --cpus 48 --refHaps /rc_scratch/meli7712/dbGAP/references/${1}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --chr ${1} --noPhoneHome --format GT,DS,GP --allTypedSites --prefix mesa-chr${1}.phased.imputed