Difference between revisions of "GSCAN dbGaP"

From Vrieze Wiki
Jump to navigation Jump to search
Line 26: Line 26:
 
   
 
   
 
  write.table(xx, file="framingham_GSCAN_phenotypesCovariates.ped", quote=FALSE, sep="\t", row.names=F)
 
  write.table(xx, file="framingham_GSCAN_phenotypesCovariates.ped", quote=FALSE, sep="\t", row.names=F)
 +
 +
==Genotypes== 
 +
 +
===Pre-Phasing QC===
 +
QC parameters that we chose:
 +
MAF > 0.01
 +
 +
SNP callrate > 0.95
 +
 +
Missingness per individual > 0.95
 +
 +
HWE = 0.05 / number of markers but greater than 5e-8
 +
 +
To update the strand builds: http://www.well.ox.ac.uk/~wrayner/strand/
 +
 +
 +
## Check strands against latest 1000G: http://www.well.ox.ac.uk/~wrayner/tools/
 +
#!/bin/bash
 +
#SBATCH --qos=blanca-ibg
 +
#SBATCH --mem=40gb
 +
perl HRC-1000G-check-bim.pl -b ARIC_b37_filtered.bim -f ARIC_b37_filtered.frq -r  1000GP_Phase3_combined.legend -g -p EUR
 +
 +
 +
## Phasing using shapeit
 +
#!/bin/bash
 +
#SBATCH --mem=20gb
 +
#SBATCH --time=48:00:00
 +
#SBATCH -o shapeit_aric_%j.out
 +
#SBATCH -e shapeit_aric_%j.err
 +
#SBATCH --qos blanca-ibgc1
 +
#SBATCH --ntasks-per-node 48
 +
#SBATCH -J shapeit_aric
 +
shapeit -B ARIC_b37_filtered-updated-chr${1} -M /rc_scratch/meli7712/dbGAP/1000GP_Phase3/genetic_map_chr${1}_combined_b37.txt -O phased/ARIC_b37_filtered-updated-chr${1}.phased -T 48
 +
 +
 +
## To convert the shapeit output into vcf
 +
#!/bin/bash
 +
#SBATCH --mem=20gb
 +
#SBATCH --time=24:00:00
 +
#SBATCH -o shapeit_mesa_%j.out
 +
#SBATCH -e shapeit_mesa_%j.err
 +
#SBATCH --qos janus
 +
#SBATCH --ntasks-per-node 12
 +
#SBATCH -J shapeit_mesa
 +
shapeit -convert --input-haps mesa-chr${1}.phased --output-vcf mesa-chr${1}.phased.vcf -T 12
 +
 +
## Imputation
 +
#!/bin/bash
 +
#SBATCH --mem=30gb
 +
#SBATCH --time=72:00:00
 +
#SBATCH -o impute_mesa_%j.out
 +
#SBATCH -e impute_mesa_%j.err
 +
#SBATCH --qos blanca-ibgc1
 +
#SBATCH --ntasks-per-node 48
 +
#SBATCH -J impute_mesa
 +
 +
/work/KellerLab/Zhen/bin/Minimac3/bin/Minimac3 --haps mesa-chr${1}.phased.vcf --cpus 48 --refHaps /rc_scratch/meli7712/dbGAP/references/${1}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --chr ${1} --noPhoneHome --format GT,DS,GP --allTypedSites --prefix mesa-chr${1}.phased.imputed

Revision as of 00:05, 25 August 2016

Framingham

ID Mapping

The following snppit describes how the PLINK genotype file IDs (which use "sampid") were mapped to the dbGaP_Subject_ID contained in the available phenotype files. The snippet only contains one of Joyce Ling's phenotype files for illustration only.

This script can be found in /work/KellerLab/GSCAN/dbGaP/Framingham/ID_mapping.r

### ID mapping file from dbGaP_Subject_ID to SAMPID 
ID.map <- read.table(gzfile("/work/KellerLab/GSCAN/dbGaP/Framingham/PhenoGenotypeFiles/RootStudyConsentSet_phs000007.Framingham.v28.p10.c1.HMB-IRB-MDS/PhenotypeFiles/phs000007.v28.pht001415.v16.p10.Framingham_Sample.MULTI.txt.gz"), header=T, sep="\t", stringsAsFactors=F)

### Genotype files
genotype.IDs <-  read.table("/work/KellerLab/GSCAN/dbGaP/Framingham/PhenoGenotypeFiles/ChildStudyConsentSet_phs000342.Framingham.v16.p10.c1.HMB-IRB-MDS/GenotypeFiles/phg000006.v9.FHS_SHARe_Affy500K.genotype-calls-matrixfmt.c1/subject_level_PLINK_sets/FHS_SHARe_Affy500K_subjects_c1.fam", header=F)
names(genotype.IDs) <- c("famid", "SAMPID", "patid", "matid", "sex", "phenotype")

length(which(genotype.IDs$SAMPID %in% ID.map$SAMPID))
## 6954 

x <- merge(genotype.IDs, ID.map, by="SAMPID", all.x=T)
x <- x[,c(1:5,7)]

### One of Joyce's phenotype ID files (eventually all phenotypes for all available participants were merged in, but this is a good example.)
phenotypes <- read.table("OffC_Exam_1.txt", header=T, sep="\t")

xx <- merge(x, phenotypes, by="dbGaP_Subject_ID", all=TRUE)

xx <- xx[,c(2:ncol(xx),1)] ## Reorder so the dbGaP_Subject_ID is the last column 

write.table(xx, file="framingham_GSCAN_phenotypesCovariates.ped", quote=FALSE, sep="\t", row.names=F)

Genotypes

Pre-Phasing QC

QC parameters that we chose: MAF > 0.01

SNP callrate > 0.95

Missingness per individual > 0.95

HWE = 0.05 / number of markers but greater than 5e-8

To update the strand builds: http://www.well.ox.ac.uk/~wrayner/strand/


## Check strands against latest 1000G: http://www.well.ox.ac.uk/~wrayner/tools/
#!/bin/bash
#SBATCH --qos=blanca-ibg
#SBATCH --mem=40gb
perl HRC-1000G-check-bim.pl -b ARIC_b37_filtered.bim -f ARIC_b37_filtered.frq -r  1000GP_Phase3_combined.legend -g -p EUR


## Phasing using shapeit
#!/bin/bash
#SBATCH --mem=20gb
#SBATCH --time=48:00:00
#SBATCH -o shapeit_aric_%j.out
#SBATCH -e shapeit_aric_%j.err
#SBATCH --qos blanca-ibgc1
#SBATCH --ntasks-per-node 48
#SBATCH -J shapeit_aric
shapeit -B ARIC_b37_filtered-updated-chr${1} -M /rc_scratch/meli7712/dbGAP/1000GP_Phase3/genetic_map_chr${1}_combined_b37.txt -O phased/ARIC_b37_filtered-updated-chr${1}.phased -T 48

## To convert the shapeit output into vcf
#!/bin/bash
#SBATCH --mem=20gb
#SBATCH --time=24:00:00
#SBATCH -o shapeit_mesa_%j.out
#SBATCH -e shapeit_mesa_%j.err
#SBATCH --qos janus
#SBATCH --ntasks-per-node 12
#SBATCH -J shapeit_mesa
shapeit -convert --input-haps mesa-chr${1}.phased --output-vcf mesa-chr${1}.phased.vcf -T 12
## Imputation
#!/bin/bash
#SBATCH --mem=30gb
#SBATCH --time=72:00:00
#SBATCH -o impute_mesa_%j.out
#SBATCH -e impute_mesa_%j.err
#SBATCH --qos blanca-ibgc1
#SBATCH --ntasks-per-node 48
#SBATCH -J impute_mesa

/work/KellerLab/Zhen/bin/Minimac3/bin/Minimac3 --haps mesa-chr${1}.phased.vcf --cpus 48 --refHaps /rc_scratch/meli7712/dbGAP/references/${1}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz --chr ${1} --noPhoneHome --format GT,DS,GP --allTypedSites --prefix mesa-chr${1}.phased.imputed