Homework 7: Functional Annotation
Jump to navigation
Jump to search
Notes
Code for Homework
# PSYC 7102 -- Statistical Genetics
# Homework #7: Functional Annotation
# Due: December 10th
# Overview: In this homework I ask you to annotate genes in your genome
# and interpret the consequences of a few annotations.
# Try not to frequently run commands on the head node, use an interactive
# session by running the following command:
qsub -I -l walltime=23:00:00 -q short -l mem=2gb -l nodes=1:ppn=2
# Some modules you'll need
module load tabix_0.2.6
module load anno_1.0
# I encourage you to annotate your own genome, but for purposes of this
# homework I ask that you conduct all analyses of this file, which contains
# a single person from 1000 Genomes. Please copy both files to a directory
# of your choice.
/Users/scvr9332/annotation/chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz
/Users/scvr9332/annotation/chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz.tbi
###############
### STEP 1 ###
###############
# Please run the following command to annotate your genome.
anno -i chrALL.filtered.PASS.beagled.HG00096.rsID.vcf.gz \
-o chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz \
-r /Users/scvr9332/reference_data/gotcloud.ref/hs37d5.fa \
-g /Users/scvr9332/reference_data/anno/refFlat_hg19.txt.gz \
-p /Users/scvr9332/reference_data/anno/priority.txt \
-c /Users/scvr9332/reference_data/anno/codon.txt
### Take a look at all the output files (especially the top.anno.frq
### file).
###
###----------------------------------###
###----QUESTION 1, a-b (2 points)----###
###----------------------------------###
### a) How many stop gain variants are there?
### b) How many UTR5 variants?
##############
### STEP 2 ###
##############
### Recall from the 1000 Genomes paper that stop gain variants tend to
### be rare. Let's compare their frequency to that of synonymous
### variants.
zgrep '^#\|Stop_Gain' chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz | bgzip -c > stop_gains.vcf.gz
zgrep '^#\|Synonymous' chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz | bgzip -c > synonymous.vcf.gz
### The following command tells me at how many sites this individual
### has a heterozygous stop gain allele:
zgrep '\s0|1:\|\s1|0:' stop_gains.vcf.gz | wc -l
###----------------------------------###
###----QUESTION 2, a-e (5 points)----###
###----------------------------------###
### Modify the zgrep command immediately above to answer:
### a) At how many sites does this individual have a heterozygous synonymous variant?
### b) At how many sites does this individual have a homozygous stop gain?
###
### Now Take the first homozygous stop gain.
### c) What gene does this variant affect?
### d) How many transcripts of this gene does this variant affect?
### e) What other functional consequences does this variant have in
### this gene? (e.g., splice site)
##############
### STEP 3 ###
##############
### Spencer's favorite gene is TLR4. Let's take a look.
tabix chrALL.filtered.PASS.beagled.HG00096.rsID.anno.vcf.gz 9:12000000-121000000 | grep 'TLR4' | less -S
###----------------------------------###
###----QUESTION 3, a-d (4 points)----###
###----------------------------------###
### a) How many variants are annotated to lie within some component of TLR4?
### b) How many are stop gain or essential splice site?
### c) Look at rs78848399. Descibe the amino acid change that occurs
### as a result of having an alternate allele at this site.
### Now use a tabix command to extract variant rs564925636.
### d) Describe the amino acid change that occurs as a result of
### having the alternate allele at this site. How does strand impact
### your answer to this question compared to your answer to c)?