The raw FASTQ files are available in:
Control/Input data => https://www.encodeproject.org/experiments/ENCSR000BPG/
Nanog data => https://www.encodeproject.org/experiments/ENCSR000BMT/
Pou5f1 (Oct4) data => https://www.encodeproject.org/experiments/ENCSR000BMU/
A useful link to find out more about Blacklisted Regions were identified and also useful links to relevant files and datasets:
Since the genotype information is stored in a different structure than other variant information, we need to use a specific syntax to filter. Normally, if we had sample information loaded via a PED file we could access the sample by subject ID. Since we don't have that information we need to apply the syntax of the wildcard.
The first two parentheses specify the genotype information we want and the selected samples we wish to evaluate. The second set of parentheses allow us to filter based on criteria/levels within the genotype information we specified and which entries we wish to retrieve (i.e. all, any, none)
Applying this to the filter we had tried in class, we get all variants that are HET in all samples (which in our case is just the one)
--gt-filter “(gt_types).(*).(== HET).(all)”
The slides uploaded in Session V for "Variant Annotation and Prioritization" have now been updated to include instructions on how to create a database (slide 13). In class, we provided this for you in the interest of time, as you can imagine compiling that much information can take some time.
If you plan on trying this please note that you will need to re-create a snpEff annotated VCF.
snpEff -Xmx2G -i vcf -o vcf -classic -formatEff -dataDir /home/mm573/HBC-NGS/var-calling/reference/snpeff/ hg19 na12878_q20_annot.vcf > na12878_new_snpEff.vcf
The additional parameters -classic and -formatEff are required because of version incompatibility. The new version of snpEff stores all annotation in the ANN field of your VCF INFO column. GEMINI has not been updated to handle the changes in output from the new snpEff and is expecting information to be store in a field called EFF. By adding the -classic and -formatEff the results are written using the old format with EFF.
To load your VCF into GEMINI:
gemini load -v na12878_new_snpEff.vcf -t snpEff na12878_new.db
For larger (full datasets) you will want to also add the parameter to specify cores for multi-threading:
snpEff is dependent on Java 7, which is not loaded by default on Orchestra. Before using snpeEff be sure to load the module for Java 7:
module load dev/java/jdk1.7
When annotating our VCF using snpEff, we encountered a problem with accessing the genome database which was located within our home directory. Rather than giving the full path to snpEff we indicated the path relative to our current directory ('variants'). This was the source of the error.
In class we used:
-dataDir ../reference/snpeff hg19
What we should use instead (user_name is your Orchestra login):
-dataDir /home/user_name/HBC-NGS/var-calling/reference/snpeff hg19
It seems that snpEff (which is a Java program) is unable to interpret the unix-specific notation of moving up a parent directory, and requires the full path. If you remember with Trimmomatic (also a Java program) we ran into a similar problem with the ~ notation for home directory. Gotta love Java.
Often you will find with bioinformatic tools, is that the output from one tool may not be compatible with another. If you are having this problem there is a good chance someone else has also encountered the same thing. For class, we incorporated Picard tools (which is part of the GATK pipeline) and found that the output from bwa was offensive in cases where the read was unmapped (based on flags) but the MAPQ not set to zero.
ERROR: Record 643, Read name FCC1LGYACXX:3:1314:10126:29246, MAPQ should be 0 for unmapped read
Searching through the GATK forums, I found that this was an issue specific to bwa:
"Ultimately, this is a BWA issue - it's the program generating the offending alignments. But this particular error is pretty inoffensive - as long as the unmapped flag is set, the MAPQ is unimportant to just about any program out there. So in my view, using LENIENT validation in Picard to avoid this issue is okay (I don't believe GATK cares about this error)"
In the last block of code in the edgeR script there is a mistake. In order to generate a vector of logical values that corresponds to the significant genes, you will need to run the following code instead of what is currently in the script :
# Get significant genes sigKD <- decideTestsDGE(lrt_kd, adjust.method="BH", p.value = 0.01) sigKD <- as.logical(abs(sigKD) > 0) length(which(sigKD)) sigOE <- decideTestsDGE(lrt_oe, adjust.method="BH", p.value = 0.01) sigOE <- as.logical(abs(sigOE) > 0) length(which(sigOE))
Our access to our hbctraining email is functional again. Please feel free to email us again at firstname.lastname@example.org
Just a reminder that office hours are 2-4pm today in TMEC448.
The answer key for session II homework can be found at: https://wiki.harvard.edu/confluence/x/bo6PCg
We have received several questions about extracting substrings in Unix. For example, if you have an alphanumeric sequence such as "askdjg_235423_dkfjsn", how would you extract just the numeric portion. Please see the following stackoverflow post for a couple of simple methods: http://stackoverflow.com/questions/428109/extract-substring-in-bash
The twitter hashtag for the NGS Data Analysis Course is #hbcngs2015.
Our email@example.com email is functioning again. Please feel free to email us at this address.
HSPH has disconnected our access to the hbctraining email address. We hope to gain access to our hbctraining account by the end of the day today, but if you have any questions please feel free to contact us at our personal emails: firstname.lastname@example.org, email@example.com, and firstname.lastname@example.org. We will post another announcement as soon as the email is up.
Also, please contact us if you previously sent an email and did not receive a response from us.
Due to the Boston marathon and Patriot's Day occurring on April 20th, we have moved the dates for Session III: RNA-Seq Part I. Session III will now take place on April 21st - April 22nd. The course agenda is up-to-date.