README.md 6.07 KB
Newer Older
Matteo Barcella's avatar
Matteo Barcella committed
1
2
## Introduction

Matteo Barcella's avatar
Matteo Barcella committed
3
**Whole Exome** data analysis - from fastq data to variants calling and annotation
Matteo Barcella's avatar
Matteo Barcella committed
4
As stated in supplementary methods we performed WES analysis relying mostly on the GATK best practices for WES analysis.  
Matteo Barcella's avatar
Matteo Barcella committed
5
Raw fastq data are available at [PRJEB63520](https://www.ebi.ac.uk/ena/browser/view/PRJEB63520) repository.   
Matteo Barcella's avatar
Matteo Barcella committed
6
7
8
9
10
11
12

Below the main steps performed and the relative running commands:

# Alignment #

Fastq files were aligned with [BWA aligner](https://github.com/lh3/bwa) (v0.7.17) to GRCh38 reference genome (GRCh38.p13 gencodegenes) using default parameters, except for the -M option for [Picard](https://broadinstitute.github.io/picard/) compatibility necessary for marking of duplicates.  

Matteo Barcella's avatar
Matteo Barcella committed
13
```
Matteo Barcella's avatar
Matteo Barcella committed
14
- bwa mem -t 12 -R @RG\tID:SampleID_L1\tSM:SampleID  PL:ILLUMINA -M GRCh38.p13.genome.fa 3_S8_L001_R1_001.fastq.gz 3_S8_L001_R2_001.fastq.gz` 
Matteo Barcella's avatar
Matteo Barcella committed
15
- picard SortSam INPUT=SampleID_L1.sam OUTPUT=BALL_12_27_1_L1_mouse.bam SORT_ORDER=coordinate # both lanes L1 and L2
Matteo Barcella's avatar
Matteo Barcella committed
16
17
18
19
20
21
- picard MergeSamFiles I=SampleID_L1_mouse.bam I=SampleID_L2_mouse.bam OUTPUT=SampleID_mouse.bam
- samtools index SampleID_mouse.bam
- picard MarkDuplicates INPUT=SampleID_mouse.bam OUTPUT=SampleID.dedup_reads_mouse.bam METRICS_FILE=SampleID.metrics_mouse.txt
- gatk BaseRecalibrator --input SampleID.dedup_reads_mouse.bam --reference $genome --known-sites $vreference --output SampleID_recal_data_mouse.table
- gatk ApplyBQSR --reference $genome --input SampleID.dedup_reads_mouse.bam --output SampleID.dedup_reads_mouse_recal.bam --bqsr-recal-file SampleID_recal_data_mouse.table --static-quantized-quals 10 --static-quantized-qual
- gatk BaseRecalibrator --input SampleID.dedup_reads_mouse_recal.bam --reference $genome --known-sites $vreference --output SampleID_post_recal_data_mouse.table
Matteo Barcella's avatar
Matteo Barcella committed
22
```
Matteo Barcella's avatar
Matteo Barcella committed
23

Matteo Barcella's avatar
Matteo Barcella committed
24
25
26
27
## Disambiguation ##

We perform alignment using both human and mouse reference genomes in order to perform reads disambiguation using [disambiguate](https://pubmed.ncbi.nlm.nih.gov/27990269/) and discard reads from human mapping that belong to mouse cells.  

Matteo Barcella's avatar
Matteo Barcella committed
28
29
30
```
python disambiguate.py -a bwa -s SampleID 02-Alignment_human/SampleID.dedup_reads_rehead_recal.bam 02-Alignment_mouse/SampleID.dedup_reads_mouse_recal.bam
```
Matteo Barcella's avatar
Matteo Barcella committed
31
32
# Variant calling #

Matteo Barcella's avatar
Matteo Barcella committed
33
34
35
36
37
38
39
40
41
42
43
44
45
For variant calling analysis we opted for Mutect2 algorithm (more sensitive for somatic variants identification) in tumor-only mode.  
Below the workflow and commands used:  

```
- gatk Mutect2 -I SampleID.disambiguatedSpeciesA.bam.sorted.bam -O Mutect2_raw_call_SampleID_exome_refseq_light.vcf.gz -R GRCh38.p13.genome.fa -L exome_light.bed -G StandardMutectAnnotation -OVI true -OVM true -ip 50
- gatk FilterMutectCalls -R GRCh38.p13.genome.fa -V Mutect2_raw_call_SampleID_exome_refseq_light.vcf.gz -O Mutect2_raw_call_SampleID_exome_refseq_light_filtered.vcf.gz
```

# Annotation #

We implemented several layers of annotation using different datasets including [SNPeff](http://pcingola.github.io/SnpEff/) (v.86), [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) (v.152), [dbNSFP4](http://database.liulab.science/dbNSFP) (4.0a for academic). 

```
Matteo Barcella's avatar
Matteo Barcella committed
46
snpEff -v -csvStats SampleID_rawstats_canon.csv -s SampleID_rawstats_canon.html -canon GRCh38.86 Mutect2_raw_call_SampleID_exome_refseq_light_filtered.vcf.gz > SampleID_raw_annot_refseq_light_filtered.vcf
Matteo Barcella's avatar
Matteo Barcella committed
47
48
49

# dbSNP annotation

Matteo Barcella's avatar
Matteo Barcella committed
50
51
SnpSift annotate $dbsnp SampleID_raw_annot_refseq_light_filtered.vcf > SampleID_dbsnp_refseq_light_filtered.vcf
gatk IndexFeatureFile -I SampleID_dbsnp_refseq_light_filtered.vcf
Matteo Barcella's avatar
Matteo Barcella committed
52
53
54

# Cosmic annotation (coding)

Matteo Barcella's avatar
Matteo Barcella committed
55
56
SnpSift annotate $cosmic SampleID_dbsnp_refseq_light_filtered.vcf > SampleID_dbsnp_cc_refseq_light_filtered.vcf
gatk IndexFeatureFile -I SampleID_dbsnp_cc_refseq_light_filtered.vcf
Matteo Barcella's avatar
Matteo Barcella committed
57
58
59

# Cosmic annotation (non coding)

Matteo Barcella's avatar
Matteo Barcella committed
60
61
SnpSift annotate $cosmicnc SampleID_dbsnp_cc_refseq_light_filtered.vcf > SampleID_dbsnp_cc_cnc_refseq_light_filtered.vcf
gatk IndexFeatureFile -I SampleID_dbsnp_cc_cnc_refseq_light_filtered.vcf
Matteo Barcella's avatar
Matteo Barcella committed
62
63
64

# dbNSFP4 annotation 

Matteo Barcella's avatar
Matteo Barcella committed
65
66
67
68
69
70
71
72
73
74
SnpSift dbnsfp -db dbNSFP4.0a/dbNSFP4.0a.txt.gz -v SampleID_dbsnp_cc_cnc_refseq_light_filtered.vcf  > SampleID_dbsnp_cc_cnc_dbnsfp_refseq_light_filtered.vcf

# NOTE: a pass-only version is also produced by picking-up only PASS variants:  

bcftools_viewCommand=view -f PASS -O z Mutect2_raw_call_SampleID_exome_refseq_light_filtered_DP10.vcf.gz
- SnpEff  -csvStats SampleID_rawstats_canon.csv -s SampleID_rawstats_canon.html GRCh38.86 Mutect2_raw_call_SampleID_exome_refseq_light_filtered_DP10_PASS.vcf.gz
- SnpSift Annotate dbSNP.v152.vcf.gz SampleID_raw_annot_refseq_light.vcf
- SnpSift Annotate CosmicCodingMuts.vcf.gz SampleID_dbsnp_refseq_light.vcf
- SnpSift Annotate CosmicNonCodingVariants.vcf.gz SampleID_dbsnp_cc_refseq_light.vcf
- SnpSift DbNsfp SampleID_dbsnp_cc_cnc_refseq_light.vcf
Matteo Barcella's avatar
Matteo Barcella committed
75
76
77
78
79
80
81

```

# Field extraction for AF comparison #

We leverage SnpSift extractfield function for selecting fields of interest and performing Allele Fraction analysis.

Matteo Barcella's avatar
Matteo Barcella committed
82
```
Matteo Barcella's avatar
Matteo Barcella committed
83
84
- SnpSift extractFields SampleID_dbsnp_cc_cnc_dbnsfp_refseq_light_filtered.vcf CHROM POS REF ALT GEN[0].AF GEN[0].DP GEN[0].AD[0] GEN[0].AD[1] GEN[0].GT ID FILTER ANN[0].HGVS_P ANN[0].GENE ANN[0].BIOTYPE ANN[0].RANK ANN[0].EFFECT ANN[0].IMPACT COMMON G5 dbNSFP_ExAC_Adj_AF dbNSFP_1000Gp3_AF dbNSFP_ExAC_AF dbNSFP_phastCons100way_vertebrate dbNSFP_FATHMM_pred dbNSFP_GERP___RS dbNSFP_GERP___NR dbNSFP_CADD_phred dbNSFP_MetaSVM_pred dbNSFP_LRT_pred dbNSFP_PROVEAN_pred dbNSFP_MutationTaster_pred dbNSFP_MutationAssessor_pred dbNSFP_SIFT_pred dbNSFP_Polyphen2_HVAR_pred dbNSFP_Polyphen2_HDIV_pred > SampleID_dbsnp_cc_cnc_dbnsfp_refseq_light_raw_filtered.fields.txt

Matteo Barcella's avatar
Matteo Barcella committed
85
- SnpSift extractFields SampleID_dbsnp_cc_cnc_dbnsfp_refseq_light.vcf CHROM POS REF ALT GEN[0].AF GEN[0].DP GEN[0].AD[0] GEN[0].AD[1] GEN[0].GT ID FILTER ANN[0].HGVS_P ANN[0].GENE ANN[0].BIOTYPE ANN[0].RANK ANN[0].EFFECT ANN[0].IMPACT COMMON G5 dbNSFP_ExAC_Adj_AF dbNSFP_1000Gp3_AF dbNSFP_ExAC_AF dbNSFP_phastCons100way_vertebrate dbNSFP_FATHMM_pred dbNSFP_GERP___RS dbNSFP_GERP___NR dbNSFP_CADD_phred dbNSFP_MetaSVM_pred dbNSFP_LRT_pred dbNSFP_PROVEAN_pred dbNSFP_MutationTaster_pred dbNSFP_MutationAssessor_pred dbNSFP_SIFT_pred dbNSFP_Polyphen2_HVAR_pred dbNSFP_Polyphen2_HDIV_pred > SampleID_dbsnp_cc_cnc_dbnsfp_refseq_light.fields.txt
Matteo Barcella's avatar
Matteo Barcella committed
86

Matteo Barcella's avatar
Matteo Barcella committed
87
88
89
90
```



Matteo Barcella's avatar
Matteo Barcella committed
91
92
93



Matteo Barcella's avatar
Matteo Barcella committed
94
95