WES data were analyzed following the GATK "Best Practice Workflows".
- Data Pre-process: quality of the input reads was assessed using FastQC and trimming of low-quality bases using trim-galore
- Read Disambiguation (human vs mouse): reads were aligned to human and mouse reference genomes, and then assigned each read to the organism showing the best alignment
- Downsampling: samples were randomly downsampled to 300M reads, 230M reads or 50M reads according to the experiment using the Seqtk toolkit
- Alignment: reads were aligned to the human genome assembly (GRCh38) using the BWA and alignments were processed to mark duplicates using Picard MarkDuplicates
- GATK: BaseRecalibrator + ApplyBQSR were used to recalibrate base quality scores on dbSNP known sites
- Variant Calling: HaplotypeCaller in GVCF mode was used to call variants in each sample, which were then combined using CombineGVCFs and genotyped with GenotypeGVCFs
- Variant Filter: variants were filtered using VariantFiltration based on their 'QualityByDepth (i.e., --filter-expression 'QD < 2.0') and overall coverage 'DP' (i.e., --filter-expression 'DP < 50'), while to identify private variants belonging to each sample additional filters were applied, i.e., variants having low genotype quality (i.e., GQ < 80) and low coverage (i.e., DP < 100 stringent and DP < 10 relaxed, respectively) were removed
- Variant Post-process: "Mock electro" in vitro sample for each experiment was used as germline reference, and its variants were filtered out from all other samples, as well as the multi-allelic ones (mainly involving repetitive sequences)
- Variant Annotation: variants were annotated using SnpEff on the canonical isoform from the GRCh38.p13.RefSeq reference database
Downstream analysis of the final variants was done to classify them based on their type (insertion, deletion, or SNV), and to focus on all SNV to classify mutation events. Assessment of variants using a panel of cancer related genes was performed based on variant annotations.<br>
Additional focus on low-frequency variants was performed for WES on colonies by using Mutect2 to call variants and then filtering those having coverage lower than 10. To enrich for variants private of each colony, including those installed by the treatment, only those being in the expected range of variant allele frequency (i.e., between 0.05 and 0.2) were kept, considering that each pool was composed of six colonies (12 alleles).