# Fiumara_BasePrimeEd2022_WES Fiumara M, Ferrari F, Omer-Javed A, Beretta S et al. **Discovery and mitigation of genotoxic effects of base and prime editing in human hematopoietic stem cells.** _Nature Biotechnology_ 2023. - WES Std: [PRJEB58344](https://www.ebi.ac.uk/ena/browser/view/PRJEB58344) - WES Opt: [PRJEB64407](https://www.ebi.ac.uk/ena/browser/view/PRJEB64407) - WES Colonies: [PRJEB64207](https://www.ebi.ac.uk/ena/browser/view/PRJEB64207) --- ### Analyses ### WES data were analyzed following the GATK "Best Practice Workflows". - Data Pre-process: quality of the input reads was assessed using _FastQC_ and trimming of low-quality bases using _trim-galore_ - Read Disambiguation (human vs mouse): reads were aligned to human and mouse reference genomes, and then assigned each read to the organism showing the best alignment - Downsampling: samples were randomly downsampled to 300M reads, 230M reads or 50M reads according to the experiment using the _Seqtk_ toolkit - Alignment: reads were aligned to the human genome assembly (GRCh38) using the _BWA_ and alignments were processed to mark duplicates using _Picard MarkDuplicates_ - GATK: _BaseRecalibrator_ + _ApplyBQSR_ were used to recalibrate base quality scores on dbSNP known sites - Variant Calling: _HaplotypeCaller_ in GVCF mode was used to call variants in each sample, which were then combined using _CombineGVCFs_ and genotyped with _GenotypeGVCFs_ - Variant Filter: variants were filtered using VariantFiltration based on their 'QualityByDepth (i.e., `--filter-expression 'QD < 2.0'`) and overall coverage 'DP' (i.e., `--filter-expression 'DP < 50'`), while to identify private variants belonging to each sample additional filters were applied, i.e., variants having low genotype quality (i.e., GQ < 80) and low coverage (i.e., DP < 100 stringent and DP < 10 relaxed, respectively) were removed - Variant Post-process: "Mock electro" in vitro sample for each experiment was used as germline reference, and its variants were filtered out from all other samples, as well as the multi-allelic ones (mainly involving repetitive sequences) - Variant Annotation: variants were annotated using _SnpEff_ on the canonical isoform from the GRCh38.p13.RefSeq reference database Downstream analysis of the final variants was done to classify them based on their type (insertion, deletion, or SNV), and to focus on all SNV to classify mutation events. Assessment of variants using a panel of cancer related genes was performed based on variant annotations.
Additional focus on low-frequency variants was performed for WES on colonies by using Mutect2 to call variants and then filtering those having coverage lower than 10. To enrich for variants private of each colony, including those installed by the treatment, only those being in the expected range of variant allele frequency (i.e., between 0.05 and 0.2) were kept, considering that each pool was composed of six colonies (12 alleles).