README.md

# Fiumara_BasePrimeEd2022_RNAseq

Fiumara M, Ferrari F, Omer-Javed A, Beretta S et al.
**Discovery and mitigation of genotoxic effects of base and prime editing in human hematopoietic stem cells.**
_Nature Biotechnology_ 2023.
- RNA-seq Base Editing: [GSE218462](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE218462)
- RNA-seq Prime Editing: [GSE218463](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE218463)

---

### Analyses ###

RNA-seq analysis to compare Treated (BE4, ABE8, Cas9, Mock electro) vs Untreated samples at Day1 or Day3 (GSE218462), and Prime edited vs Mock electro samples (GSE218463):
- Input pre-processing with _FastQC_ and quality trimming with _Trimmomatic_;
- Read alignments to the human genome assembly (GRCh38) with the _STAR_ using standard parameters
- Gene quantification computed with _featureCounts_
- Differential Gene Expression (DGE) analysis with the R/Bioconductor package _DESeq2_: genes having FDR < 0.05 were considered as differentially expressed
- Post-analyses with the R/Bioconductor package _ClusterProfiler_ using the Hallmark collection from _MSigDB_ as reference database
- Visualization of the (spliced) alignments on the TP73 gene was done with Integrative Genomes Viewer (_IGV_). 

Variant calling analysis on RNA-Seq base editing data (GSE218462):
- Merging of reads from replicates of each condition and downsampling to 120M with _Seqtk_
- Alignment to the human genome assembly (GRCh38) with _STAR_ 
- Mark duplicates with _Picard MarkDuplicates_ and split of eads containing Ns with _GATK SplitNCigarReads_
- Variant calling using three different tools: _HaplotypeCaller_ (with options `--min-base-quality-score 20`, `--dont-use-soft-clipped-bases`, and `–standard-min-confidence-threshold-for-calling 20`), _Mutect2_ (in tumor-only mode, with options `--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter`), and _FreeBayes_.

Nucleotide composition of each position was also assessed using [_REDItools_](https://github.com/tflati/reditools2.0) on each sample, discarding all the positions having coverage lower than 20 and base quality lower than 30 to avoid errors due to low sampling.</br>
Variants called by each tool in the untreated controls were filtered out in the treated samples to enrich for private mutations. This procedure retained only variants in high-quality genomic positions in both treated and untreated sample, for which the untreated sample showed ≥ 99% of reads supporting the reference, non-mutant, base at the position of the mutation (based on REDItools).</br>
The final lists of variants for each sample were made by those called by all tools and passing the filtering procedure (intersection).