scRNA-seq

Single Cell RNA-seq (scRNA-seq) is a technique used to examine the transcriptome from individual cells within a population using next-generation sequencing (NGS) technologies. It provides information about heterogeneity in a given population of cells or a tissue and it allows the identification of rare cell types. Several technologies have been developed so far, but the most popular are the plate-based Smart-Seq2 and microdroplet-based 10x Chromium. A detailed course can be found here, linked to the video on the right.

General guidelines:

- Depending on the biological questions and type of sample available, different platforms are preferred. For example, plate-based technologies (Smart-Seq2) are to be preferred if looking at a rare cell type which is possible to FACS sort. This allows sequencing less cells and increasing sequencing depth. Droplet-based (10x Chromium) are instead more suited when a higher number of cells need to be sequenced, for example when characterising the composition of a tissue. A good comparison for methods can be found here.

- For both platforms, it is crucial to firstly identify the best dissociation method for the cells/tissue of interest, in order to avoid doublets or excessive cell death.

- For droplet-based methods, 10x Genomics recommends sequencing a minimum of 20,000 read pairs/cell for Single Cell 3' v3 and Single Cell 5' gene expression libraries and 50,000 read pairs/cell for Single Cell 3' v2 libraries. More details here.

- All scRNA-seq protocols are paired-ended sequenced.

Pipeline for Data Analysis

STEP 1 : Sequencing QC

Downloading raw data

Raw reads exist in as FASTQ. File extensions include .fastq or .fq, and fastq.gz (gunzip compressed). FASTQ data can also be compressed by the Short Read Archive and exist as SRA (.sra) file. This is commonly found in public repositories such as GEO. Sra files can be converted into fastq using the sratoolkit fastq-dump.

File type:

SRA

Tool:

sratoolkit

File type:

FASTQ

Tools:

FastQC

FastX-Toolkit

cutadapt

trimmomatic

QC on FASTQ reads

It is recommended to perform some quality control checks on the FASTQ data using FastQC. The first step is to trim adapters and low quality reads. In this case, it is recommended to trim reads to a fixed length for all samples prior to alignment.

Cell Ranger

Cell Ranger is a pipeline for 10x Chromium data to align reads, generate feature-barcode matrices and perform clustering and gene expression analysis. If using this pipeline, proceed directly to Step 5.

Cell Ranger also generates cloupe files that can be directly visualised using Loupe Browser.

STEP 2 : Alignment

Mapping reads

After FASTQ files have undergone the necessary QC, they have to be mapped to a reference genome. It is important that all samples being compared are mapped to the same version of the genome (genome assembly). The first step is to download an index for the genome of choice, and a genome annotation (GFT or GFF files). The alignment step tends to be the most time consuming and the files generated are very large in size (several Gb).

File type:

FASTQ

Tool:

STAR

Kallisto

File type:

BAM

File type:

SAM

File formats

STAR aligns reads to a reference genome, whereas Kallisto is a pseudo-aligner, which maps k-mers to a reference instead. Both output aligned reads in BAM/SAM formats. Mapped reads can also be stored in CRAM format (a compressed and smaller alternative to SAM/BAM, check cramtools for more). This can be converted to BAM using the samtools view. Aligned reads in BAM can also be converted back into FASTQ using samtools fastq, if the alignment needs to be re-done (e.g. with a different genome assembly or different aligner).

QC on aligned reads

BAM files can be filtered for mapping quality with samtools view command, bamtools or RSeQC.

STEP 3 : Quantification

Quantifying genes

After aligning reads to the genome, these need to be counted in order to produce a counts matrix (genes x cell), typically this will be a .txt or .csv file, to be used as an input for downstream analysis using count-based statistical methods. You can generate a count matrix from aligned reads (BAM file) using the featureCounts function in the Rsubread R package, HTSeq or using Unique molecular identifiers (UMIs).

Tools:

Rsubread

UMI

File type:

BAM

File type:

Counts

To store as

SingleCellExperiment

Unique molecular identifiers (UMIs)

UMIs are short (4-10bp) random tags which are added to the mRNAs during library prep at the reverse-transcription step. Using these tags, each read can be assigned to one transcript molecule, removing amplification biases.

The SingleCellExperiment class is a R-based file system to store scRNA-seq data where the rows represent features (genes, transcripts, genomic regions) and columns represent cells. It provides methods for storing metadata for genes and libraries, but also dimensionality reduction coordinates and data for alternative feature sets.

STEP 4 : Expression Matrix

Matrix QC

Once the Expression Matrix is stored in a SingleCellExperiment, a few additional steps to clean-up and normalise the data are necessary before the final analysis. For example, low gene count or high mitochondrial counts are indicative of poor quality cells (e.g. broken membrane). In contrast, high counts and a large number of detected genes could be doublets. This means that QC often includes high/low‐counts thresholds, and mtRNA filtering. It's then important to normalise for library size, and finally to investigate whether batch correction is necessary between samples to compare.

Tools:

scater

scran

File type:

SingleCellExperiment

Batch effects

When cells from different experiments are pooled together and compared, it is important to check for batch effects. It can occur for cells run on different chips, in different sequencing lanes or cells harvested on different days. Aside from technical effects, occasionally also biological regression might be necessary, such as cell cycle regression. However, it's important to carefully consider cell cycle effects as they might be informative of the biology (e.g. comparing proliferative vs non-proliferative population of cells), and correcting for these processes may unintentionally mask others.

STEP 5 : Functional Analysis

Dimensionality reduction and Clustering

Once a normalised data matrix has been generated, the analysis to carry out will depend on the biological question we are trying to answer. Usually, the first step is to reduce the dimensionality of the data, which intends to keep only "informative" genes, usually High Variable Genes. It's then possible to visualise dimensionality reduction through different plots, such as PCA, tSNE and UMAP. In order to investigate heterogeneity, the next step is to group cells according to their similarity, a step called Clustering. However, in cases such as developmental processes, it helps to look at trajectories through pseudotime which recapitulate better dynamic processes. Finally, additional analyses can be performed, such as Differential Expression, Gene Ontology, or Ligand-Receptor.

File types:

SingleCellExperiment

MTX (matrix)

Integrated workflows:

Scanpy (Python) --> tutorials

Seurat (R) --> tutorials

Both tutorial will guide you through the entire workflow described in the left panel. We highly recommend those.

Different tools can be used to perform the different steps, some of which are listed below:

Clustering --> louvain

Trajectories inference --> Monocle, PAGA

Differential expression --> MAST, EdgeR, DESeq2

Ligand-Receptor --> CellPhoneDB

Legend:

should be performed on a cluster (via terminal)

time consuming step

important QC step

can be performed locally (via terminal)

can be performed on Galaxy (web interface)

Where to run these tools:

Terminal (local or cluster): sratoolkit, FastQC, cutadapt, trimmomatic, FastX, STAR, Kallisto, samtools, bamtools, bedtools, UCSC aplication binaries, deeptools

Galaxy (web interface): sratoolkit, FastQC, cutadapt, trimmomatic, samtools, bamtools, Seurat

RStudio: RSubread, Scran, Scater, Seurat, DESeq2, EdgeR, Monocle, MAST

Download software: IGV, Loupe browser

Web-interface: UCSC browser

Python (Jupyter notebook): Scanpy, CellPhoneBD

Sources:

https://scrnaseq-course.cog.sanger.ac.uk/website/index.html

Svensson, V., Natarajan, K., Ly, L. et al. Power analysis of single-cell RNA-sequencing experiments. Nat Methods 14, 381–387 (2017). https://doi.org/10.1038/nmeth.4220

Luecken M.D. and Theis F.J. Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol (2019)15:e8746 https://doi.org/10.15252/msb.20188746

scRNA-seq

General guidelines:

​

- For both platforms, it is crucial to firstly identify the best dissociation method for the cells/tissue of interest, in order to avoid doublets or excessive cell death.

- For droplet-based methods, 10x Genomics recommends sequencing a minimum of 20,000 read pairs/cell for Single Cell 3' v3 and Single Cell 5' gene expression libraries and 50,000 read pairs/cell for Single Cell 3' v2 libraries. More details here.

- All scRNA-seq protocols are paired-ended sequenced.

Pipeline for Data Analysis

STEP 1 : Sequencing QC

Downloading raw data

File type:

SRA

Tool:

File type:

FASTQ

Tools:

QC on FASTQ reads

It is recommended to perform some quality control checks on the FASTQ data using FastQC. The first step is to trim adapters and low quality reads. In this case, it is recommended to trim reads to a fixed length for all samples prior to alignment.

​

Cell Ranger is a pipeline for 10x Chromium data to align reads, generate feature-barcode matrices and perform clustering and gene expression analysis. If using this pipeline, proceed directly to Step 5.

Cell Ranger also generates cloupe files that can be directly visualised using Loupe Browser.

STEP 2 : Alignment

Mapping reads

File type:

FASTQ

Tool:

File type:

BAM

File type:

SAM

File formats

QC on aligned reads

BAM files can be filtered for mapping quality with samtools view command, bamtools or RSeQC.

STEP 3 : Quantification

Quantifying genes

Tools:

File type:

BAM

File type:

Counts

To store as

UMIs are short (4-10bp) random tags which are added to the mRNAs during library prep at the reverse-transcription step. Using these tags, each read can be assigned to one transcript molecule, removing amplification biases.

​

STEP 4 : Expression Matrix

Matrix QC

Tools:

File type:

SingleCellExperiment

Batch effects

​

STEP 5 : Functional Analysis

Dimensionality reduction and Clustering

File types:

SingleCellExperiment

MTX (matrix)

Integrated workflows:

Scanpy (Python) --> tutorials

Seurat (R) --> tutorials

Both tutorial will guide you through the entire workflow described in the left panel. We highly recommend those.

Different tools can be used to perform the different steps, some of which are listed below:

Clustering --> louvain

Trajectories inference --> Monocle, PAGA

Differential expression --> MAST, EdgeR, DESeq2

Ligand-Receptor --> CellPhoneDB

Legend:

should be performed on a cluster (via terminal)

time consuming step

important QC step

can be performed locally (via terminal)

can be performed on Galaxy (web interface)

​

Where to run these tools:

Terminal (local or cluster): sratoolkit, FastQC, cutadapt, trimmomatic, FastX, STAR, Kallisto, samtools, bamtools, bedtools, UCSC aplication binaries, deeptools

Galaxy (web interface): sratoolkit, FastQC, cutadapt, trimmomatic, samtools, bamtools, Seurat

RStudio: RSubread, Scran, Scater, Seurat, DESeq2, EdgeR, Monocle, MAST

Download software: IGV, Loupe browser

Web-interface: UCSC browser

Python (Jupyter notebook): Scanpy, CellPhoneBD

Sources:

Svensson, V., Natarajan, K., Ly, L. et al. Power analysis of single-cell RNA-sequencing experiments. Nat Methods 14, 381–387 (2017). https://doi.org/10.1038/nmeth.4220