top of page

This site was designed with the

website builder. Create your website today.Start Now

Vallier Lab NGS Pipelines

RNA-seq

RNA sequencing (RNA-seq) is a technique used to quantify RNA in biological samples using next generation sequencing (NGS). Typically, total RNA is extracted from the tissues or cell populations of interest and enriched for coding RNA with polyA selection, or coding+noncoding RNA with ribosomal RNA depletion. The RNA is reverse transcribed into cDNA which is fragmented and size selected in order to construct a library to be sequenced. Smaller RNAs, such as miRNAs require a different size selection.

ENCODE Guidelines:

- Sequence at least 2 biological replicates per experiment

- If different treatments are tested (e.g. shRNA/siRNA), a control must be included in each experiment

- At least 30 million mapped fragments per replicate (60 million paired-end reads)

- Biological replicates should have a Spearman correlation of >0.9 (same donor) or >0.8 (different donor)

- To obtain the required number of reads, libraries can be multiplexed (i.e. pooled and sequenced simultaneously during a single run). The number of reads obtained will depend on (1) the number of samples you multiplex and (2) the number of reads you get per run, which varies across sequencing platforms.

The RNA-seq technology allows you to interrogate changes in gene expression across different samples, treatments or experimental timepoints. In addition, you can explore other features such as alternative splicing, post-translational modifications or variant detection, among others. For a comprehensive overview of this technology and its applications click here.

Pipeline for Data Analysis

STEP 1 : Sequencing QC

Downloading raw data

Raw reads exist in as FASTQ. File extensions include .fastq or .fq, and fastq.gz (gunzip compressed). FASTQ data can also be compressed by the Short Read Archive and exist as SRA (.sra) file. This is commonly found in public repositories such as GEO. Sra files can be converted into fastq using the sratoolkit fastq-dump.

File type:

SRA

Tool:

sratoolkit

File type:

FASTQ

Tool:

FastQC

FastX-Toolkit

cutadapt

trimmomatic

QC on FASTQ reads

It is recommended to perform some quality control checks on the FASTQ data using FastQC. This will generate a QC report and highlight potential problems. It may be necessary to trim adaptors and/or trim the sequence length for which several tools are available. In this case, it is recommended to trim reads to a fixed length for all samples prior to alignment.

STEP 2 : Alignment

Mapping reads

After FASTQ files have undergone the necessary QC, they have to be mapped to a reference genome. It is important that all samples being compared are mapped to the same version of the genome (genome assembly). The first step is to download an index for the genome of choice. The alignment step tends to be the most time consuming and the files generated are very large in size (several Gb).

File type:

FASTQ

Tool:

STAR

TopHat

File type:

BAM

File formats

STAR and TopHat output aligned reads in BAM format. Mapped reads can also be stored in CRAM format (a compressed and smaller alternative to SAM/BAM, check cramtools for more). This can be converted to BAM using the samtools view. Aligned reads in BAM can also be converted back into FASTQ using samtools fastq, if the alignment needs to be re-done (e.g. with a different genome assembly or different aligner).

QC on aligned reads

BAM files can be filtered for mapping quality with samtools view command or bamtools.

PCR duplicates can be removed with samtools rmdup.

Merge / Sort

When the same library has been sequenced across different lanes it will be necessary to merge the different BAM files using samtools merge. Some downstream applications require you to sort the BAM files and this can be done using samtools sort commands.

STEP 3 : Visualisation

File type:

BAM

Tool:

bedtools

deeptools

File type:

BEDGRAPH

Tool:

UCSC binaries

deeptools

File type:

BIGWIG

Browsers:

Biodalliance

IGV

UCSC

Tools:

deeptools

QC on reproducibility

Reproducibility can be checked by plotting the correlation of the read coverage (BAM or BIGWIG) among biological replicates using deeptools plotCorrelation.

Coverage tracks

This step generates a BIGWIG (.bw) file containing the read coverage over every chromosome. This also allows for normalization which makes it possible to compare different samples/treatments in the same experiment. An intermediary format is the .bedgraph which can also be visualise in certain browsers. You can upload the tracks (.bw files) to your browser of choice and visualise the expression of your favourite genes including which exons are being expressed.

STEP 4 : Quantification

Quantifying genes

After aligning reads to the genome, these need to be counted in order to produce a counts matrix (genes x samples), typically this will be a .txt or .csv file, to be used as an input for downstream analysis using count-based statistical methods. You can generate a count matrix from aligned reads (BAM file) using the featureCounts function in the Rsubread R package.

File type:

BAM

Tool:

Rsubread

Salmon

File type:

Counts

OR

File type:

FASTQ

Tool:

Salmon

File type:

Counts

Transcript abundance

Alternatively, FASTQ files can be mapped and quantified in one single step using Salmon. This is a fast tool that also requires less memory and disk usage. You will need to build an index of the transcriptome or you can download a pre-built one from here or here. If you have aligned your data with another aligner such as TopHat, you can also quantify your transcript from your BAM files with Salmon. The output is transcript-level abundance data which you will need to import into R using the tximport package.

STEP 5 : Functional Analysis

Un-normalised counts

After your quantifications inputs are ready, it is time to load them into count-based statistical methods, such as DESeq2 (comprehensive tutorial here). This expects input data to be in the form of a matrix of un-normalized counts. The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input. Other similar statistical methods include edgeR and limma. For a typical RNA-seq workflow click here.

Tools:

DESeq2

edgeR

limma

File type:

Counts

Tool:

AnnotationDbi

topGO

GSEAbase

Exploring your data

The Deseq2 package provides not only methods for statistical testing of differential gene expression, but also different ways to explore and visualize your data including plotting sample to sample distances (including PCA), plotting normalized counts or clustering of your counts matrix (heatmaps). You can incorporate additional packages in order to annotate your results as well testing for gene ontology terms (topGO, GSEAbase)

Legend:

should be performed on a cluster (via terminal)

time consuming step

important QC step

can be performed locally (via terminal)

can be performed on Galaxy (web interface)

can be performed locally on Seqmonk (download software)

Where to run these tools:

Terminal (local or cluster): sratoolkit, FastQC, cutadapt, trimmomaniac, FastX, TopHat, samtools, bamtools, deeptools, UCSC application binaries, Salmon

Galaxy (web interface): sratoolkit, FastQC, cutadapt, trimmomaniac, TopHat, samtools, bamtools, deeptools, Deseq2, edgeR, limma.

RStudio: Rsubread, tximport, Deseq2, edgeR, limma, AnnotationDBi, topGO, GSEAbase

Download software: Seqmonk, IGV

Web-interface: UCSC browser, Biodalliance.

Sources:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4527835/

https://rnabio.org

https://www.encodeproject.org/pipelines/

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

bottom of page