top of page

This site was designed with the

website builder. Create your website today.Start Now

Vallier Lab NGS Pipelines

ATAC-seq

ATAC-seq (Assay for Transposase-Accessible Chromatin using high-throughput sequencing) is a method to investigate chromatin accessibility across the genome. It's based on a modified hyperactive Tn5 transposase that binds open chromatin regions, inserts DNA sequences corresponding to sequencing adapters and fragments the DNA. Fragments are then used for library preparation and then for sequencing.

Click here for the original protocol paper, here for an adaptation of the protocol (Abcam), and here for additional info and references (Illumina) .

ENCODE Guidelines:

- Sequence at least 2 biological replicates per experiment

- Sequencing may be paired- or single-ended, but paired-ended is preferred

- Each replicate should have 25 million non-mitochondrial mapped fragments (50 million paired-ended reads)

- The detailed ENCODE ATAC-seq analysis pipeline is available here

- To obtain the required number of reads, libraries can be multiplexed (i.e. pooled and sequenced simultaneously during a single run). The number of reads obtained will depend on (1) the number of samples you multiplex and (2) the number of reads you get per run, which varies across sequencing platforms

Pipeline for Data Analysis

STEP 1 : Sequencing QC

Downloading raw data

Raw reads exist in as FASTQ. File extensions include .fastq or .fq, and fastq.gz (gunzip compressed). FASTQ data can also be compressed by the Short Read Archive and exist as SRA (.sra) file. This is commonly found in public repositories such as GEO. Sra files can be converted into fastq using the sratoolkit fastq-dump.

File type:

SRA

Tool:

sratoolkit

File type:

FASTQ

Tools:

FastQC

FastX-Toolkit

cutadapt

trimmomatic

QC on FASTQ reads

It is recommended to perform some quality control checks on the FASTQ data using FastQC. The first step is to check the presence of the Nextera adapters and the quality of the reads. If needed, adapters, short reads and low quality bases can be trimmed using Cutadapt. In this case, it is recommended to trim reads to a fixed length for all samples prior to alignment.

STEP 2 : Alignment

Mapping reads

After FASTQ files have undergone the necessary QC, they have to be mapped to a reference genome. It is important that all samples being compared are mapped to the same version of the genome (genome assembly). The first step is to download an index for the genome of choice. The alignment step tends to be the most time consuming and the files generated are very large in size (several Gb). A good mapping tool is Bowtie2.

Tool:

bowtie2

STAR

File type:

FASTQ

File type:

SAM

Tool:

samtools

STAR

File type:

BAM

QC Tools:

Remove mt and low q -> samtools view

Remove PCR dup -> samtools rmdup

Plot fragment size -> ATACseqQC

File formats

STAR will produce an output in BAM format, whereas bowtie2 will produce an output in SAM format. This needs to be converted into BAM for several downstream applications. However, some peakcalling tools require SAM format instead of BAM. These formats can be interconverted using samtools view. In addition, mapped reads can also be stored in CRAM format (a compressed and smaller alternative to SAM/BAM, check cramtools for more). This too can be converted to BAM using the samtools view. Aligned reads in BAM can also be converted back into FASTQ using samtools fastq.

QC on aligned reads

BAM files can be filtered to remove mitochondrial reads and low mapping quality reads using samtools view or bamtools. PCR duplicates can be removed with samtools rmdup. It's also important to plot the insert size (the size of the DNA fragment). It should show a periodicity of 150/200 bp (nucleosome size), and can be plotted using ATACseqQC.

Merge / Sort / Index

When the same library has been sequenced across different lanes it will be necessary to merge the different BAM files using samtools merge. Some downstream applications require you to sort and index the BAM files and this can be done using samtools sort and index commands.

STEP 3 : Peak Calling

Calling peaks

The peak calling step is essential in identifying regions in the genome with accessible chromatin. Generally, ATAC-seq produce both "narrow" and "broad" peaks, therefore parameters to use need fine-tuning for each experiment. MACS2 is a good tool to use, with different settings if you want to focus on looking for where the 'cutting sites' are or for single nucleosome detection (see --shift parameters here for more info).

File type:

BAM

Tool:

MACS2

File type:

BED

Tool:

IDR

DiffBind

Reproducibility and Differential Enrichment

Apart from basic overlap of peaks, several tools are available to assess the reproducibility of peaks between biological replicates. This allows the identification of only statistically significant reproducible peaks with reduced false positives. ENCODE recommends IDR, the Irreproducible Discovery Rate.

Identifying peaks differentially enriched between different samples or treatments is also possible using the R package DiffBind.

STEP 4 : Visualisation

File type:

BAM

Tool:

bedtools

deeptools

QC on coverage reproducibility

Reproducibility can be checked by plotting the correlation of the read coverage among biological replicates using deeptools plotCorrelation. Other features such as checking GC bias or ChIP strength are also available.

File type:

BEDGRAPH

Tool:

UCSC binaries

deeptools

File type:

BIGWIG

Browsers:

IGV

UCSC

Tools:

deeptools

ChIPseeker

Genomation

Genome coverage

This step generates a BIGWIG (.bw) file containing the read coverage over every chromosome. This also allows for normalization which makes it possible to compare different samples/treatments in the same experiment. An intermediary format is the .bedgraph which can also be visualise in certain browsers.

Enrichment profile

Files in this format can be uploaded directly to a genome browser for inspection. BED files can also be uploaded and allow you to visualise both the coverage profile and the result of the peak calling in the genes of interest. At a global level, the overall enrichment across a set of regions or peaks can be plotted (e.g. around the TSS) and clustered using tools such as plotHeatmap, Genomation or ChIPseeker. BIGWIG and BED files are relatively small can be easily used locally.

STEP 5 : Functional Analysis & Motif Discovery

Annotation and Gene Ontology

In order to explore the functional relevance of the peaks identified, these can be annotated (e.g. to the nearest gene or TSS) or plotted relative to their genome location (e.g. % peaks in promoters, intergenic, etc). Peaks can be analysed for ontologies associated with the corresponding genes or regions. The tools here listed are an example of many that perform these and several additional functions.

Tools:

ChIPseeker

ChIPpeakAnno

clusterProfiler

genomation

ChromHMM

Homer

GREAT

File type:

BAM

File type:

BED

File type:

BIGWIG

Tools:

MEME suite

Homer

FLR

Wellington

Finding Motifs and foot-printing analysis

Motif discovery consists on finding over-represented DNA sequences that are significantly more frequent in a set of peaks than would expect by chance (i.e. compared against a background). These tools offer a wide range of options including motif de novo discovery, motif enrichment and motif scanning (MEME, Homer).

In addition, In order to investigate TFs occupancy using ATAC-seq, it is possible do foot-printing analysis. Indeed, the DNA corresponding to a binding motif is selectively resistant to digestion by Tn5, therefore leaving a “footprint” when a TF is binding a specific site in the genome (FLR, Wellington).

Legend:

should be performed on a cluster (via terminal)

time consuming step

important QC step

can be performed locally (via terminal)

can be performed on Galaxy (web interface)

Where to run these tools:

Terminal (local or cluster): sratoolkit, FastQC, cutadapt, trimmomatic, FastX, bowtie2, samtools, bamtools, MACS2, Homer, IDR, bedtools, UCSC aplication binaries, deeptools, ChromHMM

Galaxy (web interface): sratoolkit, FastQC, cutadapt, trimmomatic, bowtie2, samtools, bamtools, MACS2, IDR, bedtools, deeptools, MEME

RStudio: DiffBind, ChIPseeker, ChIPpeakanno, clusterProfiler, genomation

Download software: Seqmonk, IGV

Web-interface: UCSC browser, GREAT, MEME suite

References:

Buenrostro, J., Giresi, P., Zaba, L. et al. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10, 1213–1218 (2013). https://doi.org/10.1038/nmeth.2688

https://www.abcam.com/epigenetics/epigenetics-application-spotlight-atac-seq

https://emea.illumina.com/techniques/popular-applications/epigenetics/atac-seq-chromatin-accessibility.html

bottom of page