top of page

This site was designed with the

website builder. Create your website today.Start Now

Vallier Lab NGS Pipelines

ChIP-seq

Chromatin Immunoprecipitation (ChIP) is used to explore protein interactions with genomic DNA. Typically, chromatin is cross-linked to fix proteins to the DNA, sheared into small fragments and immunoprecipitated with an antibody which recognises the protein of interest (e.g. transcription factors, chromatin modifiers, histone tail, etc). The protein-DNA complexes are then purified and sequenced in order to identify the genome-wide binding of the target proteins. The enrichment of a particular protein at different genomic regions is usually compared against an "input" control (i.e. genomic DNA that was not immunoprecipitated). Click here for a typical ChIP protocol and here for several useful webinars on this topic (Abcam) .

ENCODE Guidelines:

- Sequence at least 2 biological replicates per experiment

- Each experiment should have a corresponding input

- For histone narrow peaks (e.g.H3K27ac, H3K4me3) at least 20 million mapped fragments (40 million paired-end reads)

- For histone broad peaks (H3K4me1, H3K27me3)at least 45 million mapped fragments (90 milllion paired-end reads)

- For transcription factors at least 20 million mapped reads per replicate

- The detailed ENCODE ChIP-seq analysis pipeline is available here

- To obtain the required number of reads, libraries can be multiplexed (i.e. pooled and sequenced simultaneously during a single run). The number of reads obtained will depend on (1) the number of samples you multiplex and (2) the number of reads you get per run, which varies across sequencing platforms.

Pipeline for Data Analysis

STEP 1 : Sequencing QC

Downloading raw data

Raw reads exist in as FASTQ. File extensions include .fastq or .fq, and fastq.gz (gunzip compressed). FASTQ data can also be compressed by the Short Read Archive and exist as SRA (.sra) file. This is commonly found in public repositories such as GEO. Sra files can be converted into fastq using the sratoolkit fastq-dump.

File type:

SRA

Tool:

sratoolkit

File type:

FASTQ

Tool:

FastQC

FastX-Toolkit

cutadapt

trimmomatic

QC on FASTQ reads

It is recommended to perform some quality control checks on the FASTQ data using FastQC. This will generate a QC report and highlight potential problems. It may be necessary to trim adaptors and/or trim the sequence length for which several tools are available. In this case, it is recommended to trim reads to a fixed length for all samples prior to alignment.

STEP 2 : Alignment

Mapping reads

After FASTQ files have undergone the necessary QC, they have to be mapped to a reference genome. It is important that all samples being compared are mapped to the same version of the genome (genome assembly). The first step is to download an index for the genome of choice. The alignment step tends to be the most time consuming and the files generated are very large in size (several Gb).

File type:

FASTQ

Tool:

bowtie2

STAR

File type:

SAM

Tool:

samtools

STAR

File type:

BAM

File formats

STAR will produce an output in BAM format, whereas bowtie2 will produce an output in SAM format. This needs to be converted into BAM for several downstream applications. However, some peak calling tools require SAM format instead of BAM. These formats can be interconverted using samtools view. In addition, mapped reads can also be stored in CRAM format (a compressed and smaller alternative to SAM/BAM, check cramtools for more). This too can be converted to BAM using the samtools view. Aligned reads in BAM can also be converted back into FASTQ using samtools fastq.

QC on aligned reads

BAM files can be filtered for mapping quality with samtools view command or bamtools.

PCR duplicates can be removed with samtools rmdup.

Merge / Sort / Index

When the same library has been sequenced across different lanes it will be necessary to merge the different BAM files using samtools merge. Some downstream applications require you to sort and index the BAM files and this can be done using samtools sort and index commands.

STEP 3 : Peak Calling

Calling peaks

The peak calling step is essential in identifying regions in the genome that are significantly enriched in the target protein relative to an input control. Depending on the type of ChIP, different tools and parameters are recommended and these need to be fine-tuned for each experiment. ChIP for transcription factors usually results in well defined and "narrow" peaks, whereas histones might span large regions and therefore produce "broad" peaks. MACS2 and Homer are two options can can account for these different types of peaks.

File type:

BAM

File type:

SAM

Tool:

MACS2

Tool:

Homer

File type:

BED

Tool:

IDR

DiffBind

File type:

BED

Reproducibility and Differential Enrichment

Several tools are available to assess the reproducibility of peak calling between biological replicates. This ensures this step is robust and identifies peaks statistically significant among replicates with reduced false positives. These include the Irreproducible Discovery Rate (IDR) pipeline and Homer.

Identifying peaks differentially enriched between different samples or treatments is also possible using the R package DiffBind or Homer.

STEP 4 : Visualisation

File type:

BAM

Tool:

bedtools

deeptools

QC on coverage reproducibility

Reproducibility can be checked by plotting the correlation of the read coverage among biological replicates using deeptools plotCorrelation. Other features such as checking GC bias or ChIP strength are also available.

File type:

BEDGRAPH

Tool:

UCSC binaries

deeptools

File type:

BIGWIG

Browsers:

Biodalliance

IGV

UCSC

Tools:

deeptools

ChIPseeker

genomation

Genome coverage

This step generates a BIGWIG (.bw) file containing the read coverage over every chromosome. This also allows for normalization which makes it possible to compare different samples/treatments in the same experiment. An intermediary format is the .bedgraph which can also be visualise in certain browsers.

Enrichment profile

Files in this format can be uploaded directly to a genome browser for inspection. BED files can also be uploaded and allow you to visualise both the coverage profile and the result of the peak calling in the genes of interest. At a global level, the overall enrichment across a set of regions or peaks can be plotted (e.g. around the TSS) and clustered using tools such as deeptools plotHeatmap or ChIPseeker. BIGWIG and BED files are relatively small can be easily used locally.

STEP 5 : Functional Analysis & Motif Discovery

Annotation and Gene Ontology

In order to explore the functional relevance of the peaks identified, these can be annotated (e.g. to the nearest gene or TSS) and you will get information on their genome location (e.g. % peaks in promoters, intragenic, etc). Peaks can be analysed for ontologies associated with the corresponding genes or regions. The tools here listed are an example of many that perform these and several additional functions.

Tools:

ChIPseeker

ChIPpeakAnno

clusterProfiler

genomation

ChromHMM

Homer

GREAT

File type:

BED

Tools:

MEME suite

Pscan-ChIP

Homer

Finding Motifs

Motif discovery consists on finding over-represented DNA sequences that are significantly more frequent in a set of peaks than would expect by chance (i.e. compared against a background). The tools here listed offer a wide range of options including motif de novo discovery, motif enrichment and motif scanning.

Legend:

should be performed on a cluster (via terminal)

time consuming step

important QC step

can be performed locally (via terminal)

can be performed on Galaxy (web interface)

can be performed locally on Seqmonk (download software)

Where to run these tools:

Terminal (local or cluster): sratoolkit, FastQC, cutadapt, trimmomatic, FastX, bowtie2, samtools, bamtools, MACS2, Homer, IDR, bedtools, UCSC aplication binaries, deeptools, ChromHMM

Galaxy (web interface): sratoolkit, FastQC, cutadapt, trimmomaniac, bowtie2, samtools, bamtools, MACS2, IDR, bedtools, deeptools, MEME.

RStudio: DiffBind, ChIPseeker, ChIPpeakanno, clusterProfiler, genomation

Download software: Seqmonk, IGV

Web-interface: Biodalliance, UCSC browser, GREAT, MEME suite, Pscan-ChIP

References:

https://www.abcam.com/protocols/cross-linking-chromatin-immunoprecipitation-x-chip-protocol

https://www.encodeproject.org/chip-seq/histone/

https://www.encodeproject.org/chip-seq/transcription_factor/

https://deeptools.readthedocs.io/en/develop/content/help_glossary.html#file-formats

http://homer.ucsd.edu/homer/basicTutorial/index.html

https://hbctraining.github.io/Intro-to-ChIPseq/schedule/2-day.html

bottom of page