top of page


ChIP-seq
Chromatin Immunoprecipitation (ChIP) is used to explore protein interactions with genomic DNA. Typically, chromatin is cross-linked to fix proteins to the DNA, sheared into small fragments and immunoprecipitated with an antibody which recognises the protein of interest (e.g. transcription factors, chromatin modifiers, histone tail, etc). The protein-DNA complexes are then purified and sequenced in order to identify the genome-wide binding of the target proteins. The enrichment of a particular protein at different genomic regions is usually compared against an "input" control (i.e. genomic DNA that was not immunoprecipitated). Click here for a typical ChIP protocol and here for several useful webinars on this topic (Abcam) .
​
ENCODE Guidelines:
​
- Sequence at least 2 biological replicates per experiment
- Each experiment should have a corresponding input
- For histone narrow peaks (e.g.H3K27ac, H3K4me3) at least 20 million mapped fragments (40 million paired-end reads)
- For histone broad peaks (H3K4me1, H3K27me3)at least 45 million mapped fragments (90 milllion paired-end reads)
- For transcription factors at least 20 million mapped reads per replicate
- The detailed ENCODE ChIP-seq analysis pipeline is available here
- To obtain the required number of reads, libraries can be multiplexed (i.e. pooled and sequenced simultaneously during a single run). The number of reads obtained will depend on (1) the number of samples you multiplex and (2) the number of reads you get per run, which varies across sequencing platforms.
​
Pipeline for Data Analysis
STEP 1 : Sequencing QC
Downloading raw data
Raw reads exist in as FASTQ. File extensions include .fastq or .fq, and fastq.gz (gunzip compressed). FASTQ data can also be compressed by the Short Read Archive and exist as SRA (.sra) file. This is commonly found in public repositories such as GEO. Sra files can be converted into fastq using the sratoolkit fastq-dump.
File type:
SRA
Tool:
sratoolkit
File type:
FASTQ
QC on FASTQ reads
It is recommended to perform some quality control checks on the FASTQ data using FastQC. This will generate a QC report and highlight potential problems. It may be necessary to trim adaptors and/or trim the sequence length for which several tools are available. In this case, it is recommended to trim reads to a fixed length for all samples prior to alignment.
STEP 2 : Alignment
Mapping reads
After FASTQ files have undergone the necessary QC, they have to be mapped to a reference genome. It is important that all samples being compared are mapped to the same version of the genome (genome assembly). The first step is to download an index for the genome of choice. The alignment step tends to be the most time consuming and the files generated are very large in size (several Gb).
File type:
FASTQ
File type:
SAM
File type:
BAM
File formats
STAR will produce an output in BAM format, whereas bowtie2 will produce an output in SAM format. This needs to be converted into BAM for several downstream applications. However, some peak calling tools require SAM format instead of BAM. These formats can be interconverted using samtools view. In addition, mapped reads can also be stored in CRAM format (a compressed and smaller alternative to SAM/BAM, check cramtools for more). This too can be converted to BAM using the samtools view. Aligned reads in BAM can also be converted back into FASTQ using samtools fastq.
QC on aligned reads
BAM files can be filtered for mapping quality with samtools view command or bamtools.
PCR duplicates can be removed with samtools rmdup.
Merge / Sort / Index
When the same library has been sequenced across different lanes it will be necessary to merge the different BAM files using samtools merge. Some downstream applications require you to sort and index the BAM files and this can be done using samtools sort and index commands.
STEP 3 : Peak Calling
Calling peaks
The peak calling step is essential in identifying regions in the genome that are significantly enriched in the target protein relative to an input control. Depending on the type of ChIP, different tools and parameters are recommended and these need to be fine-tuned for each experiment. ChIP for transcription factors usually results in well defined and "narrow" peaks, whereas histones might span large regions and therefore produce "broad" peaks. MACS2 and Homer are two options can can account for these different types of peaks.
File type:
BAM
File type:
SAM
Tool:
MACS2
Tool:
Homer
File type:
BED
File type:
BED
Reproducibility and Differential Enrichment
Several tools are available to assess the reproducibility of peak calling between biological replicates. This ensures this step is robust and identifies peaks statistically significant among replicates with reduced false positives. These include the Irreproducible Discovery Rate (IDR) pipeline and Homer.
Identifying peaks differentially enriched between different samples or treatments is also possible using the R package DiffBind or Homer.
STEP 4 : Visualisation
File type:
BAM
QC on coverage reproducibility
Reproducibility can be checked by plotting the correlation of the read coverage among biological replicates using deeptools plotCorrelation. Other features such as checking GC bias or ChIP strength are also available.
File type:
BEDGRAPH
File type:
BIGWIG
Genome coverage
This step generates a BIGWIG (.bw) file containing the read coverage over every chromosome. This also allows for normalization which makes it possible to compare different samples/treatments in the same experiment. An intermediary format is the .bedgraph which can also be visualise in certain browsers.
Enrichment profile
Files in this format can be uploaded directly to a genome browser for inspection. BED files can also be uploaded and allow you to visualise both the coverage profile and the result of the peak calling in the genes of interest. At a global level, the overall enrichment across a set of regions or peaks can be plotted (e.g. around the TSS) and clustered using tools such as deeptools plotHeatmap or ChIPseeker. BIGWIG and BED files are relatively small can be easily used locally.
STEP 5 : Functional Analysis & Motif Discovery
Annotation and Gene Ontology
In order to explore the functional relevance of the peaks identified, these can be annotated (e.g. to the nearest gene or TSS) and you will get information on their genome location (e.g. % peaks in promoters, intragenic, etc). Peaks can be analysed for ontologies associated with the corresponding genes or regions. The tools here listed are an example of many that perform these and several additional functions.
File type:
BED
Finding Motifs
Motif discovery consists on finding over-represented DNA sequences that are significantly more frequent in a set of peaks than would expect by chance (i.e. compared against a background). The tools here listed offer a wide range of options including motif de novo discovery, motif enrichment and motif scanning.
Legend:
should be performed on a cluster (via terminal)
time consuming step
important QC step
can be performed locally (via terminal)
can be performed on Galaxy (web interface)
can be performed locally on Seqmonk (download software)
​
​
Where to run these tools:
Terminal (local or cluster): sratoolkit, FastQC, cutadapt, trimmomatic, FastX, bowtie2, samtools, bamtools, MACS2, Homer, IDR, bedtools, UCSC aplication binaries, deeptools, ChromHMM
Galaxy (web interface): sratoolkit, FastQC, cutadapt, trimmomaniac, bowtie2, samtools, bamtools, MACS2, IDR, bedtools, deeptools, MEME.
RStudio: DiffBind, ChIPseeker, ChIPpeakanno, clusterProfiler, genomation
Download software: Seqmonk, IGV
Web-interface: Biodalliance, UCSC browser, GREAT, MEME suite, Pscan-ChIP
References:
https://www.abcam.com/protocols/cross-linking-chromatin-immunoprecipitation-x-chip-protocol​
https://www.encodeproject.org/chip-seq/histone/
https://www.encodeproject.org/chip-seq/transcription_factor/
https://deeptools.readthedocs.io/en/develop/content/help_glossary.html#file-formats
http://homer.ucsd.edu/homer/basicTutorial/index.html
https://hbctraining.github.io/Intro-to-ChIPseq/schedule/2-day.html
bottom of page