Getting Started

Introduction to NGS data analysis

Next-generation sequencing (NGS) is a term used to describe different

modern sequencing technologies that decode the identity and order

of nucleotides within DNA/RNA molecules. NGS has revolutionised genomic research by providing high-throughput, rapid and accurate sequencing of genes at low cost, in comparison to previous sequencing methods.

Bioinformatics is vital for the analysis of the large volumes of data generated. All NGS platforms perform sequencing of millions of small fragments of DNA and bioinformatic tools are required to piece together these fragments by mapping individual reads to a reference genome. Different kinds of sequence data can be generated, e.g. quantify expression level by sequencing RNA (RNA-seq) or study the regulation of genes by sequencing chromatin immunoprecipitation products (ChIP-seq).

NGS has become an essential tool across all areas of biological science and having the ability to analyse sequencing data is a beneficial and sought-after skill. Below are resources for those who wish to learn NGS data analysis. You can analyse NGS data using coding languages such as R, Python or Bash (See Introduction to Coding). Alternatively, you can use platforms specifically created for the analysis of large biological datasets which do not require any coding, such Galaxy or Seqmonk (See User-friendly Interfaces). The choice is yours!

For more information click on the underlined links within text.

Working on a Terminal

Most of the tools that you'll find on this website need to run on a Unix/Linux operating system and this is accessible through a "terminal". This is a command-line interface (also known as a shell) that allows you to essentially type text commands using a specific syntax (bash) in order to perform a number of actions. The terminal application is easily accessible on a Mac (simply search "terminal" on your Launchpad). The terminal program is also accessible on Windows, but in order to run Linux commands you will need a client app such as PuTTY. In order to use the terminal comfortably you will need to learn a few basic commands that allow you to open, create, delete directories, navigate between directories and setup your working directory by defining an environment variable. This is a pre-requisite before you start downloading and using any tools on a terminal.

Typically, all the commands you want to run will be listed in a single text file (a script), containing all the necessary information to tell the shell environment what to do and when to do it. A script functions like a protocol in the lab, where you not only need to record the steps (i.e. commands) you use, but also any other essential details of what each step means and why you did it (i.e. comments). Similarly to a lab book, it is best practice to keep a good record of your scripts to ensure you and others can reproduce your analysis, the same way you maintain a lab book to ensure reproducibility of your the experiments in the lab. There are a number of script editing tools (e.g. Visual Code Studio and Sublime Text) that allow you to edit and mark your scripts on a text editor, while integrating the terminal all in the same program. Similarly, if you are using R, using a tool like RStudio will allow you to edit, run and visualise the result of your commands (e.g. pretty heatmaps) all in the same interface. Other interactive notebooks such as Jupyter allow you to edit your Pyhon and R scripts in a web-based user friendly interface. When it comes to sharing your code and software with the wider community, GitHub is the go-to platform. This is an online repository hosting service with a diversity of applications.

As you navigate through the several tutorials related to each of the tools you will want to use for your analyses, you will find instructions on how to download and install them. Alternatively you can do that using a package manager such as Bioconda, which is a channel for the conda package manager specialising in bioinformatics software. This also functions as a repository allowing you to easily find, install, and update different software packages such as Bowtie2, samtools, Homer, etc.

A limitation of computers and laptops is the lack of sufficient processing power and RAM to handle large datasets. Therefore, the best way to circumvent this is to perform your analysis on a cluster. A cluster is essentially a number of computers configured in a way that allows them to act together as a single unit with enhanced performance (more on high performance computing here). Each computer in a cluster is referred to as a node. Typically, you access a cluster by login into the head node via the terminal. This requires you to have credentials to your institutions' cluster, if that is what you are using. Alternatively, you can consider using cloud computing, which is a pay-as-you-go computing power and storage from a cloud provider like Amazon Web Services (AWS).

cluster

nodes

head node

script

storage

Interaction with the cluster via the head node happens through the submission of a batch script, usually referred to as a job, which is a text file containing sequential commands for your analysis. The way you submit and cancel jobs will depend on the job scheduling system of your cluster, so you will have to refer to its' own tutorial. Once submitted, the job gets placed in a queue and runs when the computing resources you requested for that job become available. You will be handling and producing large volumes of data and it is essential that this data is stored appropriately and securely in the designated folders on your cluster's filesystem and within your allocated quota.

Introduction to Coding

Here are a few introductory courses on programming languages used in data analysis, suited for those with no coding experience.

R : R is an open source programming language commonly used for statistical computing and graphics. It is one of the most popular languages used by statisticians, data analysts and researchers to retrieve, clean, analyse, visualise and present data. With R you can perform tasks such as linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering etc. and produce publication-quality plots.

Intro courses:

Bitesizebio : An introduction to statistical analysis with R in the form of a webinar, 1-2h

Swirlstats : An easy and interactive tutorial on R programming, 6-8h

Datacamp : An introduction to R, good to understand basic data types, some content free but limited, 4 hours

Tools for data analysis:

RStudio : This is a downloadable set of integrated tools to allow you to run R commands on a terminal, while visualising the resulting plots on the same window.

Bioconductor : This is an open source and development software project required in order to install most of the packages that are used for data analysis in R.

Shell/Bash : A shell is a processor that allows users to interact with their computer and tell it what to do through commands. There are different types of shells, Bash (‘Bourne Again SHell’) is the most widely used shell. Using command lines you can obtain, scrub, explore and model your data. It will also allow you to combine existing programs in new ways, automate repetitive tasks and overall improve your data workflow to become more efficient in your data analysis. Learning the fundamentals of Bash is highly recommended as you will need it to access data files from your institution’s cluster and run bioinformatic programs.

Intro courses:

Udemy : An introduction to Linux Shell Scripting, 1 hour

Datacamp: Introduction to Shell, 4 hours

Datacamp: Introduction to Bash Scripting, 4 hours

Python: Python is an all-purpose programming language used in a wide range of applications, including statistical computing and graphics. Most of the tasks that can be performed through R can also be done using Python. Both languages are useful for data work and offer different advantages and disadvantages. The choice between R vs Python depends on your needs. Python is generally recommended for those who wish to build their own bioinformatic software/tools.

Intro courses:

edX : An introduction to Python for Data Science, 2-5 hours

Datacamp: Introduction to Python, 4 hours

User-friendly Graphical Interfaces

(no coding required)

Galaxy

Galaxy is an open-source application that enables researchers without informatics expertise to perform computational analyses through the web. The user uploads data directly through the website and Galaxy interacts with the servers that run the analyses and the disks where the data is stored. Broad range of tools available for mapping, peak calling, heatmaps (ChIP-seq, RNA-seq, sc-RNA-seq).

A comprehensive list of Galaxy tutorials can be found here.

Deeptools tutorial : several tools for ChIP-seq

Seqmonk

Seqmonk is a tool to analyse high-throughput mapped sequence data locally (ChIP-seq, RNA-seq, scRNA-seq). Requires installation.

Training course

Video training course

Support and Communities

If you run into a coding problem the internet is your friend! There is a high chance someone has had the same problem and the online community has kindly provided the solution. Here are a few guides and communities you might find helpful:

Bioconductor: See questions about bio conductor packages

Stackoverflow: It is an online community for developers to learn and exchange knowledge and questions. A very good section provides a list for R-related resources and you can search [r-faq] for frequently asked question related to R.

Public NGS Repositories

Array Express Archive

This platform stores raw and processed data from high-throughput functional genomics experiments, and provides these data for reuse to the research community.

Downloadable formats: fastq, fastq.gz, sra, bam, counts (.txt)

Gene Expression Omnibus

GEO is a public functional genomics data repository comprising raw and processed NGS data.

Downloadable formats: sra, gene counts (.txt), bigwig, bedgraph, bed

ENCODE Portal

This portal stores raw and processed NGS data from ENCODE and associated projects. You can access protocols, raw data and ground-level analysis data from a wide-range of assays such as RNA-seq, ChIP-seq, ATAC-seq, etc.

Downloadable formats: sra, fastq.gz, gene counts (.txt), bigwig, bed

CISTROME

Cistome is an integrative analysis pipeline to help experimental biologists integrate and explore publicly available ChIP-seq and DNAse-seq data. Data available is processed under the same criteria and QC and is ready to visualise on a browser.

Downloadable formats: bigwig, bed

Sources:

https://rnabio.org/course/

https://www.howtogeek.com/67469/the-beginners-guide-to-shell-scripting-the-basics/

https://github.com/griffithlab/rnaseq_tutorial/wiki/Bioinformatics-Best-Practices

https://bioinformatics.uconn.edu/understanding-the-bbc-bioinformatics-facility-cluster-and-sge/

https://campuscluster.illinois.edu/resources/docs/start/

https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

https://www.datamentor.io/r-programming/

https://www.r-project.org/about.html

https://www.nibsc.org/science_and_research/analytical_sciences/bioinformatics.aspx

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841808/

Getting Started

Introduction to NGS data analysis

​

Next-generation sequencing (NGS) is a term used to describe different

modern sequencing technologies that decode the identity and order

of nucleotides within DNA/RNA molecules. NGS has revolutionised genomic research by providing high-throughput, rapid and accurate sequencing of genes at low cost, in comparison to previous sequencing methods.

​

For more information click on the underlined links within text.

Working on a Terminal

​

​

​

​

​

​

cluster

nodes

head node

​

script

storage

​

Introduction to Coding

Here are a few introductory courses on programming languages used in data analysis, suited for those with no coding experience.

​

​

​

Intro courses:

Bitesizebio : An introduction to statistical analysis with R in the form of a webinar, 1-2h

Swirlstats : An easy and interactive tutorial on R programming, 6-8h

Datacamp : An introduction to R, good to understand basic data types, some content free but limited, 4 hours

​

Tools for data analysis:

RStudio : This is a downloadable set of integrated tools to allow you to run R commands on a terminal, while visualising the resulting plots on the same window.

Bioconductor : This is an open source and development software project required in order to install most of the packages that are used for data analysis in R.

​

​

User-friendly Graphical Interfaces

(no coding required)

​

​

​

​

Support and Communities

​

Public NGS Repositories

​

This platform stores raw and processed data from high-throughput functional genomics experiments, and provides these data for reuse to the research community.

Downloadable formats: fastq, fastq.gz, sra, bam, counts (.txt)

​

GEO is a public functional genomics data repository comprising raw and processed NGS data.

Downloadable formats: sra, gene counts (.txt), bigwig, bedgraph, bed

​

This portal stores raw and processed NGS data from ENCODE and associated projects. You can access protocols, raw data and ground-level analysis data from a wide-range of assays such as RNA-seq, ChIP-seq, ATAC-seq, etc.

Downloadable formats: sra, fastq.gz, gene counts (.txt), bigwig, bed

​

Cistome is an integrative analysis pipeline to help experimental biologists integrate and explore publicly available ChIP-seq and DNAse-seq data. Data available is processed under the same criteria and QC and is ready to visualise on a browser.

Downloadable formats: bigwig, bed

Sources: