Session 1: Introduction to Computing Clusters and Bioinformatics

Download the original slides for this lecture here

Presenters: Alyssa Klein, Wei Zhao

Introduction to Available Computing Clusters

We will first begin by discussing two computing clusters at NIH: CCAD and Biowulf. As this is a course from DCEG, we will open this section with a brief description of the Computer Cluster at DCEG (CCAD). CCAD is a cluster only for DCEG members. This is mostly optional information for interested DCEG members as all following practical and lecture sessions will focus on Biowulf when cluster use is required.


Computer Cluster at DCEG (CCAD)

CCAD is the dedicated computing cluster for DCEG. CCAD operates under a fair use policy to avoid monopolization of resources. In other words, users are given equal access to the available resources as they become available.

For example, say the job queue is empty and User A submits 8 jobs to the cluster which can run 5 jobs at any time. The first 5 of User A’s jobs are then started while the other 3 remain in the queue. Meanwhile User B submits 1 job which is put in the queue after User A’s 3 remaining jobs. When one of User A’s jobs completes, User B’s job will then be run despite being entered into the queue after User’s A’s to ensure fair use. Job scheduling and fair use assurance is managed automatically using the Sun Grid Engine management software, through which all jobs must be submitted.

The two types of cluster use include interactive sessions and cluster jobs.

Interactive sessions allow for actions to be performed on the command line after logging into the cluster. Logging onto CCAD logs a user onto a specialized login node which provides a place for interactive use but does not actively control the CCAD cluster. Cluster jobs are the primary and preferred method of using CCAD in which users submit a job to a queueing system to be run when resources become available, according to fair use.

For your reference, here are the resources available at CCAD:


Additional CCAD Resources


Biowulf

As compared to CCAD, Biowulf is a much larger computer cluster available to all of NIH.

from https://hpc.nih.gov/systems/

Like CCAD, Biowulf also operates under a fair use policy by which jobs are prioritized according to each user’s recent usage volume, measured in CPU-hours. If cluster resources are in high demand, users with lower recent usage (measured in CPU-hours) have their jobs prioritized over users with high recent usage. Jobs are scheduled automatically using a workload management software called Slurm, analogous to the Sun Grid Engine noted for CCAD.

Note that, like CCAD, there is a login node in on the cluster. This login node is what you are operating immediately when you login to Biowulf, and it is shared among all users. For this reason, scientific applications and other computation-intensive processes must be run via job submission or others will be unable to use the cluster.

Biowulf is one of the top 500 most powerful clusters in the world, featuring:

  • 100,000+ cores
  • 200,000+ CPUs
  • 4,000+ nodes
  • 900+ TB memory
  • 3+ PB local scratch (lscratch)
  • ~40 PB high performance storage
  • 5 PB object storage
  • 800+ GPUs (4,000,000+ CUDA cores)

Biowulf offers many versions of over 1000 scientific applications, and maintains data from several public databases for user convenience such as reference genomes, NCBI nt Blast database, NCBI taxonomy database, ClinVar, gnomAD, etc. Biowulf also features many versions of both python and R with over 500 and 1600 packages installed, respectively, and supports containerization using Singularity.

For information on using Biowulf, see the Biowulf website which contains tons of great information. In particular, we want to highlight the in-depth education and training resources on everything from basic linux to advanced scripting for Biowulf, and also see the detailed guides for using individual scientific softwares.


Additional Biowulf Resources


Comparison of Biowulf and CCAD

Here is a table to summarise and highlight the differences between working with CCAD and Biowulf:

  Biowulf CCAD
Job submission sbatch --cpus-per-task=# --mem=#g --job-name #JobName --time ##:##:## myscript.sh

swarm -g #memory -t #threads -b #bundle --job-name #JobName --time ##:##:## myscript.sh
module load sge
qsub -N JobName -e error.e -o output.o --cpus-per-task # --mem #g myscript.sh
Interactive jobs sinteractive --cpus-per-task=#cpus --mem=#g module load sge
qlogin [options], qsh
Cancel jobs scancel #job-id
scancel --name=#JobName
qdel job_id [options]
Monitor jobs squeue, sjobs, jobload, jobhist qstat
Transfer data Globus, WinSCP/Fugu scp, rsync
Load applications module spider #module
module load #module
module load #module

Snapshots

Snapshots are a read-only copy of data at a particular point in time. These snapshots are very helpful if you inadvertently delete a file that you need. To locate the file you are interested in, you can go to a specific snapshot by using the following command:

cd .snapshot

This will take you to a main snapshot directory that contains all of the other snapshot directories. From there you can find the file you need and copy it back to the desired directory.

Here are the policies of both clusters on snapshots:

Biowulf CCAD
- Home dir: Weekly backups, with daily incremental backups
- Data dir: NOT BACKED UP
- Buy-in storage
- Additional information: File Backups and Snapshots on the HPC Systems
- Nightly snapshots last one week
- 6 hour snapshots last 3 days
- True backups done via CBIIT taken weekly and retained based on their policies
- Permanent backups need to be requested to be transferred to the archive

More information on Biowulf snapshots can be found here: https://hpc.nih.gov/storage/backups.html.


Cluster How-Tos: Connect, Transfer Files/Share Data

CCAD and Biowulf are primarily accessed for one of two purposes: direct use of the cluster or transferring files. Here we will summarise how to do both.

Host Hostname Accessible by Purpose
Biowulf biowulf.nih.gov All HPC users cluster head node
Helix helix.nih.gov All HPC users data transfer
HPCdrive hpcdrive.nih.gov All HPC users data transfer
CCAD ccad.nci.nih.gov All HPC users cluster head node
CCAD Tdrive gigantor.nci.nih.gov All HPC users data transfer

Connecting via SSH

The only method for directly accessing either cluster is ultimately through the command line. This is done via secure shell, or SSH.

Connecting via SSH will vary depending on whether you’re using a MacOS or Windows computer to connect. Windows users will need to install PuTTY to connect via SSH.

PuTTy configuration menu
Logging into Biowulf using PuTTY

If you are using a Mac, no additional software is necessary. Simply open the Mac app Terminal and type ssh -Y [user]@biowulf.nih.gov, where [user] is your login ID.

Note: -Y enables trusted X11 forwarding. X11 forwarding allows for the use of graphical applications on a remote server (ex. Integrative Genomics Viewer- IGV). Another option for running graphics applications is NoMachine (NX) (https://hpc.nih.gov/docs/nx.html#setup).

We will detail and practice connecting to CCAD and Biowulf further in the practical section.


Transfer of files can be accomplished one of many ways:

Graphical User Interface (GUI) file transfer applications

GUI-based transfer applications can be a convenient way to transfer data. WinSCP for Windows and FileZilla for both Windows and MacOS are free applications recommended for file transfers.

WinSCP login window
File transfer with WinSCP
File transfer with FileZilla

Mounting drives

Drives from the NIH HPC and CCAD can be mounted directly to your local computer which allows you to click and drag files in familiar fashion. This is best only for small file transfers; transfer of larger files should be done through another method.

Doing so will require you to specify a folder to be mounted. Refer to the table below for the correct path formatting.

  Description Directory at cluster SMB path for Windows SMB path for Mac
Biowulf/Helix user’s home directory /home/[user] \\hpcdrive.nih.gov\[user] smb://hpcdrive.nih.gov/[user]
  data directory /data/[user] \\hpcdrive.nih.gov\data smb://hpcdrive.nih.gov/data
  user’s scratch space /scratch/[user] directory \\hpcdrive.nih.gov\scratch\[user] smb://hpcdrive.nih.gov/scratch/
[user]
  shared group area (e.g. you are a member of group PQRlab) /data/PQRlab \\hpcdrive.nih.gov\PQRlab smb://hpcdrive.nih.gov/PQRlab
CCAD main directory /home/ \\gigantor.nci.nih.gov\ifs smb://gigantor.nci.nih.gov/ifs
  user’s home directory /home/[user] \\gigantor.nci.nih.gov\ifs
\DCEG\Home\[user]
smb://gigantor.nci.nih.gov/ifs
/DCEG/Home/[user]

. For more detailed instructions on how to mount drives on Biowulf, see here.


Globus file transfer

Globus is the recommended method to transfer and share files from Biowulf. Globus has the ability to monitor transfer performance, retry failures, recover from faults automatically, and report transfer status. See here for how to set up a Globus account.


Command line file transfer

Finally one can use the command line to transfer files using the secure FTP (Windows via PuTTY) and secure copy (Windows via PuTTY, MacOS) commands, like so to transfer to the HPC:

Windows via PuTTY: pscp source_folder/my_file_1.txt username@helix.nih.gov:/destination_folder

MacOS: scp source_folder/my_file_1.txt username@helix.nih.gov:/destination_folder

Or conversely from the HPC:

Windows via PuTTY: pscp username@helix.nih.gov:/source_folder/my_file_2.txt destination_folder

MacOS: scp username@helix.nih.gov:/source_folder/my_file_2.txt destination_folder

. Note that these commands use the Helix system, not Biowulf (as designated by @helix.nih.gov). Biowulf should not be used for file transfers unless done from within a job on the cluster.


Basic Linux Commands

Linux Programming

Because CCAD and Biowulf must be accessed via the command line it is necessary to know some linux before using either. We’ve already discussed a few command line programs such as ssh for connecting to the clusters and scp for file transfer, but more is required to operate them. We will cover some of the most useful commands in the practical section for this session.

For now, to learn more about linux please see this linux cheatsheet and vim cheatsheet we’ve prepared for your reference, and/or view these external resources:


Bioinformatics File Formats and Tools

Finally we will discuss some of the most important file formats used in bioinformatics. These file formats will include the ones you are likely to encounter in a typical genomics study, but there are still many more specialized file formats which we won’t cover in this session.

Before we describe these file formats in more detail, below is a workflow diagram tracing the flow of information over the course of a cancer study and a slightly more detailed description of each format in the table below.


Raw Data

Sequencing data with per-base quality score is stored in one of two forms: fastq (raw reads), SAM/BAM/CRAM (aligned reads).

Fastq is the most basic sequencing format and stores only sequence barcodes, sequenced bases, and per-base quality scores. 1 nucleotide base in fastq format is stored in ~2 bytes (sequence + quality score). Bearing in mind that the human genome itself is 3GBs, one 30x whole genome sequencing sample= 3GB * 30x * 2 bytes per base ~ 180GB. Compressing fastq files typically compresses to 25% original size.

Fastq sequencing files which are aligned to reference genomes are stored as alignment files in SAM/BAM/CRAM format. SAM (Sequence Alignment Map) is a human-readable, uncompressed alignment file, whereas BAM and CRAM are both compressed forms of SAM. BAM/CRAM files are considerably smaller than SAM files and are therefore the most common alignment formats for data sharing and deposit.

BAM is a lossless compression; this means that the original SAM is always perfectly reconstructed during decompression. CRAM offers a greater degree of compression from BAM, and can be either lossless or lossy (i.e. some information may be lost during decompression).


Processed Data

Variants calling results can be shared in the formats of VCF or MAF. In somatic genomic studies, each VCF file is generated for one tumor/normal pair. In germline studies it on VCF contains pooled variants for all samples.

MAF files contain aggregated mutation information from VCF files and are generated on a project-level. MAF files also include annotation of variants from public databases.


Preparing Data for Database Submission

  • NCBI Sequence Read Archive (SRA)
    • requires raw data with per-base quality scores for all submitted data.
    • accepts binary files such as BAM/CRAM, HDF5 (for PacBio, Nanopore), SFF (when BAM is not available, for 454 Life Science and Ion Torrent data), and text formats such as FASTQ.
    • For more information: https://www.ncbi.nlm.nih.gov/sra/docs/submit/
  • Gene Expression Omnibus (GEO)
    • studies concerning quantitative gene expression, gene regulation, epigenetics, or other functional genomic studies. (e.g. mRNA-seq, miRNA-seq, ChIP-Seq, HiC-seq, methyl-seq/bisulfite-seq)
    • does NOT accept WGS, WES, metagenomic sequencing, or variation or copy number projects.
    • a complete submission includes: metadata, processed data, raw data containing sequence reads and quality scores (will be submitted to SRA by the GEO team)
    • For more information: https://www.ncbi.nlm.nih.gov/geo/info/seq.html
  • The database of Genotypes and Phenotypes (dbGaP)
    • studies investigating the interaction of genotype and phenotype in humans.
    • all submissions that require controlled access must be submitted through dbGaP.
    • requires registration of the study and subjects prior to data submission.
    • raw data will be submitted to the protected SRA account.
    • For more information: https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap/

File formats

Finally, we present a summary table of the bioinformatics file formats and what tools are available to work with them:

Format name Data type Tools
SRA a raw data archive with per-base quality score sra-tools
FASTA a text file of reference genome sequence data FASTA Tools
FASTQ a text file of sequencing data with quality score FastQC, FASTX-Toolkit, Seqtk, Samtools, Picard tools
SAM/BAM/CRAM formats of sequence alignment data Samtools, Picard tools
BCF/VCF/gVCF a tab-delimited text file to store the variation calls bcftools
BED (PEBED) a tab-delimited text file to store the coordinates of genomic regions. bedtools
GTF/GFF/GFF3 a tab-delimited text file to describe genes or other features gff tools, GFF utilities (gffread, gffcompare)
MAF a tab-delimited text file with aggregated mutation information from VCF files and are generated on a project-level. MAFtools

Additional Resources

For full descriptions of aforementioned formats as well as tools to work with them, see the resources below as well as the supplementary information tab for this session.