Processing of RNA-seqΒΆ

  • SDU

The SDU workflow for processing RNA-seq data is given as:

#!/bin/sh

#### PATHS TO REQUIRED SOFTWARE, INPUT & OUTPUT FOLDERS
DAT="/path/to/data" ### PATH TO FOLDER CONTAINING GZ-COMPRESSED FASTQ FILES
OUT="/path/to/output" ### PATH TO FOLDER DEPOSITING THE RESULTS
REFT="/path/to/reference" ### PATH TO FOLDER TRANSCRIPTOME REFERENCE (gtf/gff file)
REFG="/path/to/reference" ### PATH TO FOLDER GENOME REFERENCE (fa file and bowtie index files)
SAMTOOLS="/path/to/samtools" ### PATH TO SAMTOOLS
TOPHAT="/path/to/tophat2" ### PATH TO TOPHAT2
BOWTIE="/path/to/bowtie" ### PATH TO BOWTIE
BBMAP="/path/to/BBMAP" ### PATH TO BBMAP
PYTHON="/path/to/python"
HTSEQ="/path/to/HTSeq"

export PATH=$SAMTOOLS:$BOWTIE:$TOPHAT:$HTSEQ:$PATH
echo $PATH

### UNPACKING gz-compressed fastq files
cd $DAT
for i in *_R1.fastq.gz;
do
newfile=$(basename $i _R1.fastq.gz)
gunzip /$DAT/${newfile}_R1.fastq.gz
gunzip /$DAT/${newfile}_R2.fastq.gz
done

### Quality cleaning (adaptor removal, trimming of low quality bases and reads)
for i in *_R1.fastq;
do
newfile=$(basename $i _R1.fastq)
$BBMAP/bbduk.sh -Xmx20g \
                in1=/$DAT/${newfile}_R1.fastq \
                in2=/$DAT/${newfile}_R2.fastq \
                out1=/$DAT/${newfile}_clean_R1.fastq \
                out2=/$DAT/${newfile}_clean_R2.fastq \
                ref=$BBMAP/resources/adapters.fa \
                ktrim=r ktrim=l k=23 mink=11 hdist=1 tpe tbo \
                qtrim="rl" trimq=10 maq=10 minlen=25
done

#### Bowtie2-Tophat2 alignment
for i in *_clean_R1.fastq;
do
newfile=$(basename $i _clean_R1.fastq)
$TOPHAT -p 1 -G $REFT \
                --output-dir $OUT/${newfile} \
                $REFG $DAT/${newfile}_clean_R1.fastq \
                $DAT/${newfile}_clean_R2.fastq
$SAMTOOLS/samtools index $OUT/${newfile}/accepted_hits.bam
mv $OUT/${newfile}/accepted_hits.bam $OUT/${newfile}/${newfile}_accepted_hits.bam
mv $OUT/${newfile}/accepted_hits.bam.bai $OUT/${newfile}/${newfile}_accepted_hits.bam.bai

### Count Matrix construction by HTSeq
$python -m HTSeq.scripts.count \
                --format bam \
                --mode union \
                --stranded no \
                --minaqual 1 \
                --type gene \
                --idattr gene_id \
                $OUT/${newfile}/${newfile}_accepted_hits.bam $REFT \
                > $OUT/${newfile}_gene_read_counts_table.tsv

done
  • AAUH
The spliced alignment of RNA-seq performed with tophat in the above script can altertively be done using a 2-pass alignment with STAR. This is the suggested method in the GATK best practices

A script for doing this is shown below:

## Create temp dir
mkdir /scratch/$PBS_JOBID
TMPDIR=/scratch/$PBS_JOBID
cd $TMPDIR

## Path to paired fastq files
rawDataDir="Path to directory with QC fastq files"
R1=$(ls $rawDataDir | grep $id | grep R1)
R2=$(ls $rawDataDir | grep $id | grep R2)

## Move fastq files to scratch
cp $rawDataDir/$R1 $rawDataDir/$R2 $TMPDIR

## Programs
STAR="path to star"

## Reference data
assembly="Path to reference genome fasta"
genomeDir="Path to STAR indexed reference genome"

##########################
#### Align using STAR ####
##########################

### Do 1st pass
mkdir $TMPDIR/1pass
cd $TMPDIR/1pass

$STAR \
--genomeDir $genomeDir \
--readFilesIn ../$R1 ../$R2 \
--readFilesCommand zcat \
--runThreadN 8

### Create new index using splice junction information from 1st pass
mkdir $TMPDIR/b37_2pass

$STAR \
--runMode genomeGenerate \
--genomeDir $TMPDIR/b37_2pass \
--genomeFastaFiles $assembly \
--sjdbFileChrStartEnd $TMPDIR/1pass/SJ.out.tab \
--sjdbOverhang 75 \
--genomeSAsparseD 2 \
--runThreadN 8 \
--limitGenomeGenerateRAM 20000000000

### Do 2nd pass
mkdir $TMPDIR/2pass
cd $TMPDIR/2pass
$STAR \
--genomeDir $TMPDIR/b37_2pass \
--readFilesIn ../$R1 ../$R2 \
--readFilesCommand zcat \
--runThreadN 8 \
--outSAMstrandField intronMotif \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix ${id}_STAR_
#--outSaMmapqUnique 60 \

## Return output
cp * $outDir

## Clean up scratch
cd /scratch
rm -fr $PBS_JOBID

Transcipt level expression can then be inferred using Cufflinks. This is done using the script below:

## Paths
mkdir /scratch/$PBS_JOBID
TMPDIR=/scratch/$PBS_JOBID
bamFile="path to STAR aligned BAM file"
cufflinks="path to cufflinks"
gff="Path to gff file"
outDir="Path for output files"

cd $TMPDIR
cp $bamFile $TMPDIR

# Construct the mask file
grep rRNA $gff > mask.gff3
grep tRNA $gff >> mask.gff3

# Run cufflinks
echo Running cufflinks ...
$cufflinks \
--GTF-guide $gff \
--mask-file mask.gff3 \
--library-type fr-unstranded \
--num-threads 8 \
--output-dir cufflinks \
--quiet \
$bamFile

echo moving files to home dir
cp -fr $TMPDIR/cufflinks/* $outDir