Kathleen’s Lab Notebook - Dorado – re-basecalling pod5 files

I have a bunch of raw Nanopore output pod5 files, which I need to take through basecalling, demultiplexing, trimming, and genome alignment. Note that several of these steps are performed automatically during sequencing, and I have those output files, but there are differences in software version used. To maintain consistency in model used and types of modifications called, I need to re-do it all from the raw pod5s.

Luckily, all of these steps (modified basecalling, demultiplexing, trimming, alignment) can be done with a single command from the Nanopore Dorado software! I’ve done some preliminary trials using some of the Flongle output in the below code documents:

G1 L4 Flongle
G4 L1 Flongle

Unfortunately, this is very computationally intensive. Scaling up from the Flongle runs (50-200 Mb output) to MinION runs (2-8 Gb output) will require significantly more computational power. Re-calling the trial Flongle data took several hours using CPUs alone, so the MinION runs would take days.

I started some CPU-based MinION recalling for Group 4 (Group4 Library1 MinION recalling, Group4 Library2 MinION recalling), which has lowest output. However, I want to figure out how to use Hyak GPUs for the Dorado basecalling to significantly reduce the time required.

Sam’s previously done this (but on Mox and with the older Dorado output file format and software, fast5 and Guppy ) so I’ll be basing my work off of his notebook post.

After a whole day of modifying and testing the SLURM script (it takes 1-2hrs to test each time, since my jobs have to wait in a ckpt queue), I finally got it to work! The big problem I had to figure out was how to modify Sam’s script to use containerized software – in Sam’s fast5 basecalling script, he just had Guppy installed directly on the Mox server. On Klone, however, Dorado is available in a container (/gscratch/srlab/containers/srlab-R4.4-bioinformatics-container-3886a1c.sif). I found the clue of how to do this by searching Sam’s notebook repo, in the directory for sbatch scripts, for the word “container”, and finding this script:

# Execute Roberts Lab bioinformatics container
# Binds home directory
# Binds /gscratch directory
# Directory bindings allow outputs to be written to the hard drive.
apptainer exec \
--home "$PWD" \
--bind /mmfs1/home/ \
--bind /gscratch \
/gscratch/srlab/sr320/srlab-bioinformatics-container-586bf21.sif \
/gscratch/scrubbed/samwhite/gitrepos/ceasmallr/code/02.01-bismark-bowtie2-alignment-SLURM-array.sh

Sam used the --bind option when executing the container. After looking in to what that is, I learned that you have to “bind” your working directory (the one which contains your input data and output folder) to the container so that it knows where they are. Otherwise the container, which is essentially a self-contained computing environment, won’t have access to them. Such a simple fix after sooo long debugging 🥴

After adding binding to my container executions, I finally got a working SLURM batch script! For example, here’s the script used to Dorado basecall the sequencing data from the Group 4 Library 1 MinION run:

#!/bin/bash
## Job Name
#SBATCH --job-name=G4L1_MinION_Dorado
## Allocation Definition
#SBATCH --account=srlab-ckpt
#SBATCH --partition=ckpt
## Resources
## GPU
#SBATCH --gres=gpu:2080ti:1
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=0-02:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=kdurkin1@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/srlab/kdurkin1/SIFP-nanopore/D-Group4/output/03.01-G4-Library1-MinION-Dorado-recall-GPU/

## Script for running ONT Dorado to perform
## basecalling (i.e. convert raw ONT pod5 to FastQ) of NanaPore data generated
## Summer 2025, as psrt of K.Durkin SIFP project

## This script utilizes a GPU node. These nodes are only available as part of the checkpoint
## partition/account. Since we don't own a GPU node, our GPU jobs are lowest priority and
## can be interrupted at any time if the node owner submits a new job.

###################################################################################
# These variables need to be set by user

wd=$(pwd)

# Programs array
# declare -A programs_array
# programs_array=(
# [dorado]="apptainer exec --nv --bind /gscratch /gscratch/srlab/containers/srlab-R4.4-bioinformatics-container-3886a1c.sif dorado"
# )


# Establish variables for more readable code

# Input files directory
raw_pod5_dir=/gscratch/srlab/kdurkin1/SIFP-nanopore/D-Group4/data/03.01-G4-Library1-MinION-Dorado-recall-GPU/
output_dir=/gscratch/srlab/kdurkin1/SIFP-nanopore/D-Group4/output/03.01-G4-Library1-MinION-Dorado-recall-GPU/
genome_file=/gscratch/srlab/kdurkin1/SIFP-nanopore/data/GCA_965233905.1_jaEunKnig1.1/GCA_965233905.1_jaEunKnig1.1_genomic.fna

# Output directory
out_dir=${wd}

# CPU threads
threads=28

# Sequencing kit used
kit="SQK-NBD114-96"

# Flow Cell ID
flow_cell_id="FBD08455"

# GPU devices setting
GPU_devices=auto

# Set number of FastQ sequences written per file (0 means all in one file)
records_per_fastq=0

###################################################################################

# Exit script if any command fails
set -e

# Load CUDA GPU module
module load cuda/12.9.1

apptainer exec \
--nv \
--home "$PWD" \
--bind /mmfs1/home/ \
--bind /gscratch \
/gscratch/srlab/containers/srlab-R4.4-bioinformatics-container-3886a1c.sif \
dorado basecaller \
hac \
-r ${raw_pod5_dir}/ \
--kit-name SQK-NBD114-96 \
--trim 'all' \
--reference ${genome_file} \
--modified-bases 5mCG_5hmCG 6mA \
--device ${GPU_devices} \
> ${output_dir}/FBD08455_pass_recalled.bam


###################################################################################

# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : n
} >> system_path.log


# Capture program options
for program in "${!programs_array[@]}"
do
    {
  echo "Program options for ${program}: "
    echo ""
    ${programs_array[$program]} --help
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
} &>> program_options.log || true
done

Note that I specified a fairly small GPU in this script, gpu:2080ti:1, because Hyak has a lot of these, and the queue time ended up being much shorter. For larger jobs, a more powerful GPU may be desirable.

I set up and ran scripts to Dorado basecall to passed pod5 files of all 5 MinION sequencing runs by late night of 10/29/2025. By the next morning, all but the G1L4 runs have finished:

Sequencing Run	Dorado Basecalling Job ID	Runtime	Output Directory
G1L4 MinION	30552584	-	https://github.com/shedurkin/SIFP-nanopore/tree/main/A-Group1/output/06.01-G1-Library4-MinION-Dorado-recall-GPU
G2L2 MinION	30552393	03:25:48	https://github.com/shedurkin/SIFP-nanopore/tree/main/B-Group2/output/04.01-G2-Library2-MinION-Dorado-recall-GPU
G2L3 MinION	30553071	03:01:26	https://github.com/shedurkin/SIFP-nanopore/tree/main/B-Group2/output/05.01-G2-Library3-MinION-Dorado-recall-GPU
G4L1 MinION	30542902	01:26:17	https://github.com/shedurkin/SIFP-nanopore/tree/main/D-Group4/output/03.01-G4-Library1-MinION-Dorado-recall-GPU
G4L2 MinION	30550488	01:33:34	https://github.com/shedurkin/SIFP-nanopore/tree/main/D-Group4/output/04.01-G4-Library2-MinION-Dorado-recall-GPU

I’m soooo excited to have this working, because basecalling on CPUs along was taking a verylong time. For the G4L1 MinION data, I had it running on CPUs for 24 hours and it generated ~120MB of output (withuot completing the run). On a GPU, the full ~900MB of output was generated in just an hour and a half!