Want to compile some data summary stats (e.g., sequencing output, coverage, modification %s, etc.) and plot for Smithsonian No Bones presentation (12/16/25).
modkit summary
First, finished modkit summaries of the % modifications called in each sample from each sequencing run. The general outline of this process, in full, is:
Recall all the MinION
.pod5files (quality-passed only) usingDorado, including both alignment to the E. knighti genome and modified basecalling for 5mCG, 5hmCG, and 6mA modifications (note that only 5mC and 5hmC found in a CG context were basecalled). Outputs are BAM files.From those, separate by barcode and isolate only mapped reads, so I end up with one BAM file per barcode that contains only the reads with that barcode that mapped to the E. knighti genome.
Then, use
modkitsummaryto summarize the presence of each modification type. This is global modification, e.g. % of all A nucleotides in the BAM file that have the 6mA modification. I performedmodkit summaryon both the un-separated BAM files and on the barcodes individually, saved outputs, and plotted (e.g. the below plot, taken from G4L2)
Knitted code and plots for each sequencing run linked below:
Group 1, Library 4, MinION (2019 vouchers)
Group 2, Library 2, MinION (1960s vouchers)
Group 2, Library 3, MinION (1960s vouchers)
Group 4, Library , MinION (pre-1900 vouchers)
Group 4, Library , MinION (pre-1900 vouchers)
Plotting
Then generated a bunch of summary plots potentially useful for the presentation, including visualizations of output, quality, modifications, etc. Code writing supported with Claude Sonnet AI. Of particular interest might be plots of specimen output, quality, and modification %s by age, to highlight issues with sequencing older specimens.
Knitted code, with plots, found here: 03-nobones-plotting




Additional analyses
Finally, did some additional data processing. Namely, summarized per-base modifications. Currently have BAM files that show the base called, with and without modifications, for every single read. pileup will summarize these down to BED files that show, for every position in the genome, the % modification (if applicable). Performed this by specimen (barcode), so for specimens that were sequenced multiple times all the data is combined.
Knitted code: 04-specimen-pileup