# Extract only the CDS sequence lines from the gff
grep -w 'CDS' ${transcriptome_gff} > ${transcriptome_gff_filtered}
head -n 5 ${transcriptome_gff_filtered}
In the process of running kallisto on the three E5 deep dive species, we realized that there’s no transcriptome FASTA available for P. evermanni. That means we need to generate one!
Related posts:
RNAseq abundance quantification (A.Pulchra, P.evermanni, P.meandrina)
P.evermanni RNAseq kallisto debugging
Testing bedtools: gff vs bed input file
Starter files
The closest file to a P. evermanni transcriptome that we have is a P.evermanni coding sequence (CDS) gff file, and a corresponding genome scaffolds FASTA.
head of gff:
Porites_evermani_scaffold_1 Gmove mRNA 3107 4488 543 - . ID=Peve_00000001;Name=Peve_00000001;start=0;stop=1;cds_size=543 Porites_evermani_scaffold_1 Gmove CDS 3107 3444 . - . Parent=Peve_00000001
Porites_evermani_scaffold_1 Gmove CDS 4284 4488 . - . Parent=Peve_00000001
A gff file lists information that can be used to identify and retrieve specific sequences. The first column lists the region or scaffold of a reference genome in which the sequence can be found; the third identifies which genomic feature (e.g., mRNA, CDS, UTR) the sequence is; and the fourth and fifth are genomic coordinates for the beginning and end of the sequence. Since we have both the coding sequences gff and a reference fasta, we should be able to extract FASTA sequences for all of the relevant sequences in the gff!
Extract CDS sequences
First, it looks like the gff has features labelled a little weirdly. Normally we’d want all of the mRNA sequences, but it looks like this file lists mRNA sequences that include introns. Instead, we want all of the CDS sequences associated with each mRNA (association is denoted in the final column with a parent mRNA ID, e.g. “Parent=Peve_00000001”). That means the first step is to extract all of the CDS sequences from our gff file.
Porites_evermani_scaffold_1 Gmove CDS 3107 3444 . - . Parent=Peve_00000001
Porites_evermani_scaffold_1 Gmove CDS 4284 4488 . - . Parent=Peve_00000001
Porites_evermani_scaffold_1 Gmove CDS 424479 425361 . - . Parent=Peve_00000002
Porites_evermani_scaffold_1 Gmove CDS 426181 426735 . - . Parent=Peve_00000002
Porites_evermani_scaffold_1 Gmove CDS 427013 427140 . - . Parent=Peve_00000002
Convert gff to bed
Then we convert this filtered CDS gff file to a bed file, which has slightly different formatting. We’re doing this because we’ll be using a tool called bedtools getfasta, which can take both gff and bed files, but which I trust more to appropriately parse bed files (though, see post on whether it can handle gff files correctly as well).
# Ensure bedops can find its dependencies when running
export PATH=/home/shared/bedops_linux_x86_64-v2.4.41/bin:$PATH
${bedops}/gff2bed \
\
--do-not-sort < ${transcriptome_gff_filtered} \
> ${transcriptome_bed}
head -n 3 ${transcriptome_gff_filtered}
echo ""
head -n 3 ${transcriptome_bed}
Porites_evermani_scaffold_1 Gmove CDS 3107 3444 . - . Parent=Peve_00000001
Porites_evermani_scaffold_1 Gmove CDS 4284 4488 . - . Parent=Peve_00000001
Porites_evermani_scaffold_1 Gmove CDS 424479 425361 . - . Parent=Peve_00000002
Porites_evermani_scaffold_1 3106 3444 . . - Gmove CDS . Parent=Peve_00000001
Porites_evermani_scaffold_1 4283 4488 . . - Gmove CDS . Parent=Peve_00000001
Porites_evermani_scaffold_1 424478 425361 . . - Gmove CDS . Parent=Peve_00000002
Generating a transcriptome FASTA
The basic approach to generating this transcriptome makes use of the fact that each CDS sequence in our bed file is assigned a “parent” mRNA, denoted by a parent Id in the last column (e.g., “Parent=Peve_00000001”). Each sequence with the same parent ID should be used to a) retrieve a FASTA for that sequence, and b) concatenate all of the FASTAs with the same parent ID into a single, labelled gene FASTA. The below code will iterate through our entire bed file and do this for each parent ID and its associated sequences.
# Load bash variables into memory
source .bashvars
# Navigate to correct directory and make output file
cd ${transcriptome_dir}
echo > ${transcriptome_fasta_name}
# Helper list for processing all parent IDs
processed_ids=()
######################################################
# Helper function to concatenate and format several bedtools output sequences
# into a single, appropriately named contig
concatenate_helper() {
local input_bedtools_fastas="$1"
local parent_ID="$2"
local reference_name=""
local positions=""
local concatenated_sequences=""
# Read the input line by line
while IFS= read -r line; do
# Check if the line starts with ">"
if [[ "$line" == ">"* ]]; then
# Extract reference name and position from the line
reference_position="${line:1}" # Remove ">"
reference_name=$(echo "$reference_position" | cut -d: -f1)
position=$(echo "$reference_position" | cut -d: -f2)
# Append position to the positions variable
positions+="$position,"
else
# Concatenate sequences
concatenated_sequences+="$line"
fi
done <<< "$input_bedtools_fastas"
# Remove trailing comma from positions
positions="${positions%,}"
# Output the reformatted result
echo ">$parent_ID $reference_name:$positions"
echo "$concatenated_sequences"
}
######################################################
# Process your input bed file
while IFS= read -r line; do
# pull the parent ID number for the current line of the bed
parentID=$(echo "$line" | grep -o 'Parent=Peve_[0-9]\+')
# Only continue if you haven't already processed the CDS sequences associated with this parent ID
if [[ ! " ${processed_ids[@]} " =~ " $parentID " ]]; then
# Add the current parentID to the processed list
processed_ids+=("$parentID")
# Create temporary files to store intermediate results
temp_CDS_bed_file=$(mktemp)
temp_bedtools_fasta_file=$(mktemp)
# Grab all of the CDS sequences with the same parent ID and write to temporary file
grep "$parentID" ${transcriptome_bed} > "$temp_CDS_bed_file"
# Use bedtools to extract corresponding FASTAs and write to temporary file
${programs_array[bedtools]} getfasta -fi ${genome_fasta} -bed "$temp_CDS_bed_file" -fo "$temp_bedtools_fasta_file"
# Use our helper function to concatenate and format all of these CDS fastas into a single contig
concatenated_fasta=$(concatenate_helper "$(cat "$temp_bedtools_fasta_file")" "$parentID")
# Add the concatenated CDS fasta to our output file on a new line
echo "$concatenated_fasta" >> ${transcriptome_fasta}
# Remove the temporary files
rm "$temp_CDS_bed_file" "$temp_bedtools_fasta_file"
fi
done < ${transcriptome_bed}
# The output file ends up having a blank first line before the data, so delete that unwanted empty first line
sed -i '1{/^$/d}' ${transcriptome_fasta}
Now we have a transcriptome FASTA the can be used for kallisto abundance quantification!