2 Sequence reads to SNP-level parent-specific read counts

2.1 Overview

This bash tutorial demonstrates best practices for processing whole genome sequence (WGS) and mRNA sequencing (mRNA-seq) reads for allele-specific transcriptomics analysis.

Sequencing data for this tutorial are described in Bresnahan et al., 2024, “Intragenomic conflict underlies extreme phenotypic plasticity in queen-worker caste determination in honey bees (Apis mellifera)”, bioRxiv. See the tutorial introduction for more details.

F1 WGS reads are trimmed with fastp to remove adapter sequences and filter low-quality reads, then aligned to the A. mellifera Amel_HAv3.1 reference genome assembly with BWA-MEM. Alignments are coordinate sorted and filtered using samtools. Duplicate alignments are removed with GATK MarkDuplicates. Variants against the A. mellifera reference genome assembly are detected using freebayes to account for differences in ploidy between diploid females (queens) and haploid males (drones). The variant calls are then filtered for quality and read depth with VCFtools, and all variants except homozygous SNPs are filtered with samtools BCFtools. The high-confidence homozygous SNPs are then integrated into the A. mellifera reference genome assembly with GATK FastaAlternativeReferenceMaker, and intersected using bedtools to identify SNPs that are unique to each parent but shared between the crosses of a reciprocal cross pair. Parent-specific SNPs are then intersected with the A. mellifera reference genome annotation with bedtools to identify positions within the longest transcript for each gene.

F2 mRNA-seq reads are trimmed with fastp to remove adapter sequences, filter low-quality reads, and generate quality control metrics, and then aligned to each respective parent genome using STAR. Alignments are then filtered using samtools. For each sample, read coverage is calculated at each F1 SNP using bedtools intersect in strand-aware mode.

2.2 Setup

2.2.1 Command-line tools

Mamba

Several command-line tools are used in this tutorial. These are easily installed with the package manager Mamba. Follow the Miniforge distribution documentation for installation, then create a new environment called “AST-bioinfo” for installing the necessary tools.

mamba create --name AST-bioinfo
sra-tools

The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for retrieving data in the INSDC Sequence Read Archives. (Documentation)

In this tutorial, sra-tools is used to retrieve Apis mellifera WGS & mRNA-seq reads from SRA BioProject accession PRJNA1106847.

mamba install -n AST-bioinfo bioconda::sra-tools
fastp

fastp is a tool designed to provide fast all-in-one preprocessing for FastQ files. (Documentation)

In this tutorial, fastp is used to trim any sequencing adapter contamination from the reads and assess quality control metrics.

mamba install -n AST-bioinfo bioconda::fastp
BWA

BWA is a software package for mapping DNA sequences against a large reference genome. (Documentation)

In this tutorial, the BWA-MEM algorithm which is designed for Illumina sequence reads ranged from 70bp to a few megabases and is faster and more accurate than other alignment algorithms, is used for alignment of the F1 (parental) WGS reads to the Apis mellifera reference genome.

mamba install -n AST-bioinfo bioconda::bwa
STAR

Spliced Transcripts Alignment to a Reference (STAR) is one of the best performing tools for alignment of mRNA-seq reads to a reference genome. (Documentation)

In this tutorial, STAR is used to align the F2 mRNA-seq reads to their respective F1 reference genomes.

mamba install -n AST-bioinfo bioconda::star
freebayes

freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. (Documentation)

In this tutorial, freebayes is used to identify SNPs in the F1 WGS reads with reference to the Apis mellifera reference genome.

mamba install -n AST-bioinfo bioconda::freebayes
samtools

samtools is a suite of tools for manipulating next-generation sequencing data. (Documentation)

In this tutorial, samtools is used to process alignments produced by BWA-MEM & STAR and SNP calls produced by freebayes.

mamba install -n AST-bioinfo bioconda::samtools
BEDTools

The bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. (Documentation)

In this tutorial, BEDTools is used to identify SNPs that differ between F1 parents and to annotate the Apis mellifera transcript annotations with the positions of these SNPs.

mamba install -n AST-bioinfo bioconda::bedtools
VCFtools

VCFtools is a program package designed for working with VCFs. (Documentation)

In this tutorial, VCFtools is used to filter SNPs called by freebayes.

mamba install -n AST-bioinfo bioconda::vcftools
GATK

The Genome Analysis Toolkit is a collection of command-line tools for analyzing high-throughput sequencing data with a primary focus on variant discovery. (Documentation)

In this tutorial, the MarkDuplicates tool is used to filter duplicate WGS reads, and the FastaAlternateReferenceMaker tool is used to integrate F1 SNPs into the Apis mellifera reference genome to generate parental reference genomes.

mamba install -n AST-bioinfo bioconda::gatk
FastQC

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. (Documentation)

In this tutorial, FastQC is used for an initial quality control check of the sequencing reads prior to pre-processing steps.

mamba install -n AST-bioinfo bioconda::fastqc
MultiQC

MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools. (Documentation)

In this tutorial, MultiQC is used to aggregate reports from FastQC and fastp to assess quality control metrics of the sequencing reads before and after filtering.

mamba install -n AST-bioinfo bioconda::multiqc
Activate mamba environment

Upon opening a new terminal, you must activate your mamba environment to access these tools:

mamba activate AST-bioinfo

Project directory structure

It is important to create a cohesive and consistent directory structure for your bioinformatics projects. For this tutorial, create these directories with mkdir:

AST-tutorial
└───RAW
│   └───WGS
│   └───mRNA
└───TRIM
│   └───WGS
│   └───mRNA
└───ALIGN_GENOME
│   └───UNFILTERED
│   └───FILTERED
└───VARIANTS
│   └───UNFILTERED
│   └───FILTERED
└───ANALYSIS
│   └───INTERMEDIATE
│   └───RESULTS
└───ANALYSIS_SETS
└───PARENT_GENOMES
└───ALIGN_PARENT_GENOMES
│   └───UNFILTERED
│   └───FILTERED
└───COUNT
└───INDEX
└───SCRIPTS
└───REPORTS
│   └───WGS
│   └───mRNA

2.2.2 Amel_HAv3.1 reference genome assembly and annotation files

Retrieve the RefSeq assembly files from NCBI via ftp.

DIR_INDEX="AST-tutorial/INDEX"

wget -O ${DIR_INDEX}/Amel_HAv3.1.fasta.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.fna.gz

gunzip ${DIR_INDEX}/Amel_HAv3.1.fasta.gz

wget -O ${DIR_INDEX}/Amel_HAv3.1.gff.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz

gunzip ${DIR_INDEX}/Amel_HAv3.1.gff.gz

wget -O ${DIR_INDEX}/Amel_HAv3.1.gff.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gtf.gz

gunzip ${DIR_INDEX}/Amel_HAv3.1.gtf.gz

2.2.3 A note on parallel processing in high performance computing clusters

The code chunks throughout the tutorial are written so as to run each FILE in a FILES array through serial loops. If you are running this tutorial in a high-performance computing (HPC) cluster with a batch submission system (e.g., SLURM on Penn State ROAR Collab), you can instead submit a “parent” job that submits individual “child” jobs for each file. For example:

Parent job
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0:10:00
#SBATCH --mem-per-cpu=1gb
#SBATCH --partition=open

SCRIPTS="path/to/SCRIPTS"
THREADS=4
DIR_OUT="path/to/OUTPUT_DIRECTORY"
DIR_RAW="path/to/RAW_DATA"

FILES=(FILE1 FILE2 etc...)

for FILE in "${FILES[@]}"
do
  sbatch ${SCRIPTS}/script_to_run.sh ${FILE} ${THREADS} ${DIR_OUT} ${DIR_RAW}
done
Child job
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=8:00:00
#SBATCH --mem-per-cpu=8gb
#SBATCH --partition=open 

# To source miniconda
source /storage/home/stb5321/work/miniconda3/etc/profile.d/conda.sh

# Assign arguments from sbatch command called in parent script to variables
FILE="$1"
THREADS="$2"
DIR_OUT="$3"
DIR_RAW="$4"

commandX -t ${THREADS} \
  ${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \
  -o ${DIR_OUT}/${FILE}.out

Note that depending on how your HPC is set up, you may need to source your conda installation in each child script. For more information on the #SBATCH commands in the headers of these scripts, see the Slurm Workflow Manager documentation. Specifically for PSU, see the ROAR Collab documentation.

2.3 Generation of F1 genomes and transcriptomes

2.3.1 WGS read pre-processing

2.3.1.1 Retrieve WGS reads from SRA

SRA Cross Parent
SRR28865844 A Male
SRR28865843 A Female
SRR28865842 B Male
SRR28865841 B Female

Note: “Cross” here also refers to the lineage of the maternal parent. So, e.g., “Cross A” is synonymous with A-lineage mother x B-lineage father.

DIR_RAW="AST-tutorial/RAW/WGS"

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  # Retrieve the .sra file
  prefetch -O ${DIR_RAW} ${FILE}
  # Extract .fastq from .sra
  fasterq-dump -O ${DIR_RAW} ${DIR_RAW}/${FILE}.sra
  # Delete the .sra file
  rm ${DIR_RAW}/{FILE}.sra
done

2.3.1.2 Assess raw quality control metrics

Here, FastQC reports for each library are generated. These can be viewed individually, or aggregated into a single report with MultiQC. This tutorial will demonstrate the latter approach.

DIR_RAW="AST-tutorial/RAW/WGS"
DIR_REPORTS="AST-tutorial/REPORTS/WGS"
THREADS="integer number of threads for parallel processing, usually 4 is fine"

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  fastqc \
   -t ${THREADS} \
   ${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \
   --outdir ${DIR_REPORTS}
done

2.3.1.3 Trim and filter reads

fastp will detect common sequencing adapters and trim them from the ends of sequencing reads, in addition to filter low-quality reads (where the average base call quality score is below some threshold and/or where the read is less than n nucleotides in length). Here, run fastp using the default settings to clean up the reads.

DIR_RAW="AST-tutorial/RAW/WGS"
DIR_TRIM="AST-tutorial/TRIM/WGS"
DIR_REPORTS="AST-tutorial/REPORTS/WGS"
THREADS="integer number of threads for parallel processing, usually 8 is fine"

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  fastp \
   -w ${THREADS} \
   -i ${DIR_RAW}/${FILE}_1.fastq -I ${DIR_RAW}/${FILE}_2.fastq \
   -o ${DIR_TRIM}/${FILE}_1.fastq -O ${DIR_TRIM}/${FILE}_2.fastq \
   -j ${DIR_REPORTS}/${FILE}_fastp.json
done

2.3.1.4 Aggregate QC reports with MultiQC

Quality control metrics of the WGS reads before and after trimming/filtering can be visualized in one html report. Read more about interpreting NGS QC metrics from these reports here from the Harvard Chan Bioinformatics Core. An example MultiQC report of WGS reads post-trimming can be found here.

DIR_REPORTS="AST-tutorial/REPORTS/WGS"
multiqc ${DIR_REPORTS}/. \
 -o ${DIR_REPORTS} \
 -n WGS.html

2.3.2 WGS read alignment and post-processing

2.3.2.1 Generate BWA alignment index for the Amel_HAv3.1 reference genome assembly

DIR_INDEX="AST-tutorial/INDEX"
bwa index ${DIR_INDEX}/Amel_HAv3.1.fasta

2.3.2.2 Align reads to the Amel_HAv3.1 reference genome assembly with BWA-MEM

DIR_TRIM="AST-tutorial/TRIM/WGS"
DIR_INDEX="AST-tutorial/INDEX"
DIR_ALIGN="AST-tutorial/ALIGN_GENOME/UNFILTERED"
THREADS="integer number of threads for parallel processing, usually 8 is fine"

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  bwa mem \
   -t ${THREADS} \
   ${DIR_INDEX}/Amel_HAv3.1.fasta \
   ${FILE}_1.fastq ${FILE}_2.fastq \
   > ${DIR_ALIGN}/${FILE}.sam
done

2.3.2.3 Sort and filter alignments with samtools and filter duplicates with GATK

Here, alignments are coordinate sorted and filtered to remove unmapped reads, reads with unmapped mates, secondary, supplementary, and duplicate alignments (see here for information on SAM flags). These are important to remove for variant calling steps as they can contribute to both false positive and negative calls. For more information about these practices, see the GATK best practices workflow for variant discovery and Koboldt (2020), Genome Med, https://doi.org/10.1186/s13073-020-00791-w.

DIR_ALIGN="AST-tutorial/ALIGN_GENOME/UNFILTERED"
DIR_ALIGN_FILTERED="AST-tutorial/ALIGN_GENOME/FILTERED"
DIR_REPORTS="AST-tutorial/REPORTS/WGS"
THREADS="integer number of threads for parallel processing, usually 6 is fine"

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  # Remove alignments with any flags in 2316
  # Convert SAM to BAM and sort using 4Gb memory per thread
  samtools view \
   -@ ${THREADS} -F 2316 -O bam ${DIR_ALIGN}/${FILE}.sam \
   | samtools sort -@ ${THREADS} -m 4g -O bam - \
   > ${DIR_ALIGN}/${FILE}.bam

  # Remove duplicates and sort by coordinate
  gatk MarkDuplicates \
   -I ${DIR_ALIGN}/${FILE}.bam \
   -O ${DIR_ALIGN_FILTERED}/${FILE}_rmvdup.bam \
   -M ${DIR_REPORTS}/${FILE}_rmvdup_metrics.txt \
   --ASSUME_SORT_ORDER coordinate \
   --REMOVE_DUPLICATES true \
   --ADD_PG_TAG_TO_READS false
done

2.3.3 Variant discovery and filtration

Here, variants (SNPs, MNPs, & indels) are identified using freebayes to account for differences in ploidy between DIPLOID and HAPLOID samples. After variant discovery, the variant call files (VCFs) are then subset to homozygous SNPs and filtered by quality (QUAL), minimum depth of coverage (MINDP) and maximum depth of coverage (MAXDP) using samtools and VCFtools, compressed with bgzip and indexed with tabix (samtools). For more information about these practices, see this how-to guide on Filtering and handling VCFs for population genomics from Mark Ravinet & Joana Meier.

QUAL="an integer, 30 is ideal"
MINDP="an integer, usually 10 is fine"
MAXDP="an integer, usually 50 is fine"
DIR_ALIGN_FILTERED="AST-tutorial/ALIGN_GENOME/FILTERED"
DIR_VARIANTS="AST-tutorial/VARIANTS/UNFILTERED"
DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"
DIR_INDEX="AST-tutorial/INDEX"

DIPLOID=(SRR28865843 SRR28865841)

for FILE in "${DIPLOID[@]}"
do
  freebayes \
   -p 2 -f ${DIR_INDEX}/Amel_HAv3.1.fasta \
   ${DIR_ALIGN_FILTERED}/${FILE}.bam > ${DIR_VARIANTS}/${FILE}.vcf
done

HAPLOID=(SRR28865844 SRR28865842)

for FILE in "${HAPLOID[@]}"
do
  freebayes \
   -p 1 -f ${DIR_INDEX}/Amel_HAv3.1.fasta \
   ${DIR_ALIGN_FILTERED}/${FILE}.bam > ${DIR_VARIANTS}/${FILE}.vcf
done

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  # Remove variants that do not meet criteria
  # Retain only high-quality homozygous SNPs
  vcftools \
   --vcf ${DIR_VARIANTS}/${FILE}.vcf \
   --minQ ${QUAL} \
   --min-meanDP ${MINDP} --max-meanDP ${MAXDP} \
   --minDP ${MINDP} --maxDP ${MAXDP} \
   --recode --stdout \
   | bcftools filter -e 'GT="het"' - \
   | bcftools filter -i 'TYPE="snp"' - \
   > ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf
   
  # Compress
  bgzip -c ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf \
   > ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz
  
  # Index the compressed file
  tabix -p vcf ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz
done

2.3.4 Genome and transcriptome construction

2.3.4.1 Construct F1 genomes with GATK FastaAlternateReferenceMaker

First, create a “sequence dictionary” from the Amel_HAv3.1 reference genome for use with GATK. Then, integrate the high-quality homozygous SNPs into the Amel_HAv3.1 reference genome for each parent to generate individual genomes. Finally, reformat the numeric headers of the F1 genome .fasta files produced by GATK to match the nucleotide accessions in the headers of the Amel_HAv3.1 reference genome fasta file.

DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"
DIR_INDEX="AST-tutorial/INDEX"
DIR_PARENT_GENOMES="AST-tutorial/PARENT_GENOMES"

gatk4 CreateSequenceDictionary \
 -R ${DIR_INDEX}/Amel_HAv3.1.fasta \
 -O ${DIR_INDEX}/Amel_HAv3.1.dict

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  gatk4 FastaAlternateReferenceMaker \
   -R ${DIR_INDEX}/Amel_HAv3.1.fasta \
   -O ${DIR_PARENT_GENOMES}/${FILE}.fasta \
   -V ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz
done

# Get all sequence headers from one of the F1 genomes (= "bad"")
grep ">" ${DIR_PARENT_GENOMES}/SRR28865844.fasta \
 | sed 's/>//g' > bad_headers.txt

# Get all sequence headers from the reference genome (= "good"")
grep ">" ${DIR_INDEX}/Amel_HAv3.1.fasta \
 | sed 's/\s.*$//' | sed 's/>//g' > ${DIR_INDEX}/good_headers.txt

# Combine the "bad" and "good" headers into one .tsv
paste -d"\t" ${DIR_INDEX}/bad_headers.txt ${DIR_INDEX}/good_headers.txt \
 > ${DIR_INDEX}/replace_headers.tsv

for FILE in "${FILES[@]}"
do
  # Use replace_headers.tsv as a lookup table for replacement
  awk 'FNR==NR{a[">"$1]=$2;next}$1 in a{sub(/>/,">"a[$1]"|",$1)}1' \
   ${DIR_INDEX}/replace_headers.tsv ${DIR_PARENT_GENOMES}/${FILE}.fasta \
   | sed 's/:.*//' > ${DIR_PARENT_GENOMES}/${FILE}_fixed.fasta
done

This will replace the sequential numeric sequence headers in the resultant F1 genome .fasta files produced by GATK, e.g.,

>1
ATCCTCCACCT....
>2 
GGGAATTGCCA....

With the nucleotide accessions in the sequence headers of the Amel_HAv3.1 reference genome fasta file, e.g.,

>NC_037638.1
ATCCTCCACCT....
>NC_037639.1
GGGAATTGCCA....

This step is critical for the final mRNA-seq allele-specific read counting step as the chromosome field of the annotation file used to assign counts to transcripts must match the chromosome field of the alignment files used for counting.

2.3.4.2 Construct F1 transcriptomes for STAR

THREADS="integer number of threads for parallel processing, usually 8 is fine"
DIR_INDEX="AST-tutorial/INDEX"
DIR_PARENT_GENOMES="AST-tutorial/PARENT_GENOMES"

FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)

for FILE in "${FILES[@]}"
do
  STAR \
   --runThreadN ${THREADS} \
   --runMode genomeGenerate \
   --genomeDir ${DIR_PARENT_GENOMES}/${FILE} \
   --genomeFastaFiles ${DIR_PARENT_GENOMES}/${FILE}_fixed.fasta \
   --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf
done

2.3.4.3 Annotate F1 SNPs within genes

Here, the high-quality homozygous SNPs for each reciprocal cross are intersected to find those that are unique to each parent but shared between the crosses. For example, consider these hypothetical SNPs reported by a variant calling tool:

Parent Cross Chr Pos Ref Alt
Male A 1 100 A T
Female B 1 100 A T

In this example, in Cross A, the “A>T” SNP is present in the male parent but absent in the female parent (i.e., the “Alt” allele for the female parent is the same as the “Ref” allele and thus was not reported as a SNP). In Cross B, this SNP is present in the female parent but absent in the male parent. Thus, this SNP is unique to the parents of each cross, but shared between the crosses of the reciprocal cross. Therefore, this SNP can be used to differentiate reads originating from each parent in each cross, allowing for assessment of genotype-specific or parent-specific bias within the reciprocal cross design.

Get gene annotations

First, subset the Amel_HAv3.1 reference genome feature file (gff) to gene features, and modify column 9 (the info column) to keep only the RefSeq GeneID (these will be used in all downstream analyses).

DIR_INDEX="AST-tutorial/INDEX"
DIR_VARIANTS="AST-tutorial/VARIANTS/UNFILTERED"
DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"

# Subset Amel_HAv3.1.gff to rows with "gene"" in column 3
# Print columns 1-8
awk '$3 == "gene" { print $0 }' ${DIR_INDEX}/Amel_HAv3.1.gff \
 | awk -v OFS="\t" '{print $1, $2, $3, $4, $5, $6, $7, $8}' \
 > ${DIR_INDEX}/Amel_HAv3.1_genes.txt

# Extract the "GeneID" from column 9
awk '$3 == "gene" { print $0 }' ${DIR_INDEX}/Amel_HAv3.1.gff \
 | awk '{print $9}' \
 | grep -o 'GeneID[^\s]*' \
 | cut -d':' -f2 \
 | cut -d';' -f1 \
 | cut -d',' -f1 \
 | sed -e 's/^/LOC/' \
 > ${DIR_INDEX}/Amel_HAv3.1_geneIDs.txt

# Combine the two files to create a new gene-only .gff 
paste -d'\t' ${DIR_INDEX}/Amel_HAv3.1_genes.txt \
${DIR_INDEX}/Amel_HAv3.1_geneIDs.txt \
> ${DIR_INDEX}/Amel_HAv3.1_genes.gff3

The resultant Amel_HAv3.1_genes.gff3 file should look like this:

NC_037638.1 Gnomon gene 9273 12174 . - . LOC551580
NC_037638.1 Gnomon gene 10792 17180 . + . LOC551555
NC_037638.1 Gnomon gene 17090 23457 . - . LOC726347
Identify informative F1 SNPs

Next, identify the informative SNPs by taking the difference of each respective parental SNP call set (i.e., SNPs unique to each parent) and finding the intersection between crosses.

DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"
DIR_ANALYSIS="AST-tutorial/ANALYSIS_SETS"
CrossApat="SRR28865844"
CrossAmat="SRR28865843"
CrossBpat="SRR28865842"
CrossBmat="SRR28865841"

# Get SNPs in -a that are not in -b
bedtools intersect -header -v \
 -a ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps.vcf \
 -b ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps.vcf \
 > ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps_outer.vcf

bedtools intersect -header -v \
 -a ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps.vcf \
 -b ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps.vcf \
 > ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps_outer.vcf

bedtools intersect -header -v \
 -a ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps.vcf \
 -b ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps.vcf \
 > ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps_outer.vcf

bedtools intersect -header -v \
 -a ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps.vcf \
 -b ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps.vcf \
 > ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps_outer.vcf

# Combine Cross A SNPs from above to single .vcf
grep -v '^#' \
 ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps_outer.vcf \
 | cat ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps_outer.vcf - \
 > ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf

# Compress the combined Cross A .vcf
bgzip -c \
 ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf \
 > ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf.gz

# Combine Cross B SNPs from above to single .vcf
grep -v '^#' \
 ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps_outer.vcf \
 | cat ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps_outer.vcf - \
 > ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf

# Compress the combined Cross B .vcf
bgzip -c \
 ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf \
 > ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf.gz

# Intersect the combined Cross A & Cross B SNPs
bedtools intersect -header -u \
 -a ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf.gz \
 -b ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf.gz \
 > ${DIR_ANALYSIS}/Analysis_SNP_Set.vcf

# Create a basic .bed file from the intersected SNPs
# Use the genomic position as the start and end coordinates
# Label each variant sequentially (i.e., SNP_1, SNP_2, etc.)
grep -v '^#' ${DIR_ANALYSIS}/Analysis_SNP_Set.vcf \
 | awk -v OFS="\t" '{print $1, $2, $2}' \
 | awk -v OFS="\t" '$4=(FNR FS $4)' \
 | awk -v OFS="\t" '{print $1, $2, $3, "snp_"$4}' \
 > ${DIR_ANALYSIS}/Analysis_SNP_Set.bed

The resultant Analysis_SNP_Set.bed file should look like this:

NC_037638.1 155356 155356 snp_1
NC_037638.1 174473 174473 snp_2
NC_037638.1 183350 183350 snp_3
Intersect informative F1 SNPs with genes

Finally, annotate the genes with the positions of informative F1 SNPs. In Section 3, these SNPs will be subset to those within the longest transcript for each gene. Mitochondrial genes (on the chromosome with nucleotide accession “NC_001566.1”) are filtered, as these are innapropriate for assessing parent-of-origin effects (as all mitochondrial genes will be maternally expressed).

DIR_ANALYSIS="AST-tutorial/ANALYSIS_SETS"
DIR_INDEX="AST-tutorial/INDEX"

# Intersect the gene annotations and SNP set
# Combine the SNP_ID and GeneIDs in column 4 (name) as SNP_ID:GeneID
bedtools intersect -wb \
 -a ${DIR_INDEX}/Amel_HAv3.1_genes.gff3 \
 -b ${DIR_ANALYSIS}/Analysis_SNP_Set.bed \
 | awk -v OFS="\t" '{print $10, $11, $12, $13 ":" $9, $6, $7}' \
 | grep -v '^NC_001566.1' | sort -k1,1 -k2,2n \
 > ${DIR_ANALYSIS}/SNPs_for_Analysis.bed

sort --parallel=8 -k1,1 -k2,2n \
 ${DIR_ANALYSIS}/SNPs_for_Analysis.bed \
 > ${DIR_ANALYSIS}/SNPs_for_Analysis_Sorted.bed

The resultant SNPs_for_Analysis_Sorted.bed bed should look like this:

NC_037638.1 54751 54751 snp_99616:LOC107964061 . +
NC_037638.1 91396 91396 snp_99617:LOC113219112 . +
NC_037638.1 147182 147182 snp_99618:LOC726544 . -

2.4 Quantification of F2 allele-specific transcription

2.4.1 mRNA-seq read pre-processing

2.4.1.1 Retrieve mRNA-seq reads from SRA

SRA Cross Phenotype
SRR28865816 A Worker
SRR28865815 A Worker
SRR28865811 A Queen
SRR28865812 A Queen
SRR28865777 A Queen
SRR28865776 A Worker
SRR28865775 A Queen
SRR28865774 A Worker
SRR28865773 A Queen
SRR28865772 B Queen
SRR28865770 B Queen
SRR28865769 B Worker
SRR28865768 B Queen
SRR28865771 B Queen
SRR28865767 B Worker
SRR28865808 B Worker
SRR28865806 B Worker
SRR28865804 A Worker
SRR28865803 B Worker
SRR28865802 B Queen
DIR_RAW="AST-tutorial/RAW/mRNA"

FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
       SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
       SRR28865773 SRR28865772 SRR28865770 SRR28865769 \
       SRR28865768 SRR28865771 SRR28865767 SRR28865808 \
       SRR28865806 SRR28865804 SRR28865803 SRR28865802)

for FILE in "${FILES[@]}"
do
  prefetch -O ${DIR_RAW} ${FILE}
  fasterq-dump -O ${DIR_RAW} ${DIR_RAW}/${FILE}.sra
  rm ${DIR_RAW}/{FILE}.sra
done

2.4.1.2 Assess raw quality control metrics

DIR_RAW="AST-tutorial/RAW/mRNA"
DIR_REPORTS="AST-tutorial/REPORTS/mRNA"
THREADS="integer number of threads for parallel processing, usually 4 is fine"

FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
       SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
       SRR28865773 SRR28865772 SRR28865770 SRR28865769 \
       SRR28865768 SRR28865771 SRR28865767 SRR28865808 \
       SRR28865806 SRR28865804 SRR28865803 SRR28865802)

for FILE in "${FILES[@]}"
do
  fastqc \
  -t ${THREADS} \
  ${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \
  --outdir ${DIR_REPORTS}
done

2.4.1.3 Trim and filter reads

DIR_RAW="AST-tutorial/RAW/mRNA"
DIR_TRIM="AST-tutorial/TRIM/mRNA"
DIR_REPORTS="AST-tutorial/REPORTS/mRNA"
THREADS="integer number of threads for parallel processing, usually 8 is fine"

FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
       SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
       SRR28865773 SRR28865772 SRR28865770 SRR28865769 \
       SRR28865768 SRR28865771 SRR28865767 SRR28865808 \
       SRR28865806 SRR28865804 SRR28865803 SRR28865802)

for FILE in "${FILES[@]}"
do
  fastp \
  -w ${THREADS} \
  -i ${DIR_RAW}/${FILE}_1.fastq -I ${DIR_RAW}/${FILE}_2.fastq \
  -o ${DIR_TRIM}/${FILE}_1.fastq -O ${DIR_TRIM}/${FILE}_2.fastq \
  -j ${DIR_REPORTS}/${FILE}_fastp.json
done

2.4.1.4 Aggregate QC reports with MultiQC

DIR_REPORTS="AST-tutorial/REPORTS/mRNA"
multiqc ${DIR_REPORTS}/. \
  -o ${DIR_REPORTS} \
  -n mRNA.html

2.4.2 mRNA-seq read alignment to F1 transcriptomes

Here, the mRNA-seq reads for each F2 sample are aligned to their respective parental (F1) transcriptomes.

DIR_TRIM="AST-tutorial/TRIM/mRNA"
DIR_PARENT_GENOMES="AST-tutorial/PARENT_GENOMES"
DIR_ALIGN="AST-tutorial/ALIGN_PARENT_GENOMES/UNFILTERED"
DIR_INDEX="AST-tutorial/INDEX"
THREADS="integer number of threads for parallel processing, usually 8 is fine"

CrossApat="SRR28865844"
CrossAmat="SRR28865843"
CrossA=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
        SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
        SRR28865773 SRR28865804)

for FILE in "${CrossA[@]}"
do
  STAR \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN ${THREADS} \
  --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
  --genomeDir ${DIR_PARENT_GENOMES}/${CrossApat} \
  --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
  --outFileNamePrefix ${DIR_ALIGN}/${CrossApat}_${FILE} \
  --outSAMattributes NH HI AS nM NM
  
  STAR \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN ${THREADS} \
  --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
  --genomeDir ${DIR_PARENT_GENOMES}/${CrossAmat} \
  --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
  --outFileNamePrefix ${DIR_ALIGN}/${CrossAmat}_${FILE} \
  --outSAMattributes NH HI AS nM NM
done

CrossBpat="SRR28865842"
CrossBmat="SRR28865841"
CrossB=(SRR28865772 SRR28865770 SRR28865769 SRR28865768 \
        SRR28865771 SRR28865767 SRR28865808 SRR28865806 \
        SRR28865803 SRR28865802)

for FILE in "${CrossB[@]}"
do
  STAR \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN ${THREADS} \
  --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
  --genomeDir ${DIR_PARENT_GENOMES}/${CrossBpat} \
  --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
  --outFileNamePrefix ${DIR_ALIGN}/${CrossBpat}_${FILE} \
  --outSAMattributes NH HI AS nM NM
  
  STAR \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN ${THREADS} \
  --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
  --genomeDir ${DIR_PARENT_GENOMES}/${CrossBmat} \
  --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
  --outFileNamePrefix ${DIR_ALIGN}/${CrossBmat}_${FILE} \
  --outSAMattributes NH HI AS nM NM
done

2.4.3 mRNA-seq alignment post-processing and allele-specific read counting

This final step seems a bit convoluted at first, but after testing many other read counting tools, I found this is the only approach that allows for counting strand-specific reads at the SNP-level without making any additional assumptions.

DIR_ANALYSIS="AST-tutorial/ANALYSIS_SETS"
DIR_INDEX="AST-tutorial/INDEX"
DIR_ALIGN="AST-tutorial/ALIGN_PARENT_GENOMES/UNFILTERED"
DIR_ALIGN_FILTERED="AST-tutorial/ALIGN_PARENT_GENOMES/FILTERED"
DIR_COUNT="AST-tutorial/COUNT"
THREADS="integer number of threads for parallel processing, usually 8 is fine"
ASET="${DIR_ANALYSIS}/SNPs_for_Analysis_Sorted.bed"

Parents=("SRR28865844" "SRR28865843")
CrossA=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
        SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
        SRR28865773 SRR28865804)

for FILE in "${CrossA[@]}"
do
  for PARENT in "${Parents[@]}"
  do
    # Keep only mapped, primary alignments with 0 mismatches
    samtools view -@ ${THREADS} -m 6G -e '[NM]==0' -F 260 -O BAM \
    ${DIR_ALIGN}/${PARENT}_${FILE}Aligned.sortedByCoord.out.bam \
    > ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam
    
    # Index the alignments
    samtools index ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
    ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam.bai
    
    # Convert BAM to BED
    bedtools bamtobed -i ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
    > ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed
    
    # Sort BED by coordinate
    sort --parallel=${THREADS} \
    -k1,1 -k2,2n ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed \
    > ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed
    
    # Count reads by SNP accounting for strand of alignment
    bedtools intersect -S -sorted -c -a ${ASET} \
    -b ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed \
    > ${DIR_COUNT}/${PARENT}_${FILE}.txt
  done
done


Parents=("SRR28865842" "SRR28865841")
CrossB=(SRR28865772 SRR28865770 SRR28865769 SRR28865768 \
        SRR28865771 SRR28865767 SRR28865808 SRR28865806 \
        SRR28865803 SRR28865802)

for FILE in "${CrossB[@]}"
do
  for PARENT in "${Parents[@]}"
  do
    # Keep only mapped, primary alignments with 0 mismatches
    samtools view -@ ${THREADS} -m 6G -e '[NM]==0' -F 260 -O BAM \
    ${DIR_ALIGN}/${PARENT}_${FILE}Aligned.sortedByCoord.out.bam \
    > ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam
    
    # Index the alignments
    samtools index ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
    ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam.bai
    
    # Convert BAM to BED
    bedtools bamtobed -i ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
    > ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed
    
    # Sort BED by coordinate
    sort --parallel=${THREADS} \
    -k1,1 -k2,2n ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed \
    > ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed
    
    # Count reads by SNP accounting for strand of alignment
    bedtools intersect -S -sorted -c -a ${ASET} \
    -b ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed \
    > ${DIR_COUNT}/${PARENT}_${FILE}.txt
  done
done

Resultant counts files should look identical to SNPs_for_Analysis_Sorted.bed with an additional column containing the read counts:

NC_037638.1 54751 54751 snp_99616:LOC107964061 . + 8
NC_037638.1 91396 91396 snp_99617:LOC113219112 . + 7
NC_037638.1 147182 147182 snp_99618:LOC726544 . - 9

In Section 3: SNP-level parent-specific read counts to allele-specific transcription, the resultant count files will be combined into a single matrix, and various statistical methods will be used to assess and compare allele-specific transcription between phenotypes.