2 Sequence reads to SNP-level parent-specific read counts
2.1 Overview
This bash tutorial demonstrates best practices for processing whole genome sequence (WGS) and mRNA sequencing (mRNA-seq) reads for allele-specific transcriptomics analysis.
Sequencing data for this tutorial are described in Bresnahan et al., 2024, “Intragenomic conflict underlies extreme phenotypic plasticity in queen-worker caste determination in honey bees (Apis mellifera)”, bioRxiv. See the tutorial introduction for more details.
F1 WGS reads are trimmed with fastp to remove adapter sequences and filter low-quality reads, then aligned to the A. mellifera Amel_HAv3.1 reference genome assembly with BWA-MEM. Alignments are coordinate sorted and filtered using samtools. Duplicate alignments are removed with GATK MarkDuplicates. Variants against the A. mellifera reference genome assembly are detected using freebayes to account for differences in ploidy between diploid females (queens) and haploid males (drones). The variant calls are then filtered for quality and read depth with VCFtools, and all variants except homozygous SNPs are filtered with samtools BCFtools. The high-confidence homozygous SNPs are then integrated into the A. mellifera reference genome assembly with GATK FastaAlternativeReferenceMaker, and intersected using bedtools to identify SNPs that are unique to each parent but shared between the crosses of a reciprocal cross pair. Parent-specific SNPs are then intersected with the A. mellifera reference genome annotation with bedtools to identify positions within the longest transcript for each gene.
F2 mRNA-seq reads are trimmed with fastp to remove adapter sequences, filter low-quality reads, and generate quality control metrics, and then aligned to each respective parent genome using STAR. Alignments are then filtered using samtools. For each sample, read coverage is calculated at each F1 SNP using bedtools intersect in strand-aware mode.
2.2 Setup
2.2.1 Command-line tools
Mamba
Several command-line tools are used in this tutorial. These are easily installed with the package manager Mamba. Follow the Miniforge distribution documentation for installation, then create a new environment called “AST-bioinfo” for installing the necessary tools.
mamba create --name AST-bioinfo
sra-tools
The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for retrieving data in the INSDC Sequence Read Archives. (Documentation)
In this tutorial, sra-tools is used to retrieve Apis mellifera WGS & mRNA-seq reads from SRA BioProject accession PRJNA1106847.
mamba install -n AST-bioinfo bioconda::sra-tools
fastp
fastp is a tool designed to provide fast all-in-one preprocessing for FastQ files. (Documentation)
In this tutorial, fastp is used to trim any sequencing adapter contamination from the reads and assess quality control metrics.
mamba install -n AST-bioinfo bioconda::fastp
BWA
BWA is a software package for mapping DNA sequences against a large reference genome. (Documentation)
In this tutorial, the BWA-MEM algorithm which is designed for Illumina sequence reads ranged from 70bp to a few megabases and is faster and more accurate than other alignment algorithms, is used for alignment of the F1 (parental) WGS reads to the Apis mellifera reference genome.
mamba install -n AST-bioinfo bioconda::bwa
STAR
Spliced Transcripts Alignment to a Reference (STAR) is one of the best performing tools for alignment of mRNA-seq reads to a reference genome. (Documentation)
In this tutorial, STAR is used to align the F2 mRNA-seq reads to their respective F1 reference genomes.
mamba install -n AST-bioinfo bioconda::star
freebayes
freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. (Documentation)
In this tutorial, freebayes is used to identify SNPs in the F1 WGS reads with reference to the Apis mellifera reference genome.
mamba install -n AST-bioinfo bioconda::freebayes
samtools
samtools is a suite of tools for manipulating next-generation sequencing data. (Documentation)
In this tutorial, samtools is used to process alignments produced by BWA-MEM & STAR and SNP calls produced by freebayes.
mamba install -n AST-bioinfo bioconda::samtools
BEDTools
The bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. (Documentation)
In this tutorial, BEDTools is used to identify SNPs that differ between F1 parents and to annotate the Apis mellifera transcript annotations with the positions of these SNPs.
mamba install -n AST-bioinfo bioconda::bedtools
VCFtools
VCFtools is a program package designed for working with VCFs. (Documentation)
In this tutorial, VCFtools is used to filter SNPs called by freebayes.
mamba install -n AST-bioinfo bioconda::vcftools
GATK
The Genome Analysis Toolkit is a collection of command-line tools for analyzing high-throughput sequencing data with a primary focus on variant discovery. (Documentation)
In this tutorial, the MarkDuplicates tool is used to filter duplicate WGS reads, and the FastaAlternateReferenceMaker tool is used to integrate F1 SNPs into the Apis mellifera reference genome to generate parental reference genomes.
mamba install -n AST-bioinfo bioconda::gatk
FastQC
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. (Documentation)
In this tutorial, FastQC is used for an initial quality control check of the sequencing reads prior to pre-processing steps.
mamba install -n AST-bioinfo bioconda::fastqc
MultiQC
MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools. (Documentation)
In this tutorial, MultiQC is used to aggregate reports from FastQC and fastp to assess quality control metrics of the sequencing reads before and after filtering.
mamba install -n AST-bioinfo bioconda::multiqc
Project directory structure
It is important to create a cohesive and consistent directory structure for your bioinformatics projects. For this tutorial, create these directories with mkdir
:
AST-tutorial
└───RAW
│ └───WGS
│ └───mRNA
└───TRIM
│ └───WGS
│ └───mRNA
└───ALIGN_GENOME
│ └───UNFILTERED
│ └───FILTERED
└───VARIANTS
│ └───UNFILTERED
│ └───FILTERED
└───ANALYSIS
│ └───INTERMEDIATE
│ └───RESULTS
└───ANALYSIS_SETS
└───PARENT_GENOMES
└───ALIGN_PARENT_GENOMES
│ └───UNFILTERED
│ └───FILTERED
└───COUNT
└───INDEX
└───SCRIPTS
└───REPORTS
│ └───WGS
│ └───mRNA
2.2.2 Amel_HAv3.1 reference genome assembly and annotation files
Retrieve the RefSeq assembly files from NCBI via ftp.
DIR_INDEX="AST-tutorial/INDEX"
wget -O ${DIR_INDEX}/Amel_HAv3.1.fasta.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.fna.gz
gunzip ${DIR_INDEX}/Amel_HAv3.1.fasta.gz
wget -O ${DIR_INDEX}/Amel_HAv3.1.gff.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz
gunzip ${DIR_INDEX}/Amel_HAv3.1.gff.gz
wget -O ${DIR_INDEX}/Amel_HAv3.1.gff.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gtf.gz
gunzip ${DIR_INDEX}/Amel_HAv3.1.gtf.gz
2.2.3 A note on parallel processing in high performance computing clusters
The code chunks throughout the tutorial are written so as to run each FILE
in a FILES
array through serial loops. If you are running this tutorial in a high-performance computing (HPC) cluster with a batch submission system (e.g., SLURM on Penn State ROAR Collab), you can instead submit a “parent” job that submits individual “child” jobs for each file. For example:
Parent job
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0:10:00
#SBATCH --mem-per-cpu=1gb
#SBATCH --partition=open
SCRIPTS="path/to/SCRIPTS"
THREADS=4
DIR_OUT="path/to/OUTPUT_DIRECTORY"
DIR_RAW="path/to/RAW_DATA"
FILES=(FILE1 FILE2 etc...)
for FILE in "${FILES[@]}"
do
sbatch ${SCRIPTS}/script_to_run.sh ${FILE} ${THREADS} ${DIR_OUT} ${DIR_RAW}
done
Child job
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=8:00:00
#SBATCH --mem-per-cpu=8gb
#SBATCH --partition=open
# To source miniconda
source /storage/home/stb5321/work/miniconda3/etc/profile.d/conda.sh
# Assign arguments from sbatch command called in parent script to variables
FILE="$1"
THREADS="$2"
DIR_OUT="$3"
DIR_RAW="$4"
commandX -t ${THREADS} \
${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \
-o ${DIR_OUT}/${FILE}.out
Note that depending on how your HPC is set up, you may need to source your conda installation in each child script. For more information on the #SBATCH
commands in the headers of these scripts, see the Slurm Workflow Manager documentation. Specifically for PSU, see the ROAR Collab documentation.
2.3 Generation of F1 genomes and transcriptomes
2.3.1 WGS read pre-processing
2.3.1.1 Retrieve WGS reads from SRA
SRA | Cross | Parent |
---|---|---|
SRR28865844 | A | Male |
SRR28865843 | A | Female |
SRR28865842 | B | Male |
SRR28865841 | B | Female |
Note: “Cross” here also refers to the lineage of the maternal parent. So, e.g., “Cross A” is synonymous with A-lineage mother x B-lineage father.
DIR_RAW="AST-tutorial/RAW/WGS"
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
# Retrieve the .sra file
prefetch -O ${DIR_RAW} ${FILE}
# Extract .fastq from .sra
fasterq-dump -O ${DIR_RAW} ${DIR_RAW}/${FILE}.sra
# Delete the .sra file
rm ${DIR_RAW}/{FILE}.sra
done
2.3.1.2 Assess raw quality control metrics
Here, FastQC reports for each library are generated. These can be viewed individually, or aggregated into a single report with MultiQC. This tutorial will demonstrate the latter approach.
DIR_RAW="AST-tutorial/RAW/WGS"
DIR_REPORTS="AST-tutorial/REPORTS/WGS"
THREADS="integer number of threads for parallel processing, usually 4 is fine"
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
fastqc \
-t ${THREADS} \
${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \
--outdir ${DIR_REPORTS}
done
2.3.1.3 Trim and filter reads
fastp will detect common sequencing adapters and trim them from the ends of sequencing reads, in addition to filter low-quality reads (where the average base call quality score is below some threshold and/or where the read is less than n nucleotides in length). Here, run fastp using the default settings to clean up the reads.
DIR_RAW="AST-tutorial/RAW/WGS"
DIR_TRIM="AST-tutorial/TRIM/WGS"
DIR_REPORTS="AST-tutorial/REPORTS/WGS"
THREADS="integer number of threads for parallel processing, usually 8 is fine"
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
fastp \
-w ${THREADS} \
-i ${DIR_RAW}/${FILE}_1.fastq -I ${DIR_RAW}/${FILE}_2.fastq \
-o ${DIR_TRIM}/${FILE}_1.fastq -O ${DIR_TRIM}/${FILE}_2.fastq \
-j ${DIR_REPORTS}/${FILE}_fastp.json
done
2.3.1.4 Aggregate QC reports with MultiQC
Quality control metrics of the WGS reads before and after trimming/filtering can be visualized in one html report. Read more about interpreting NGS QC metrics from these reports here from the Harvard Chan Bioinformatics Core. An example MultiQC report of WGS reads post-trimming can be found here.
DIR_REPORTS="AST-tutorial/REPORTS/WGS"
multiqc ${DIR_REPORTS}/. \
-o ${DIR_REPORTS} \
-n WGS.html
2.3.2 WGS read alignment and post-processing
2.3.2.1 Generate BWA alignment index for the Amel_HAv3.1 reference genome assembly
DIR_INDEX="AST-tutorial/INDEX"
bwa index ${DIR_INDEX}/Amel_HAv3.1.fasta
2.3.2.2 Align reads to the Amel_HAv3.1 reference genome assembly with BWA-MEM
DIR_TRIM="AST-tutorial/TRIM/WGS"
DIR_INDEX="AST-tutorial/INDEX"
DIR_ALIGN="AST-tutorial/ALIGN_GENOME/UNFILTERED"
THREADS="integer number of threads for parallel processing, usually 8 is fine"
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
bwa mem \
-t ${THREADS} \
${DIR_INDEX}/Amel_HAv3.1.fasta \
${FILE}_1.fastq ${FILE}_2.fastq \
> ${DIR_ALIGN}/${FILE}.sam
done
2.3.2.3 Sort and filter alignments with samtools and filter duplicates with GATK
Here, alignments are coordinate sorted and filtered to remove unmapped reads, reads with unmapped mates, secondary, supplementary, and duplicate alignments (see here for information on SAM flags). These are important to remove for variant calling steps as they can contribute to both false positive and negative calls. For more information about these practices, see the GATK best practices workflow for variant discovery and Koboldt (2020), Genome Med, https://doi.org/10.1186/s13073-020-00791-w.
DIR_ALIGN="AST-tutorial/ALIGN_GENOME/UNFILTERED"
DIR_ALIGN_FILTERED="AST-tutorial/ALIGN_GENOME/FILTERED"
DIR_REPORTS="AST-tutorial/REPORTS/WGS"
THREADS="integer number of threads for parallel processing, usually 6 is fine"
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
# Remove alignments with any flags in 2316
# Convert SAM to BAM and sort using 4Gb memory per thread
samtools view \
-@ ${THREADS} -F 2316 -O bam ${DIR_ALIGN}/${FILE}.sam \
| samtools sort -@ ${THREADS} -m 4g -O bam - \
> ${DIR_ALIGN}/${FILE}.bam
# Remove duplicates and sort by coordinate
gatk MarkDuplicates \
-I ${DIR_ALIGN}/${FILE}.bam \
-O ${DIR_ALIGN_FILTERED}/${FILE}_rmvdup.bam \
-M ${DIR_REPORTS}/${FILE}_rmvdup_metrics.txt \
--ASSUME_SORT_ORDER coordinate \
--REMOVE_DUPLICATES true \
--ADD_PG_TAG_TO_READS false
done
2.3.3 Variant discovery and filtration
Here, variants (SNPs, MNPs, & indels) are identified using freebayes to account for differences in ploidy between DIPLOID
and HAPLOID
samples. After variant discovery, the variant call files (VCFs) are then subset to homozygous SNPs and filtered by quality (QUAL
), minimum depth of coverage (MINDP
) and maximum depth of coverage (MAXDP
) using samtools and VCFtools, compressed with bgzip and indexed with tabix (samtools). For more information about these practices, see this how-to guide on Filtering and handling VCFs for population genomics from Mark Ravinet & Joana Meier.
QUAL="an integer, 30 is ideal"
MINDP="an integer, usually 10 is fine"
MAXDP="an integer, usually 50 is fine"
DIR_ALIGN_FILTERED="AST-tutorial/ALIGN_GENOME/FILTERED"
DIR_VARIANTS="AST-tutorial/VARIANTS/UNFILTERED"
DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"
DIR_INDEX="AST-tutorial/INDEX"
DIPLOID=(SRR28865843 SRR28865841)
for FILE in "${DIPLOID[@]}"
do
freebayes \
-p 2 -f ${DIR_INDEX}/Amel_HAv3.1.fasta \
${DIR_ALIGN_FILTERED}/${FILE}.bam > ${DIR_VARIANTS}/${FILE}.vcf
done
HAPLOID=(SRR28865844 SRR28865842)
for FILE in "${HAPLOID[@]}"
do
freebayes \
-p 1 -f ${DIR_INDEX}/Amel_HAv3.1.fasta \
${DIR_ALIGN_FILTERED}/${FILE}.bam > ${DIR_VARIANTS}/${FILE}.vcf
done
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
# Remove variants that do not meet criteria
# Retain only high-quality homozygous SNPs
vcftools \
--vcf ${DIR_VARIANTS}/${FILE}.vcf \
--minQ ${QUAL} \
--min-meanDP ${MINDP} --max-meanDP ${MAXDP} \
--minDP ${MINDP} --maxDP ${MAXDP} \
--recode --stdout \
| bcftools filter -e 'GT="het"' - \
| bcftools filter -i 'TYPE="snp"' - \
> ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf
# Compress
bgzip -c ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf \
> ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz
# Index the compressed file
tabix -p vcf ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz
done
2.3.4 Genome and transcriptome construction
2.3.4.1 Construct F1 genomes with GATK FastaAlternateReferenceMaker
First, create a “sequence dictionary” from the Amel_HAv3.1 reference genome for use with GATK. Then, integrate the high-quality homozygous SNPs into the Amel_HAv3.1 reference genome for each parent to generate individual genomes. Finally, reformat the numeric headers of the F1 genome .fasta files produced by GATK to match the nucleotide accessions in the headers of the Amel_HAv3.1 reference genome fasta file.
DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"
DIR_INDEX="AST-tutorial/INDEX"
DIR_PARENT_GENOMES="AST-tutorial/PARENT_GENOMES"
gatk4 CreateSequenceDictionary \
-R ${DIR_INDEX}/Amel_HAv3.1.fasta \
-O ${DIR_INDEX}/Amel_HAv3.1.dict
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
gatk4 FastaAlternateReferenceMaker \
-R ${DIR_INDEX}/Amel_HAv3.1.fasta \
-O ${DIR_PARENT_GENOMES}/${FILE}.fasta \
-V ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz
done
# Get all sequence headers from one of the F1 genomes (= "bad"")
grep ">" ${DIR_PARENT_GENOMES}/SRR28865844.fasta \
| sed 's/>//g' > bad_headers.txt
# Get all sequence headers from the reference genome (= "good"")
grep ">" ${DIR_INDEX}/Amel_HAv3.1.fasta \
| sed 's/\s.*$//' | sed 's/>//g' > ${DIR_INDEX}/good_headers.txt
# Combine the "bad" and "good" headers into one .tsv
paste -d"\t" ${DIR_INDEX}/bad_headers.txt ${DIR_INDEX}/good_headers.txt \
> ${DIR_INDEX}/replace_headers.tsv
for FILE in "${FILES[@]}"
do
# Use replace_headers.tsv as a lookup table for replacement
awk 'FNR==NR{a[">"$1]=$2;next}$1 in a{sub(/>/,">"a[$1]"|",$1)}1' \
${DIR_INDEX}/replace_headers.tsv ${DIR_PARENT_GENOMES}/${FILE}.fasta \
| sed 's/:.*//' > ${DIR_PARENT_GENOMES}/${FILE}_fixed.fasta
done
This will replace the sequential numeric sequence headers in the resultant F1 genome .fasta files produced by GATK, e.g.,
>1
ATCCTCCACCT....
>2
GGGAATTGCCA....
With the nucleotide accessions in the sequence headers of the Amel_HAv3.1 reference genome fasta file, e.g.,
>NC_037638.1
ATCCTCCACCT....
>NC_037639.1
GGGAATTGCCA....
This step is critical for the final mRNA-seq allele-specific read counting step as the chromosome field of the annotation file used to assign counts to transcripts must match the chromosome field of the alignment files used for counting.
2.3.4.2 Construct F1 transcriptomes for STAR
THREADS="integer number of threads for parallel processing, usually 8 is fine"
DIR_INDEX="AST-tutorial/INDEX"
DIR_PARENT_GENOMES="AST-tutorial/PARENT_GENOMES"
FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841)
for FILE in "${FILES[@]}"
do
STAR \
--runThreadN ${THREADS} \
--runMode genomeGenerate \
--genomeDir ${DIR_PARENT_GENOMES}/${FILE} \
--genomeFastaFiles ${DIR_PARENT_GENOMES}/${FILE}_fixed.fasta \
--sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf
done
2.3.4.3 Annotate F1 SNPs within genes
Here, the high-quality homozygous SNPs for each reciprocal cross are intersected to find those that are unique to each parent but shared between the crosses. For example, consider these hypothetical SNPs reported by a variant calling tool:
Parent | Cross | Chr | Pos | Ref | Alt |
---|---|---|---|---|---|
Male | A | 1 | 100 | A | T |
Female | B | 1 | 100 | A | T |
In this example, in Cross A, the “A>T” SNP is present in the male parent but absent in the female parent (i.e., the “Alt” allele for the female parent is the same as the “Ref” allele and thus was not reported as a SNP). In Cross B, this SNP is present in the female parent but absent in the male parent. Thus, this SNP is unique to the parents of each cross, but shared between the crosses of the reciprocal cross. Therefore, this SNP can be used to differentiate reads originating from each parent in each cross, allowing for assessment of genotype-specific or parent-specific bias within the reciprocal cross design.
Get gene annotations
First, subset the Amel_HAv3.1 reference genome feature file (gff) to gene features, and modify column 9 (the info column) to keep only the RefSeq GeneID (these will be used in all downstream analyses).
DIR_INDEX="AST-tutorial/INDEX"
DIR_VARIANTS="AST-tutorial/VARIANTS/UNFILTERED"
DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"
# Subset Amel_HAv3.1.gff to rows with "gene"" in column 3
# Print columns 1-8
awk '$3 == "gene" { print $0 }' ${DIR_INDEX}/Amel_HAv3.1.gff \
| awk -v OFS="\t" '{print $1, $2, $3, $4, $5, $6, $7, $8}' \
> ${DIR_INDEX}/Amel_HAv3.1_genes.txt
# Extract the "GeneID" from column 9
awk '$3 == "gene" { print $0 }' ${DIR_INDEX}/Amel_HAv3.1.gff \
| awk '{print $9}' \
| grep -o 'GeneID[^\s]*' \
| cut -d':' -f2 \
| cut -d';' -f1 \
| cut -d',' -f1 \
| sed -e 's/^/LOC/' \
> ${DIR_INDEX}/Amel_HAv3.1_geneIDs.txt
# Combine the two files to create a new gene-only .gff
paste -d'\t' ${DIR_INDEX}/Amel_HAv3.1_genes.txt \
${DIR_INDEX}/Amel_HAv3.1_geneIDs.txt \
> ${DIR_INDEX}/Amel_HAv3.1_genes.gff3
The resultant Amel_HAv3.1_genes.gff3
file should look like this:
NC_037638.1 | Gnomon | gene | 9273 | 12174 | . | - | . | LOC551580 |
NC_037638.1 | Gnomon | gene | 10792 | 17180 | . | + | . | LOC551555 |
NC_037638.1 | Gnomon | gene | 17090 | 23457 | . | - | . | LOC726347 |
Identify informative F1 SNPs
Next, identify the informative SNPs by taking the difference of each respective parental SNP call set (i.e., SNPs unique to each parent) and finding the intersection between crosses.
DIR_VARIANTS_FILTERED="AST-tutorial/VARIANTS/FILTERED"
DIR_ANALYSIS="AST-tutorial/ANALYSIS_SETS"
CrossApat="SRR28865844"
CrossAmat="SRR28865843"
CrossBpat="SRR28865842"
CrossBmat="SRR28865841"
# Get SNPs in -a that are not in -b
bedtools intersect -header -v \
-a ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps.vcf \
-b ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps.vcf \
> ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps_outer.vcf
bedtools intersect -header -v \
-a ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps.vcf \
-b ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps.vcf \
> ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps_outer.vcf
bedtools intersect -header -v \
-a ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps.vcf \
-b ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps.vcf \
> ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps_outer.vcf
bedtools intersect -header -v \
-a ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps.vcf \
-b ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps.vcf \
> ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps_outer.vcf
# Combine Cross A SNPs from above to single .vcf
grep -v '^#' \
${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps_outer.vcf \
| cat ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps_outer.vcf - \
> ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf
# Compress the combined Cross A .vcf
bgzip -c \
${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf \
> ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf.gz
# Combine Cross B SNPs from above to single .vcf
grep -v '^#' \
${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps_outer.vcf \
| cat ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps_outer.vcf - \
> ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf
# Compress the combined Cross B .vcf
bgzip -c \
${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf \
> ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf.gz
# Intersect the combined Cross A & Cross B SNPs
bedtools intersect -header -u \
-a ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf.gz \
-b ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf.gz \
> ${DIR_ANALYSIS}/Analysis_SNP_Set.vcf
# Create a basic .bed file from the intersected SNPs
# Use the genomic position as the start and end coordinates
# Label each variant sequentially (i.e., SNP_1, SNP_2, etc.)
grep -v '^#' ${DIR_ANALYSIS}/Analysis_SNP_Set.vcf \
| awk -v OFS="\t" '{print $1, $2, $2}' \
| awk -v OFS="\t" '$4=(FNR FS $4)' \
| awk -v OFS="\t" '{print $1, $2, $3, "snp_"$4}' \
> ${DIR_ANALYSIS}/Analysis_SNP_Set.bed
The resultant Analysis_SNP_Set.bed
file should look like this:
NC_037638.1 | 155356 | 155356 | snp_1 |
NC_037638.1 | 174473 | 174473 | snp_2 |
NC_037638.1 | 183350 | 183350 | snp_3 |
Intersect informative F1 SNPs with genes
Finally, annotate the genes with the positions of informative F1 SNPs. In Section 3, these SNPs will be subset to those within the longest transcript for each gene. Mitochondrial genes (on the chromosome with nucleotide accession “NC_001566.1”) are filtered, as these are innapropriate for assessing parent-of-origin effects (as all mitochondrial genes will be maternally expressed).
DIR_ANALYSIS="AST-tutorial/ANALYSIS_SETS"
DIR_INDEX="AST-tutorial/INDEX"
# Intersect the gene annotations and SNP set
# Combine the SNP_ID and GeneIDs in column 4 (name) as SNP_ID:GeneID
bedtools intersect -wb \
-a ${DIR_INDEX}/Amel_HAv3.1_genes.gff3 \
-b ${DIR_ANALYSIS}/Analysis_SNP_Set.bed \
| awk -v OFS="\t" '{print $10, $11, $12, $13 ":" $9, $6, $7}' \
| grep -v '^NC_001566.1' | sort -k1,1 -k2,2n \
> ${DIR_ANALYSIS}/SNPs_for_Analysis.bed
sort --parallel=8 -k1,1 -k2,2n \
${DIR_ANALYSIS}/SNPs_for_Analysis.bed \
> ${DIR_ANALYSIS}/SNPs_for_Analysis_Sorted.bed
The resultant SNPs_for_Analysis_Sorted.bed
bed should look like this:
NC_037638.1 | 54751 | 54751 | snp_99616:LOC107964061 | . | + |
NC_037638.1 | 91396 | 91396 | snp_99617:LOC113219112 | . | + |
NC_037638.1 | 147182 | 147182 | snp_99618:LOC726544 | . | - |
2.4 Quantification of F2 allele-specific transcription
2.4.1 mRNA-seq read pre-processing
2.4.1.1 Retrieve mRNA-seq reads from SRA
SRA | Cross | Phenotype |
---|---|---|
SRR28865816 | A | Worker |
SRR28865815 | A | Worker |
SRR28865811 | A | Queen |
SRR28865812 | A | Queen |
SRR28865777 | A | Queen |
SRR28865776 | A | Worker |
SRR28865775 | A | Queen |
SRR28865774 | A | Worker |
SRR28865773 | A | Queen |
SRR28865772 | B | Queen |
SRR28865770 | B | Queen |
SRR28865769 | B | Worker |
SRR28865768 | B | Queen |
SRR28865771 | B | Queen |
SRR28865767 | B | Worker |
SRR28865808 | B | Worker |
SRR28865806 | B | Worker |
SRR28865804 | A | Worker |
SRR28865803 | B | Worker |
SRR28865802 | B | Queen |
DIR_RAW="AST-tutorial/RAW/mRNA"
FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
SRR28865773 SRR28865772 SRR28865770 SRR28865769 \
SRR28865768 SRR28865771 SRR28865767 SRR28865808 \
SRR28865806 SRR28865804 SRR28865803 SRR28865802)
for FILE in "${FILES[@]}"
do
prefetch -O ${DIR_RAW} ${FILE}
fasterq-dump -O ${DIR_RAW} ${DIR_RAW}/${FILE}.sra
rm ${DIR_RAW}/{FILE}.sra
done
2.4.1.2 Assess raw quality control metrics
DIR_RAW="AST-tutorial/RAW/mRNA"
DIR_REPORTS="AST-tutorial/REPORTS/mRNA"
THREADS="integer number of threads for parallel processing, usually 4 is fine"
FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
SRR28865773 SRR28865772 SRR28865770 SRR28865769 \
SRR28865768 SRR28865771 SRR28865767 SRR28865808 \
SRR28865806 SRR28865804 SRR28865803 SRR28865802)
for FILE in "${FILES[@]}"
do
fastqc \
-t ${THREADS} \
${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \
--outdir ${DIR_REPORTS}
done
2.4.1.3 Trim and filter reads
DIR_RAW="AST-tutorial/RAW/mRNA"
DIR_TRIM="AST-tutorial/TRIM/mRNA"
DIR_REPORTS="AST-tutorial/REPORTS/mRNA"
THREADS="integer number of threads for parallel processing, usually 8 is fine"
FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
SRR28865773 SRR28865772 SRR28865770 SRR28865769 \
SRR28865768 SRR28865771 SRR28865767 SRR28865808 \
SRR28865806 SRR28865804 SRR28865803 SRR28865802)
for FILE in "${FILES[@]}"
do
fastp \
-w ${THREADS} \
-i ${DIR_RAW}/${FILE}_1.fastq -I ${DIR_RAW}/${FILE}_2.fastq \
-o ${DIR_TRIM}/${FILE}_1.fastq -O ${DIR_TRIM}/${FILE}_2.fastq \
-j ${DIR_REPORTS}/${FILE}_fastp.json
done
2.4.2 mRNA-seq read alignment to F1 transcriptomes
Here, the mRNA-seq reads for each F2 sample are aligned to their respective parental (F1) transcriptomes.
DIR_TRIM="AST-tutorial/TRIM/mRNA"
DIR_PARENT_GENOMES="AST-tutorial/PARENT_GENOMES"
DIR_ALIGN="AST-tutorial/ALIGN_PARENT_GENOMES/UNFILTERED"
DIR_INDEX="AST-tutorial/INDEX"
THREADS="integer number of threads for parallel processing, usually 8 is fine"
CrossApat="SRR28865844"
CrossAmat="SRR28865843"
CrossA=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
SRR28865773 SRR28865804)
for FILE in "${CrossA[@]}"
do
STAR \
--outSAMtype BAM SortedByCoordinate \
--runThreadN ${THREADS} \
--sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
--genomeDir ${DIR_PARENT_GENOMES}/${CrossApat} \
--readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
--outFileNamePrefix ${DIR_ALIGN}/${CrossApat}_${FILE} \
--outSAMattributes NH HI AS nM NM
STAR \
--outSAMtype BAM SortedByCoordinate \
--runThreadN ${THREADS} \
--sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
--genomeDir ${DIR_PARENT_GENOMES}/${CrossAmat} \
--readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
--outFileNamePrefix ${DIR_ALIGN}/${CrossAmat}_${FILE} \
--outSAMattributes NH HI AS nM NM
done
CrossBpat="SRR28865842"
CrossBmat="SRR28865841"
CrossB=(SRR28865772 SRR28865770 SRR28865769 SRR28865768 \
SRR28865771 SRR28865767 SRR28865808 SRR28865806 \
SRR28865803 SRR28865802)
for FILE in "${CrossB[@]}"
do
STAR \
--outSAMtype BAM SortedByCoordinate \
--runThreadN ${THREADS} \
--sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
--genomeDir ${DIR_PARENT_GENOMES}/${CrossBpat} \
--readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
--outFileNamePrefix ${DIR_ALIGN}/${CrossBpat}_${FILE} \
--outSAMattributes NH HI AS nM NM
STAR \
--outSAMtype BAM SortedByCoordinate \
--runThreadN ${THREADS} \
--sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \
--genomeDir ${DIR_PARENT_GENOMES}/${CrossBmat} \
--readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \
--outFileNamePrefix ${DIR_ALIGN}/${CrossBmat}_${FILE} \
--outSAMattributes NH HI AS nM NM
done
2.4.3 mRNA-seq alignment post-processing and allele-specific read counting
This final step seems a bit convoluted at first, but after testing many other read counting tools, I found this is the only approach that allows for counting strand-specific reads at the SNP-level without making any additional assumptions.
DIR_ANALYSIS="AST-tutorial/ANALYSIS_SETS"
DIR_INDEX="AST-tutorial/INDEX"
DIR_ALIGN="AST-tutorial/ALIGN_PARENT_GENOMES/UNFILTERED"
DIR_ALIGN_FILTERED="AST-tutorial/ALIGN_PARENT_GENOMES/FILTERED"
DIR_COUNT="AST-tutorial/COUNT"
THREADS="integer number of threads for parallel processing, usually 8 is fine"
ASET="${DIR_ANALYSIS}/SNPs_for_Analysis_Sorted.bed"
Parents=("SRR28865844" "SRR28865843")
CrossA=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \
SRR28865777 SRR28865776 SRR28865775 SRR28865774 \
SRR28865773 SRR28865804)
for FILE in "${CrossA[@]}"
do
for PARENT in "${Parents[@]}"
do
# Keep only mapped, primary alignments with 0 mismatches
samtools view -@ ${THREADS} -m 6G -e '[NM]==0' -F 260 -O BAM \
${DIR_ALIGN}/${PARENT}_${FILE}Aligned.sortedByCoord.out.bam \
> ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam
# Index the alignments
samtools index ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam.bai
# Convert BAM to BED
bedtools bamtobed -i ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
> ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed
# Sort BED by coordinate
sort --parallel=${THREADS} \
-k1,1 -k2,2n ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed \
> ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed
# Count reads by SNP accounting for strand of alignment
bedtools intersect -S -sorted -c -a ${ASET} \
-b ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed \
> ${DIR_COUNT}/${PARENT}_${FILE}.txt
done
done
Parents=("SRR28865842" "SRR28865841")
CrossB=(SRR28865772 SRR28865770 SRR28865769 SRR28865768 \
SRR28865771 SRR28865767 SRR28865808 SRR28865806 \
SRR28865803 SRR28865802)
for FILE in "${CrossB[@]}"
do
for PARENT in "${Parents[@]}"
do
# Keep only mapped, primary alignments with 0 mismatches
samtools view -@ ${THREADS} -m 6G -e '[NM]==0' -F 260 -O BAM \
${DIR_ALIGN}/${PARENT}_${FILE}Aligned.sortedByCoord.out.bam \
> ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam
# Index the alignments
samtools index ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam.bai
# Convert BAM to BED
bedtools bamtobed -i ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \
> ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed
# Sort BED by coordinate
sort --parallel=${THREADS} \
-k1,1 -k2,2n ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed \
> ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed
# Count reads by SNP accounting for strand of alignment
bedtools intersect -S -sorted -c -a ${ASET} \
-b ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed \
> ${DIR_COUNT}/${PARENT}_${FILE}.txt
done
done
Resultant counts files should look identical to SNPs_for_Analysis_Sorted.bed
with an additional column containing the read counts:
NC_037638.1 | 54751 | 54751 | snp_99616:LOC107964061 | . | + | 8 |
NC_037638.1 | 91396 | 91396 | snp_99617:LOC113219112 | . | + | 7 |
NC_037638.1 | 147182 | 147182 | snp_99618:LOC726544 | . | - | 9 |
In Section 3: SNP-level parent-specific read counts to allele-specific transcription, the resultant count files will be combined into a single matrix, and various statistical methods will be used to assess and compare allele-specific transcription between phenotypes.