Methods

Project directory structure

TCRBOA/
├── raw_fastq
├── trim_fastq
├── raw_bam
├── filter_bam
├── raw_vcf
├── filter_vcf
├── index
├── reports

Data retrieval

Illumina Whole Exome Sequencing (WXS) reads

Sequencing reads were described in Becnel, L. B. et al. (2016). An open access pilot freely sharing cancer genomic data from participants in Texas. Scientific Data 3: 160010. https://doi.org/10.1038/sdata.2016.10.

Run	Sample	Disease	Tissue	Sex
SRR2089363	TCR002361-T	9690/3: Follicular lymphoma	Lymph node of the groin	male
SRR2089364	TCR002361-N	Normal	Blood	male
SRR2184243	TCRBOA6-T	9690/3: Follicular lymphoma	Lymph node of the groin	female
SRR2184242	TCRBOA6-N	Normal	Blood	female
SRR2089355	TCR000484-T	8246/3: Neuroendocrine carcinoma, NOS	Tail of pancreas	female
SRR2089359	TCR000484-N	Normal	Blood	female
SRR2089357	TCR002101-T	8500/3: Infiltrating duct adenocarcinoma, NOS	Pancreas	female
SRR2089356	TCR002101-N	Normal	Blood	female
SRR2089360	TCR002103-T	8500/3: Infiltrating duct adenocarcinoma, NOS	Pancreas	male
SRR2089358	TCR002103-N	Normal	Blood	male
SRR2089361	TCR002201-T	8500/3: Infiltrating duct adenocarcinoma, NOS	Pancreas	male
SRR2089362	TCR002201-N	Normal	Blood	male
SRR2089365	TCR002182-T	8500/3: Infiltrating duct adenocarcinoma, NOS	Pancreas	male
SRR2089366	TCR002182-N	Normal	Blood	male

Fastq files were retrieved from NCBI SRA (BioProject accessions PRJNA284596 and PRJNA284598) via The SRA Toolkit:

cd ~/TCRBOA/raw_fastq

FILES=(SRR2089363 SRR2089364 SRR2184243 SRR2184242 \
       SRR2089355 SRR2089359 SRR2089357 SRR2089356 \
       SRR2089360 SRR2089358 SRR2089361 SRR2089362 \
       SRR2089365 SRR2089366)

for FILE in "${FILES[@]}"
do

 prefetch ${FILE}
 fasterq-dump ${FILE}
 
done

Human genome assembly, annotations, and variant call sets

Analysis set

Retrieved from UCSC.

hg38.analysisSet.fasta.gz: The “analysis set” is a version of the genome prepared for next-gen sequencing read alignment. It contains no alternate sequences, no patches, fixes or haplotypes, only the main chromosomes. For more information, see https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/.

Gene annotation

Retreived from UCSC.

hg38.ncbiRefSeq.gtf.gz

Variant call sets

Retrieved from the GATK resource bundle.

Data preprocessing

Following the hpc.nih.gov genome sequence data preprocessing workflow.

Adapter trimming

WXS reads were adapter trimmed with fastp:

for FILE in "${FILES[@]}"
do

 fastp -w 8 \
  -i raw_fastq/${FILE}_1.fastq \
  -I raw_fastq/${FILE}_2.fastq \
  -o trim_fastq/${FILE}_1.fastq \
  -O trim_fastq/${FILE}_2.fastq \
  --json reports/${FILE}_fastp.json

done

Alignment

For alignment of WXS reads to hg38 with BWA-MEM, index files were generated:

gunzip index/hg38.analysisSet.fasta.gz

samtools faidx index/hg38.analysisSet.fasta

bwa index index/hg38.analysisSet.fasta

WXS reads were aligned to hg38 with BWA-MEM, including SRA accessions as read groups. Alignments in SAM format were coordinate sorted and converted to BAM format with samtools.

for FILE in "${FILES[@]}"
do

 bwa mem -a -M \
  -R '@RG\tID:'${FILE}'_TCRBOA1\tSM:'${FILE}'\tLB:'${FILE}'\tPL:ILLUMINA' \
  -t 8 index/hg38.analysisSet.fasta \
  trim_fastq/${FILE}_1.fastq trim_fastq/${FILE}_2.fastq | \
   samtools view -@ 8 -O BAM | \
    samtools sort -@ 8 -O BAM > \
     raw_bam/${FILE}.bam

done

Alignment processing

Following the GATK Best Practices Workflow on Data pre-processing for variant discovery.

Mark duplicates

PCR & Sequencing (optical) duplicate reads were annotated to mark all but one copy from a set of duplicate alignments as duplicates. The unmarked copy is assumed to originate from a single molecule of sequenced DNA.

for FILE in "${FILES[@]}"
do

 gatk MarkDuplicatesSpark \
  -I raw_bam/${FILE}.bam \
  -O filter_bam/${FILE}.bam \
  -M reports/${FILE}.MarkDuplicates

done

Base quality scores recalibration

“This processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it’s important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate measurements from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.” – GATK Best Practices Workflow

Homo_sapiens_assembly38.dbsnp138.vcf, Homo_sapiens_assembly38.known_indels.vcf.gz and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz were used for BaseRecalibrator --known-sites to mask out positions with known variation to avoid confusing real variation with errors.

tabix -p vcf index/Homo_sapiens_assembly38.dbsnp138.vcf.gz

for FILE in "${FILES[@]}"
do

 gatk BaseRecalibrator \
  -I filter_bam/${FILE}.bam \
  -R index/hg38.analysisSet.fasta \
  -O reports/${FILE}_BQSR.report \
  --known-sites index/Homo_sapiens_assembly38.dbsnp138.vcf \
  --known-sites index/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --known-sites index/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
  
 gatk ApplyBQSR \
  -I filter_bam/${FILE}.bam \
  -R index/hg38.analysisSet.fasta \
  --bqsr-recal-file reports/${FILE}_BQSR.report \
  -O filter_bam/${FILE}_BQSR.bam
  
done

Somatic short variant discovery (SNVs + indels)

Following the GATK Best Practices Workflow.

Candidate variant calling

Mutect2 was used to call somatic SNVs and indels via local assembly of haplotypes.

“Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.” – GATK Best Practices Workflow

af-only-gnomad.hg38.vcf.gz was used as a germline resource and 1000g_pon.hg38.vcf.gz as the panel of normals.

TUMOR=(SRR2089363 SRR2184243 SRR2089355 SRR2089357 \
       SRR2089360 SRR2089361 SRR2089365)

NORMAL=(SRR2089364 SRR2184242 SRR2089359 SRR2089356 \
        SRR2089358 SRR2089362 SRR2089366)

INDEX=0

for FILE in "${TUMOR[@]}"
do

 gatk Mutect2 \
  -R index/hg38.analysisSet.fasta \
  -I filter_bam/${FILE}_BQSR.bam \
  -I filter_bam/${NORMAL[$INDEX]}_BQSR.bam \
  -normal ${NORMAL[$INDEX]} \
  --germline-resource index/af-only-gnomad.hg38.vcf.gz \
  --panel-of-normals index/1000g_pon.hg38.vcf.gz \
  -O raw_vcf/${FILE}.vcf
  
 let INDEX=${INDEX}+1
 
done

Candidate variant processing

Contamination assessment

“This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.” – GATK Best Practices Workflow

GetPileupSummaries

Summarizes counts of reads that support reference, alternate and other alleles for given sites. af-only-gnomad.hg38.vcf.gz was used for a common germline variant sites resource.

for FILE in "${FILES[@]}"
do

 gatk GetPileupSummaries \
  -I filter_bam/${FILE}_BQSR.bam \
  -V index/af-only-gnomad.hg38.vcf.gz \
  -L index/af-only-gnomad.hg38.vcf.gz \
  -O reports/${FILE}_pileups.table
 
done

CalculateContamination

Calculates the fraction of reads coming from cross-sample contamination by estimating contamination from ref reads at hom alt sites, given results from GetPileupSummaries.

INDEX=0

for FILE in "${TUMOR[@]}"
do

 gatk CalculateContamination \
  -I reports/${FILE}_pileups.table \
  -matched reports/${NORMAL[$INDEX]}_pileups.table
  -O reports/${FILE}_contamination.table
 
 let INDEX=${INDEX}+1
 
done

Filtering

Filter variants due to contamination with FilterMutectCalls.

for FILE in "${TUMOR[@]}"
do

 gatk FilterMutectCalls \
  -R index/hg38.analysisSet.fasta \
  -V raw_vcf/${FILE}.vcf \
  --contamination-table reports/${FILE}_contamination.table \
  -O filter_vcf/${FILE}_filtered.vcf

done

Somatic variant discovery with tumor-normal WXS

Sean Bresnahan

2024-02-02