Methods

Project directory structure

TCRBOA/
├── raw_fastq
├── trim_fastq
├── raw_bam
├── filter_bam
├── raw_vcf
├── filter_vcf
├── index
├── reports

Data retrieval

Illumina Whole Exome Sequencing (WXS) reads

Sequencing reads were described in Becnel, L. B. et al. (2016). An open access pilot freely sharing cancer genomic data from participants in Texas. Scientific Data 3: 160010. https://doi.org/10.1038/sdata.2016.10.

Run Sample Disease Tissue Sex
SRR2089363 TCR002361-T 9690/3: Follicular lymphoma Lymph node of the groin male
SRR2089364 TCR002361-N Normal Blood male
SRR2184243 TCRBOA6-T 9690/3: Follicular lymphoma Lymph node of the groin female
SRR2184242 TCRBOA6-N Normal Blood female
SRR2089355 TCR000484-T 8246/3: Neuroendocrine carcinoma, NOS Tail of pancreas female
SRR2089359 TCR000484-N Normal Blood female
SRR2089357 TCR002101-T 8500/3: Infiltrating duct adenocarcinoma, NOS Pancreas female
SRR2089356 TCR002101-N Normal Blood female
SRR2089360 TCR002103-T 8500/3: Infiltrating duct adenocarcinoma, NOS Pancreas male
SRR2089358 TCR002103-N Normal Blood male
SRR2089361 TCR002201-T 8500/3: Infiltrating duct adenocarcinoma, NOS Pancreas male
SRR2089362 TCR002201-N Normal Blood male
SRR2089365 TCR002182-T 8500/3: Infiltrating duct adenocarcinoma, NOS Pancreas male
SRR2089366 TCR002182-N Normal Blood male

Fastq files were retrieved from NCBI SRA (BioProject accessions PRJNA284596 and PRJNA284598) via The SRA Toolkit:

cd ~/TCRBOA/raw_fastq

FILES=(SRR2089363 SRR2089364 SRR2184243 SRR2184242 \
       SRR2089355 SRR2089359 SRR2089357 SRR2089356 \
       SRR2089360 SRR2089358 SRR2089361 SRR2089362 \
       SRR2089365 SRR2089366)

for FILE in "${FILES[@]}"
do

 prefetch ${FILE}
 fasterq-dump ${FILE}
 
done

Human genome assembly, annotations, and variant call sets

Analysis set

Retrieved from UCSC.

Gene annotation

Retreived from UCSC.

Data preprocessing

Following the hpc.nih.gov genome sequence data preprocessing workflow.

Adapter trimming

WXS reads were adapter trimmed with fastp:

for FILE in "${FILES[@]}"
do

 fastp -w 8 \
  -i raw_fastq/${FILE}_1.fastq \
  -I raw_fastq/${FILE}_2.fastq \
  -o trim_fastq/${FILE}_1.fastq \
  -O trim_fastq/${FILE}_2.fastq \
  --json reports/${FILE}_fastp.json

done

Alignment

For alignment of WXS reads to hg38 with BWA-MEM, index files were generated:

gunzip index/hg38.analysisSet.fasta.gz

samtools faidx index/hg38.analysisSet.fasta

bwa index index/hg38.analysisSet.fasta

WXS reads were aligned to hg38 with BWA-MEM, including SRA accessions as read groups. Alignments in SAM format were coordinate sorted and converted to BAM format with samtools.

for FILE in "${FILES[@]}"
do

 bwa mem -a -M \
  -R '@RG\tID:'${FILE}'_TCRBOA1\tSM:'${FILE}'\tLB:'${FILE}'\tPL:ILLUMINA' \
  -t 8 index/hg38.analysisSet.fasta \
  trim_fastq/${FILE}_1.fastq trim_fastq/${FILE}_2.fastq | \
   samtools view -@ 8 -O BAM | \
    samtools sort -@ 8 -O BAM > \
     raw_bam/${FILE}.bam

done

Alignment processing

Following the GATK Best Practices Workflow on Data pre-processing for variant discovery.

Mark duplicates

PCR & Sequencing (optical) duplicate reads were annotated to mark all but one copy from a set of duplicate alignments as duplicates. The unmarked copy is assumed to originate from a single molecule of sequenced DNA.

for FILE in "${FILES[@]}"
do

 gatk MarkDuplicatesSpark \
  -I raw_bam/${FILE}.bam \
  -O filter_bam/${FILE}.bam \
  -M reports/${FILE}.MarkDuplicates

done

Base quality scores recalibration

“This processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it’s important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate measurements from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.” – GATK Best Practices Workflow

Homo_sapiens_assembly38.dbsnp138.vcf, Homo_sapiens_assembly38.known_indels.vcf.gz and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz were used for BaseRecalibrator --known-sites to mask out positions with known variation to avoid confusing real variation with errors.

tabix -p vcf index/Homo_sapiens_assembly38.dbsnp138.vcf.gz

for FILE in "${FILES[@]}"
do

 gatk BaseRecalibrator \
  -I filter_bam/${FILE}.bam \
  -R index/hg38.analysisSet.fasta \
  -O reports/${FILE}_BQSR.report \
  --known-sites index/Homo_sapiens_assembly38.dbsnp138.vcf \
  --known-sites index/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --known-sites index/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
  
 gatk ApplyBQSR \
  -I filter_bam/${FILE}.bam \
  -R index/hg38.analysisSet.fasta \
  --bqsr-recal-file reports/${FILE}_BQSR.report \
  -O filter_bam/${FILE}_BQSR.bam
  
done

Somatic short variant discovery (SNVs + indels)

Following the GATK Best Practices Workflow.

Candidate variant calling

Mutect2 was used to call somatic SNVs and indels via local assembly of haplotypes.

“Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.” – GATK Best Practices Workflow

af-only-gnomad.hg38.vcf.gz was used as a germline resource and 1000g_pon.hg38.vcf.gz as the panel of normals.

TUMOR=(SRR2089363 SRR2184243 SRR2089355 SRR2089357 \
       SRR2089360 SRR2089361 SRR2089365)

NORMAL=(SRR2089364 SRR2184242 SRR2089359 SRR2089356 \
        SRR2089358 SRR2089362 SRR2089366)

INDEX=0

for FILE in "${TUMOR[@]}"
do

 gatk Mutect2 \
  -R index/hg38.analysisSet.fasta \
  -I filter_bam/${FILE}_BQSR.bam \
  -I filter_bam/${NORMAL[$INDEX]}_BQSR.bam \
  -normal ${NORMAL[$INDEX]} \
  --germline-resource index/af-only-gnomad.hg38.vcf.gz \
  --panel-of-normals index/1000g_pon.hg38.vcf.gz \
  -O raw_vcf/${FILE}.vcf
  
 let INDEX=${INDEX}+1
 
done

Candidate variant processing

Contamination assessment

“This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.” – GATK Best Practices Workflow

GetPileupSummaries

Summarizes counts of reads that support reference, alternate and other alleles for given sites. af-only-gnomad.hg38.vcf.gz was used for a common germline variant sites resource.

for FILE in "${FILES[@]}"
do

 gatk GetPileupSummaries \
  -I filter_bam/${FILE}_BQSR.bam \
  -V index/af-only-gnomad.hg38.vcf.gz \
  -L index/af-only-gnomad.hg38.vcf.gz \
  -O reports/${FILE}_pileups.table
 
done
CalculateContamination

Calculates the fraction of reads coming from cross-sample contamination by estimating contamination from ref reads at hom alt sites, given results from GetPileupSummaries.

INDEX=0

for FILE in "${TUMOR[@]}"
do

 gatk CalculateContamination \
  -I reports/${FILE}_pileups.table \
  -matched reports/${NORMAL[$INDEX]}_pileups.table
  -O reports/${FILE}_contamination.table
 
 let INDEX=${INDEX}+1
 
done

Filtering

Filter variants due to contamination with FilterMutectCalls.

for FILE in "${TUMOR[@]}"
do

 gatk FilterMutectCalls \
  -R index/hg38.analysisSet.fasta \
  -V raw_vcf/${FILE}.vcf \
  --contamination-table reports/${FILE}_contamination.table \
  -O filter_vcf/${FILE}_filtered.vcf

done