TCRBOA/
├── raw_fastq
├── trim_fastq
├── raw_bam
├── filter_bam
├── raw_vcf
├── filter_vcf
├── index
├── reports
Sequencing reads were described in Becnel, L. B. et al. (2016). An open access pilot freely sharing cancer genomic data from participants in Texas. Scientific Data 3: 160010. https://doi.org/10.1038/sdata.2016.10.
Run | Sample | Disease | Tissue | Sex |
---|---|---|---|---|
SRR2089363 | TCR002361-T | 9690/3: Follicular lymphoma | Lymph node of the groin | male |
SRR2089364 | TCR002361-N | Normal | Blood | male |
SRR2184243 | TCRBOA6-T | 9690/3: Follicular lymphoma | Lymph node of the groin | female |
SRR2184242 | TCRBOA6-N | Normal | Blood | female |
SRR2089355 | TCR000484-T | 8246/3: Neuroendocrine carcinoma, NOS | Tail of pancreas | female |
SRR2089359 | TCR000484-N | Normal | Blood | female |
SRR2089357 | TCR002101-T | 8500/3: Infiltrating duct adenocarcinoma, NOS | Pancreas | female |
SRR2089356 | TCR002101-N | Normal | Blood | female |
SRR2089360 | TCR002103-T | 8500/3: Infiltrating duct adenocarcinoma, NOS | Pancreas | male |
SRR2089358 | TCR002103-N | Normal | Blood | male |
SRR2089361 | TCR002201-T | 8500/3: Infiltrating duct adenocarcinoma, NOS | Pancreas | male |
SRR2089362 | TCR002201-N | Normal | Blood | male |
SRR2089365 | TCR002182-T | 8500/3: Infiltrating duct adenocarcinoma, NOS | Pancreas | male |
SRR2089366 | TCR002182-N | Normal | Blood | male |
Fastq files were retrieved from NCBI SRA (BioProject accessions PRJNA284596 and PRJNA284598) via The SRA Toolkit:
cd ~/TCRBOA/raw_fastq
FILES=(SRR2089363 SRR2089364 SRR2184243 SRR2184242 \
SRR2089355 SRR2089359 SRR2089357 SRR2089356 \
SRR2089360 SRR2089358 SRR2089361 SRR2089362 \
SRR2089365 SRR2089366)
for FILE in "${FILES[@]}"
do
prefetch ${FILE}
fasterq-dump ${FILE}
done
Retrieved from UCSC.
Following the hpc.nih.gov genome sequence data preprocessing workflow.
WXS reads were adapter trimmed with fastp:
for FILE in "${FILES[@]}"
do
fastp -w 8 \
-i raw_fastq/${FILE}_1.fastq \
-I raw_fastq/${FILE}_2.fastq \
-o trim_fastq/${FILE}_1.fastq \
-O trim_fastq/${FILE}_2.fastq \
--json reports/${FILE}_fastp.json
done
For alignment of WXS reads to hg38 with BWA-MEM, index files were generated:
gunzip index/hg38.analysisSet.fasta.gz
samtools faidx index/hg38.analysisSet.fasta
bwa index index/hg38.analysisSet.fasta
WXS reads were aligned to hg38 with BWA-MEM, including SRA accessions as read groups. Alignments in SAM format were coordinate sorted and converted to BAM format with samtools.
for FILE in "${FILES[@]}"
do
bwa mem -a -M \
-R '@RG\tID:'${FILE}'_TCRBOA1\tSM:'${FILE}'\tLB:'${FILE}'\tPL:ILLUMINA' \
-t 8 index/hg38.analysisSet.fasta \
trim_fastq/${FILE}_1.fastq trim_fastq/${FILE}_2.fastq | \
samtools view -@ 8 -O BAM | \
samtools sort -@ 8 -O BAM > \
raw_bam/${FILE}.bam
done
Following the GATK Best Practices Workflow on Data pre-processing for variant discovery.
PCR & Sequencing (optical) duplicate reads were annotated to mark all but one copy from a set of duplicate alignments as duplicates. The unmarked copy is assumed to originate from a single molecule of sequenced DNA.
for FILE in "${FILES[@]}"
do
gatk MarkDuplicatesSpark \
-I raw_bam/${FILE}.bam \
-O filter_bam/${FILE}.bam \
-M reports/${FILE}.MarkDuplicates
done
“This processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it’s important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate measurements from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.” – GATK Best Practices Workflow
Homo_sapiens_assembly38.dbsnp138.vcf,
Homo_sapiens_assembly38.known_indels.vcf.gz
and Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
were used for BaseRecalibrator --known-sites
to mask out
positions with known variation to avoid confusing real variation with
errors.
tabix -p vcf index/Homo_sapiens_assembly38.dbsnp138.vcf.gz
for FILE in "${FILES[@]}"
do
gatk BaseRecalibrator \
-I filter_bam/${FILE}.bam \
-R index/hg38.analysisSet.fasta \
-O reports/${FILE}_BQSR.report \
--known-sites index/Homo_sapiens_assembly38.dbsnp138.vcf \
--known-sites index/Homo_sapiens_assembly38.known_indels.vcf.gz \
--known-sites index/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
gatk ApplyBQSR \
-I filter_bam/${FILE}.bam \
-R index/hg38.analysisSet.fasta \
--bqsr-recal-file reports/${FILE}_BQSR.report \
-O filter_bam/${FILE}_BQSR.bam
done
Following the GATK Best Practices Workflow.
Mutect2 was used to call somatic SNVs and indels via local assembly of haplotypes.
“Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.” – GATK Best Practices Workflow
af-only-gnomad.hg38.vcf.gz was used as a germline resource and 1000g_pon.hg38.vcf.gz as the panel of normals.
TUMOR=(SRR2089363 SRR2184243 SRR2089355 SRR2089357 \
SRR2089360 SRR2089361 SRR2089365)
NORMAL=(SRR2089364 SRR2184242 SRR2089359 SRR2089356 \
SRR2089358 SRR2089362 SRR2089366)
INDEX=0
for FILE in "${TUMOR[@]}"
do
gatk Mutect2 \
-R index/hg38.analysisSet.fasta \
-I filter_bam/${FILE}_BQSR.bam \
-I filter_bam/${NORMAL[$INDEX]}_BQSR.bam \
-normal ${NORMAL[$INDEX]} \
--germline-resource index/af-only-gnomad.hg38.vcf.gz \
--panel-of-normals index/1000g_pon.hg38.vcf.gz \
-O raw_vcf/${FILE}.vcf
let INDEX=${INDEX}+1
done
“This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.” – GATK Best Practices Workflow
Summarizes counts of reads that support reference, alternate and other alleles for given sites. af-only-gnomad.hg38.vcf.gz was used for a common germline variant sites resource.
for FILE in "${FILES[@]}"
do
gatk GetPileupSummaries \
-I filter_bam/${FILE}_BQSR.bam \
-V index/af-only-gnomad.hg38.vcf.gz \
-L index/af-only-gnomad.hg38.vcf.gz \
-O reports/${FILE}_pileups.table
done
Calculates the fraction of reads coming from cross-sample contamination by estimating contamination from ref reads at hom alt sites, given results from GetPileupSummaries.
INDEX=0
for FILE in "${TUMOR[@]}"
do
gatk CalculateContamination \
-I reports/${FILE}_pileups.table \
-matched reports/${NORMAL[$INDEX]}_pileups.table
-O reports/${FILE}_contamination.table
let INDEX=${INDEX}+1
done
Filter variants due to contamination with FilterMutectCalls.
for FILE in "${TUMOR[@]}"
do
gatk FilterMutectCalls \
-R index/hg38.analysisSet.fasta \
-V raw_vcf/${FILE}.vcf \
--contamination-table reports/${FILE}_contamination.table \
-O filter_vcf/${FILE}_filtered.vcf
done