Code that downloads one bacterial genome at a time, maps the reads, calls the variants, annotates them and outputs a pseudoalignment
BactStream avoids the storage requirements of downloading thousands of .fastq_ and gets right to the much smaller .vcf
BactStream --reference <reference.fasta> --sra_ids <accessions.txt>If the run is interrupted just run BactStream again and it'll continue where it left off
BactStream builds the custom snpEff database from the reference annotation. To avoid snpEff ERROR_CHROMOSOME_NOT_FOUND errors, the reference annotation is built with contig IDs derived from the reference FASTA header up to the first whitespace character. For example:
>NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
is treated as:
NC_000913.3
This matches the chromosome names produced by BWA/GATK for the same reference.
The custom snpEff database name is derived from the reference contig names and
lengths, so different references do not reuse the same custom-db cache. If
you are running an older BactStream checkout and see this error, remove the old
generic cache first:
rm -rf customref.gbk customref.reference.fna customref_bakta
rm -rf <snpeff-directory>/data/custom-dbFor workflows that only need read mapping, variant calling, and snpEff annotation, skip per-sample SPAdes assembly and per-sample Bakta annotation:
BactStream \
--reference <reference.fasta> \
--sra_ids <accessions.txt> \
--baktadb <bakta-db> \
--skip_assemblyThe reference is still annotated with Bakta, unless customref.gbk already exists.
When --skip_assembly is used, BactStream can process multiple accessions at the same time. If --jobs is not supplied, it uses the available CPU allocation and the per-genome --threads value to choose the number of concurrent genome streams. For example, on a 16 CPU job:
BactStream \
--reference <reference.fasta> \
--sra_ids <accessions.txt> \
--baktadb <bakta-db> \
--skip_assembly \
--threads 4runs up to 4 genomes at once with 4 threads each. To force the stream count explicitly, add --jobs <N>.
For each completed accession, BactStream writes:
<accession>.filtered.snpeff.vcf
<accession>.consensus.aln.fasta
The consensus FASTA is reference-coordinate aligned. PASS SNPs and MNVs replace the reference bases, and query deletions are represented as - gap characters. Insertions relative to the reference are ignored in this FASTA so every sample remains the same length as the reference and can be concatenated or combined into a pseudoalignment.
For a single-contig reference, the FASTA record name is the accession. For multi-contig references, records are named <accession>|<contig>.
If any of your dependencies are not in the path they can be provided:
usage: BactStream [-h] --reference REFERENCE [--snpeff SNPEFF] [--gatk GATK]
[--samtools SAMTOOLS] --sra_ids SRA_IDS
[--trimgalore TRIMGALORE] [--cutadapt CUTADAPT]Combine whole reference SNP alignments, extract the core genome with BactCore and generated a phylogeny with IQTREE and ClonalFrameML
cat */*aln > wga.aln
BactCore wga.aln > core_alignment.fasta
iqtree -s core_alignment.fasta -B 1000
ClonalFrameML core_alignment.fasta.treefile core_alignment.fasta bactvarprojectBactVar is written in python3 and requires the following python packages:
- Biopython
BactVar has been built around bwa, GATK, snpEff etc. The most up to date versions (mid 2023) have been used:
- prokka
- Java ≥v17
- GATK (tested on v4.4.0)
- bwa (tested on 0.7.13-r1126)
- snpEff (tested on v5.1f)
- samtools (tested on v1.9)
It won't work with much earlier versions of GATK for example, so I reccomend installing these versions
For local installations of java I reccommend: https://pypi.org/project/install-jdk/
BWA:
Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
SnpEff:
"A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
BioPython:
Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
GATK:
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20:1297-303. DOI: 10.1101/gr.107524.110.
You may also include the following for GATK:
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 43:491-498. DOI: 10.1038/ng.806.
Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, Shakir K, Thibault J, Chandran S, Whelan C, Lek M, Gabriel S, Daly MJ, Neale B, MacArthur DG, Banks E. (2017). Scaling accurate genetic variant discovery to tens of thousands of samples bioRxiv, 201178. DOI: 10.1101/201178
Van der Auwera GA & O'Connor BD. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st Edition). O'Reilly Media.
PROKKA:
Seemann T. Prokka: rapid prokaryotic genome annotation, Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063