Output
Introduction
This section describes the output produced by the pipeline.
Pipeline overview
The sieve pipeline is built using Nextflow and processes data using the following steps:
Control of local input reads (trimming) - (Only if –local is set)
Fetching data from MGnify API (Filters and download) - (Only if –noapi is not set)
Identification of genes of interest on raw trimmed reads - (Only if –nodiamond is no set)
Assembly of trimmed reads
Proteins coding gene prediction of assemblies
Identification of contigs that contains macromolecular systems, genetic pathways of our interest
Taxonomic classification of the targeted contigs
Binning and binning refinement of targeted contigs
Genome annotation of binned genomes
Taxonomic classification of binned genomes
Summary
Pipeline information - Metrics generated during the workflow execution
Control of local input reads
Adapterremoval
Adapterremoval searches for and removes remnant adapter sequences form High-throughput Sequencing (HTS) data and optionally trims low quality bases from the 3’ end of reads following adapter removal. The output logs are stored in the results folder.
Output files:
[sample]_trimSE.fastq.gzif –single-end is specified
[sample]_trimPE.fastq.gzfor pair-end reads
Fetching data from MGnify API
Getting accession
Getting all analysis accession from MGnify API based on the parameters in the command line.
Output files:
in
outdir/acession/
accession.csv
Targeting taxonomy
Apply taxonomic filters on the accession obtained previously.
Output files:
in
outdir/taxonomy/
[sample]_ID_to_download.csv
[sample]_taxonomy_details.csv
Downloading
Downloading data using accession number.
Output files:
[sample].fastq.gz
Identification of genes of interest
Generating diamond database
Create a diamond formatted reference database from a FASTA input file.
Output files:
references.dmnd
references.fasta
Identification of target genes
Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.
Output files:
[sample].daa
Assembly
Reads are assembled with MEGAHIT . MEGAHIT is a single node assembler for large and complex metagenomics short reads.
Output files:
[sample]_assembly_MG.fastaif experiment type is metagenomic
[sample]_assembly_AS.fastaif experiment type is assembly
Gene prediction
Protein-coding genes are predicted for each assembly using Prodigal .
Output files:
[sample].faa
Identification of macromolecular systems
MacSyFinder is a program to model and detect macromolecular systems, genetic pathways… in protein datasets. Criteria for systems detection include component content (quorum), and genomic co-localization. Each component corresponds to a hidden Markov model (HMM) protein profile to perform sequence similarity searches with the program Hmmer.
Output files:
in
<outdir>/contig/[sample]_contig.fasta[sample]_contig_name_deduplicated.txt
in work dir
[sample]_contig_names.txt
[sample]_contig_proteins.faa.idxin
out_macsyfinder/[sample]/
all_best_solutions.tsv
Taxonomic classification of the targeted contigs
CAT is a toolkit for annotating contigs and bins from metagenome-assembled-genomes. The sieve pipeline uses CAT to assign taxonomy to targeted contigs.
Output files:
in
<outdir>/contig/classification/[sample]_classification_summary.txt
in work dir
[sample].alignment.diamond[sample].contig2classification.txt[sample].ORF2LCA.txt[sample]classification_names.txt[sample]classification_official_names.txt[sample].log[sample].predicted.proteins.faa[sample].predicted.proteins.gff
Binning and binning refinement
Contig coverage
Create bwa index, Align reads with bwa mem, Convert and sort sam to bam file using samtools , Index BAM file, Output per contig coverage using pileup.sh, Generate abundance file from mapped reads These files ares for downstream binning steps.
Output files:
[sample]_abundance.txt
[sample]_aln.bam
[sample]_aln.sam
[sample]_cov.txt
[sample]_index.amb/ann/bwt/pac/sa
Maxbin2
MaxBin2 recovers genome bins (that is, contigs/scaffolds that all belongs to a same organism) from metagenome assemblies.
Output files:
[sample]_maxbin.*.fasta
[sample]_maxbin.log
[sample]_maxbin.marker
[sample]_maxbin.marker_of_each_bin.tar.gz
[sample]_maxbin.noclass
[sample]_maxbin.summary
[sample]_maxbin.tooshort
Concoct
CONCOCT performs unsupervised binning of metagenomic contigs by using nucleotide composition, coverage data in multiple samples and linkage data from paired end reads.
Output files:
in
workdir/[sample]_concot_bins/[sample]_concoct.*.fa
in work dir
[sample]_args.txt[sample]_clustering_gt1000.csv[sample]_clustering_merged.csv[sample]_concoct.contigs2bin.tsv[sample]_contigs_10K.bed[sample]_contigs_10K.fasta[sample]_coverage_table.tsv[sample]_original_data_gt1000.csv[sample]_PCA_components_data_gt1000.csv[sample]_PCA_transformed_data_gt1000.csv[sample]_log.txt
DASTool
DAS Tool is an automated binning refinement method that integrates the results of a flexible number of binning algorithms to calculate an optimized, non-redundant set of bins from a single assembly. nf-core/mag uses this tool to attempt to further improve bins based on combining the MetaBAT2 and MaxBin2 binning output, assuming sufficient quality is met for those bins.
DAS Tool will remove contigs from bins that do not pass additional filtering criteria, and will discard redundant lower-quality output from binners that represent the same estimated ‘organism’, until the single highest quality bin is represented.
- WARNING ::
If DAS Tool does not find any bins passing your selected threshold it will exit with an error.
Output files:
in
workdir/[sample]_DASTool_bins/[sample]_concoct.*.faand/or[sample]_maxbin.*.fasta
in work dir
all_prot.dmnd[sample]_allBins.eval[sample]_DASTool_contig2bin.tsv[sample]_DASTool_summary.tsv[sample]_maxbin.contigs2bin.tsv[sample]_proteins.faa/all.b6/bacteria.scg/findSCG.b6/scg.candidates.faa[sample].seqlength[sample]_DASTool.log
miComplete
miComplete is a compact software aimed at rapidly and accurately determining of the quality of assembled genomes, often metagenome assembled bins. miComplete also aims at providing a more reliable completeness and redundancy metric via a system of weighting the impact of different marker genes presence or absence differently.
Output files:
[bin_name].fna
[bin_name]_bins_stats_quality.tab
micomplete.log
Genome annotation of binned genomes
miComplete also perform the protein-coding genes prediction for each bin that match de bins quality criteria defined by the user.
Output file:
[bin_name]_profigal.faa
Taxonomic classification of binned genomes
BAT is a toolkit for annotating contigs and bins from metagenome-assembled-genomes. The sieve pipeline uses BAT to assign taxonomy to genome bins based on the taxnomy of the contigs.
Output files:
in work dir
[bin].ORF2LCA.txt[bin]classification_names.txt[bin]classification_official_names.txt[bin].log
Summary
Generate the general stats table and plot for the pipeline.
Output file:
in
<outdir>/results_summary.tsv
Important
To visualize the main statistical results you can upload the file results_summary.tsv to the sieve shinyapp directly HERE.