Output

Introduction

This section describes the output produced by the pipeline.

Pipeline overview

The sieve pipeline is built using Nextflow and processes data using the following steps:

Control of local input reads (trimming) - (Only if –local is set)
Fetching data from MGnify API (Filters and download) - (Only if –noapi is not set)
Identification of genes of interest on raw trimmed reads - (Only if –nodiamond is no set)
Assembly of trimmed reads
Proteins coding gene prediction of assemblies
Identification of contigs that contains macromolecular systems, genetic pathways of our interest
Taxonomic classification of the targeted contigs
Binning and binning refinement of targeted contigs
Genome annotation of binned genomes
Taxonomic classification of binned genomes
Summary
Pipeline information - Metrics generated during the workflow execution

Control of local input reads

Adapterremoval

Adapterremoval searches for and removes remnant adapter sequences form High-throughput Sequencing (HTS) data and optionally trims low quality bases from the 3’ end of reads following adapter removal. The output logs are stored in the results folder.

Output files:

[sample]_trimSE.fastq.gz if –single-end is specified

[sample]_trimPE.fastq.gz for pair-end reads

Fetching data from MGnify API

Getting accession

Getting all analysis accession from MGnify API based on the parameters in the command line.

Output files:

in outdir/acession/

accession.csv

Targeting taxonomy

Apply taxonomic filters on the accession obtained previously.

Output files:

in outdir/taxonomy/

[sample]_ID_to_download.csv

[sample]_taxonomy_details.csv

Downloading

Downloading data using accession number.

Output files:

[sample].fastq.gz

Identification of genes of interest

Generating diamond database

Create a diamond formatted reference database from a FASTA input file.

Output files:

references.dmnd

references.fasta

Identification of target genes

Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.

Output files:

[sample].daa

Assembly

Reads are assembled with MEGAHIT . MEGAHIT is a single node assembler for large and complex metagenomics short reads.

Output files:

[sample]_assembly_MG.fasta if experiment type is metagenomic

[sample]_assembly_AS.fasta if experiment type is assembly

Gene prediction

Protein-coding genes are predicted for each assembly using Prodigal .

Output files:

[sample].faa

Identification of macromolecular systems

MacSyFinder is a program to model and detect macromolecular systems, genetic pathways… in protein datasets. Criteria for systems detection include component content (quorum), and genomic co-localization. Each component corresponds to a hidden Markov model (HMM) protein profile to perform sequence similarity searches with the program Hmmer.

Output files:

in <outdir>/contig/
- [sample]_contig.fasta
- [sample]_contig_name_deduplicated.txt
in work dir

[sample]_contig_names.txt

[sample]_contig_proteins.faa.idx

in out_macsyfinder/[sample]/

all_best_solutions.tsv

Taxonomic classification of the targeted contigs

CAT is a toolkit for annotating contigs and bins from metagenome-assembled-genomes. The sieve pipeline uses CAT to assign taxonomy to targeted contigs.

Output files:

in <outdir>/contig/classification/
- [sample]_classification_summary.txt
in work dir
- [sample].alignment.diamond
- [sample].contig2classification.txt
- [sample].ORF2LCA.txt
- [sample]classification_names.txt
- [sample]classification_official_names.txt
- [sample].log
- [sample].predicted.proteins.faa
- [sample].predicted.proteins.gff

Genome annotation of binned genomes

miComplete also perform the protein-coding genes prediction for each bin that match de bins quality criteria defined by the user.

Output file:

[bin_name]_profigal.faa

Taxonomic classification of binned genomes

BAT is a toolkit for annotating contigs and bins from metagenome-assembled-genomes. The sieve pipeline uses BAT to assign taxonomy to genome bins based on the taxnomy of the contigs.

Output files:

in work dir
- [bin].ORF2LCA.txt
- [bin]classification_names.txt
- [bin]classification_official_names.txt
- [bin].log

Summary

Generate the general stats table and plot for the pipeline.

Output file:

in <outdir>/
- results_summary.tsv

Important

To visualize the main statistical results you can upload the file results_summary.tsv to the sieve shinyapp directly HERE.

Output

Introduction

Pipeline overview

Control of local input reads

Adapterremoval

Fetching data from MGnify API

Getting accession

Targeting taxonomy

Downloading

Identification of genes of interest

Generating diamond database

Identification of target genes

Assembly

Gene prediction

Identification of macromolecular systems

Taxonomic classification of the targeted contigs

Binning and binning refinement

Contig coverage

Maxbin2

Concoct

DASTool

miComplete

Genome annotation of binned genomes

Taxonomic classification of binned genomes

Summary