Introduction

SIEVE is a bioinformatics filters-analysis pipeline for assembly, binning and annotation of metagenomes from EBI public database (database mining) or local user data.

Pipeline summary

To analyse metagenomic datasets, users can input their own data (the pipeline will be in charge of the trimming) or filter and collect data from the European public database EBI using the MGnify API.

The pipeline then:

Identification of genes of interest using diamond
Performs assembly using MEGAHIT and predicts proteins-coding genes for the assemblies using Prodigal .
Identification of contigs that contains macromolecular systems, genetic pathways of our interest with MacSyFinder .
Extract contigs of interest using seqtk and assigns taxonomy using CAT .
Performs metagenome binning using MaxBin2 and CONCOCT and checks the quality of the genome bins using miComplete
Refines bins with DAS Tool
Assigns taxonomy to bins using BAT

Futhermore, the pipeline creates various reports in the results directory specified, including a final table summarizing the main findings of the run. A shiny app ‘Sieve app’ is available to visualise the main results.

Basic usage

Note

If you are new to Nextflow, please refer to this page on how to set-up Nextflow.

nextflow run . with-singularity sieve.sif --resultsDir <OUTDIR> --cat_db <PATH/TO/CAT_database> --cat_taxonomy <PATH/TO/CAT_taxonomy>

Warning

Please provide pipeline parameters via the command line or Nextflow config file nextflow.config.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refers to results tab on the Github pipeline page. For more details about the output files and reports, please refer to the output documentation.

Inputs

The pipeline supports two types of input.

Local data

The user can enter their own data by adding the flag --local. All raw reads must be in the same directory and have the same extension .fastq.gz.

In the command line the user must add --local_input with the path of the samples sheet (csv format). The samples file specifies the samples, the path of the corresponding raw read files and the biome lineage, separated by commas. A template is available here .

It has the format: sample,read_1,read_2,biome. For more details please refer to the usage documentation.

Warning

The ‘local data’ input option only works with short reads.

MGnify API

The pipeline can be run with metagenomic data from the European public database EBI. The data are retrieved using MGnify API .

The MGnify ressource:

“Microbiome research involves the study of all genomes present within a specific environment. The approach can provide unique insights into the complex processes performed by environmental micro-organisms and their relationship to their surroundings, to each other, and, in some cases, to their host.

MGnify offers an automated pipeline for the analysis and archiving of microbiome data to help determine the taxonomic diversity and functional & metabolic potential of environmental samples. Users can submit their own data for analysis or freely browse all of the analysed public datasets held within the repository. In addition, users can request analysis of any appropriate dataset within the European Nucleotide Archive (ENA). User-submitted or ENA-derived datasets can also be assembled on request, prior to analysis.”

If you use the MGnify API option as input please cite the article: Lorna Richardson, Ben Allen, Germana Baldi, Martin Beracochea, Maxwell L Bileschi, Tony Burdett, Josephine Burgin, Juan Caballero-Pérez, Guy Cochrane, Lucy J Colwell, Tom Curtis, Alejandra Escobar-Zepeda, Tatiana A Gurbich, Varsha Kale, Anton Korobeynikov, Shriya Raj, Alexander B Rogers, Ekaterina Sakharova, Santiago Sanchez, Darren J Wilkinson, Robert D Finn, MGnify: the microbiome sequence data analysis resource in 2023, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D753–D759, https://doi.org/10.1093/nar/gkac1080

For more details, please refer to the :doc: input <input> documentation.

Credits

SIEVE pipeline was written by Zelia Bontemps, Andrei Gulliaev and Lionel Guy at Uppsala University (Departement of Medical Biochemistry and Microbiology).

We thank the MGnify team for the assistance in the developpement of this pipeline.

Citation

If you use SIEVE, please cite the article: XXX