Introduction ============ `SIEVE `_ is a bioinformatics filters-analysis pipeline for assembly, binning and annotation of metagenomes from EBI public database (database mining) or local user data. Pipeline summary ---------------- To analyse metagenomic datasets, users can input their own data (the pipeline will be in charge of the trimming) or filter and collect data from the European public database EBI using the MGnify API. The pipeline then: * Identification of genes of interest using `diamond `_ * Performs assembly using `MEGAHIT `_ and predicts proteins-coding genes for the assemblies using `Prodigal `_ . * Identification of contigs that contains macromolecular systems, genetic pathways of our interest with `MacSyFinder `_ . * Extract contigs of interest using `seqtk `_ and assigns taxonomy using `CAT `_ . * Performs metagenome binning using `MaxBin2 `_ and `CONCOCT `_ and checks the quality of the genome bins using `miComplete `_ * Refines bins with `DAS Tool `_ * Assigns taxonomy to bins using `BAT `_ Futhermore, the pipeline creates various reports in the results directory specified, including a final table summarizing the main findings of the run. A shiny app `'Sieve app' `_ is available to visualise the main results. Basic usage ----------- .. NOTE:: If you are new to Nextflow, please refer to this `page `_ on how to set-up Nextflow. .. code-block:: console nextflow run . with-singularity sieve.sif --resultsDir --cat_db --cat_taxonomy .. WARNING:: Please provide pipeline parameters via the command line or Nextflow config file ``nextflow.config``. For more details and further functionality, please refer to the :doc:`usage ` documentation and the :doc:`parameter ` documentation. Pipeline output --------------- To see the results of an example test run with a full size dataset refers to results tab on the Github pipeline page. For more details about the output files and reports, please refer to the :doc:`output ` documentation. Inputs ------ The pipeline supports two types of input. Local data ~~~~~~~~~~ The user can enter their own data by adding the flag ``--local``. All raw reads must be in the same directory and have the same extension ``.fastq.gz``. In the command line the user must add ``--local_input`` with the path of the samples sheet (csv format). The samples file specifies the samples, the path of the corresponding raw read files and the biome lineage, separated by commas. A template is available `here `_ . It has the format: ``sample,read_1,read_2,biome``. For more details please refer to the :doc:`usage ` documentation. .. WARNING:: The 'local data' input option only works with short reads. MGnify API ~~~~~~~~~~ The pipeline can be run with metagenomic data from the European public database EBI. The data are retrieved using `MGnify API `_ . The MGnify ressource: "Microbiome research involves the study of all genomes present within a specific environment. The approach can provide unique insights into the complex processes performed by environmental micro-organisms and their relationship to their surroundings, to each other, and, in some cases, to their host. MGnify offers an automated pipeline for the analysis and archiving of microbiome data to help determine the taxonomic diversity and functional & metabolic potential of environmental samples. Users can submit their own data for analysis or freely browse all of the analysed public datasets held within the repository. In addition, users can request analysis of any appropriate dataset within the European Nucleotide Archive (ENA). User-submitted or ENA-derived datasets can also be assembled on request, prior to analysis." If you use the MGnify API option as input please cite the article: Lorna Richardson, Ben Allen, Germana Baldi, Martin Beracochea, Maxwell L Bileschi, Tony Burdett, Josephine Burgin, Juan Caballero-Pérez, Guy Cochrane, Lucy J Colwell, Tom Curtis, Alejandra Escobar-Zepeda, Tatiana A Gurbich, Varsha Kale, Anton Korobeynikov, Shriya Raj, Alexander B Rogers, Ekaterina Sakharova, Santiago Sanchez, Darren J Wilkinson, Robert D Finn, MGnify: the microbiome sequence data analysis resource in 2023, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D753–D759, https://doi.org/10.1093/nar/gkac1080 For more details, please refer to the :doc: `input ` documentation. Credits ------- SIEVE pipeline was written by Zelia Bontemps, Andrei Gulliaev and Lionel Guy at Uppsala University (Departement of Medical Biochemistry and Microbiology). We thank the MGnify team for the assistance in the developpement of this pipeline. Citation -------- If you use SIEVE, please cite the article: XXX