Usage

MGnify API input

To use data from MGnify API you can first specify a sample accession, study accesion, experiment type, pipeline version, instrument platform, instrument model and biome name. Valid examples could look like the following:

--sample_accession ERR2136697
--study_accession MGYS00001946
--experiment_type metagenomic
--pipeline_version 2.0
--instrument_platform Illumina
--instrument_model Illumina MiSeq
--biome_name aquatic

For more details please refer to the parameters documentation.

Please note the following requirements:

You have to precise a least one of these parameters in the command line
If the command line parameter –noapi is specified all the parameters related to the API input are ignored

You can apply taxonomy filters on these data before the downloading step. You can specify taxonomy filters at each taxonomic level directly in the command line. Valid examples could look like the following:

--taxonomy_phylum proteobacteria
--taxonomy_class alphaproteobacteria
--taxonomy_order ['legionellales','triotrichales']
--taxonomy_family legionellaceae
--taxonomy_genus legionella
--taxonomy_species 'legionella pneumophila'

For more details please refer to the usage documentation.

Warning

Some biomes in MGnify have a lot of analyses associated with them, for example the human digestive system. sieve can handle these large biomes but completing the pipeline will take long. Adding more filters such as a taxonomy filter or filtering based on user provided genes will decrease the completion time significantly.

Local data input

To use you own local data you can specify a CSV samplesheet input file that contains the paths to your FASTQ files and additional metadata.

Note

By default the input data comes from the MGnify API, if you want to use you own data as input you have to specify –local in the command line.

At a minimum CSV file should contain the following columns: sample,read_1,read_2,experiment,biome

The path to read_2 is optional. Valid examples could look like the following:

sample,read_1,read_2,biome
sample1,data/sample1_R1.fastq.gz,data/sample1_R2.fastq.gz,root:Environmental:Terrestrial:Soil:Mine
sample2,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,root:Environmental:Terrestrial:Soil
sample3,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,root:Environmental:Terrestrial:Soil:Cave

or

sample,read_1,read_2,biome
sample1,data/sample1.fastq.gz,,root:Environmental:Terrestrial:Soil:Mine
sample2,data/sample2.fastq.gz,,root:Environmental:Terrestrial
sample3,data/sample3.fastq.gz,,root:Environmental:Terrestrial:Soil:Cave

Please note the following requirements:

a minimum 4 of comma-seperated columns
Valid file extension: .csv
Must contain the header sample,read_1,read_2,biome
FastQ files must be compressed (.fastq.gz, .fq.gz)
Within one samplesheet either only single-end, assembled reads, or only paired-end reads can be specified
If single-end reads are specified, the command line parameter –single_end must be specified as well
If assembled reads are specified, the command line parameter –assembly_input must be specified as well

Warning

Please provide the biome lineage correctly as same nomenlature a MGnify. If you don’t know the biome lineage you can find it on the MGnify website (browse biomes data)

Note

A sample sheet template is available on the GitHub repository.

Running the pipeline

The typical command for running the pipeline is as follows:

nextflow run main.nf --with-singularity sieve01.sif --resultsDir <OUTDIR> --cat_db <PATH/TO/CAT_database> --cat_taxonomy <PATH/TO/CAT_taxonomy>

Note that the pipeline will create the following files in your working directory:

work                # Directory containing the nextflow working files
<OUTDIR>            # Finished results in specified location (defined with --resultsDir)
.nextflow_log       # Log file from Nextflow

How to skip steps

Some of the pipeline steps are optional such as the identification of genes of interest, the identification of macromolecular systems and the usage of all binning tools. If you want to skip one or all these steps you can specify it directly in the command line.

Valid examples could look like the following:

--noapi                #Skip the API processes
--nodiamond            #Skip the indentification of genes
--nomacsyfinder        #Skip the identification of macromolecular system
--nomaxbin2            #Skip binning with maxbin2
--noconcoct            #Skip binning with concoct

Need help to writing the running command line ?

We have developed the shiny app with a ‘command_generator’ tab that can easily generate the command line with a graphical interface.