Workflow Parameters
The workflow parameters should be included in a configuration file, an example of which can be found at https://raw.githubusercontent.com/mriffle/nf-skyline-dia-ms/main/resources/pipeline.config
The parameters in this file should be changed to indicate the locations of your data, the options you’d like to use for the software included in the workflow, and the capabilities and configuration for the system on which you are running the workflow steps.
The configuration file is roughly organized as:
params {
...
}
profiles {
...
}
mail {
...
}
The
paramssection includes locations of data and configuration options for a specific run of the workflow.The
profilessections includes parameters that describe the capabilities of the systems that run the steps of the workflow. For example, if running on your local system, this will include things like how many cores and how much RAM may be used by the steps of the workflow. This will not need to be changed for each run of the workflow.The
mailsection includes configuration options for sending email. This is optional and only necessary if you wish to send emails when the workflow completes. This will not need to be changed for each run of the workflow.
Below is a complete description of all parameters that may be included in these sections.
Note
This workflow can process files stored in PanoramaWeb. When specifying directories or file locations, any paths that begin with https:// will be interpreted as being PanoramaWeb locations.
For example, to process raw files stored in PanoramaWeb, you would have the following in your pipeline.config file:
quant_spectra_dir= 'https://panoramaweb.org/_webdav/path/to/@files/RawFiles/'
Where, https://panoramaweb.org/_webdav/path/to/@files/RawFiles/ is the WebDav URL of the folder on the Panorama server.
The params Section
Req? |
Parameter Name |
Description |
|---|---|---|
|
That path to the spectral library to use. May be a |
|
|
The path to the background FASTA file to use. This parameter is required, except when running Cascadia. |
|
✓ |
|
The path to the directory containing the raw data to be quantified. If using narrow window DIA and GPF to generated a chromatogram library this is the location of the wide-window data to be searched using the chromatogram library.
Supported file formats are |
|
Which files in this directory to use. Default: |
|
|
Use this regex instead of |
|
|
Randomly select |
|
|
If you are creating a chromatogram library using GPF and narrow window DIA, this is the path to the directory containing the narrow-window raw data.
Accepts the same file formats as |
|
|
Which files in this directory to use. Default: |
|
|
Use this regex instead of |
|
|
If supported by the |
|
|
If use_vendor_raw is set to true, Nextflow will attempt to use hard links to the raw file, which is required by vendor libraries. However, this is not supported in all environment. If hard links are not supported, set this to true to create physical copies of the files instead of hard links. This will use extra space.
Default is |
|
|
Randomly select |
|
|
The seed used to randomly select files for the |
|
|
Must be set to either |
|
|
Metadata annotations for each |
|
|
If set to |
|
|
The email address to which a notification should be sent upon workflow completion. If no email is specified, no email will be sent. To send email, you must configure mail server settings (see below). |
params.pdc
Parameter Name |
Description |
|---|---|
|
When this option is set, raw files and metadata will be downloaded from the PDC. Default: |
|
Path to a pre-downloaded PDC study metadata file ( |
|
Override the study name used when |
|
A |
|
If this option is set, only the first |
|
Additional command line arguments passed to |
|
If set to |
|
A |
params.carafe
Parameter Name |
Description |
|---|---|
|
Legacy direct |
|
Directory, or list of directories, containing the |
|
Glob used to select files in |
|
Use this regex instead of |
|
The path to a DIA-NN |
|
FASTA file used by Carafe to generate final spectral library. If |
|
Command line options to pass to Carafe. Note: Do not set the |
|
Set to |
|
Set to |
|
The number of variable modifications allowed per peptide. Ignore if no variable modifications are include. Default: |
|
The FASTA file used by the DIA-NN search in the Carafe subworkflow. If not set either |
|
List of PDC file names (matching entries in the PDC study’s metadata) to feed Carafe. Requires |
|
Random sample size: pick |
params.msconvert
Parameter Name |
Description |
|---|---|
|
If starting with raw files, this is the value used by |
|
If starting with raw files, this is the value used by |
|
If starting with raw files, |
params.diann
When using DIA-NN, the chromatogram_library_spectra_dir parameter can optionally be used to create a subset library.
The files in chromatogram_library_spectra_dir are searched first using a spectral library either specified by params.spectral_library, or a predicted library generated in the workflow by Carafe or DiaNN.
Then, the resulting subset library containing only those precursors identified in the first search, is then used to search the files in quant_spectra_dir.
DIA-NN requires at least 2 MS files in each search input.
This applies to quant_spectra_dir, and (when configured) also to chromatogram_library_spectra_dir.
The match-between-runs step (DIANN_MBR) needs two or more runs to emit the spectral library used downstream;
the workflow will fail with an explicit error naming which input(s) are too small when fewer files are supplied.
Parameter Name |
Description |
|---|---|
|
The parameters passed to DIA-NN when it is run. Default: |
|
Parameters used when generateing predicted spectral library with DIA-NN.
Note: Do not set the Default is: |
params.encyclopedia and params.cascadia
Parameter Name |
Description |
|---|---|
|
If you are generating a chromatogram library for quantification, this is the command line options passed to EncyclopeDIA during the chromatogram generation step. Default: |
|
The command line options passed to EncyclopeDIA during the quantification step. Default: |
|
EncyclopeDIA generates many intermediate files that are subsequently processed by the workflow to generate the final results. These intermediate files may be large. If set to |
|
If set to |
|
Score threshold applied to Cascadia predictions. Must be between 0 and 1. Default: |
Cascadia has additional behavioral constraints worth knowing:
Multi-batch mode is not supported. Setting
quant_spectra_dirto aMap(or usingpdc.batch_file) withsearch_engine = 'cascadia'will cause the workflow to fail at startup.Any user-supplied
spectral_libraryis ignored with a warning. Cascadia performs de novo identification and produces its own library.Cascadia generates its own FASTA from identified sequences;
params.fastais not required when running Cascadia.
params.skyline
Parameter Name |
Description |
|---|---|
|
If set to |
|
The base of the file name of the generated Skyline document. If set to |
|
Path(s) (local file system or Panorama WebDAV) to a |
|
The Skyline template file used to generate the final Skyline file. By default a
pre-made Skyline template file suitable for EncyclopeDIA or DIA-NN will be used. Specify a file
location here to use your own template. Note: The filenames in the .zip file must match
the name of the zip file, itself. E.g., |
|
If |
|
If |
|
The fasta file to use as a background proteome in Skyline. If |
|
If |
|
If |
|
On systems that allow it, setting this to |
params.qc_report and params.batch_report
Parameter Name |
Description |
|---|---|
|
If set to |
|
Normalization method to use for plots in QC and batch report(s). This option applies to both the QC and batch reports. Available options are |
|
Method to use to impute missing precursor peak areas for plots in QC and batch report(s).
This option applies to both the QC and batch reports.
Available options are |
|
List of protein names in Skyline document to plot retention times for. For example: If |
|
List of metadata variables to color PCA plots by. For example: This option applies to both the QC and batch reports.
If |
|
Export tsv files containing normalized precursor and protein quantities? Default is |
|
List of formats to render the QC report in. Allowed values are |
|
List of replicate names to exclude from normalization and batch correction. Default: |
|
List of batch/project names to exclude from normalization and batch correction. Default: |
|
If set to |
|
Metadata key for batch level 1. If |
|
Metadata key for batch level 2. A second batch level is only supported with |
|
Metadata key(s) to use as covariates for batch correction. If |
|
Metadata key indicating replicates which are controls for CV plots. If |
|
Metadata value(s) mapping to |
|
File extension for standalone plots. If |
params.panorama
Parameter Name |
Description |
|---|---|
|
Whether or not to upload results to PanoramaWeb Default: |
|
The WebDAV URL of a directory in PanoramaWeb to which to upload the results. Note that |
|
If set to |
Running the workflow in multi-batch mode
The workflow can be run in multi-batch mode if the params.search_engine supports it.
Among the search engines, only 'diann' supports multi-batch mode; EncyclopeDIA and Cascadia raise an error at startup if invoked with batch inputs (a Map-shaped quant_spectra_dir or a pdc.batch_file). No-search mode (search_engine = null) also accepts batch inputs and produces one Skyline document per batch.
There are two ways to activate multi-batch mode:
Using quant_spectra_dir as a Map
For non-PDC runs, params.quant_spectra_dir must be a Map where each key, value pair is a batch name and the ms files corresponding to the batch.
For example:
params {
quant_spectra_dir = ['Plate_1': '<path to mzML/raw files>',
'Plate_2': '<path to mzML/raw files>']
}
Note: mzML/raw file names can not be duplicated in any batch. If there are duplicate file names the DIANN_MBR process will fail.
Using pdc.batch_file for PDC runs
For PDC runs, multi-batch mode is activated by setting params.pdc.batch_file to a tsv file that assigns each downloaded PDC file to a batch. The file must have file_name and batch columns:
file_name |
batch |
|---|---|
sample_001.raw |
BatchA |
sample_002.raw |
BatchA |
sample_003.raw |
BatchB |
sample_004.raw |
BatchB |
The workflow validates that all files in the batch file match files downloaded from the PDC study, and that all downloaded files appear in the batch file.
For example:
params {
pdc.study_id = 'PDC000504'
pdc.batch_file = '/path/to/pdc_batches.tsv'
}
Differences in result files in multi batch mode
A separate Skyline document is generated for each batch, with the batch name appended to the document name.
For example, if
params.skyline.document_nameis'human_dia'and using the batches in the example above, 2 documents would be generated:human_dia_Plate_1.sky.ziphuman_dia_Plate_2.sky.zip
For PDC runs where
skyline.document_namedefaults to the study name, the batch name is appended similarly:study_name_BatchA.sky.zipstudy_name_BatchB.sky.zip
Any optional Skyline reports will be generated separately for each document.
A separate QC report is generated for each Skyline document.
If results are uploaded to PanoramaWeb, any
mzMLfiles generated in the workflow are put into a separate subdirectory for each batch.
Providing replicate metadata
The replicate_metadata file can be a tsv or csv file where the first column has the header Replicate. The values under the replicate column should match exactly the names of the mzML or raw files which will be in the Skyline document. The headers of subsequent columns are the names of each metadata variable and the values in each column are the annotations corresponding to each replicate.
Replicate |
sample_type |
strain |
|---|---|---|
replicate_1.raw |
test |
BALB/cJ |
replicate_2.raw |
test |
C57BL/6J |
replicate_3.raw |
IBQC |
Pool |
The profiles Section
The example configuration file includes this profiles section:
profiles {
// "standard" is the profile used when the steps of the workflow are run
// locally on your computer. These parameters should be changed to match
// your system resources (that you are willing to devote to running
// workflow jobs).
standard {
params.max_memory = '8.GB'
params.max_cpus = 4
params.max_time = '240.h'
params.mzml_cache_directory = '/data/mass_spec/nextflow/nf-skyline-dia-ms/mzml_cache'
params.panorama_cache_directory = '/data/mass_spec/nextflow/panorama/raw_cache'
}
}
These parameters describe the capability of your local computer for running the steps of the workflow. Below is a description of each parameter:
Req? |
Parameter Name |
Description |
|---|---|---|
✓ |
|
The maximum amount of RAM that may be used by steps of the workflow. Default: 8 gigabytes. |
✓ |
|
The number of cores that may be used by the workflow. Default: 4 cores. |
✓ |
|
The maximum amount of a time a step in the workflow may run before it is stopped and error generated. Default: 240 hours. |
✓ |
|
When |
✓ |
|
If the RAW files to be processed are in PanoramaWeb, the RAW files will be downloaded to and cached in this directory for future use. |
The process Section
In Nextflow the default compute resources allocated to a process can be adjusted in the process section using the withName selector.
The following processes will dynamically adjust the requested memory and run time to fit the number and size of the files being processed.
Nextflow will try to allocate resources using the formulas below up to the maximum values specified by params.max_memory, params.max_time and params.max_cpus.
Process |
CPUs |
Memory |
Walltime |
|---|---|---|---|
|
8 |
Maximum of 16 GB and 2 times the sum of the sizes of the MS and spectral library files |
2 hours |
|
32 |
Maximum of 32 GB and 2 times the sum of the MS file sizes |
10 minutes times the number of MS files |
|
2 |
Maximum of 8 GB and 1.5 times the size of the precursor report file |
2 hours |
|
8 |
16 GB |
4 hours |
|
32 |
Maximum of 32 GB and 4 times the number of MS files |
24 hours |
|
8 |
Maximum of 8 GB and 10 times the spectral library size |
4 hours |
|
8 |
Maximum of 8 GB and the sum of the MS file and skyline template with spectral library |
2 hours |
|
32 |
Maximum of 8 GB and 1.5 times the sum of the sizes of the .skyd files |
8 hours |
|
8 |
Maximum of 8 GB and 1.5 times the size of the skyline zip file |
4 hours |
|
8 |
Maximum of 8 GB and 1.5 times the size of the skyline zip file |
4 hours |
|
2 |
Maximum of 8 GB and the sum of the sizes of the precursor reports |
8 hours |
|
8 |
Maximum of 8 GB and 2 times the size of the batch database |
4 hours |
|
2 |
Maximum of 8 GB and 2 times the size of the batch database |
2 hours |
|
2 |
Maximum of 8 GB and 2 times the size of the batch database |
4 hours |
|
2 |
Maximum of 8 GB and 2 times the size of the batch database |
2 hours |
|
2 |
Maximum of 8 GB and 2 times the size of the batch database |
2 hours |
|
2 |
Maximum of 8 GB and 2 times the size of the batch database |
2 hours |
In most cases there is no need for users to adjust the default values.
One instance where adjusting these parameters could be useful is to select the AWS batch queue to be used for a specific process.
The DIANN_MBR process downloads all MS files to a single EC2 instance.
In cases where large numbers of files are being processed the available disk space on the default EC2 instance might not be sufficient to hold all the MS files.
The DIANN_MBR process can be set to run in a queue with more disk space by adding the following to the pipeline config.
process {
withName:DIANN_MBR {
queue = "nextflow_basic_ec2_1tb"
}
}
The resource requirements allocated to a process can be fully customized by adding a withName selector to the process section of the pipeline config file.
For example, to override the default memory and wall time for DIANN_MBR you could add the following to the pipeline config:
process {
withName:DIANN_MBR {
memory = 248.GB
time = 48.h
}
}
The mail Section
This is a more advanced and entirely optional set of parameters. When the workflow completes, it can optionally send an email to the address specified above in the params section.
For this to work, the following parameters must be changed to match the settings of your email server. You may need to contact your IT department to obtain the appropriate settings.
The example configuration file includes this mail section:
mail {
from = 'address@host.com'
smtp.host = 'smtp.host.com'
smtp.port = 587
smtp.user = 'smpt_user'
smtp.password = 'smtp_password'
smtp.auth = true
smtp.starttls.enable = true
smtp.starttls.required = false
mail.smtp.ssl.protocols = 'TLSv1.2'
}
Below is a description of each parameter:
Req? |
Parameter Name |
Description |
|---|---|---|
✓ |
|
The email address from which the email should be sent. |
✓ |
|
The internet address (host name or ip address) of the email SMTP server. |
✓ |
|
The port on the host to connect to. Most likely will be |
|
If authentication is required, this is the username. |
|
|
If authentication is required, this is the password. |
|
✓ |
|
Whether or not (true or false) authentication is required. |
✓ |
|
Whether or not to enable TLS support. |
✓ |
|
Whether or not TLS is required. |
✓ |
|
SSL protocol to use for sending SMTP messages. |