Workflow Overview

These documents describe a standardized Nextflow workflow for processing DIA mass spectrometry data to quantify peptides and proteins. The source code for the workflow can be found at: https://github.com/mriffle/nf-skyline-dia-ms.

This workflow supports three search engines: DIA-NN, Encyclopedia, and Cascadia for performing de novo searches. Each search engine works as a drop-in replacement for the other, supporting all the same pre- and post-analysis steps. In all cases, the workflow supports converting RAW files, integrating with PanoramaWeb (ProteomeXchange) and Proteomic Data Commons, and will generate a Skyline document suitable for visualization and analysis in Skyline.

Supported input file formats

The workflow accepts the following MS input file formats for quant_spectra_dir and chromatogram_library_spectra_dir:

.mzML — supported by all search engines.
.raw (Thermo) — supported by all search engines. Files are converted to mzML using msconvert unless use_vendor_raw is enabled.
.d.zip (Bruker) — a zipped Bruker .d directory. The workflow extracts these to .d directories rather than running msconvert. Bruker ``.d.zip`` input is only supported when ``search_engine`` is ``’diann’`` or ``null`` (no-search, Skyline-only). EncyclopeDIA and Cascadia do not read Bruker data.

All matched files in a single directory must share one extension; mixing formats within a batch is not supported.

Cascadia workflow:

The workflow will perform de novo identification of peptides using user-supplied DIA RAW (or mzML) files. The workflow will generate a Skyline document where users may visualize the de novo results and export integrated peak areas for the identified peptides.

DIA-NN workflow:

The workflow will quantify peptides and proteins using user-supplied DIA RAW, mzML, or Bruker .d.zip files, a FASTA file, and a spectral library (optional). If the user does not specify a spectral library, DIA-NN will be run in “library-free” mode, where it will create its own library using AI. Finally the workflow will generate a Skyline document using the quantified peptides and proteins.

EncyclopeDIA workflow:

This workflow is summarized in the following article:

Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry. Searle BC, Pino LK, Egertson JD, Ting YS, Lawrence RT, MacLean BX, Villén J, MacCoss MJ. Nat Commun. 2018 Dec 3;9(1):5128. (https://pubmed.ncbi.nlm.nih.gov/30510204/)

The workflow will quantify peptides and proteins using user-supplied DIA RAW (or mzML) files, FASTA file, and spectral library. If the experimental design includes generation of a chromatogram library using narrow window DIA data, the workflow will first generate the chromatogram library (Figure 1A) and use that as input to the next phase (Figure 1B) to quantify peptides and proteins. If the experimental design does not include this, the user-supplied spectral library is used as input for quantifying peptides and proteins. Finally the workflow will generate a Skyline document using the quantified peptides and proteins.

Other run modes

No-search mode (search_engine = null) — the search step is skipped and the user-supplied spectral_library, fasta, and quant_spectra_dir files are imported directly into Skyline. Useful when you already have a curated library and only need quantification + Skyline import.
msconvert-only mode (msconvert_only = true) — the workflow resolves MS inputs (downloading and converting RAW or extracting Bruker .d.zip as needed), optionally uploads them to PanoramaWeb, and exits. No search, library generation, or Skyline document is produced.
Carafe library generation (carafe.spectra_dir set, or — when running against PDC — carafe.pdc_files / carafe.pdc_n_files set) — Carafe runs before the main search to build a spectral library from your data. The resulting library overrides any user-supplied spectral_library. With PDC input, Carafe consumes a subset of the PDC quant download (no separate upload required).
PDC input (pdc.study_id set) — RAW or Bruker .d.zip files and study metadata are downloaded from the Proteomic Data Commons instead of from local/Panorama paths. Outside msconvert_only mode, PDC requires search_engine = 'diann'.

The workflow is summarized graphically as:

_images/workflow_figure.png — Figure 1. An overview of the computational pipeline implemented by this workflow. (A) the optional generation of a chromatogram library that can be fed into part (B) for peptide and protein quantification using DIA. If part (A) is not run, a user-supplied spectral library or chromatogram library may be used for quantification in part (B).

How to Run

This workflow uses the Nextflow standardized workflow platform. The Nextflow platform emphasizes ease of use, workflow portability, and containerization of the individual steps. To run this workflow, you do not need to install any of the software components of the workflow. There is no need to worry about installing necessary software libraries, version incompatibilities, or compiling or installing complex and fickle software.

To run the workflow you need only install Nextflow, which is relatively simple. To run the individual steps of the workflow on your own computer, you will need to install Docker. After these are installed, you will need to edit the pipeline configuration file to supply the locations of your data and execute a simple Nextflow command, such as:

nextflow run -resume -r main mriffle/nf-skyline-dia-ms -c pipeline.config

The entire workflow will be run automatically, downloading Docker images as necessary, and the results output to the results directory. See How to Install the Workflow for more details on how to install Nextflow and Docker. See How to Run the Workflow for more details on how to run the workflow. And see Output & Results for more details on how to retrieve the results.

Workflow Components

The workflow is made up of the following software components, each may be run multiple times for different tasks.

PanoramaWeb (https://panoramaweb.org/home/project-begin.view)

Users may optionally use WebDAV URLs as locations for input data files in PanoramaWeb. The workflow will automatically download files as necessary.
msconvert (https://proteowizard.sourceforge.io/)

If users supply RAW files as input, they will be converted to mzML using msconvert (unless use_vendor_raw is set). Bruker .d.zip inputs bypass msconvert and are extracted to .d directories that are passed directly to DIA-NN or Skyline.
EncyclopeDIA (http://www.searlelab.org/software/encyclopedia/index.html)

When search_engine = 'encyclopedia', EncyclopeDIA is used in three parts of the pipeline:
1. If the user supplies a BLIB spectral library, EncyclopeDIA will be used to convert that to a DLIB.
2. EncyclopeDIA is used to search narrow window DIA data and generate a chromatogram library.
3. EncyclopeDIA is used to quantify peptides and proteins.
DIA-NN (https://github.com/vdemichev/DiaNN)

When search_engine = 'diann', DIA-NN performs the search. It can use a user-supplied spectral library, a Carafe-generated library, or run in library-free mode where it predicts a library from the FASTA. DIA-NN is the only search engine that supports Bruker .d.zip input and multi-batch runs.
Cascadia (https://github.com/Noble-Lab/Cascadia)

When search_engine = 'cascadia', Cascadia performs de novo peptide identification and produces its own spectral library and FASTA. User-supplied spectral libraries are ignored and batch mode is not supported.
Carafe (https://github.com/Noble-Lab/Carafe)

Optionally generates a spectral library before the main search when carafe.spectra_dir (or the legacy carafe.spectra_file) is set. The generated library overrides any user-supplied spectral_library for downstream search. Carafe accepts .mzML, .raw, and Bruker .d.zip inputs; .raw files are converted to mzML and .d.zip files are extracted to .d directories before Carafe runs.
PDC Client (https://proteomic.datacommons.cancer.gov/)

When pdc.study_id is set, the workflow downloads RAW or Bruker .d.zip files and study metadata from the Proteomic Data Commons. PDC studies are searched with DIA-NN (the only search engine compatible with the PDC branch outside msconvert_only mode).
Skyline (https://skyline.ms/project/home/begin.view)

Skyline imports MS data and search results from any of the three search engines (or the user-supplied library in no-search mode) into a Skyline template document. The document is annotated with replicate metadata, optionally minimized, and used to run any user-supplied .skyr reports.
DIA-QC report tooling (https://github.com/ajmaurais/DIA_QC_report)

When qc_report.skip is false, this tooling generates a normalized precursor/protein quality report (HTML and/or PDF) from Skyline report exports. Batch reports and PDC gene-level reports use the same database.