Workflow Overview
These documents describe a standardized Nextflow workflow for processing DIA mass spectrometry data to quantify peptides and proteins. The source code for the workflow can be found at: https://github.com/mriffle/nf-skyline-dia-ms.
This workflow supports three search engines: DIA-NN, Encyclopedia, and Cascadia for performing de novo searches. Each search engine works as a drop-in replacement for the other, supporting all the same pre- and post-analysis steps. In all cases, the workflow supports converting RAW files, integrating with PanoramaWeb (ProteomeXchange) and Proteomic Data Commons, and will generate a Skyline document suitable for visualization and analysis in Skyline.
Supported input file formats
The workflow accepts the following MS input file formats for quant_spectra_dir and chromatogram_library_spectra_dir:
.mzML— supported by all search engines..raw(Thermo) — supported by all search engines. Files are converted to mzML using msconvert unlessuse_vendor_rawis enabled..d.zip(Bruker) — a zipped Bruker.ddirectory. The workflow extracts these to.ddirectories rather than running msconvert. Bruker ``.d.zip`` input is only supported when ``search_engine`` is ``’diann’`` or ``null`` (no-search, Skyline-only). EncyclopeDIA and Cascadia do not read Bruker data.
All matched files in a single directory must share one extension; mixing formats within a batch is not supported.
Cascadia workflow:
The workflow will perform de novo identification of peptides using user-supplied DIA RAW (or mzML) files. The workflow will generate a Skyline document where users may visualize the de novo results and export integrated peak areas for the identified peptides.
DIA-NN workflow:
The workflow will quantify peptides and proteins using user-supplied DIA RAW, mzML, or Bruker .d.zip files, a FASTA file, and a spectral
library (optional). If the user does not specify a spectral library, DIA-NN will be run in “library-free” mode, where
it will create its own library using AI. Finally the workflow will generate a Skyline document using the quantified peptides
and proteins.
EncyclopeDIA workflow:
This workflow is summarized in the following article:
Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry. Searle BC, Pino LK, Egertson JD, Ting YS, Lawrence RT, MacLean BX, Villén J, MacCoss MJ. Nat Commun. 2018 Dec 3;9(1):5128. (https://pubmed.ncbi.nlm.nih.gov/30510204/)
The workflow will quantify peptides and proteins using user-supplied DIA RAW (or mzML) files, FASTA file, and spectral library. If the experimental design includes generation of a chromatogram library using narrow window DIA data, the workflow will first generate the chromatogram library (Figure 1A) and use that as input to the next phase (Figure 1B) to quantify peptides and proteins. If the experimental design does not include this, the user-supplied spectral library is used as input for quantifying peptides and proteins. Finally the workflow will generate a Skyline document using the quantified peptides and proteins.
Other run modes
No-search mode (
search_engine = null) — the search step is skipped and the user-suppliedspectral_library,fasta, andquant_spectra_dirfiles are imported directly into Skyline. Useful when you already have a curated library and only need quantification + Skyline import.msconvert-only mode (
msconvert_only = true) — the workflow resolves MS inputs (downloading and converting RAW or extracting Bruker.d.zipas needed), optionally uploads them to PanoramaWeb, and exits. No search, library generation, or Skyline document is produced.Carafe library generation (
carafe.spectra_dirset, or — when running against PDC —carafe.pdc_files/carafe.pdc_n_filesset) — Carafe runs before the main search to build a spectral library from your data. The resulting library overrides any user-suppliedspectral_library. With PDC input, Carafe consumes a subset of the PDC quant download (no separate upload required).PDC input (
pdc.study_idset) — RAW or Bruker.d.zipfiles and study metadata are downloaded from the Proteomic Data Commons instead of from local/Panorama paths. Outsidemsconvert_onlymode, PDC requiressearch_engine = 'diann'.
The workflow is summarized graphically as:
Figure 1. An overview of the computational pipeline implemented by this workflow. (A) the optional generation of a chromatogram library that can be fed into part (B) for peptide and protein quantification using DIA. If part (A) is not run, a user-supplied spectral library or chromatogram library may be used for quantification in part (B).
How to Run
This workflow uses the Nextflow standardized workflow platform. The Nextflow platform emphasizes ease of use, workflow portability, and containerization of the individual steps. To run this workflow, you do not need to install any of the software components of the workflow. There is no need to worry about installing necessary software libraries, version incompatibilities, or compiling or installing complex and fickle software.
To run the workflow you need only install Nextflow, which is relatively simple. To run the individual steps of the workflow on your own computer, you will need to install Docker. After these are installed, you will need to edit the pipeline configuration file to supply the locations of your data and execute a simple Nextflow command, such as:
nextflow run -resume -r main mriffle/nf-skyline-dia-ms -c pipeline.config
The entire workflow will be run automatically, downloading Docker images as necessary, and the results output to
the results directory. See How to Install the Workflow for more details on how to install Nextflow and Docker. See
How to Run the Workflow for more details on how to run the workflow. And see Output & Results for more details on how to
retrieve the results.
Workflow Components
The workflow is made up of the following software components, each may be run multiple times for different tasks.
PanoramaWeb (https://panoramaweb.org/home/project-begin.view)
Users may optionally use WebDAV URLs as locations for input data files in PanoramaWeb. The workflow will automatically download files as necessary.
msconvert (https://proteowizard.sourceforge.io/)
If users supply RAW files as input, they will be converted to mzML using msconvert (unless
use_vendor_rawis set). Bruker.d.zipinputs bypass msconvert and are extracted to.ddirectories that are passed directly to DIA-NN or Skyline.EncyclopeDIA (http://www.searlelab.org/software/encyclopedia/index.html)
When
search_engine = 'encyclopedia', EncyclopeDIA is used in three parts of the pipeline:If the user supplies a BLIB spectral library, EncyclopeDIA will be used to convert that to a DLIB.
EncyclopeDIA is used to search narrow window DIA data and generate a chromatogram library.
EncyclopeDIA is used to quantify peptides and proteins.
DIA-NN (https://github.com/vdemichev/DiaNN)
When
search_engine = 'diann', DIA-NN performs the search. It can use a user-supplied spectral library, a Carafe-generated library, or run in library-free mode where it predicts a library from the FASTA. DIA-NN is the only search engine that supports Bruker.d.zipinput and multi-batch runs.Cascadia (https://github.com/Noble-Lab/Cascadia)
When
search_engine = 'cascadia', Cascadia performs de novo peptide identification and produces its own spectral library and FASTA. User-supplied spectral libraries are ignored and batch mode is not supported.Carafe (https://github.com/Noble-Lab/Carafe)
Optionally generates a spectral library before the main search when
carafe.spectra_dir(or the legacycarafe.spectra_file) is set. The generated library overrides any user-suppliedspectral_libraryfor downstream search. Carafe accepts.mzML,.raw, and Bruker.d.zipinputs;.rawfiles are converted to mzML and.d.zipfiles are extracted to.ddirectories before Carafe runs.PDC Client (https://proteomic.datacommons.cancer.gov/)
When
pdc.study_idis set, the workflow downloads RAW or Bruker.d.zipfiles and study metadata from the Proteomic Data Commons. PDC studies are searched with DIA-NN (the only search engine compatible with the PDC branch outsidemsconvert_onlymode).Skyline (https://skyline.ms/project/home/begin.view)
Skyline imports MS data and search results from any of the three search engines (or the user-supplied library in no-search mode) into a Skyline template document. The document is annotated with replicate metadata, optionally minimized, and used to run any user-supplied
.skyrreports.DIA-QC report tooling (https://github.com/ajmaurais/DIA_QC_report)
When
qc_report.skipisfalse, this tooling generates a normalized precursor/protein quality report (HTML and/or PDF) from Skyline report exports. Batch reports and PDC gene-level reports use the same database.