R-software multiMS-toolbox © 2012+

Introduction

multiMS-toolbox is a software toolbox to efficiently search for differences in mass-spectrometry samples from long-term experiments. It is supposed to have several runs for each sample. Then the software allows you to:

match the appropriate peaks or peak clusters in the spectra of different runs and different samples and replace the peaks of the same isotope group by one group peak if required,
select appropriate normalization method and run a principal component analysis (PCA, [Pearson 1901]) on processed data,
group samples together and assign them the same shape or the same color in graphs, then draw the PCA scores plot (data samples plot) and loadings plot between each 2 of 3 dominant principal components, draw the 3D PCA scores plot,
export and analyze each PCA component, draw the graphs of the most important PCA loadings and the most important peaks in the spectra,
run analysis of variance (ANOVA) for each principal component and samples grouped according to same shape or color in graphs, draw the appropriate graphs,
draw other graphs for later analysis, like the most important changes of absolute and relative intensities or areas of the peaks and peak clusters, output all the results to csv, txt, pdf (or tiff, png) files for later analysis.

Installation on MS Windows

The software is distributed in ZIP archive containing the „multiMS-toolbox" application folder. The toolbox main file is "multiMS-toolbox.R".

To use it you must first install the R-software system (https://www.r-project.org/) on your computer. To create 3D PCA plots and visualize match map of matched peaks, install also R-software packages rgl and ggplot2 (select Packages → Install package(s)… from the R gui menu). To check the characters in filenames, install also the R-software package stringi. To create average spectra plots, install also the R-software package reshape2. If you want to use normalization by best matching exponential line, install also the R-software package minpack.lm.

When you run the R-software, move to the directory, where you have installed the toolbox by setwd() command (to see the current directory, run the getwd() command from the R command line). For example, when multiMS-toolbox is installed in D:\multiMS-toolbox, run

> getwd()

[1] "D:/"

>setwd("D:/multiMS-toolbox")

And then load the toolbox by the source() command

> source("multiMS-toolbox.R")

You can assign the .RData extension to be automatically opened by the R for Windows GUI front-end and then run the R enviroment by double-click the 1blank.RData file (blank workspace) with the directory set to the current directory.

To run the demo examples for proteinaceous binders aging effect, move to the directory, where the example files are stored:

> setwd("examples")

> setwd("protbind")

And then run either of these commands:

> demoLowProteins1()

or according to selected normalization method (see Implemented functions for details)

> demoNormalizedLowProteins1()

> demoNormalizedLowProteins2()

> demoNormalizedLowProteins3()

For the full spectrum analysis available from version 2.0, you can also run

> demoFullSpectraNormalizedLowProteins1()

> demoFullSpectraNormalizedLowProteins2()

To run the demo examples for bacteria mass spectum, move to the directory, where the example files are stored:

>setwd("D:/multiMS-toolbox")

> setwd("examples")

> setwd("bacteria")

And then run the command

> demoHighProteins1()

or, when normalization is used, run

> demoNormalizedHighProteins1()

For the full spectrum analysis available from version 2.0, you can also run

> demoFullSpectraNormalizedHighProteins1()

> demoFullSpectraNormalizedHighProteins2()

All the outputs are printed and drawn to the R-GUI and stored to csv, pdf and txt files to the current directory.

Each of these demo functions only runs the core function runPCA having only two required parameters (lowMz, highMz) and several optional parameters. See the help for the given function in the Implemented functions section. The functions can be called from R command window, the passed parameters should include the name of the parameter and the value set. If the string value is set, then use quotes around the string, e.g.

> runPCA(csvfile="filesAll.csv", lowMz=900.0, highMz=2000.0)

Otherwise, you can specify all parameters in "config.R" file and specify only the path there:

> runPCA(paramsFile="configPeaksOnlyLowProteinsNormalize1.R")

If you find the next section too difficult, try our comon usecase examples for multiMS-toolbox.

Current version of multiMS-toolbox was successfully tested on Windows 7 64-bit with R 3.5.0 (64-bit environment).

Configuration file parameters

Parameters for file with ms peaks and spectrum information

csvfile – Excel’s csv file with the csvsep column separator. The default value is "filesAll.csv". The file should have column headers on the first line. The file should have at least columns:

filesName – containing the names of data files to process.
filesColorProperty – vector of string representations of graph colors for given files, the same string for two different files means, that its data points will be drawn with the same color.
filesShapeProperty – vector of string representations of graph point shapes for given files, the same string for two different files means, that its data points will be drawn with the same shape.
filesSpectrum – containing the names of spectrum files for given data files, this column is required either if findRealValuessForMissingPeaks is set to 1 or normalize is set to 1 or 3.

csvsep – the delimiter character between the columns in the read csvfile and in the written output files. The default value is ",".

csvdec – the decimal point character in read input files and written output files. For English, set it to ".". The default value is ".".

Crop spectrum parameters

lowMz – required parameter: the lowest used and displayed m/z value. No default value.

highMz – required parameter: the highest used and displayed m/z value.No default value.

Peak intensity and spectrum normalization parameters

normalize – input data normalization method:

0 – not normalized.
1 – peak intensities or areas are normalized by median of spectrum intensity ratios of each data spectrum to the template spectrum passed in the normalizedTemplateSpectrumFor1 parameter or to the first data sample spectrum.
2 – sum of all matched peak intensities or areas is normalized to the same value (sum of the first data sample).
3 – sum of the whole (cropped) spectrum area is normalized to the same value (sum of the first data sample).
4 – spectrum is divided by best matching exponential line, this option is implemented only for full spectrum analysis.
5 – each intensity is scaled among the samples to have standard deviation equal to 1, this option is implemented only for full spectrum analysis.

The default value is 0.

normalizedTemplateSpectrumFor1 – filename of the spectrum, which will be used as the template spectrum. If set to NULL then the first data sample spectrum will be used instead. The parameter is used only if normalize is set to 1 (normalization by median of spectrum intensity ratios of each data spectrum to template spectrum). The default value is NULL.

normalizeLowMz – the m/z start value of the spectrum normalization interval, valid only if normalize is set to 1 or higher value. The default value is NULL, i.e. to be the same as lowMz.

normalizeHighMz – the m/z end value of the spectrum normalization interval, valid only if normalize is set to 1 or higher value. The default value is NULL, i.e. to be the same as highMz.

useFullSpectra – 0 means run PCA on peaks, 1 means run PCA on full spectrum data and thus several other options like areaBased or deisotoped is then switched off. if this option is set, then the filesSpectrum column is required in the input csvfile. The default value is 0.

areaBased – which peak values use for the PCA:

0 – peak intensities.
1 – peak areas. Assuming Gaussian distribution of peak intensities for each peak, areas are computed as

(full width at half maximum) . (peak intensity) .

When used, fwhm and int columns are required in the data files.
2 – peak areas or partial peak areas. If partial areas are proportional to the whole area, PCA can be run only on partial peak areas (this holds for the Gaussian distribution for the intensities of each peak). When used, the area column is required in the data files.

The default value is 1. The parameter is valid only if useFullSpectra is set to 0.

deisotoping – 0 means all peaks are used, 1 means clusters are replaced by only one peak having the m/z value as first peak in the cluster. Peak intensity or area is then the sum of the processed intensities / areas of all the peaks within the cluster. When used, the deisotoping_grp column is required in the data files. The parameter is valid only if useFullSpectra is set to 0.

sn_cut – signal to noise ratio threshold. The default value is 0.0. When used with the value ≥ 0.0, the sn column is required in the data files. The parameter is valid only if useFullSpectra is set to 0.

maxDistance1 – the maximum m/z distance where peaks are treated as of the same m/z value. The default value is 0.3. The parameter is valid only if useFullSpectra is set to 0.

maxDistance2 – the maximum m/z distance where already matched groups of peaks (matched among several files by maxDistance1) will be treated as only one peak. The value is used only if it is higher than the maxDistance1 value. For more information about matching the peaks, see the Remarks section of the matchPeaks function. The default value is 0.51. The parameter is valid only if useFullSpectra is set to 0.

useRelativeMaxDistance – 0 means the maxDistance1 and maxDistance2 parameters are treated as absolute size of the interval to search, 1 means the maxDistance1 and maxDistance2 parameters are treated as multiplication coefficients. The absolute size of the interval to search is then computed as maxDistance1 (or maxDistance2) multiplied by the m/z value of the peak. For peaks with large m/z values, you can use for example:
useRelativeMaxDistance=1, maxDistance1=0.00015, maxDistance2=0.000255
The default value is 0. The parameter is valid only if useFullSpectra is set to 0.

findRealValuesForMissingPeaks – if set to 1 then for missing peaks (no match in given data file) their absolute intensity value is approximated from original spectrum file instead of setting them 0 intensity value (i.e. sn=0.0 intensity value for baseline subtracted intensity), or their area is approximated from intensity found in the original spectrum file and from minimum fwhm found between matched peaks of given m/z. if this option is set, then the filesSpectrum column is required in the input csvfile.The default value is 1. The parameter is valid only if useFullSpectra is set to 0.

fullSpectraDivide1MzBy - all the available spectrum data are interpolated from lowMz to highMz values and each 1 m/z is interpolated in fullSpectraDivide1MzBy intermediate values. The default value is 50. The parameter is valid only if useFullSpectra is set to 1.

fullSpectraMzTemplate:

if set to any file name, than full spectra are interpolated at m/z points reads from the first column of given file restricted to the <lowMz, highMz> interval.
if set to 1, than full spectra are interpolated at m/z points reads from the first sample spectrum file restricted to the <lowMz, highMz> interval.
if set to -1, than full spectra are assumed to be already interpolated and only values inside the <lowMz, highMz> interval are used.

The default value is NULL, i.e. use fullSpectraDivide1MzBy instead. The parameter is valid only if useFullSpectra is set to 1.

Experiment output parameters

label – character string to print in graphs and to use for file names (i.e. the name of the experiment). The default value is "".

numOfPCAComponents – number of principal components to show and to draw their graphs. The default value is 3.

itemsLabelAtMost – in the PCA scores plot this parameter specifies how large graphs will be plotted with labels assigned to each data point, in the PCA loadings plot the parameter specifies how many most extreme points will be plotted with their m/z values. If the PCA scores plot contains at most itemsLabelAtMost data points, then the data points will be plotted with their labels. Each label is a number representing the read order of given data point (data line in the original csvfile). In the PCA loadings plot, only to the first itemsLabelAtMost data points are plotted with their m/z values. The default value is 25.

legendColorPropertyLabel – the string showed in graph legends for grouping of samples based on colors - optional filesColorProperty column in the csvfile. The default value is "Colors".

legendShapePropertyLabel – the string showed in graph legends for grouping of samples based on shapes - optional filesShapeProperty column in the csvfile. The default value is "Shapes".

pdfFileWidthCm – the width of produced file outputs (in cm). The default value is 20.

pdfFileHeightCm – the height of produced file outputs (in cm). The default value is 20.

outputdev – the extension (type) of produced file outputs. Default value is "pdf" for PDF files, other available formats are for example "tiff" for uncompressed TIFF or "png" for PNG files.

dpi – dpi value for rasterized file outputs (tiff or png). Default value is 300.

Speed processing parameters

fast - 0 means compute and show all outputs, 1 means some long and time consuming but not essential outputs are omitted. Default value is 1.

Required format of peak and spectrum data files

MS Data files:

Data files, whose names are listed in the csvfile, should have the structure:

First line should contain the names of the data columns and each other line should contain data for one peak. There should be at least the columns mz and int. Optionally, there should be included the columns sn (when sn_cut has value ≥ 0.0), fwhm (when areaBased=1), area (when areaBased=2), or deisotoping_grp (when deisotoping=1). The column delimiter should be tab character.

MS Data file columns:

mz – m/z of the peak.

int – peak intensity after preprocessing.

sn – signal to noise ratio of the peak.

fwhm – full width at half maximum of the peak.

deisotoping_grp – either "None", if not a part of any peak cluster, or the number of the peak cluster.

area – area or partial area of the peak.

MS Spectrum file columns:

The spectrum is read from first two data columns (assuming no header line). In the first column there are m/z values, in the second column there are spectrum values (processed peak intensities). The spectrum data files needn’t to be sampled in the exactly same m/z points. The column delimiter should be tab character.

Implemented functions

runPCA function

Reads the data from the csvfile and normalizes them, matches the appropriate peaks, runs the PCA using Singular Value Decomposition, draws the graphs for components, computes ANOVA for each PCA component and group of data samples, exports the results.

Allowed parameters:

paramsFile – configuration file to load parameters from. The default values are overriden by values read from configuration file and those can be overriden by runPCA() function called with additional parameters. Use forward slashes to delimit path in Windows, e.g. "C:/multiMS-toolbox/examples/bacteria/configNotNormalized.R". The default value is NULL, i.e. no configuration file.