[1]:
# Setup
import emzed
import os

# path to files
data_folder = os.path.join(os.getcwd(), "tutorial_data")
path_aa_peaks_table = os.path.join(
    data_folder, "amino_acids_table.csv"
)
path_pm_aa_std = os.path.join(data_folder, "AA_std_5uM.mzML")
path_sample_caulobacter = os.path.join(
    data_folder, "AA_sample_caulobacter.mzML"
)
path_sample_arabidopsis = os.path.join(
    data_folder, "AA_sample_arabidopsis.mzML"
)
start remote ip in /root/.emzed3/pyopenms_venv
pyopenms client started.
connected to pyopenms client.

2 First steps: a targeted analysis example

In this section, we will introduce some basic emzed features and data types required to perform a simple targeted workflow. Given example uses LC-MS measurements of proteinogenic amino acids ([M+H]+ ions) with an LTQ-Orbitrap instrument. It comprises a standard mixture containing 17 amino acids, and extracts of the two model organisms Arabidopsis thaliana and Caulobacter crescentus. If you want to run the code an commands introduced in this tutorial on your own, here you can download a functional emzed3 installation and here the required sample .mzML files.

€€ TODO: update link to sample files €€

Goals

  • Getting familiar with the emzed environment

  • First steps with emzed core data structures PeakMap and Table

  • First steps with emzed modules

  • Learning how to load, visualize and inspect raw data -> the peak map

  • Getting familiar with the iPython console

  • Basic operations with emzed Table

  • First steps with the editor

  • Using simple functions to structure workflow code

emzed.spyder environment

Emzed.spyder has all Spyder 4 functionalities and therefore can also be used as default environment for emzed independed projects (if you are not familiar with Spyder click here). In addition, emzed.spyder includes already the complete emzed3 functionalities.

spyder_environment

emzed name space

Everything starts with emzed.* emzed_dot

In the IPython console, you can directly access the emzed3 modules and its commands. All emzed functionalities are directly accessible.

Here is a short overview of the ``emzed`` modules:

  • ``align``: aligning retention time, m/z and samples

  • ``adducts``: provides addduct information

  • ``annotate``: features adduct annotation

  • ``ext``: accessing downloadable emzed extensions (i.e. mzmine2)

  • ``db``: provides integrated pubchem database

  • ``chemistry``: provides chemical information

  • ``gui``: provides predefined user dialogs

  • ``io``: load and save data

  • ``ms2``: supports simplified data access

  • ``quantification``: allows LC-MS peak integration

  • ``targeted``: features targeted peaks extraction

An example: To obtain a list of single charged adducts in the positive mode, execute the command:

[2]:
emzed.adducts.positive_single_charged
[2]:
id adduct_name m_multiplier adduct_add adduct_sub z sign_z
int str int str str int int
19 M+ 1 1 1
20 M+H 1 H 1 1
21 M+NH4 1 NH4 1 1
22 M+Na 1 Na 1 1
23 M+H-2H2O 1 H (H2O)2 1 1
24 M+H-H2O 1 H H2O 1 1
25 M+K 1 K 1 1
26 M+ACN+H 1 C2H3NH 1 1
27 M+2ACN+H 1 (C2H3N)2H 1 1
28 M+ACN+Na 1 (C2H3N)1Na 1 1
29 M+2Na-H 1 Na2 H 1 1
30 M+Li 1 Li 1 1
31 M+CH3OH+H 1 CH3OHH 1 1
32 M+2K-H 1 K2 H 1 1
33 M+IsoProp+H 1 (C3H8O)1H 1 1
34 M+IsoProp+Na+H 1 (C3H8O)1NaH 1 1
35 M+DMSO+H 1 (C2H6OS)1H 1 1
36 2M+H 2 H 1 1
37 2M+NH4 2 NH4 1 1
38 2M+Na 2 Na 1 1
39 2M+K 2 K 1 1
40 2M+ACN+H 2 (C2H3N)1H 1 1
41 2M+ACN+Na 2 (C2H3N)1Na 1 1

Note: In the IPython console you can directly access the command documentation (docstring) by adding a ``?`` at the end.

Inspecting raw data

To inspect raw data (mzML, mzXML files) two commands are required:

pm = emzed.io.load_peak_map()emzed.gui.inspect(pm)

peakmap_explorer Figure 2: Visualizing the raw data of an example file. emzed peak map explorer provides data visualization as heat map with retention time values (RT) on x-axis and m/z values on y-axis, corresponding total ion chromatogram (TIC, sum of all m/z peaks), and the overlay of all m/z spectra.

The output is an emzed Table containing all information required to determine the corresponding molecular mass of an adduct ion.

Peak extraction with emzed Table

Table is the core data structure of emzed. It appears as typical table with rows and columns. If you are familiar with pandas data frames you will see some similar properties. Each column has a unique name shown in the first row and contains values of a defined data types. In the following we will use Table to visualize peaks of LC-MS raw data measurements.

We will start by loading a csv file containing mz data of selected amino acids as Table. Note, it is possible to save Tables as and load Tables from CSV end XLS(X) files enabling straightforward creation of tables for targeted peaks extraction.

[3]:
aa_peaks_table = emzed.io.load_csv(
    path_aa_peaks_table
)  # we provide the path of .csv file...
# Without arguments a user dialog will open.
aa_peaks_table[:5]  # list indexing on table rows
[3]:
compound mf m0 mz
str str float MzType
Phenylalanine C9H11NO2 165.079000  166.086300
Leucine C6H13NO2 131.094600  132.101900
Tryptophane C11H12N2O2 204.089900  205.097200
Isoleucine C6H13NO2 131.094600  132.101900
Methionine C5H11NO2S 149.051100  150.058300

When printing the table, you get name, type and the formatted content of each table column. In our case we have columns of types int, str, and float. Concerning column names there are two different groups: predefined column names and user defined. Predefined columns have fixed names, types, formats and they are required for most Table operations with MS data. You can find a complete list of predefined column names link More general information about the Table is available via the command:

[4]:
aa_peaks_table.summary()
[4]:
id name type format nones len min max distinct values
int str str str int int float float int
0 compound str %s 0 21 - - 21
1 mf str %s 0 21 - - 19
2 m0 float %f 0 21 75.032000 204.089900 19
3 mz MzType %11.6f 0 21 76.039300 205.097200 19

For each column the command provides content information i.e. name, type, format.

We can use now given table to inspect the amino acid peaks of our standard file.

To visualize LC-MS peaks a Table requires the columns ``id``, ``mzmin``, ``mzmax``, ``rtmin``, ``rtmax``, and ``peakmap``. These columns are used to define the LC-MS peaks that will later be extracted and integrated. mzmin and mzmax are the lower and upper bounds of the mz values, while rtmin and rtmax define the retention time boundaries of the peak. The peakmap column contains the LC-MS peak map, i.e. the raw data.

We will use Table method ``add_column`` to add those columns to aa_table:

[5]:
aa_peaks_table.add_column(
    "mzmin", aa_peaks_table.mz - 0.003, emzed.MzType
)
aa_peaks_table.add_column(
    "mzmax", aa_peaks_table.mz + 0.003, emzed.MzType
)

add_column requires three arguments: column name, the values you want to add, and the value type. In given example we add a new column mzmin using existing column mz: aa.mz - 0.003, meaning, we take the values of column mz in table aa and subtract 0.003. In other words, we can directly perform operations on columns. Note, it is mandatory to define the ``type_`` argument. Also note, rt and mz columns have a specific emzed type.

Next, we have to add the columns rtmin and rtmax. Currently we do not know the retention times (RT) of the different amino acid peaks. One way to access RTs is inspecting again the peak map and then manually extract the values. For now, we will just set rtmin and rtmax such that they cover the entire length of the run. To do so, we first load the PeakMap object with:

[6]:
# we load the peakmap from path path_aa_std
peakmap_std = emzed.io.load_peak_map(path_pm_aa_std)

Similar to Table, PeakMap provide a multitude of methods. The PeakMap method ``rt_range()`` returns a tuple with the first value corresponding to the RT of the first measured spectrum and the second to the last. Please note that RT values are given in seconds. We can extract rtmin and rtmax by the command:

[7]:
rtmin, rtmax = peakmap_std.rt_range()
print(f"rtmin = {rtmin}, rtmax = {rtmax}")
rtmin = 0.3225, rtmax = 599.8278

Now we can add the columns rtmin and rtmax to the table:

[8]:
aa_peaks_table.add_column_with_constant_value(
    "rtmin", rtmin, emzed.RtType
)
aa_peaks_table.add_column_with_constant_value(
    "rtmax", rtmax, emzed.RtType
)

You might have realized, that we used the method add_column_with_constant_value since Table distinguish between an iterable with potential different row values and a constant value, respectively. Ignoring the difference will cause an error.

Next, we have to add the peak map to the table:

[9]:
aa_peaks_table.add_column_with_constant_value(
    "peakmap", peakmap_std, emzed.PeakMap
)

Finally the column ``id``, providing unique row identification values, is added:

[10]:
aa_peaks_table.add_enumeration()
[11]:
aa_peaks_table[:10]  # or aa_table.print_(max_rows=10)
[11]:
id compound mf m0 mz mzmin mzmax rtmin rtmax
int str str float MzType MzType MzType RtType RtType
0 Phenylalanine C9H11NO2 165.079000  166.086300  166.083300  166.089300   0.01 m  10.00 m
1 Leucine C6H13NO2 131.094600  132.101900  132.098900  132.104900   0.01 m  10.00 m
2 Tryptophane C11H12N2O2 204.089900  205.097200  205.094200  205.100200   0.01 m  10.00 m
3 Isoleucine C6H13NO2 131.094600  132.101900  132.098900  132.104900   0.01 m  10.00 m
4 Methionine C5H11NO2S 149.051100  150.058300  150.055300  150.061300   0.01 m  10.00 m
5 Valine C5H11NO2 117.079000  118.086300  118.083300  118.089300   0.01 m  10.00 m
6 Proline C5H9NO2 115.063300  116.070600  116.067600  116.073600   0.01 m  10.00 m
7 Tyrosine C9H11NO3 181.073900  182.081200  182.078200  182.084200   0.01 m  10.00 m
8 Cysteine C3H7NO2S 121.019700  122.027000  122.024000  122.030000   0.01 m  10.00 m
9 Alanine C3H7NO2 89.047700   90.055000   90.052000   90.058000   0.01 m  10.00 m
[12]:
aa_peaks_table.summary()
[12]:
id name type format nones len min max distinct values
int str str str int int float float int
0 id int %d 0 21 0.000000 20.000000 21
1 compound str %s 0 21 - - 21
2 mf str %s 0 21 - - 19
3 m0 float %f 0 21 75.032000 204.089900 19
4 mz MzType %11.6f 0 21 76.039300 205.097200 19
5 mzmin MzType %11.6f 0 21 76.036300 205.094200 19
6 mzmax MzType %11.6f 0 21 76.042300 205.100200 19
7 rtmin RtType rt_formatter 0 21 0.322500 0.322500 1
8 rtmax RtType rt_formatter 0 21 599.827800 599.827800 1
9 peakmap PeakMap None 0 21 - - 1

Now the table is ready for inspection. Just type:

emzed.gui.inspect(aa_peaks_table)

amino_acid_table Figure 2: Visualization of an emzed Table using the Table Explorer. The inset on the bottom right shows the Peakmap Explorer that opens up after double clicking an entry in the column peakmap.

Some comments

  • LC-MS Peaks Tables require columns with fixed names and types.

  • Required columns ``mz``, ``mzmin``, ``mzmax``, ``rtmin``, ``rtmax`` and ``peakmap`` have emzed specific types, which can be directly accessed via emzed.MzType and emzed.RtType. They come along with special formatting. In particular, rt values are given in seconds, but shown in minutes.

  • The column peakmap allows keeping track to the raw data.

Defining EIC-peaks. Since we did not yet define the accurate retention time (RT) window, each row shows the complete EIC of the corresponding amino acid. The easiest way to adapt the RT windows is using the emzed ``integrate`` method, which not only calculates the peak area but also features interactive definition of RT windows. Until now, we applied only Table methods like i.e. add_column operating in place, meaning the method modifies the table but does not create a new Table object. This is memory efficient because we do not have to copy the object after each operation. integrate is not a Table method and will return a new object. We will first explore how to integrate: ~~~ emzed.quantification.integrate(peak_table, peak_shape_model)
~~~

The function ``integrate`` requires the two mandatory arguments peak_table and peak_shape_model. Available peak integration models are:`linear <https://en.wikipedia.org/wiki/Trapezoidal_rule>`__,`sgolay <https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter>`__, ``no_integration``,`asym_gauss <https://en.wikipedia.org/wiki/Skew_normal_distribution>`__ and `emg <https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution>`__. The two models linear and sgolay are both using trapezoidal rule as numerical approximation method and do not assume a defined peak shape. The sgolay method includes noise filtering step prior to integration. In contrast, asym_gauss and emg_exact model are assuming distinct, Gaussian-related distributions. Last but not least, no_integration returns None as the area. Note, integrate has a large number of optional predefined arguments that we can simply keep as they are and that we will discuss later. Hence, we can integrate aa_table as follows:

[13]:
aa_peaks_table = emzed.quantification.integrate(
    aa_peaks_table, "linear", show_progress=False
)
aa_peaks_table.summary()
[13]:
id name type format nones len min max distinct values
int str str str int int float float int
0 id int %d 0 21 0.000000 20.000000 21
1 compound str %s 0 21 - - 21
2 mf str %s 0 21 - - 19
3 m0 float %f 0 21 75.032000 204.089900 19
4 mz MzType %11.6f 0 21 76.039300 205.097200 19
5 mzmin MzType %11.6f 0 21 76.036300 205.094200 19
6 mzmax MzType %11.6f 0 21 76.042300 205.100200 19
7 rtmin RtType rt_formatter 0 21 0.322500 0.322500 1
8 rtmax RtType rt_formatter 0 21 599.827800 599.827800 1
9 peakmap PeakMap None 0 21 - - 1
10 peak_shape_model str %s 0 21 - - 1
11 area float %.2e 0 21 0.000000 27402460.271635 19
12 rmse float %.2e 0 21 0.000000 0.000000 1
13 model object None 0 21 - - 19
14 valid_model bool bool_formatter 0 21 - - 1

The function integrate adds 5 additional columns to the table. the columns area and rmse are showing peak area and the mean root square error (rmse) of the area. The latter describes the rmse of the selected model named column peak_shape_model in comparison with the linear model as reference. For each peaks, the column model contains ab object with its individual integration parameters. Finally, the column valid_model is internal and can be ignored.

After integration, inspecting the resulting Table t gives:

emzed.gui.inspect(aa_peaks_table)integrated_table

Now, we are able to adapt the RT window width for each peak manually: We change the peak window boarders with the mouse and we adapt rtmin and rtmax by simply reintegrating the peak (pressing the Update area button). adapt_rt_by_integration

Summarized. With the two Table methods ``add_column`` and ``add_column_with_constant_value`` we could build an emzed Table that allows extracting and visualizing EICs from a raw data file using a predefined list of compounds given as .csv file. We can use the function ``integrate`` and the interactive Table explorer to manually select peaks within EICs resulting in a peak table containing individual RT/mz windows for all peaks.

Simple scripts

It is quite cumbersome to execute required commands one by one in the IPython console and organizing the commands in a script is a much better way to proceed. The easiest way to do so is writing all the command we have executed so far into a Python script. Spyder provides a history log showing us all commands we executed so far.

command_history

We can proceed by simply copy paste the commands shown in history log into the editor and remove all commands that are not required. We end up with a script similar to the following:

import emzed # loading data peakmap_std = emzed.io.load_peak_map() aa_peaks_table=emzed.io.load_csv() # creating peaks table rtmin, rtmax = peakmap_std.rt_range() aa_peaks_table.add_column('mzmin', aa_peaks_table.mz - 0.003, emzed.MzType) aa_peaks_table.add_column('mzmax', aa_peaks_table.mz + 0.003, emzed.MzType) aa_peaks_table.add_column_with_constant_value('rtmin', rtmin, emzed.RtType) aa_peaks_table.add_column_with_constant_value('rtmax', rtmax, emzed.RtType) aa_peaks_table.add_column_with_constant_value('peakmap', peakmap_std, emzed.PeakMap) aa_peaks_table.add_enumeration() # integrate peaks table aa_peaks_table = emzed.quantification.integrate(aa_peaks_table, 'linear') # adapt retention of peaks manually emzed.gui.inspect(aa_peaks_table)

You can execute the code by pressing F5. When executing the script for the first time a cofiguration window will pop up. Just press the window ok button and the script gets executed.

There are several disadvantages of direct code execution. Since all intermediate results end up in the workspace, longer scripts can provoke messy workspaces. Code readability counts: Long unstructured scripts are hard to understand in particular, when the workflow is not a linear sequence of commands. Finally, you cannot execute only parts of the script and each time you have to run the complete script. To increase code reproducibility we can organize functional blocks of our scripts into Pythin functions, i.e. a function building a peaks_table that allow manual RT window definition:

[14]:
def build_peak_table(t, peakmap, mztol=0.003):
    """builds table defining EICs inspectable in peakmap
    t: Table with required column `mz`
    peakmap: emzed PeakMap
    """
    t.add_column("mzmin", t.mz - mztol, emzed.MzType)
    t.add_column("mzmax", t.mz + mztol, emzed.MzType)
    rtmin, rtmax = peakmap_std.rt_range()
    t.add_column_with_constant_value("rtmin", rtmin, emzed.RtType)
    t.add_column_with_constant_value("rtmax", rtmax, emzed.RtType)
    t.add_enumeration()
    return emzed.quantification.integrate(t, "linear")

We can load the function by pressing again ``F5`` and it is available in the IPython console.

Batch extraction of amino acids. The sample containing amino acid standards allowed us to define the amino acid peaks via mzmin, mzmax, rtmin, rtmax. To perform a targeted extraction of amino acid peaks from other samples, we simply have to replace the peak map of the standard in the peaks table, by the sample peak map, and reintegrate the table. Since the method replace operates in place we would modify our peaks with each extraction and the original peak map containing the standards will get lost. To avoid this we will create a copy of aa_peak_table.

[15]:
peak_table = aa_peaks_table.copy()
sample = emzed.io.load_peak_map(path_sample_caulobacter)
peak_table.replace_column_with_constant_value(
    "peakmap", sample, emzed.PeakMap
)
t_sample = emzed.quantification.integrate(
    peak_table, "linear", show_progress=False
)

Let’s inspect the resulting Table t_sample:

integrated sample

For a greater number of samples it makes sense to write a little routine to extract amino acids from a batch of samples that could look like this:

def extract_peaks(peak_table, peakmap_sample): pt = peak_table.copy) pt.replace_column_with_constant_value('peakmap', peakmap_sample, emzed.PeakMap) return emzed.quantification.integrate(pt, 'linear') # load peak table peak_table = emzed.io.load_table() # load sample batch samples = emzed.io.load_peak_maps() # extract peaks from all samples with Python list comprehension results = [extract_peaks(peak_table, sample) for sample in samples] # To inspect peak extraction result use list indexing (i.e. last sample): emzed.gui.inspect(results[-1])

Back to top

© Copyright 2012-2024 ETH Zurich
Last build 2024-03-25 10:41:42.995953.
Created using Sphinx 7.2.6.