[1]:
# Setup
import emzed
import os
# path to files
data_folder = os.path.join(os.getcwd(), "tutorial_data")
path_aa_peaks_table = os.path.join(
data_folder, "amino_acids_table.csv"
)
path_pm_aa_std = os.path.join(data_folder, "AA_std_5uM.mzML")
path_sample_caulobacter = os.path.join(
data_folder, "AA_sample_caulobacter.mzML"
)
path_sample_arabidopsis = os.path.join(
data_folder, "AA_sample_arabidopsis.mzML"
)
start remote ip in /root/.emzed3/pyopenms_venv
pyopenms client started.
connected to pyopenms client.
In this section, we will introduce some basic emzed features and data types required to perform a simple targeted workflow. Given example uses LC-MS measurements of proteinogenic amino acids ([M+H]+ ions) with an LTQ-Orbitrap instrument. It comprises a standard mixture containing 17 amino acids, and extracts of the two model organisms Arabidopsis thaliana and Caulobacter crescentus. If you want to run the code an commands introduced in this tutorial on your own,
here you can download a functional emzed3 installation and here the required sample .mzML
files.
€€ TODO: update link to sample files €€
Getting familiar with the emzed environment
First steps with emzed core data structures PeakMap and Table
First steps with emzed modules
Learning how to load, visualize and inspect raw data -> the peak map
Getting familiar with the iPython console
Basic operations with emzed Table
First steps with the editor
Using simple functions to structure workflow code
Emzed.spyder has all Spyder 4 functionalities and therefore can also be used as default environment for emzed independed projects (if you are not familiar with Spyder click here). In addition, emzed.spyder includes already the complete emzed3 functionalities.
Everything starts with emzed.*
In the IPython console, you can directly access the emzed3 modules and its commands. All emzed functionalities are directly accessible.
Here is a short overview of the ``emzed`` modules:
``align``: aligning retention time, m/z and samples
``adducts``: provides addduct information
``annotate``: features adduct annotation
``ext``: accessing downloadable emzed
extensions (i.e. mzmine2)
``db``: provides integrated pubchem database
``chemistry``: provides chemical information
``gui``: provides predefined user dialogs
``io``: load and save data
``ms2``: supports simplified data access
``quantification``: allows LC-MS peak integration
``targeted``: features targeted peaks extraction
An example: To obtain a list of single charged adducts in the positive mode, execute the command:
[2]:
emzed.adducts.positive_single_charged
[2]:
id | adduct_name | m_multiplier | adduct_add | adduct_sub | z | sign_z |
---|---|---|---|---|---|---|
int | str | int | str | str | int | int |
19 | M+ | 1 | 1 | 1 | ||
20 | M+H | 1 | H | 1 | 1 | |
21 | M+NH4 | 1 | NH4 | 1 | 1 | |
22 | M+Na | 1 | Na | 1 | 1 | |
23 | M+H-2H2O | 1 | H | (H2O)2 | 1 | 1 |
24 | M+H-H2O | 1 | H | H2O | 1 | 1 |
25 | M+K | 1 | K | 1 | 1 | |
26 | M+ACN+H | 1 | C2H3NH | 1 | 1 | |
27 | M+2ACN+H | 1 | (C2H3N)2H | 1 | 1 | |
28 | M+ACN+Na | 1 | (C2H3N)1Na | 1 | 1 | |
29 | M+2Na-H | 1 | Na2 | H | 1 | 1 |
30 | M+Li | 1 | Li | 1 | 1 | |
31 | M+CH3OH+H | 1 | CH3OHH | 1 | 1 | |
32 | M+2K-H | 1 | K2 | H | 1 | 1 |
33 | M+IsoProp+H | 1 | (C3H8O)1H | 1 | 1 | |
34 | M+IsoProp+Na+H | 1 | (C3H8O)1NaH | 1 | 1 | |
35 | M+DMSO+H | 1 | (C2H6OS)1H | 1 | 1 | |
36 | 2M+H | 2 | H | 1 | 1 | |
37 | 2M+NH4 | 2 | NH4 | 1 | 1 | |
38 | 2M+Na | 2 | Na | 1 | 1 | |
39 | 2M+K | 2 | K | 1 | 1 | |
40 | 2M+ACN+H | 2 | (C2H3N)1H | 1 | 1 | |
41 | 2M+ACN+Na | 2 | (C2H3N)1Na | 1 | 1 |
Note: In the IPython console you can directly access the command documentation (docstring) by adding a ``?`` at the end.
To inspect raw data (mzML, mzXML files) two commands are required:
pm = emzed.io.load_peak_map()emzed.gui.inspect(pm)Figure 2: Visualizing the raw data of an example file. emzed peak map explorer provides data visualization as heat map with retention time values (RT) on x-axis and m/z values on y-axis, corresponding total ion chromatogram (TIC, sum of all m/z peaks), and the overlay of all m/z spectra.
The output is an emzed Table containing all information required to determine the corresponding molecular mass of an adduct ion.
Table is the core data structure of emzed. It appears as typical table with rows and columns. If you are familiar with pandas data frames you will see some similar properties. Each column has a unique name shown in the first row and contains values of a defined data types. In the following we will use Table to visualize peaks of LC-MS raw data measurements.
We will start by loading a csv file containing mz data of selected amino acids as Table. Note, it is possible to save Tables as and load Tables from CSV end XLS(X) files enabling straightforward creation of tables for targeted peaks extraction.
[3]:
aa_peaks_table = emzed.io.load_csv(
path_aa_peaks_table
) # we provide the path of .csv file...
# Without arguments a user dialog will open.
aa_peaks_table[:5] # list indexing on table rows
[3]:
compound | mf | m0 | mz |
---|---|---|---|
str | str | float | MzType |
Phenylalanine | C9H11NO2 | 165.079000 | 166.086300 |
Leucine | C6H13NO2 | 131.094600 | 132.101900 |
Tryptophane | C11H12N2O2 | 204.089900 | 205.097200 |
Isoleucine | C6H13NO2 | 131.094600 | 132.101900 |
Methionine | C5H11NO2S | 149.051100 | 150.058300 |
When printing the table, you get name, type and the formatted content of each table column. In our case we have columns of types int
, str
, and float
. Concerning column names there are two different groups: predefined column names and user defined. Predefined columns have fixed names, types, formats and they are required for most Table operations with MS data. You can find a complete list of predefined column names link More general information about the Table is
available via the command:
[4]:
aa_peaks_table.summary()
[4]:
id | name | type | format | nones | len | min | max | distinct values |
---|---|---|---|---|---|---|---|---|
int | str | str | str | int | int | float | float | int |
0 | compound | str | %s | 0 | 21 | - | - | 21 |
1 | mf | str | %s | 0 | 21 | - | - | 19 |
2 | m0 | float | %f | 0 | 21 | 75.032000 | 204.089900 | 19 |
3 | mz | MzType | %11.6f | 0 | 21 | 76.039300 | 205.097200 | 19 |
For each column the command provides content information i.e. name, type, format.
We can use now given table to inspect the amino acid peaks of our standard file.
To visualize LC-MS peaks a Table requires the columns ``id``, ``mzmin``, ``mzmax``, ``rtmin``, ``rtmax``, and ``peakmap``. These columns are used to define the LC-MS peaks that will later be extracted and integrated. mzmin
and mzmax
are the lower and upper bounds of the mz values, while rtmin
and rtmax
define the retention time boundaries of the peak. The peakmap
column contains the LC-MS peak map, i.e. the raw data.
We will use Table method ``add_column`` to add those columns to aa_table
:
[5]:
aa_peaks_table.add_column(
"mzmin", aa_peaks_table.mz - 0.003, emzed.MzType
)
aa_peaks_table.add_column(
"mzmax", aa_peaks_table.mz + 0.003, emzed.MzType
)
add_column
requires three arguments: column name, the values you want to add, and the value type. In given example we add a new column mzmin
using existing column mz
: aa.mz - 0.003, meaning, we take the values of column mz in table aa and subtract 0.003. In other words, we can directly perform operations on columns. Note, it is mandatory to define the ``type_`` argument. Also note, rt and mz columns have a specific emzed type.
Next, we have to add the columns rtmin
and rtmax
. Currently we do not know the retention times (RT) of the different amino acid peaks. One way to access RTs is inspecting again the peak map and then manually extract the values. For now, we will just set rtmin
and rtmax
such that they cover the entire length of the run. To do so, we first load the PeakMap object with:
[6]:
# we load the peakmap from path path_aa_std
peakmap_std = emzed.io.load_peak_map(path_pm_aa_std)
Similar to Table, PeakMap provide a multitude of methods. The PeakMap method ``rt_range()`` returns a tuple with the first value corresponding to the RT of the first measured spectrum and the second to the last. Please note that RT values are given in seconds. We can extract rtmin
and rtmax
by the command:
[7]:
rtmin, rtmax = peakmap_std.rt_range()
print(f"rtmin = {rtmin}, rtmax = {rtmax}")
rtmin = 0.3225, rtmax = 599.8278
Now we can add the columns rtmin
and rtmax
to the table:
[8]:
aa_peaks_table.add_column_with_constant_value(
"rtmin", rtmin, emzed.RtType
)
aa_peaks_table.add_column_with_constant_value(
"rtmax", rtmax, emzed.RtType
)
You might have realized, that we used the method add_column_with_constant_value
since Table distinguish between an iterable with potential different row values and a constant value, respectively. Ignoring the difference will cause an error.
Next, we have to add the peak map to the table:
[9]:
aa_peaks_table.add_column_with_constant_value(
"peakmap", peakmap_std, emzed.PeakMap
)
Finally the column ``id``, providing unique row identification values, is added:
[10]:
aa_peaks_table.add_enumeration()
[11]:
aa_peaks_table[:10] # or aa_table.print_(max_rows=10)
[11]:
id | compound | mf | m0 | mz | mzmin | mzmax | rtmin | rtmax |
---|---|---|---|---|---|---|---|---|
int | str | str | float | MzType | MzType | MzType | RtType | RtType |
0 | Phenylalanine | C9H11NO2 | 165.079000 | 166.086300 | 166.083300 | 166.089300 | 0.01 m | 10.00 m |
1 | Leucine | C6H13NO2 | 131.094600 | 132.101900 | 132.098900 | 132.104900 | 0.01 m | 10.00 m |
2 | Tryptophane | C11H12N2O2 | 204.089900 | 205.097200 | 205.094200 | 205.100200 | 0.01 m | 10.00 m |
3 | Isoleucine | C6H13NO2 | 131.094600 | 132.101900 | 132.098900 | 132.104900 | 0.01 m | 10.00 m |
4 | Methionine | C5H11NO2S | 149.051100 | 150.058300 | 150.055300 | 150.061300 | 0.01 m | 10.00 m |
5 | Valine | C5H11NO2 | 117.079000 | 118.086300 | 118.083300 | 118.089300 | 0.01 m | 10.00 m |
6 | Proline | C5H9NO2 | 115.063300 | 116.070600 | 116.067600 | 116.073600 | 0.01 m | 10.00 m |
7 | Tyrosine | C9H11NO3 | 181.073900 | 182.081200 | 182.078200 | 182.084200 | 0.01 m | 10.00 m |
8 | Cysteine | C3H7NO2S | 121.019700 | 122.027000 | 122.024000 | 122.030000 | 0.01 m | 10.00 m |
9 | Alanine | C3H7NO2 | 89.047700 | 90.055000 | 90.052000 | 90.058000 | 0.01 m | 10.00 m |
[12]:
aa_peaks_table.summary()
[12]:
id | name | type | format | nones | len | min | max | distinct values |
---|---|---|---|---|---|---|---|---|
int | str | str | str | int | int | float | float | int |
0 | id | int | %d | 0 | 21 | 0.000000 | 20.000000 | 21 |
1 | compound | str | %s | 0 | 21 | - | - | 21 |
2 | mf | str | %s | 0 | 21 | - | - | 19 |
3 | m0 | float | %f | 0 | 21 | 75.032000 | 204.089900 | 19 |
4 | mz | MzType | %11.6f | 0 | 21 | 76.039300 | 205.097200 | 19 |
5 | mzmin | MzType | %11.6f | 0 | 21 | 76.036300 | 205.094200 | 19 |
6 | mzmax | MzType | %11.6f | 0 | 21 | 76.042300 | 205.100200 | 19 |
7 | rtmin | RtType | rt_formatter | 0 | 21 | 0.322500 | 0.322500 | 1 |
8 | rtmax | RtType | rt_formatter | 0 | 21 | 599.827800 | 599.827800 | 1 |
9 | peakmap | PeakMap | None | 0 | 21 | - | - | 1 |
Now the table is ready for inspection. Just type:
emzed.gui.inspect(aa_peaks_table) Figure 2: Visualization of an emzed Table using the Table Explorer. The inset on the bottom right shows the Peakmap Explorer that opens up after double clicking an entry in the column peakmap
.
Some comments
LC-MS Peaks Tables require columns with fixed names and types.
Required columns ``mz``, ``mzmin``, ``mzmax``, ``rtmin``, ``rtmax`` and ``peakmap`` have emzed specific types, which can be directly accessed via emzed.MzType
and emzed.RtType
. They come along with special formatting. In particular, rt values are given in seconds, but shown in minutes.
The column peakmap
allows keeping track to the raw data.
add_column
operating in place, meaning the method modifies the table but does not create a new
Table object. This is memory efficient because we do not have to copy the object after each operation. integrate
is not a Table method and will return a new object. We will first explore how to integrate: ~~~ emzed.quantification.integrate(peak_table, peak_shape_model)The function ``integrate`` requires the two mandatory arguments peak_table
and peak_shape_model
. Available peak integration models are:`linear
<https://en.wikipedia.org/wiki/Trapezoidal_rule>`__,`sgolay
<https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter>`__, ``no_integration``,`asym_gauss
<https://en.wikipedia.org/wiki/Skew_normal_distribution>`__ and `emg
<https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution>`__.
The two models linear
and sgolay
are both using trapezoidal rule as numerical approximation method and do not assume a defined peak shape. The sgolay
method includes noise filtering step prior to integration. In contrast, asym_gauss
and emg_exact
model are assuming distinct, Gaussian-related distributions. Last but not least, no_integration
returns None
as the area. Note, integrate
has a large number of optional predefined arguments that we can simply keep as they
are and that we will discuss later. Hence, we can integrate aa_table
as follows:
[13]:
aa_peaks_table = emzed.quantification.integrate(
aa_peaks_table, "linear", show_progress=False
)
aa_peaks_table.summary()
[13]:
id | name | type | format | nones | len | min | max | distinct values |
---|---|---|---|---|---|---|---|---|
int | str | str | str | int | int | float | float | int |
0 | id | int | %d | 0 | 21 | 0.000000 | 20.000000 | 21 |
1 | compound | str | %s | 0 | 21 | - | - | 21 |
2 | mf | str | %s | 0 | 21 | - | - | 19 |
3 | m0 | float | %f | 0 | 21 | 75.032000 | 204.089900 | 19 |
4 | mz | MzType | %11.6f | 0 | 21 | 76.039300 | 205.097200 | 19 |
5 | mzmin | MzType | %11.6f | 0 | 21 | 76.036300 | 205.094200 | 19 |
6 | mzmax | MzType | %11.6f | 0 | 21 | 76.042300 | 205.100200 | 19 |
7 | rtmin | RtType | rt_formatter | 0 | 21 | 0.322500 | 0.322500 | 1 |
8 | rtmax | RtType | rt_formatter | 0 | 21 | 599.827800 | 599.827800 | 1 |
9 | peakmap | PeakMap | None | 0 | 21 | - | - | 1 |
10 | peak_shape_model | str | %s | 0 | 21 | - | - | 1 |
11 | area | float | %.2e | 0 | 21 | 0.000000 | 27402460.271635 | 19 |
12 | rmse | float | %.2e | 0 | 21 | 0.000000 | 0.000000 | 1 |
13 | model | object | None | 0 | 21 | - | - | 19 |
14 | valid_model | bool | bool_formatter | 0 | 21 | - | - | 1 |
The function integrate
adds 5 additional columns to the table. the columns area and rmse are showing peak area and the mean root square error (rmse) of the area. The latter describes the rmse of the selected model named column peak_shape_model in comparison with the linear model as reference. For each peaks, the column model contains ab object with its individual integration parameters. Finally, the column valid_model is internal and can be ignored.
After integration, inspecting the resulting Table t
gives:
Now, we are able to adapt the RT window width for each peak manually: We change the peak window boarders with the mouse and we adapt rtmin
and rtmax
by simply reintegrating the peak (pressing the Update area
button).
Summarized. With the two Table methods ``add_column`` and ``add_column_with_constant_value`` we could build an emzed Table that allows extracting and visualizing EICs from a raw data file using a predefined list of compounds given as .csv
file. We can use the function ``integrate`` and the interactive Table explorer to manually select peaks within EICs resulting in a peak table containing individual RT/mz windows for all peaks.
It is quite cumbersome to execute required commands one by one in the IPython console and organizing the commands in a script is a much better way to proceed. The easiest way to do so is writing all the command we have executed so far into a Python script. Spyder provides a history log showing us all commands we executed so far.
We can proceed by simply copy paste the commands shown in history log into the editor and remove all commands that are not required. We end up with a script similar to the following:
import emzed # loading data peakmap_std = emzed.io.load_peak_map() aa_peaks_table=emzed.io.load_csv() # creating peaks table rtmin, rtmax = peakmap_std.rt_range() aa_peaks_table.add_column('mzmin', aa_peaks_table.mz - 0.003, emzed.MzType) aa_peaks_table.add_column('mzmax', aa_peaks_table.mz + 0.003, emzed.MzType) aa_peaks_table.add_column_with_constant_value('rtmin', rtmin, emzed.RtType) aa_peaks_table.add_column_with_constant_value('rtmax', rtmax, emzed.RtType) aa_peaks_table.add_column_with_constant_value('peakmap', peakmap_std, emzed.PeakMap) aa_peaks_table.add_enumeration() # integrate peaks table aa_peaks_table = emzed.quantification.integrate(aa_peaks_table, 'linear') # adapt retention of peaks manually emzed.gui.inspect(aa_peaks_table)You can execute the code by pressing F5
. When executing the script for the first time a cofiguration window will pop up. Just press the window ok button and the script gets executed.
There are several disadvantages of direct code execution. Since all intermediate results end up in the workspace, longer scripts can provoke messy workspaces. Code readability counts: Long unstructured scripts are hard to understand in particular, when the workflow is not a linear sequence of commands. Finally, you cannot execute only parts of the script and each time you have to run the complete script. To increase code reproducibility we can organize functional blocks of our scripts into Pythin functions, i.e. a function building a peaks_table that allow manual RT window definition:
[14]:
def build_peak_table(t, peakmap, mztol=0.003):
"""builds table defining EICs inspectable in peakmap
t: Table with required column `mz`
peakmap: emzed PeakMap
"""
t.add_column("mzmin", t.mz - mztol, emzed.MzType)
t.add_column("mzmax", t.mz + mztol, emzed.MzType)
rtmin, rtmax = peakmap_std.rt_range()
t.add_column_with_constant_value("rtmin", rtmin, emzed.RtType)
t.add_column_with_constant_value("rtmax", rtmax, emzed.RtType)
t.add_enumeration()
return emzed.quantification.integrate(t, "linear")
We can load the function by pressing again ``F5`` and it is available in the IPython
console.
Batch extraction of amino acids. The sample containing amino acid standards allowed us to define the amino acid peaks via mzmin
, mzmax
, rtmin
, rtmax
. To perform a targeted extraction of amino acid peaks from other samples, we simply have to replace the peak map of the standard in the peaks table, by the sample peak map, and reintegrate the table. Since the method replace operates in place we would modify our peaks with each extraction and the original peak map containing
the standards will get lost. To avoid this we will create a copy of aa_peak_table
.
[15]:
peak_table = aa_peaks_table.copy()
sample = emzed.io.load_peak_map(path_sample_caulobacter)
peak_table.replace_column_with_constant_value(
"peakmap", sample, emzed.PeakMap
)
t_sample = emzed.quantification.integrate(
peak_table, "linear", show_progress=False
)
Let’s inspect the resulting Table t_sample
:
For a greater number of samples it makes sense to write a little routine to extract amino acids from a batch of samples that could look like this:
def extract_peaks(peak_table, peakmap_sample): pt = peak_table.copy) pt.replace_column_with_constant_value('peakmap', peakmap_sample, emzed.PeakMap) return emzed.quantification.integrate(pt, 'linear') # load peak table peak_table = emzed.io.load_table() # load sample batch samples = emzed.io.load_peak_maps() # extract peaks from all samples with Python list comprehension results = [extract_peaks(peak_table, sample) for sample in samples] # To inspect peak extraction result use list indexing (i.e. last sample): emzed.gui.inspect(results[-1])
© Copyright 2012-2024 ETH Zurich
Last build 2024-03-25 10:41:42.995953.
Created using Sphinx 7.2.6.