Getting started with emzed modules

This is a short tour which shall give you a first impression of working with emzed modules. We suppose that you followed the instructions at Installation to get emzed installed and that you did the first steps as described in Getting started with emzed workbench.

As Python is a programming language with a very clear syntax, even a novice Python programmer will very likely be able to grasp the examples below. Indeed, Python is very easy to learn. You find a comprehensive list of Python tutorials at http://wiki.python.org/moin/BeginnersGuide

We mark the user input in the following examples with a leading >>> , as in the following code example:

>>> import emzed
>>> print emzed.mass.C13
13.003355

Before you try out the examples below change the working directory to the folder emzed_files/example_scripts which you find in your personal home directory.

Working with Peak Maps

emzed allows loading, inspecting and basic filtering of LC-MS(/MS) data files in different formats. To load your single data files to the workspace use the command loadPeakMap. If you are unsure how to use a command, you can get some help as follows:

>>> help(emzed.io.loadPeakMap)
Help on function loadPeakMap in module emzed.io.load_utils:

loadPeakMap(path=None)
    loads mzXML, mzML and mzData files
    
    If *path* is missing, a dialog for file selection is opened
    instead.

So if we call the function without giving a specific path, you will be asked to choose a file. Please choose the file example1.mzXML from your current folder:

>>> ds = emzed.io.loadPeakMap() 

You can access the spectra in this peak map using Python:

>>> firstSpec = ds.spectra[0]
>>> print firstSpec.rt, firstSpec.msLevel, firstSpec.polarity
725.765930176 1 -
>>> lastSpec = ds.spectra[-1]
>>> print lastSpec.rt
1212.01330566

Internally, retention times are always stored as seconds.

The variable ds will appear in the variable explorer. You see that we have 2278 spectra in this dataset and you can visualize them simply by double clicking on ds.

_images/peakmap_variable_explorer.png

Alternatively use the command

>>> emzed.gui.inspect(pm) 
_images/inspect_peakmap1.png

The upper plot shows the TIC and the lower plot the ms spectrum indicated by the bar with the center dot.

_images/inspect_peakmap2.png
  1. You can move the bar in the upper chromatogram plot with the mouse by clicking the bar. m/z values and intensities of mass peaks in the chosen spectrum are depicted in the lower plot.
  2. You can extract an ion chromatogram by entering data into the two input fields for providing a central m/z value and a half window width w/2 and then pressing Select. If you press the right button during moving the mouse the plots will zoom in or out. Pressing the backspace key will reset the active plot. Furthermore, you can measure peak relations by dragging the mouse in the lower plot.

Extracting chromatographic peaks

emzed includes three peak detection algorithms: MetaboFeatureFinder from [openms] and two of the XCMS [xcms] package: centwave [centwave] and matched filters. Accepted input file formats are mzML, mzXML, and mzData. The output file format is emzed-specific and has the file extension .table.

We continue with an example of MetaboFeatureFinder algorithm for high resolution LC-MS MS-1-data. Analysing MS-n for n=2 data is possible too, please look at the SRM/MRM example workflow mentioned at FAQ (Frequently Asked Questions):

You can start the MetaboFeatureFinder feature detector by typing

>>> tables = emzed.batches.runMetaboFeatureFinder("*.mzXML", destination=".", configid="std")

The feature detector needs a few minutes depending on the power of your computer, we omitted the verbose output. We predefined a combination of parameters with the identifier tour in order to simplify the instructions. In general various parameters can be provided individually. For getting (a lot of) details use the Python help system

>>> help(emzed.batches.runMetaboFeatureFinder) 

The return value tables is a list containing three tables, you see them in the variable explorer.

_images/tableListVarBrowser.png

Please open the table list by double clicking the variable tables in the variable explorer.

_images/table_explorer.png
  1. Now you can select a specific table using the Choose Table menu at the top of the window. In each table parameters of detected peaks are depicted row wise. You can visualize corresponding Extracted Ion Chromatograms (EIC) and mass spectra by clicking to the left of a row. Table entries are editable (just double click to a certain cell) and all modifications are stored in place. Notice that the original peak map is linked to the table and the underlying spectral data is accessible.
  2. If you click with the right mouse button to the left of a row you see a context menu with commands for manipulating whole rows. All manipulations to the table can be undone using this context menu or the commands below the Edit menu at the top of the window.

Integrating Peaks

To reduce the runtime in the following demonstration we will extract peaks with a quality above 1e-2:

>>> tab1, tab2, tab3 = tables
>>> print len(tab1)
3840
>>> tab1 = tab1.filter(tab1.quality > 1e-2)
>>> print len(tab1)
46
>>> tab2 = tab2.filter(tab2.quality > 1e-2)

Detected Peaks can be integrated. To perform peak integration columns rtmin, rtmax, mzmin, and mzmax are mandatory. We use the EMG integrator:

>>> tabInt = emzed.utils.integrate(tab1, 'emg_exact')

INFO: as the table has les thann 500 rows, we switch to one cpu mode
integrate table using 1 processes

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
needed 0.2 seconds

If you open the dialog for tabInt you can select a row, as is shown in Figure A.

_images/table_integrate.png
  1. For all integrated peaks area and rmse values are added automatically to the table. As EMG fits a function to the EIC, you see this function in the chromatogram plot.
  2. You can manually reintegrate individual EIC peaks by adapting the rt bounds in the chromatogram plot, then choosing one of the provided integration methods and pressing Integrate. The result will be plotted in the chromatogram plot and the corresponding row is updated.

Aligning Features

The retention time alignment is performed by the Pose Clustering alignment algorithm [poseclustering] implemented in OpenMS [openms].

>>> tablesAligned = emzed.align.rtAlign(tables, destination=".") 

In this simple use case all tables are aligned to the table with the most peaks.

To visualize the rt shift on tables we will now combine two tables before and after alignment. Users which are familiar to relational databases will recognize the JOIN statement from the SQL language. More information about combining and filtering tables will be given below at Working with Tables.

>>> before = tab1.join(tab2, tab1.mz.approxEqual(tab2.mz, 3 * emzed.MMU) & tab1.rt.approxEqual(tab2.rt, 30 * emzed.SECONDS))
10 21 32 43 54 65 76 86 97

Open the window for table before and sort the table to ascending sn values and click on column with id 191.

Now repeat the same procedure for the same tables after retention time alignment:

>>> tabA1, tabA2, tabA3 = tablesAligned
>>> tabA1 = tabA1.filter(tabA1.quality > 1e-2)
>>> tabA2 = tabA2.filter(tabA2.quality > 1e-2)
>>> after = tabA1.join(tabA2, tabA1.mz.approxEqual(tabA2.mz, 3 * emzed.MMU) & tabA1.rt.approxEqual(tabA2.rt, 30 * emzed.SECONDS))
10 21 32 43 54 65 76 86 97

Open now the table after, sort again and choose the same row as above.

_images/rtalignment.png

The plot shows the overlay of two EIC peaks of the same compound in two different samples before (A) and after (B) retention time alignment.

Working with Tables

This section demonstrates some operations on tables, which are a central data structure in emzed, you have already seen them above as peak tables.

An easy way to create tables is to parse a csv file. This is how the content of example.csv looks like:

>>> print open("example.csv").read()
name;mf
water;H2O
sodium;NaCl
fullerene;C60
cryptonite;Kr

We load this file

>>> substances = emzed.io.loadCSV("example.csv")
>>> print substances
name       mf      
str        str     
------     ------  
water      H2O     
sodium     NaCl    
fullerene  C60     
cryptonite Kr      

and print some information about the columns of the table:

>>> substances.info()

id       name     type     format   diff values nones   
int      str      str      str      int         int     
------   ------   ------   ------   ------      ------  
0        name     str      '%s'     4           0       
1        mf       str      '%s'     4           0       

That is the table has two columns named name and mf and both contain data of type str.

If the table is to complex or large for printing, you can open a dialog by clicking to the substances entry in the variable explorer or from the command line:

>>> emzed.gui.inspect(substances)  

Adding a new, computed column is easy. Here we want to generate a new column m0 which contains the mono-isotopic masses corresponding to the contents of the mf column. Converting a molecular formula to the corresponding mono-isotopic weight can be done by the function emzed.mass.of:

>>> print emzed.mass.of("H2O")
18.0105650638

Generating the new column m0 is done by applying this function to the column substances.mf:

>>> substances.addColumn("m0", substances.mf.apply(emzed.mass.of))
>>> print substances
name       mf       m0       
str        str      float    
------     ------   ------   
water      H2O      18.01057 
sodium     NaCl     57.95862 
fullerene  C60      720.00000
cryptonite Kr       -        

Now we want to add some extra information to substances, this information is stored in information.csv:

>>> print open("information.csv").read()
mf;onEarth
H2O;1
NaCl;1
HCl;1
Kr;0

>>> info = emzed.io.loadCSV("information.csv")
>>> print info
mf       onEarth 
str      int     
------   ------  
H2O      1       
NaCl     1       
HCl      1       
Kr       0       

As you can see emzed.io.loadCSV recognized that the column info.onEarth only contains integers.

To combine both tables we use an SQL-like LEFT JOIN to match rows with the same molecular formula:

>>> joined = substances.leftJoin(info, substances.mf== info.mf) 
>>> print joined
name       mf       m0        mf__0    onEarth__0
str        str      float     str      int       
------     ------   ------    ------   ------    
water      H2O      18.01057  H2O      1         
sodium     NaCl     57.95862  NaCl     1         
fullerene  C60      720.00000 -        -         
cryptonite Kr       -         Kr       0         

To simplfiy the table we do some cleanup:

>>> joined.dropColumns("mf__0")
>>> joined.renameColumn("onEarth__0", "onEarth")
>>> print joined
name       mf       m0        onEarth 
str        str      float     int     
------     ------   ------    ------  
water      H2O      18.01057  1       
sodium     NaCl     57.95862  1       
fullerene  C60      720.00000 -       
cryptonite Kr       -         0       

To restrict to substances which are known to exist on earth we can do:

>>> common = joined.filter(joined.onEarth == 1)
>>> print common
name     mf       m0       onEarth 
str      str      float    int     
------   ------   ------   ------  
water    H2O      18.01057 1       
sodium   NaCl     57.95862 1       

The emzed.db module contains some databases, e.g. the substances from PubChem [pubchem] categorized as metabolomic compounds. These databases are hold in tables:

>>> pc = emzed.db.load_pubchem()
try to load pubchem db from /Users/uweschmitt/emzed2_files/tables_of_version_2.7.5/pubchem.table
>>> print pc.filter(pc.cid <= 5)
m0          mw         cid      mf       iupac                                       synonyms                                                     url                                                       is_in_kegg is_in_hmdb is_in_bioycyc inchi                                                              
float       float      int      str      str                                         str                                                          str                                                       int        int        int           str                                                                
------      ------     ------   ------   ------                                      ------                                                       ------                                                    ------     ------     ------        ------                                                             
75.0684143  75.109660  4        C3H9NO   1-aminopropan-2-ol                          1-Aminopropan-2-ol;1-AMINO-2-PROPANOL;Monoisopropanolamin... http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=4 1          1          0             InChI=1S/C3H9NO/c1-3(5)2-4/h3,5H,2,4H2,1H3                         
169.0140107 169.073082 5        C3H8NO5P (3-amino-2-oxopropyl) dihydrogen phosphate  3-amino-2-oxopropyl dihydrogen phosphate;3-Amino-2-oxopro... http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5 1          0          1             InChI=1S/C3H8NO5P/c4-1-3(5)2-9-10(6,7)8/h1-2,4H2,(H2,6,7,8)        
203.1157595 203.235580 1        C9H17NO4 3-acetyloxy-4-(trimethylazaniumyl)butanoate acetylcarnitine;Acetyl-DL-carnitine;DL-O-Acetylcarnitine;... http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=1 0          1          1             InChI=1S/C9H17NO4/c1-7(11)14-8(5-9(12)13)6-10(2,3)4/h8H,5-6H2,1-4H3

>>> pc.info()

id       name          type     format                                         diff values nones   
int      str           str      str                                            int         int     
------   ------        ------   ------                                         ------      ------  
0        m0            float    '%.7f'                                         36498       38      
1        mw            float    '%.6f'                                         36535       0       
2        cid           int      '%s'                                           36535       0       
3        mf            str      '%s'                                           36534       0       
4        iupac         str      '%s'                                           36535       0       
5        synonyms      str      "str(o) if len(o) < 60 else str(o)[:57]+'...'" 34608       0       
6        url           str      '%s'                                           36535       0       
7        is_in_kegg    int      '%d'                                           2           0       
8        is_in_hmdb    int      '%d'                                           2           0       
9        is_in_bioycyc int      '%d'                                           2           0       
10       inchi         str      '%s'                                           36535       0       
>>> emzed.gui.inspect(pc)  

Before we go on we remove some column because printint the full columns makes this tutorial less readable:

>>> pc.dropColumns("synonyms", "url", "inchi")
>>> print pc.filter(pc.cid <= 5)
m0          mw         cid      mf       iupac                                       is_in_kegg is_in_hmdb is_in_bioycyc
float       float      int      str      str                                         int        int        int          
------      ------     ------   ------   ------                                      ------     ------     ------       
75.0684143  75.109660  4        C3H9NO   1-aminopropan-2-ol                          1          1          0            
169.0140107 169.073082 5        C3H8NO5P (3-amino-2-oxopropyl) dihydrogen phosphate  1          0          1            
203.1157595 203.235580 1        C9H17NO4 3-acetyloxy-4-(trimethylazaniumyl)butanoate 0          1          1            

Before matching our data against the large PubChem table, we build an index on this table in order to speed up the following leftJoin call. Building an index is done by sorting the corresponding column:

>>> pc.sortBy("m0")
>>> matched = joined.leftJoin(pc, (joined.onEarth == 1) & joined.m0.approxEqual(pc.m0, 15 * emzed.MMU))
25 50 75 100
>>> print len(matched), "rows in matched"
7 rows in matched
>>> matched.info()

id       name             type     format   diff values nones   
int      str              str      str      int         int     
------   ------           ------   ------   ------      ------  
0        name             str      '%s'     4           0       
1        mf               str      '%s'     4           0       
2        m0               float    '%.5f'   4           1       
3        onEarth          int      '%d'     3           1       
4        m0__0            float    '%.7f'   6           2       
5        mw__0            float    '%.6f'   6           2       
6        cid__0           int      '%s'     6           2       
7        mf__0            str      '%s'     6           2       
8        iupac__0         str      '%s'     6           2       
9        is_in_kegg__0    int      '%d'     2           2       
10       is_in_hmdb__0    int      '%d'     3           2       
11       is_in_bioycyc__0 int      '%d'     3           2       

Before we inspect the result we cleanup the column names

>>> matched.cleanupPostfixes()
>>> matched.info()

id       name          type     format   diff values nones   
int      str           str      str      int         int     
------   ------        ------   ------   ------      ------  
0        name          str      '%s'     4           0       
1        mf            str      '%s'     4           0       
2        m0            float    '%.5f'   4           1       
3        onEarth       int      '%d'     3           1       
4        m0__0         float    '%.7f'   6           2       
5        mw            float    '%.6f'   6           2       
6        cid           int      '%s'     6           2       
7        mf__0         str      '%s'     6           2       
8        iupac         str      '%s'     6           2       
9        is_in_kegg    int      '%d'     2           2       
10       is_in_hmdb    int      '%d'     3           2       
11       is_in_bioycyc int      '%d'     3           2       
>>> print matched
name       mf       m0        onEarth  m0__0      mw        cid      mf__0    iupac                  is_in_kegg is_in_hmdb is_in_bioycyc
str        str      float     int      float      float     int      str      str                    int        int        int          
------     ------   ------    ------   ------     ------    ------   ------   ------                 ------     ------     ------       
water      H2O      18.01057  1        18.0105651 18.015280 962      H2O      oxidane                1          1          1            
water      H2O      18.01057  1        18.0105651 20.027604 24602    H2O      None                   1          0          0            
water      H2O      18.01057  1        18.0105651 17.018946 10129877 H2O      oxidane                1          0          0            
sodium     NaCl     57.95862  1        57.9586220 58.442769 5234     ClNa     sodium;chloride        1          0          1            
sodium     NaCl     57.95862  1        57.9586220 57.447436 23667643 ClNa     sodium-22(1+);chloride 1          0          0            
fullerene  C60      720.00000 -        -          -         -        -        -                      -          -          -            
cryptonite Kr       -         0        -          -         -        -        -                      -          -          -            

>>> emzed.gui.inspect(matched)  

Accessing Chemical Data

The mass module provides the masses of an electron, a proton or a neutron and all all important elements

>>> print emzed.mass.e # electron
0.00054857990946
>>> print emzed.mass.C, emzed.mass.C12, emzed.mass.C13
12.0 12.0 13.003355

Additionally, it provides functions to calculate masses of molecules from their sum formula

>>> print emzed.mass.of("C6H2O6")
169.985140064

One can consider isotopes too:

>>> print emzed.mass.of("[13]C")
13.003355
>>> print emzed.mass.of("[13]C6H2O6")
176.005270064
>>> print emzed.mass.of("[13]C3[12]C3H2O6")
172.995205064

The elements module provides information of important elements

>>> print emzed.elements.C
{'massnumber': 12, 'm0': 12.0, 'number': 6, 'name': 'Carbon'}
>>> print emzed.elements.C13
{'abundance': 0.010700000000000001, 'name': 'Carbon', 'number': 6, 'mass': 13.003355}

abundance is a module which provides the natural abundances of common elements

>>> print emzed.abundance.C
{12: 0.9893000000000001, 13: 0.010700000000000001}

Generating isotope patterns

As the Table objects provide powerful matchings, all we need to analyze isotope patterns occurring in feature tables is a way to generate tables containing these data. emzed.utils.isotopeDistributionTable does this

>>> tab = emzed.utils.isotopeDistributionTable("C4S4", minp=0.01)
>>> print tab
mf                   mass       abundance
str                  float      float    
------               ------     ------   
[12]C4 [32]S4        175.888283 0.7779   
[12]C4 [32]S3 [33]S1 176.887670 0.0249   
[12]C3 [13]C1 [32]S4 176.891638 0.0337   
[12]C4 [32]S3 [34]S1 177.884079 0.1406   
-                    -          0.0229   

Non natural distributions as in marker experiments can be simulated too

>>> iso = emzed.utils.isotopeDistributionTable("C4S4", C=dict(C12=0.5, C13=0.5))
>>> iso.replaceColumn("abundance", iso.abundance / iso.abundance.sum() * 100.0)
>>> print iso
mf                          mass       abundance
str                         float      float    
------                      ------     ------   
[12]C4 [32]S4               175.888283 5.08     
[12]C3 [13]C1 [32]S4        176.891638 20.30    
[12]C2 [13]C2 [32]S4        177.894993 30.45    
[12]C3 [13]C1 [32]S3 [34]S1 178.887434 3.67     
[12]C1 [13]C3 [32]S4        178.898348 20.30    
[12]C2 [13]C2 [32]S3 [34]S1 179.890789 5.51     
[13]C4 [32]S4               179.901703 5.08     
[12]C1 [13]C3 [32]S3 [34]S1 180.894144 3.67     
-                           -          5.94     

The method can simulate the resolution of the used mass analyzer

>>> tab = emzed.utils.isotopeDistributionTable("C4S4", R=10000, minp=0.01)
>>> print tab
mf       mass       abundance
str      float      float    
------   ------     ------   
C4S4     175.888283 0.7977   
C4S4     176.889972 0.0580   
C4S4     177.884079 0.1442   

Matching isotope patterns now works like this

>>> iso = emzed.utils.isotopeDistributionTable("H2O", minp=1e-3)
>>> iso.addEnumeration()
>>> print iso
id       mf           mass      abundance
int      str          float     float    
------   ------       ------    ------   
0        [16]O1 [1]H2 18.010565 0.9973   
1        [18]O1 [1]H2 20.014819 0.0020   
2        -            -         0.0006   

>>> common.dropColumns("onEarth")
>>> matched = iso.leftJoin(common, iso.mass.approxEqual(common.m0, 1 * emzed.MMU))
33 66 100
>>> print matched
id       mf           mass      abundance name__0  mf__0    m0__0   
int      str          float     float     str      str      float   
------   ------       ------    ------    ------   ------   ------  
0        [16]O1 [1]H2 18.010565 0.9973    water    H2O      18.01057
1        [18]O1 [1]H2 20.014819 0.0020    -        -        -       
2        -            -         0.0006    -        -        -       

Statistical Analysis

The framework provides two methods for comparing two datasets by analysis of variance: classical one way ANOVA and non parametric Kruskal Wallis analysis. These methods work on tables like this

>>> group1 = [1.0, 0.9, 1.2, 1.4, 2.1]
>>> group2 = [1.0, 2.2, 2.3, 1.9, 2.8, 2.3]
>>> 
>>> t = emzed.utils.toTable("measurement", group1 + group2)
>>> 
>>> indices = [1] * len(group1) + [2] * len(group2)
>>> print indices
[1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
>>> 
>>> t.addColumn("group", indices)
>>> print t
measurement group   
float       int     
------      ------  
1.00000     1       
0.90000     1       
1.20000     1       
1.40000     1       
2.10000     1       
1.00000     2       
2.20000     2       
2.30000     2       
1.90000     2       
2.80000     2       
2.30000     2       

emzed.stats.oneWayAnova returns the corresponding F and p value, emzed.stats.kruskalWallis the H and p value

>>> F, p = emzed.stats.oneWayAnova(t.group, t.measurement)
>>> print p
0.0480721067923
>>> H, p = emzed.stats.kruskalWallis(t.group, t.measurement)
>>> print p
0.0541290285797

Quering METLIN web service

As the METLIN Rest interface is out ouf order at this juncture we removed the former examples and leave this paragraph empty.

Building graphical interfaces

Beyond the Table-Explorer emzed.gui.inspect and the PeakMap-Explorer emzed.gui.inspect assisted work-flows request certain parameters and decisions at certain processing steps. To support this emzed has an builder for input forms.

The following dialog is created by the simple commands below:

_images/dialogbuilder.png
>>> b = emzed.gui.DialogBuilder(title="Please provide data")
>>> b.addInstruction("For Algorithm A please provide")
>>> b.addInt("Level")
>>> b.addFloat("Threshold")
>>> b.addFileOpen("Input File")
>>> print b.show()                            
(10, 1.02, 'C:/Dokumente und Einstellungen/e001.mzML')