API Tables and Expressions

Tables

class emzed.core.data_types.table.Table(colNames, colTypes, colFormats, rows=None, title=None, meta=None)[source]

A table holds rows of the same length. Each Column of the table has a name, a type and format information, which indicates how to render values in this column. Further a table has a title and meta information which is a dictionary.

column types can be any python type.

format can be:

  • a string interpolation string, e.g. “%.2f”
  • None, which suppresses rendering of this column
  • python code which renders an object of name o, e.g. str(o)+"x"
__getitem__(ix)[source]

supports creating a new table from a subset of current rows using brackets notation.

Example:

t = emzed.utils.toTable("a", [1, 2, 3])
t[0].a.values == [1]
t[:1].a.values == [1]
t[1:].a.values == [2, 3]
t[:].a.values == [1, 2, 3]
t[::-1].a.values == [3, 2, 1]

# now:
t[:] == t.copy()

# revert
t_reverted = t[::-1]

For selection rows irow max be a list or numpy array of booleans or integers:

t[[True, False, False]] == t[0]
t[[0, 1]] == t[:2]
t[numpy.arange(0, 2)] == t[:2]

Selection of columns is supported as well, so expressions like t[:, 1:2], t[:, [0, 1]] work as expected.

__getstate__()[source]

for internal use: filters some attributes for pickling.

__iter__()[source]

allows iteration over the rows of a table.

For example:

for row in table:
    print row.mz, row["rt"]
    row.rt = 1.0
    row["mz"] *= 1.01
__setstate__(dd)[source]

for internal use: builds table from pickled data.

addColumn(name, what, type_=None, format_='', insertBefore=None, insertAfter=None)[source]

adds a column in place.

For the meaning of the parameters see replaceColumn()

If you do not want the column be added at the end, one can use insertBefore, which maybe the name of an existing column, or an integer index (negative values allowed !).

addConstantColumn(name, value, type_=None, format_='', insertBefore=None, insertAfter=None)[source]

see addColumn().

value is inserted in the column, despite its class. So the cases in py:meth:~.addColumn are not considered, which is useful, e.g. if you want to set all cells in a column to as list.

addEnumeration(colName='id', insertBefore=None, insertAfter=None, startWith=0)[source]

adds enumerated column as first column to table in place.

if colName is not given the default name is “id”

Enumeration starts with zero.

addRow(row, doResetInternals=True)[source]

adds a new row to the table, checks if values in row are of expected type or can be converted to this type.

Raises an AssertException if the length of row does not match to the numbers of columns of the table.

append(*tables)[source]

appends tables horizontally to the existing table in place. Can be called as

t1.append(t2, t3)
t1.append([t2, t3])
t1.append(t2, [t3, t4])

the column names and the column types have to match ! column format is taken from the original table.

appendTable(other)[source]

horizontal inplace appending

apply(fun, args, keep_nones=False)[source]

Allows computing a new columen from a function with multiple arguments.

>>> import emzed
>>> t = emzed.utils.toTable("a", [1, 2, 3], type_=int)
>>> t.addColumn("b", [None, 5, 1], type_=int)
>>> print t
a        b       
int      int     
------   ------  
1        -       
2        5       
3        1       

>>> t.addColumn("c", t.apply(max, (t.a, t.b, 4)), type_=int)
>>> print t
a        b        c       
int      int      int     
------   ------   ------  
1        -        -       
2        5        5       
3        1        4       

Here missing values (None values) are not passed to the function. To change this behaviour the keep_nones parameter should be set to True.

buildEmptyClone(cols=None)[source]

returns empty table with same names, types, formatters, title and meta data

cleanupPostfixes()[source]

removes postfixes from those columns where this change does not introduce duplicate names.

collapse(*col_names, **kw)[source]

colapse a table by grouping according to columns col_names. This creates a subtable for every group, but is in “standard” mode not memory efficient if the number of groups is in the range of several thousands. As this method is mostly used for preparing a table before visual inspection, the “efficient” mode is available, where the sub tables may be inspected, but access to the columns via attribute access is not available any more.

>>> import emzed
>>> t = emzed.utils.toTable("group", (1, 1, 2, 3))
>>> t.addColumn("values", range(4))
>>> print(t)
group    values  
int      int     
------   ------  
1        0       
1        1       
2        2       
3        3       

>>> tn = t.collapse("group")
>>> print(tn)
group    collapsed                          
int      emzed.core.data_types.table.Table  
------   ------                             
1        <TProxy to 0x114026090 with 2 rows>
2        <TProxy to 0x114026090 with 1 row> 
3        <TProxy to 0x114026090 with 1 row> 

>>> tn = t.collapse("group", efficient=True)
>>> print(tn)
group    collapsed                          
int      emzed.core.data_types.table.Table  
------   ------                             
1        <TProxy to 0x114026090 with 2 rows>
2        <TProxy to 0x114026090 with 1 row> 
3        <TProxy to 0x114026090 with 1 row> 

compressPeakMaps()[source]

sometimes duplicate peakmaps occur as different objects in a table, that is: different id() but same content. this function removes duplicates and replaces different instances of the same data by one particular instance.

copy()[source]

returns a ‘semi-deep’ copy of the table

dropColumns(*patterns)[source]

removes columns where name matches on of given patterns from the table. wildcards as ”?” and “*” are allowed.

Works in place

Example: tab.dropColumns("mz", "rt", "rmse", "*__0")

ensureColNames(*names)[source]

convenient function to assert existence of column names

enumerateBy(*column_names)[source]

returns a list of numbers for enumerating the rows grouped by the given column names

For example: If we have a table t with entries:

v1    v2
str   int
----- -----
a      3
a      3
b      3
b      7

Then the following statements are true:

t.enumerateBy("v1") == [0, 0, 1, 1]
t.enumerateBy("v2") == [0, 0, 0, 1]
t.enumerateBy("v1", "v2") == [0, 0, 1, 2]

This can be used as:

t.addColumn("v1_v2_id", t.enumerateBy("v1", "v2"), insertBefore=0)

which yields:

v1_v2_id  v1    v2
int       str   int
--------  ----- -----
0         a      3
0         a      3
1         b      3
2         b      7
extractColumns(*names)[source]

extracts the given column names names and returns a new table with these columns

Example: t.extractColumns("id", "name"))

fastJoin(other, column_name, column_name_other=None, rel_tol=None, abs_tol=None)[source]

Fast joining for combining tables based on equality of a given column. .fastJoin is more restricted than the regualar .join method but much faster.

For example: t1 and t2 have a column named id. The the call:

tn = t1.fastJoin(t2, "id")

yields the same result as:

tn = t1.join(t2, t1.id == t2.id)

The column name column_name_other can be used if table other does have a column column_name for matching. Then this column name is used instead.

You can use rel_tol or abs_tol for approximate matching of numerical values.

Remark:

For a more flexible but still fast way to join on exact or approximate matches use the .equals expression, which allows and/or for more complex match conditions:

tn = t.join(t2, t.mz.equals(t2.mz, rel_tol=5e-6) & t.rt.equals(t2.rt, abs_tol=30))
fastLeftJoin(other, column_name, column_name_other=None, rel_tol=None, abs_tol=None)[source]

Same optimization as fastJoin described above, but performas a fast leftJoin instead.

filter(expr, debug=False)[source]

builds a new table with columns selected according to expr. E.g.

table.filter(table.mz >= 100.0)
table.filter((table.mz >= 100.0) & (table.rt <= 20))

findMatchingRows(filters)[source]

accepts list of column names and functions operating on those columns, returns the indices of the remaining columns

Example:

t.findMatchingRows(("mz", lambda mz: 100 <= mz <= 200),
                   ("rt", lambda rt: 200 <= rt <= 1000))

computes the row indices of all rows where mz and rt are in the given
ranges.
findPostfixes()[source]

returns postfixes of column names

static from_numpy_array(data, col_names, col_types, col_formats, title=None, meta=None)[source]

Creates a Table from a numpy 2dim array. Arguments are given in the same manner as in the Table constructor __init__()

numpy.nan values are converted to None values for indicating missing values. numpy numerical types are converted to python types.

static from_pandas(df, title=None, meta=None, types=None, formats=None)[source]

creates Table from pandas DataFrame

  • title and meta are parametrs according to the consctructor of Table,

    see __init__()

  • types is a mapping of column names to python types. If this parameter misses types are derived automatically from the pandas column types, missing colum names in this mapping are handled in the same way.

    This parameter is intended for non standard or special type conversions.

  • formats is a mapping for declaring column formats. The keys can be column_names, python types, or a mixture of them. Types are resolved in the first place, then column names.

    The values of this mapping are similar to the format declarations used in __init__()

    For missing declarations, or a missing formats argument the formats are guessed from the column types.

Comment: type resolution is done before determining the formats, so values in the types dictionary may appear as keys in formats.

getColFormat(name)[source]

returns format of column name

getColFormats()[source]

returns a copied list of column formats, one can operator on this list without corrupting the underlying table.

getColNames()[source]

returns a copied list of column names, one can operator on this list without corrupting the underlying table

getColType(name)[source]

returns type of column name

getColTypes()[source]

returns a copied list of column names, one can operator on this list without corrupting the underlying table

getColumn(name)[source]

returns ColumnExpression object for column name. to get the values of the column you can use table.getColumn("index").values.

See: ColumnExpression

getFormat(colName)[source]

returns for format for the given column colName

getIndex(colName)[source]

gets the integer index of the column colName.

Example: table.rows[0][table.getIndex("mz")]

getType(colName)[source]

returns for format for the given column colName

getValue(row, colName, default=None)[source]

returns value of column colName in a given row

Example: table.getValue(table.rows[0], "mz")

getValues(row)[source]

if colName is not provided, one gets the content of the row as an enhanced dictionary mapping column names to values.

Example:

row = table.getValues(table.rows[0])
print row["mz"]
print row.mz
getVisibleCols()[source]

returns a list with the names of the columns which are visible. that is: the corresponding format is not None

hasColumn(name)[source]

checks if column with given name name exists

hasColumns(*names)[source]

checks if columns with given names exist.

Example: tab.hasColumn("rt", "rtmin", "rtmax")

info()[source]

prints some table information and some table statistics

join(t, expr=True, debug=False, title=None)[source]

joins two tables.

So if you have two table t1 and t2 as

id mz
0 100.0
1 200.0
2 300.0

and

id mz rt
0 100.0 10.0
1 110.0 20.0
2 200.0 30.0

Then the result of t1.join(t2, (t1.mz >= t2.mz -20) & (t1.mz <= t2.mz + 20) is

id_1 mz_1 id_2 mz_2 rt
0 100.0 0 100.0 10.0
0 100.0 1 110.0 20.0
1 200.0 2 200.0 30.0

If you do not provide an expression, this method returns the full cross product.

leftJoin(t, expr=True, debug=False, title=None)[source]

performs an left join also known as outer join of two tables.

It works similar to join() but keeps non-matching rows of the first table. So if we take the example from join()

Then the result of t1.leftJoin(t2, (t1.mz >= t2.mz -20) & (t1.mz <= t2.mz + 20) is

id_1 mz_1 id_2 mz_2 rt
0 100.0 0 100.0 10.0
0 100.0 1 110.0 20.0
1 200.0 2 200.0 30.0
2 300.0 None None None

If you do not provide an expression, this method returns the full cross product.

static load(path)[source]

loads a table stored with store()

Note: as this is a static method, it has to be called as tab = Table.load("xzy.table")

static loadCSV(path, sep=';', keepNone=False, dashIsNone=True, **specialFormats)[source]

loads csv file from path. column separator is given by sep. If keepNone is set to True, “None” strings in file are kept as a string. Else this string is converted to Python None values. specialFormats collects positional arguments for setting formats of columns.

Example: Table.loadCSV("abc.csv", mz="%.3f")

maxPostfix()[source]

returns last postfix in sorted postfixes of column names

static mergeTables(tables, reference_table=None, force_merge=False)[source]

merges tables. Eg:

>>> import emzed
>>> t1 = emzed.utils.toTable("a", [1], type_=int)
>>> t2 = t1.copy()
>>> t1.addColumn("b", 3, type_=int)
>>> t2.addColumn("c", 5, type_=int)
>>> 
>>> print t1
a        b       
int      int     
------   ------  
1        3       

>>> print t2
a        c       
int      int     
------   ------  
1        5       

>>> t3 = emzed.utils.mergeTables([t1, t2])
>>> print t3
a        b        c       
int      int      int     
------   ------   ------  
1        3        -       
1        -        5       

in case of conflicting names, name orders, types or formats you can try force_merge=True or provide a reference table. This reference table just serves information about wanted column names, types and formats and is merged to the result only if it appers in tables.

minPostfix()[source]

returns first postfix in sorted postfixes of column names

numRows()[source]

returns the number of rows

print_(w=8, out=None, title=None, max_lines=None)[source]

Prints the table to the console. w is the width of the columns, If you want to print to a file or stream instead, you can use the out parameter, e.g. t.print_(out=sys.stderr). If you support a title string this will be printed above the content of the table.

removePostfixes(*postfixes)[source]

removes postfixes. throws exception if this process produces duplicate column names.

if you ommit postfixes parameter, all “__xyz” postfixes are removed. else the given postfixes are stripped from the column names

renameColumns(*dicts, **keyword_args)[source]

renames columns in place.

So if you want to rename “mz_1” to “mz” and “rt_1” to “rt”, table.renameColumns(mz_1=mz, rt_1=rt)

For memorization read the = as ->.

Sometimes it is helpful to build dictionaries which describe the renaming. In this case you can pass an arbitrary number of dictionaries, and even additional keywword arguments (if you want) to this method.

Example:

table.renameColumns(dict(a="b", c="d"))
table.renameColumns(dict(a="b"), dict(c="d"))
table.renameColumns(dict(a="b"), c="d"))
renamePostfixes(**kw)[source]

Renames postfixes given as key-value arguments.

Example: tab.renamePostfixes(__0 = "_new")

replaceColumn(name, what, type_=None, format_='')[source]

replaces column name in place.

  • name is name of the new column
  • *type_ is one of the valid types described above. If type_ == None the method tries to guess the type from what.
  • format_ is a format string as “%d” or None or an executable string with python code. If you use format_="" the method will try to determine a default format for the type.

For the values what you can use

  • an expression (see Expression) as table.addColumn("diffrt", table.rtmax-table.rtmin)
  • a callback with signature callback(table, row, name)
  • a constant value
  • a list with the correct length.
replaceSelectedRows(name, what, rowIndices)[source]

replace values in given rowIndices and given column name with constant value what

requireColumn(name)[source]

throws exception if column with name name does not exist

this method only exists for compatibility with older emzed code.

resetInternals()[source]

internal method

must be called after manipulation of

  • self._colNames
  • self._colFormats

or

  • self.rows
setCellValue(row_indices, col_indices, values)[source]

row_indices is either a list of ints or a single int

col_indices is either a list of ints or a single int

values is either a single value or a list of lists where the lenghts must match the given indices.

setColFormat(name, format_)[source]

sets format of column name to type format_.

setColType(name, type_)[source]

sets type of column name to type type_.

setRow(idx, row)[source]

replaces row idx with row. be carefull we only check the length not the types !

setValue(row, colName, value, slow_but_checked=True)[source]

sets value of column colName in a given row.

Example: table.setValue(table.rows[0], "mz", 252.83332)

sortPermutation(colNames, ascending=True)[source]

sorts table in respect of column named colName in place. ascending is boolean and tells if the values are sorted in ascending order or descending.

This is important to build an internal index for faster queries with Table.filter, Table.leftJoin and Table.join.

For building an internal index, ascending must be True, you can have only one index per table.

splitBy(*colNames, **kw)[source]

generates a list of subtables, where the columns given by colNames are unique.

If we have a table t1 as

a b c
1 1 1
1 1 2
2 1 3
2 2 4

res = t1.splitBy("a") results in a table res[0] as

a b c
1 1 1
1 1 2

res[1] which is like

a b c
2 1 3

and res[2] which is

a b c
2 2 4
splitByIter(*colNames, **kw)[source]

generator which yields a TProxy for every iteration.

splitVertically(columns)[source]

this method separates a table vertically based on columnames. the method returns two tables: the first one matches the given column names, the second is the reminder.

the argument columns is either a string or a list of strings. Globbing is allowed ! So for example:

tleft, tright = t.splitVertically("rt*", "mz*")

will return two tables. The left one has columns where the names start with “mz” or “rt”, the right one contains the reminder.

static stackTables(tables)[source]

dumb and fast version of Table.mergeTables if all tables have common column names, types and formats unless they are empty.

store(path, forceOverwrite=False, compressed=True, peakmap_cache_folder=None, atomic=False)[source]

Writes the table in binary format. All information, as corresponding peak maps too.

The file name extension in path``must be ``.table.

forceOverwrite must be set to True to overwrite an existing file.

compressed replaces duplicate copies of the same peakmap of a single one to save space on disk.

peakmap_cache_folder is a folder. if provided the table data and the peakmap are stored separtely. so the table file can then be loaded much faster and the peakmaps are lazily loaded only if one tries to access their spectra. This speeds up workflows but the developer must care about consistency: if the peakmap folder is deleted the table may becom useless !

Latter the file can be loaded with load()

storeCSV(path, as_printed=True, row_indices=None)[source]

writes the table in .csv format. The path has to end with ‘.csv’.

As .csv is a text format all binary information is lost !

supportedPostfixes(colNamesToSupport)[source]

For a table with column names ["rt", "rtmin", "rtmax", "rt1", "rtmin1"]

supportedPostfixes(["rt"]) returns ["", "min", "max", "1", "min1"],

supportedPostfixes(["rt", "rtmin"]) returns ["", "1"],

supportedPostfixes(["rt", "rtmax"]) returns [""],

toOpenMSFeatureMap()[source]

converts table to pyopenms FeatureMap type.

static toTable(colName, iterable, format_='', type_=None, title='', meta=None)[source]

generates a one-column table from an iterable, e.g. from a list, colName is name for the column.

  • if type_ is not given a common type for all values is determined,
  • if format_ is not given, a default format for type_ is used.

further one can provide a title and meta data

to_pandas(do_format=False)[source]

converts table to pandas DataFrame object

transformColumnNames(transformation)[source]

transformation is a function mapping a string to a string. this is used to modify the column names of the given table in-place.

uniqueRows(byColumns=None)[source]

extracts table with unique rows. Two rows are equal if all fields, including invisible columns (those with format_==None) are equal.

Expressions

Working with tables relies on so called Expressions

class emzed.core.data_types.expressions.BaseExpression(left, right)[source]

BaseClass for Expressions. For two Expressions t1 and t2 this class generates new Expressions as follows:

  • Comparison Operators:
  • t1 <= t2
  • t1 < t2
  • t1 >= t2
  • t1 > t2
  • t1 == t2
  • t1 != t2
  • Algebraic Operators:

    • t1 + t2
    • t1 - t2
    • t1 * t2
    • t1 > t2
  • Logic Operators:

    • t1 & t2
    • t1 | t2
    • t1 ^  t2
    • ~t1

    Note

    Due to some Python internals, these operators have a low precedence, so you have to use parentheses like (t1 <= t2) & (t1 > t3)`

aggregate(efun, res_type=None, ignore_none=True, default_empty=None)[source]

creates aggregate expression for aggregation function efun

allFalse

This is an aggregation expression which evaluates an expression to true if all values “represent” false.

Example: tab.rt.allFalse

allNone

This is an aggregation expression which evaluates an expression to true if all values are Nones

Example: tab.rt.allNone

allTrue

This is an aggregation expression which evaluates an expression to true if all values “represent” true.

Example: tab.rt.allTrue

anyFalse

This is an aggregation expression which evaluates an expression to true if any values “represent” false.

Example: tab.rt.anyTrue

anyNone

This is an aggregation expression which evaluates an expression to true if at least one value is None.

Example: tab.rt.anyNone

anyTrue

This is an aggregation expression which evaluates an expression to true if any value “represent” true.

Example: tab.rt.anyTrue

apply(fun, filter_nones=True)[source]

t.apply(fun) results in an expression which applies fun to the values in t if evaluated.

Example:: tab.addColumn("amplitude", tab.time.apply(sin))

As None values indicate “unknown” value, the function is applied only to not None values in the columns, unless you specify filter_nones=False.

approxEqual(what, tol)[source]

a.approxEqual(b, tol) tests if |a-b| <= tol.

Example: tab.mz.approxEqual(meatbolites.mz, 0.001)

callMethod(name, args=())[source]

calls method named name on values of given column or expression result. args can be used to pass parameters to the method call.

>>> import emzed
>>> t = emzed.utils.toTable("a", ("1", "23"))
>>> t.addColumn("l", t.a.callMethod("__len__"), type_=int)
>>> t.addColumn("x", t.a.callMethod("startswith", ("1",)), type_=bool)
>>> print t
a        l        x       
str      int      bool    
------   ------   ------  
1        1        True    
23       2        False   

contains(other)[source]

a.contains(b) tests if b in a.

containsElement(element)[source]

For a string valued expression a which represents a molecular formula the expressions a.containsElement(element) tests if the given element is contained in a.

Example: tab.mf.containsElement("Na")

containsOnlyElements(elements)[source]

elements is either a list of strings where each item is a chemical symbol, or a string composed of such symbols.

count

This is an aggregation expression which evaluates an column expression to the number of values in the column.

Example: tab.id.len

replaces ``len` expression.

countNone

This is an aggregation expression which evaluates an Column expression to the number of None values in it.

countNotNone

This is an aggregation expression which evaluates an Column expression to the number of values != None in it.

count_different

This is an aggregation expression which evaluates an column expression to the number of different values in the column.

Example:: tab.id.len

equals(other, abs_tol=None, rel_tol=None)[source]

fast comparison for equality, maybe with some numerical tolerance.

For example:

tn = t.join(t2, t.mz.equals(t2.mz, rel_tol=5e-6) & t.rt.equals(t2.rt, abs_tol=30))

Attention: This operation only works if the first arg of the join (here t2) appears as the table in the first argument (here t2.mz) of equals. Else an exception will be thrown !

hasNone

This is an aggregation expression which evaluates an Column expression to True if and only if the column contains a None value.

ifNotNoneElse(other)[source]

a.ifNotNoneElse(b) evaluates to a if a is not None, else it evaluates to b.

Example: t.rt.replaceColumn("rt", rt.ifNotNoneElse(t.rt_default))

inRange(minv, maxv)[source]

a.inRange(low, up) tests if low <= a <= up.

Example: tab.rt.inRange(60, 120)

isIn(li)[source]

a.isIn(li) tests if the value of a is contained in a list li.

Example: tab.id.isIn([1,2,3])

isNone()[source]

This expression returns True for None values which indicate “missing value”

isNotNone()[source]

This expression returns False for None values which indicate “missing value”

len

This expression is depreciated. Please use count() instead.

loadFileFromPath(type_=None)[source]

inserts Blob column by reading files from paths listed in given column.

max

This is an aggregation expression which evaluates an expression to its maximal value.

Example: tab.rt.max

mean

This is an aggregation expression which evaluates an expression to its mean.

Example: tab.area.mean

median

This is an aggregation expression which evaluates an expression to its mean.

Example: tab.area.mean

min

This is an aggregation expression which evaluates an expression to its minimal value.

Example: tab.rt.min

pow(exp)[source]

a.pow(b) evaluates to a**b.

Example: tab.rmse.pow(2)

startswith(other)[source]

For two string valued expressions a and b the expression a.startswith(b) evaluates if the string a starts with the string b. The latter might be a fixed string, as tab.mf.startswith("H2")

std

This is an aggregation expression which evaluates an expression to its standard deviation.

Example: tab.area.std

sum

This is an aggregation expression which evaluates an expression to its sum.

Example: tab.area.sum

thenElse(then, else_)[source]

a.thenElse(b, c) evaluates to b if a is True, if not it evaluates to c.

toTable(colName, fmt=None, type_=None, title='', meta=None)[source]

Generates a one column Table from an expression.

Example: tab = substances.name.toTable()

uniqueNotNone

This is an aggregation expression. If applied to an expression t t.uniqueNotNone evaluates to v if t only contains two values v and None. Else it raises an Exception.

Example: tab.peakmap.uniqueNotNone

class emzed.core.data_types.expressions.ColumnExpression(table, colname, idx, type_, rows=None)[source]

A ColumnExpression is the simplest form of an Expression. It is generated from a Table t as t.x or by calling t.getColumn("x").

__iadd__(value)[source]

Allows inplace modification of a Table column.

Example: tab.id += 1

__idiv__(value)[source]

Allows inplace modification of a Table column.

Example: tab.area /= 3.141

__imul__(value)[source]

Allows inplace modification of a Table column.

Example: tab.area *= 2

__isub__(value)[source]

Allows inplace modification of a Table column.

Example: tab.id -= 1

modify(operation)[source]

Allows inplace modification of a Table column.

Example: tab.time.modify(sin) replaces the content of in column time by its sin value.