Skip to content

Table internals

emzed.Table is a typed, expression-based tabular data structure. Compared to a plain pandas DataFrame it offers strict column types, domain-specific numeric types, and a lazy evaluation model backed by SQLite.

Storage: SQLite, not pandas

Every Table is backed by a SQLite database, either in-memory or as a file on disk. There is no pandas or numpy array underneath. This has a few practical consequences:

  • Tables saved to disk can be re-opened without loading all data into RAM.
  • Operations that look like in-memory transformations (filtering, sorting, column projection) are actually SQL queries.
  • A Table loaded from a file keeps a live connection to that file; changes on disk are visible immediately.

The internal column names used in the database (e.g. col_1, col_2) are mapped to user-facing names via a metadata dictionary, so renaming a column does not require altering the schema.

Lazy views and consolidation

Three Table methods — filter(), sort_by(), and extract_columns() — do not copy data. Instead, they create a SQL VIEW on top of the existing table and return a new Table wrapping that view. These views are composable: applying filter() on an already-filtered Table adds another WHERE clause to the chain.

Materialising a view chain into a real table is called consolidation. These three methods consolidate automatically before returning (controlled by the @consolidate_per_default decorator), but passing keep_view=True defers this. Explicit consolidation is also available via Table.consolidate().

Consolidation does the following:

  1. Opens a new SQLite connection — in-memory when path=None, file-backed otherwise.
  2. Creates a fresh data table with the correct schema in the target.
  3. Temporarily attaches source and target databases to each other.
  4. Transfers any custom SQLite functions registered on the source connection to the target. This step is essential when a view was built using an apply()-based expression: those expressions are backed by SQLite user-defined functions, and without transferring them the SELECT used to copy the data would fail.
  5. Executes a single INSERT INTO … SELECT FROM SOURCE.<view> to materialise the data.
  6. Detaches both databases, saves table metadata, and copies any BLOB-stored objects (nested Tables, PeakMaps) that the rows reference.
  7. Returns a new FullTableModel — a real, mutable, writable table.

All mutations (adding or removing rows, adding or replacing columns) are blocked on views and raise a TypeError asking the caller to consolidate first. The following operations also trigger consolidation internally, without requiring the caller to do so explicitly:

  • join() — consolidates the left table if it is a view before joining
  • split_by() and groupby() iteration — each chunk is returned consolidated
  • Table.load() — consolidates after reading from disk

When a mutable table is modified (row added or deleted, column added or replaced), it immediately invalidates all views derived from it. Invalidation does two things: the underlying SQL VIEW is dropped from the database, and an _is_valid flag is set to False. Every model method is guarded by a decorator that checks this flag and raises a clear ValueError"this view / model is not valid any more" — rather than silently returning stale or incorrect data.

The view registry on the parent model holds ordinary (strong) references to its child views, so views are kept alive as long as the parent is alive or until they are invalidated. Once invalidated, the view object still exists in Python but is inert: every operation on the corresponding Table raises the error above.

Column types

Column types are declared at table creation and enforced throughout the table's lifetime. The supported types are:

Type Notes
int, float, bool, str Standard scalar types
MzType Float subtype for m/z values; displays with 6 decimal places
RtType Float subtype for retention times; stored in seconds, displayed in minutes
PeakMap MS data; stored as a serialised BLOB, not queryable via expressions
Table Nested Table; stored as a BLOB
object Arbitrary picklable Python object; stored as a BLOB

MzType and RtType participate in arithmetic and comparisons just like float. PeakMap, Table, and object columns block arithmetic and comparison operations with an explicit error — they can only be retrieved row-by-row.

BLOB columns are serialised with dill (a pickle-compatible library that handles a wider range of Python objects).

Expression system

Column expressions are the primary way to derive new columns or filter rows. When you write:

t.filter(t.rt > 60.0)
t.add_column("area_norm", t.area / t.area.max(), float)

the right-hand sides are not evaluated immediately in Python. Instead, accessing a column on a Table (e.g. t.rt) returns a ColumnAccessor object. Operators on that object build up an expression tree. When the expression is used in a Table operation, the tree is compiled to a SQL fragment and executed by SQLite.

User-defined Python functions can also be used in expressions via expression.apply(func, return_type). These are registered as SQLite user-defined functions so the same SQL execution path is used throughout.

Multiprocessing and in-memory vs file-backed tables

Tables come in two storage modes:

  • In-memory (path=None) — the SQLite database lives entirely in RAM. Fast and self-contained, but the table cannot be shared with worker processes. Passing an in-memory table to a multiprocessing worker raises a PicklingError immediately.
  • File-backed — the SQLite database is a regular file on disk. File-backed tables can be sent to worker processes: on pickling, the SQLite connection's __getstate__ returns only the URI and bookkeeping data (no live sqlite3 handle). When the bytes are unpickled in the worker, __setstate__ runs sqlite3.connect(uri, uri=True) against the same file, giving the worker an independent connection.

A few important consequences of the file-backed model:

  • Workers receive a read-only snapshot in practice — there is no coordination layer, so concurrent writes from multiple processes would race and corrupt the database. Treat file-backed tables as read-only when used across processes.
  • General pickling (e.g. pickle.dumps(table)) is intentionally blocked and raises NotImplementedError. Only the multiprocessing serialisation path is allowed; the code detects the caller and rejects everything else.
  • Views derived from a file-backed table reference the same file, so a worker that receives a filtered view still opens the original database file.

Resource management and garbage collection

Two separate cleanup mechanisms keep memory and disk usage under control.

Connection reference counting. Multiple Table objects can share the same underlying SQLite connection. A module-level reference counter tracks how many FullTableModel instances are using each connection URI. When a model is garbage-collected, a weakref.finalize callback fires, decrements the counter, and closes the connection only when the count reaches zero. This avoids both premature closes and connection leaks without relying on __del__.

BLOB column GC. PeakMap and Table values stored in BLOB columns are not embedded directly in the SQLite row. Instead, each object is kept in a separate registry and the row stores only a unique ID. Whenever a cell holding one of these values is removed or overwritten — e.g. by delete_rows, a column drop, or a cell update — the table runs an internal garbage collector that scans all remaining IDs still referenced by any row and calls _remove_unused_references() on the PeakMap and Table registries to free objects that are no longer reachable. This prevents large MS objects from accumulating in the database after the cells referring to them are gone.

Group-by aggregations

Table.groupby(column) returns a GroupBy helper rather than a new Table. Aggregation methods on that helper (min, max, sum, mean, …) compile to SQL GROUP BY queries and return a new consolidated Table.