Table internals¶
emzed.Table is a typed, expression-based tabular data structure. Compared
to a plain pandas DataFrame it offers strict column types, domain-specific
numeric types, and a lazy evaluation model backed by SQLite.
Storage: SQLite, not pandas¶
Every Table is backed by a SQLite database, either in-memory or as a file on disk. There is no pandas or numpy array underneath. This has a few practical consequences:
- Tables saved to disk can be re-opened without loading all data into RAM.
- Operations that look like in-memory transformations (filtering, sorting, column projection) are actually SQL queries.
- A Table loaded from a file keeps a live connection to that file; changes on disk are visible immediately.
The internal column names used in the database (e.g. col_1, col_2) are
mapped to user-facing names via a metadata dictionary, so renaming a column
does not require altering the schema.
Lazy views and consolidation¶
Three Table methods — filter(), sort_by(), and extract_columns() — do
not copy data. Instead, they create a SQL VIEW on top of the existing
table and return a new Table wrapping that view. These views are composable:
applying filter() on an already-filtered Table adds another WHERE clause to
the chain.
Materialising a view chain into a real table is called consolidation.
These three methods consolidate automatically before returning (controlled by
the @consolidate_per_default decorator), but passing keep_view=True defers
this. Explicit consolidation is also available via Table.consolidate().
Consolidation does the following:
- Opens a new SQLite connection — in-memory when
path=None, file-backed otherwise. - Creates a fresh
datatable with the correct schema in the target. - Temporarily attaches source and target databases to each other.
- Transfers any custom SQLite functions registered on the source connection to
the target. This step is essential when a view was built using an
apply()-based expression: those expressions are backed by SQLite user-defined functions, and without transferring them theSELECTused to copy the data would fail. - Executes a single
INSERT INTO … SELECT FROM SOURCE.<view>to materialise the data. - Detaches both databases, saves table metadata, and copies any BLOB-stored objects (nested Tables, PeakMaps) that the rows reference.
- Returns a new
FullTableModel— a real, mutable, writable table.
All mutations (adding or removing rows, adding or replacing columns) are
blocked on views and raise a TypeError asking the caller to consolidate
first. The following operations also trigger consolidation internally, without
requiring the caller to do so explicitly:
join()— consolidates the left table if it is a view before joiningsplit_by()andgroupby()iteration — each chunk is returned consolidatedTable.load()— consolidates after reading from disk
When a mutable table is modified (row added or deleted, column added or
replaced), it immediately invalidates all views derived from it. Invalidation
does two things: the underlying SQL VIEW is dropped from the database, and an
_is_valid flag is set to False. Every model method is guarded by a
decorator that checks this flag and raises a clear ValueError — "this view /
model is not valid any more" — rather than silently returning stale or
incorrect data.
The view registry on the parent model holds ordinary (strong) references to its child views, so views are kept alive as long as the parent is alive or until they are invalidated. Once invalidated, the view object still exists in Python but is inert: every operation on the corresponding Table raises the error above.
Column types¶
Column types are declared at table creation and enforced throughout the table's lifetime. The supported types are:
| Type | Notes |
|---|---|
int, float, bool, str |
Standard scalar types |
MzType |
Float subtype for m/z values; displays with 6 decimal places |
RtType |
Float subtype for retention times; stored in seconds, displayed in minutes |
PeakMap |
MS data; stored as a serialised BLOB, not queryable via expressions |
Table |
Nested Table; stored as a BLOB |
object |
Arbitrary picklable Python object; stored as a BLOB |
MzType and RtType participate in arithmetic and comparisons just like
float. PeakMap, Table, and object columns block arithmetic and
comparison operations with an explicit error — they can only be retrieved
row-by-row.
BLOB columns are serialised with dill (a pickle-compatible library that handles a wider range of Python objects).
Expression system¶
Column expressions are the primary way to derive new columns or filter rows. When you write:
the right-hand sides are not evaluated immediately in Python. Instead,
accessing a column on a Table (e.g. t.rt) returns a ColumnAccessor object.
Operators on that object build up an expression tree. When the expression is
used in a Table operation, the tree is compiled to a SQL fragment and executed
by SQLite.
User-defined Python functions can also be used in expressions via
expression.apply(func, return_type). These are registered as SQLite
user-defined functions so the same SQL execution path is used throughout.
Multiprocessing and in-memory vs file-backed tables¶
Tables come in two storage modes:
- In-memory (
path=None) — the SQLite database lives entirely in RAM. Fast and self-contained, but the table cannot be shared with worker processes. Passing an in-memory table to amultiprocessingworker raises aPicklingErrorimmediately. - File-backed — the SQLite database is a regular file on disk. File-backed
tables can be sent to worker processes: on pickling, the SQLite connection's
__getstate__returns only the URI and bookkeeping data (no livesqlite3handle). When the bytes are unpickled in the worker,__setstate__runssqlite3.connect(uri, uri=True)against the same file, giving the worker an independent connection.
A few important consequences of the file-backed model:
- Workers receive a read-only snapshot in practice — there is no coordination layer, so concurrent writes from multiple processes would race and corrupt the database. Treat file-backed tables as read-only when used across processes.
- General pickling (e.g.
pickle.dumps(table)) is intentionally blocked and raisesNotImplementedError. Only themultiprocessingserialisation path is allowed; the code detects the caller and rejects everything else. - Views derived from a file-backed table reference the same file, so a worker that receives a filtered view still opens the original database file.
Resource management and garbage collection¶
Two separate cleanup mechanisms keep memory and disk usage under control.
Connection reference counting. Multiple Table objects can share the same
underlying SQLite connection. A module-level reference counter tracks how many
FullTableModel instances are using each connection URI. When a model is
garbage-collected, a weakref.finalize callback fires, decrements the counter,
and closes the connection only when the count reaches zero. This avoids both
premature closes and connection leaks without relying on __del__.
BLOB column GC. PeakMap and Table values stored in BLOB columns are
not embedded directly in the SQLite row. Instead, each object is kept in a
separate registry and the row stores only a unique ID. Whenever a cell
holding one of these values is removed or overwritten — e.g. by
delete_rows, a column drop, or a cell update — the table runs an internal
garbage collector that scans all remaining IDs still referenced by any row
and calls _remove_unused_references() on the PeakMap and Table
registries to free objects that are no longer reachable. This prevents
large MS objects from accumulating in the database after the cells
referring to them are gone.
Group-by aggregations¶
Table.groupby(column) returns a GroupBy helper rather than a new Table.
Aggregation methods on that helper (min, max, sum, mean, …) compile to
SQL GROUP BY queries and return a new consolidated Table.