In [1]:

%%capture
# Setup
import os

# paths to files
data_folder = os.path.join(os.getcwd(), "tutorial_data")

13 Out-of-memory Processing with Tables and PeakMaps¶

The Challenge: Many large data sets¶

In some applications the RAM of your computer can limit the analysis you want to perform. E.g. aligning many peak tables with many peaks in each can limit be limited. emzed is designed to handle this challenge gracefully by providing a out-of-memory processing capability for its core objects: Table and PeakMap.

The Core Concept: On-Disk Data Storage¶

The key to emzed's out-of-memory strategy is its ability to store Table and PeakMap data on your computer's hard drive in a highly efficient SQLite database file, rather than keeping it all in RAM.

In-Memory: By default, or when you explicitly load data without specifying a file path to save to, emzed objects are created in-memory. This is fast for smaller objects but limited by your available RAM.
On-Disk: When you save a Table or PeakMap to a file (using file extensions .table or .peakmap file), the data resides on your disk. emzed can then interact with this on-disk database without needing to load the entire dataset into memory at once.

`open()` vs. `load()`: The Key to Out-of-Memory Access¶

Understanding the difference between open and load methods is fundamental to leveraging emzed's out-of-memory capabilities.

load(): This method, such as emzed.io.load_peak_map(path), typically reads the entire source file (like an mzML) and loads its contents into memory. If you save it to a target file, it writes the full data there. For large files, this initial step can consume a lot of memory.
open(): This method, such as emzed.Table.open(path) or emzed.PeakMap.open(path), is the workhorse for out-of-memory processing. It does not load the data into memory. Instead, it opens a connection to the on-disk database file and provides you with a Table or PeakMap object that acts as a lightweight interface to that data.

If you want to access a peakmap in this way, you cannot operate on native .mzML or similar raw data files. This needs an initial conversion step as shown in the example below.

Practical Example: PeakMap¶

In [2]:

import emzed

large_raw_file_path = os.path.join(data_folder, "AA_sample_arabidopsis.mzML")
assert os.path.exists(large_raw_file_path)

on_disk_peakmap_path = "large_data.peakmap"

# create on-disk version of peakmap wihthout loading full peakmap first:
if not os.path.exists(on_disk_peakmap_path):
    emzed.PeakMap.load(
        large_raw_file_path, target_db_file=on_disk_peakmap_path, overwrite=True
    )

pm = emzed.PeakMap.open(on_disk_peakmap_path)
print(pm.is_in_memory())
print()
print(pm.summary())

start remote ip in /builds/K1B9c9w9Z/0/sispub/emzed/emzed.ethz.ch/.venv/share/pyopenms_venv

pyopenms client started.
connected to pyopenms client.

False

info              ms_level  value             
str               str       str               
----------------  --------  ------------------
rt range          1         0.0m .. 10.0m     
mz range          1         100.001 .. 999.944
num_scans         1         1054              
polarities                  {'+'}             
ms_chromatograms            0

In [3]:

# This 'extract' operation creates a view. No data is loaded yet.
high_mass_peaks = pm.extract(mzmin=800)

# Now, when we iterate, the data is fetched from the disk on-demand.
total_intensity = 0
for spectrum in high_mass_peaks:  # This triggers the lazy loading
    if spectrum.peaks.shape[0] > 0:
        total_intensity += spectrum.peaks[:, 1].sum()

print(f"Total intensity for peaks above 800 m/z: {total_intensity:.2f}")
# release file handle:
pm.close()

Total intensity for peaks above 800 m/z: 204001622.35

Practical Example: Table¶

In [4]:

# Step 4: Create and save a Table on-disk.

table_path = os.path.join(data_folder, "AA_sample_arabidopsis.table")
assert os.path.exists(table_path)

# instead of loading, we open the table to operate on disk:
t_disk = emzed.Table.open(table_path)
print("is table in memory?", t_disk.is_in_memory())

# don't create copy when filtering:
sub_table = t_disk.filter(t_disk.m0 < 150, keep_view=True)
print("is sub_table in memory?", sub_table.is_in_memory())

# another on-disk view:
first_rows = sub_table[:5]
print("is first_rows in memory?", first_rows.is_in_memory())

# some operations offer an extra path parameter to create a new
# sqlite db instead of loading into memory:
final_table = emzed.Table.stack_tables(
    [first_rows, sub_table], path="stacked_tables.table", overwrite=True
)
print("is final_table in memory?", final_table.is_in_memory())

is table in memory? False
is sub_table in memory? False
is first_rows in memory? False

is final_table in memory? False

Best Practices and Performance¶

Use .open() for Large Files: Always prefer emzed.PeakMap.open() and emzed.Table.open() when interacting with large, pre-existing .peakmap or .table files.
Disk I/O vs. RAM: While out-of-memory processing saves RAM, it relies on disk input/output (I/O), which is inherently slower than accessing memory. For very complex and iterative algorithms, you might notice a performance difference. A fast SSD will significantly improve performance compared to a traditional hard drive.
Consolidate Results: If you create a complex chain of filtered views, it can sometimes be beneficial to .save() or .consolidate() the intermediate result to a new on-disk file. This can optimize the underlying database queries for subsequent steps.
Check with is_in_memory(): You can always check if your object is working in-memory or on-disk by calling the .is_in_memory() method.