%%capture
# Setup
import os
# paths to files
data_folder = os.path.join(os.getcwd(), "tutorial_data")
13 Out-of-memory Processing with Tables and PeakMaps¶
The Challenge: Many large data sets¶
In some applications the RAM of your computer can limit the analysis you want to perform. E.g. aligning many peak tables with many peaks in each can limit be limited.
emzed is designed to handle this challenge gracefully by providing a out-of-memory processing capability for its core objects: Table and PeakMap.
The Core Concept: On-Disk Data Storage¶
The key to emzed's out-of-memory strategy is its ability to store Table and PeakMap data on your computer's hard drive in a highly efficient SQLite database file, rather than keeping it all in RAM.
- In-Memory: By default, or when you explicitly load data without specifying a file path to save to,
emzedobjects are created in-memory. This is fast for smaller objects but limited by your available RAM. - On-Disk: When you save a
TableorPeakMapto a file (using file extensions.tableor.peakmapfile), the data resides on your disk.emzedcan then interact with this on-disk database without needing to load the entire dataset into memory at once.
open() vs. load(): The Key to Out-of-Memory Access¶
Understanding the difference between open and load methods is fundamental to leveraging emzed's out-of-memory capabilities.
load(): This method, such asemzed.io.load_peak_map(path), typically reads the entire source file (like anmzML) and loads its contents into memory. If you save it to a target file, it writes the full data there. For large files, this initial step can consume a lot of memory.open(): This method, such asemzed.Table.open(path)oremzed.PeakMap.open(path), is the workhorse for out-of-memory processing. It does not load the data into memory. Instead, it opens a connection to the on-disk database file and provides you with aTableorPeakMapobject that acts as a lightweight interface to that data.
If you want to access a peakmap in this way, you cannot operate on native .mzML or similar raw data files. This needs an initial conversion step as shown in the example below.
Practical Example: PeakMap¶
import emzed
large_raw_file_path = os.path.join(data_folder, "AA_sample_arabidopsis.mzML")
assert os.path.exists(large_raw_file_path)
on_disk_peakmap_path = "large_data.peakmap"
# create on-disk version of peakmap wihthout loading full peakmap first:
if not os.path.exists(on_disk_peakmap_path):
emzed.PeakMap.load(
large_raw_file_path, target_db_file=on_disk_peakmap_path, overwrite=True
)
pm = emzed.PeakMap.open(on_disk_peakmap_path)
print(pm.is_in_memory())
print()
print(pm.summary())
start remote ip in /builds/K1B9c9w9Z/0/sispub/emzed/emzed.ethz.ch/.venv/share/pyopenms_venv
pyopenms client started. connected to pyopenms client.
False
info ms_level value
str str str
---------------- -------- ------------------
rt range 1 0.0m .. 10.0m
mz range 1 100.001 .. 999.944
num_scans 1 1054
polarities {'+'}
ms_chromatograms 0
# This 'extract' operation creates a view. No data is loaded yet.
high_mass_peaks = pm.extract(mzmin=800)
# Now, when we iterate, the data is fetched from the disk on-demand.
total_intensity = 0
for spectrum in high_mass_peaks: # This triggers the lazy loading
if spectrum.peaks.shape[0] > 0:
total_intensity += spectrum.peaks[:, 1].sum()
print(f"Total intensity for peaks above 800 m/z: {total_intensity:.2f}")
# release file handle:
pm.close()
Total intensity for peaks above 800 m/z: 204001622.35
Practical Example: Table¶
# Step 4: Create and save a Table on-disk.
table_path = os.path.join(data_folder, "AA_sample_arabidopsis.table")
assert os.path.exists(table_path)
# instead of loading, we open the table to operate on disk:
t_disk = emzed.Table.open(table_path)
print("is table in memory?", t_disk.is_in_memory())
# don't create copy when filtering:
sub_table = t_disk.filter(t_disk.m0 < 150, keep_view=True)
print("is sub_table in memory?", sub_table.is_in_memory())
# another on-disk view:
first_rows = sub_table[:5]
print("is first_rows in memory?", first_rows.is_in_memory())
# some operations offer an extra path parameter to create a new
# sqlite db instead of loading into memory:
final_table = emzed.Table.stack_tables(
[first_rows, sub_table], path="stacked_tables.table", overwrite=True
)
print("is final_table in memory?", final_table.is_in_memory())
is table in memory? False is sub_table in memory? False is first_rows in memory? False
is final_table in memory? False
Best Practices and Performance¶
- Use
.open()for Large Files: Always preferemzed.PeakMap.open()andemzed.Table.open()when interacting with large, pre-existing.peakmapor.tablefiles. - Disk I/O vs. RAM: While out-of-memory processing saves RAM, it relies on disk input/output (I/O), which is inherently slower than accessing memory. For very complex and iterative algorithms, you might notice a performance difference. A fast SSD will significantly improve performance compared to a traditional hard drive.
- Consolidate Results: If you create a complex chain of filtered views, it can sometimes be beneficial to
.save()or.consolidate()the intermediate result to a new on-disk file. This can optimize the underlying database queries for subsequent steps. - Check with
is_in_memory(): You can always check if your object is working in-memory or on-disk by calling the.is_in_memory()method.