[1]:
# Setup
import emzed
import os
# path to files
data_folder = os.path.join(os.getcwd(), "tutorial_data")
path_table_caulobacter = os.path.join(
data_folder, "t_caulo_ffmetabo.table"
)
path_table_arabidopsis = os.path.join(
data_folder, "t_ara_ffmetabo.table"
)
ref_path = os.path.join(data_folder, "mz_calibration_table.csv")
t_caulo = emzed.io.load_table(path_table_caulobacter)
t_ara = emzed.io.load_table(path_table_arabidopsis)
t_ref = emzed.io.load_csv(ref_path)
start remote ip in /root/.emzed3/pyopenms_venv
pyopenms client started.
connected to pyopenms client.
The emzed3 ``alignment`` module provides functions that allow correcting for instrument drifts (mz, RT), which occured during data acquisition. Note, all alignment processes operate on LC-MS peak level and a preceeding peak detection step is required.
There are multiple reasons (e.g. sample matrix, variation in buffer composition) provoking retention time shift between samples analyzed with the same LC-MS method. Depending on the observed retention time differences, aligning can enhance the mapping of identical peaks in different samples significantly, as the required retention time tolerance can be reduced. In principle, the algorithm maps peaks of a reference table and a feature table with similar mz values and within a user defined retention time window. Based on peak mapping, a function \(\Delta RT(RT)\) is created and a curve fit is applied to the curve. Subsequently, retention time of each spectrum in the peak map is shifted as function of the fit curve. Hence, the quality of retention time shift depends a lot on the number of correct peak mappings and on the applied fitting.
In emzed, the emzed.align.rt_align
function is used for RT alignment. The algorithm originates from OpenMS and you can find some information about it here. RT alignment in emzed requires the specific column names feature_id
, rt
, mz
, and intensity
present in run_feature_finder_metabo
output tables. If you want to align peak tables
generated with the mzmine2 extension, you have to rename column names accordingly!
The emzed.align.rt_align
function has a number of arguments that require some explanations:
rt_align(tables, reference_table=None, destination=None, reset_integration=False, n_peaks=-1, max_rt_difference=100, max_mz_difference=0.3, max_mz_difference_pair_finder=0.5, model='b_spline', **model_parameters)
tables
: A list of feature tables created with e.g. run_feature_finder_metabo.
reference_table
: Extra reference table, if None the table with most features among tables is taken. A reference Table is in particular useful when using a global reference for routine measurements. Aligning all your measurements to the same table significantly enhances long term sample comparison over different batches. Note, a significant peak overlap between reference_table and tables is required for good retention time alignment.
destination
: Target folder for quality plots \(\Delta RT(RT)\) including the fitting curve (see Figure below). For each feature table a quality plot is created. Such plots are in particular usefull for parameter tuning (e.g. checking whether the reference table is suitable).
reset_integration
: Sets all column values related to integration (i.e. area) to None
if present. Else the alignment will fail for already integrated tables. Why resetting peak integration? Because changing the rt values of spectra can have an influence on peak bounderies, meaning the peak areas are no longer correct. If possible, do integrate peaks after retention time alignment!
n_peaks
: The maximal number of peaks matched by matching algorithm, where -1 means all peaks. The value can be float between 0 and 1 to refer to fraction of peaks in reference table. Peaks are selected by their intensity. In general, the default value of -1 works well.
max_rt_difference
: Max allowed difference of rt values in seconds for searching matching features. The value should be adapted to the data set. An overestimation will significantly lower the alignment quality.
max_mz_difference_pair_finder
: Max allowed difference in mz values for pair finding.
model
: The two available fitting models are b-spline and lowess. Both fitting models can be adapted to obtain good fitting results since lowess allows using cubic spline as fitting function. However, b-spline applies least square, whereas lowess uses weighted least square for error minimization. Hence, when appplying lowess, data points closer to the fitting will be weighted stronger, which
makes the approach less sensitive to outliers and should be advantageous with more noisy data.
model_parameters
: Keyword parameters depending on provided value for model
.
model='b_spline'
- `num_nodes`: Number of break points of fitted spline (default: 5). Note, that more points result in splines with higher variation and that `num_modes` must be uneven.
model='lowess'
span
: Value between 0 and 1, larger values lead to more smoothing (default is 2/3). The smaller the value the more local regression segments will be defined.
interpolation_type
: either "cspline"
or "linear"
(default: "linear"
)
Again, we will use the Arabidopsis and Caulobacter cell extract samples to demonstrate RT alignment. The two examples originate from two different batches. We can observe a significant retention time shift increasing with increasing retention time (see Figure 6 below).
Figure 6. Selected peaks from Caulobacter (left peak) and Arabidopsis (right peak) samples to demonstrate the retentions time shift.
We can observe a maximal retention time shifts for late eluting peaks of about 10 s. Since the number of common peaks of the two example files is low, we can extract all peaks used for the alignment process by a simple join with max_rt_difference = 10
as absolute RT tolerance and max_mz_pair_difference = 0.005
as absolute mz tolerance: €€ correct rt in_range!!€€
[2]:
mz_condition = t_caulo.mz.approx_equal(t_ara.mz, 0.005, 0)
rt_condition = t_caulo.rt.in_range(t_ara.rt - 20.0, t_ara.rt + 5.0)
t_common = t_caulo.join(t_ara, mz_condition & rt_condition)
print(f"number of common peaks: {len(t_common)}")
number of common peaks: 27
[3]:
import matplotlib.pyplot as plt
# we calculate the delta rt
t_common.add_or_replace_column(
"delta_rt", t_common.rt - t_common.rt__0, float
)
rt = t_common.rt.to_list()
delta_rt = t_common.delta_rt.to_list()
plt.plot(rt, delta_rt, "*")
plt.xlabel("Retention time (s)")
plt.ylabel("$\Delta$ RT (s)")
plt.show()
The plot indicates a more or less linear retention time difference decrease. We can now apply the RT alignment function using the two different fitting models with default fitting model settings and check the fitting quality.
[4]:
rt_align = emzed.align.rt_align
tables = [t_caulo, t_ara]
tables_spline = rt_align(
tables,
max_rt_difference=20.0,
max_mz_difference=0.005,
max_mz_difference_pair_finder=0.005,
) # default model is b-spline
tables_lowess = rt_align(
tables,
max_rt_difference=20.0,
max_mz_difference=0.005,
max_mz_difference_pair_finder=0.005,
model="lowess",
)
reference_map is <noname>
reference_map is <noname>
Alignment results:
Figure 7: Fitting curve (top) and selected LC-MS peaks of aligned samples (bottom) after RT alignment obtained with lowess (left) and b-spline (right).
For given example, lowess results a slightly better alignment. However, as already mentioned, the two samples have very few peaks in common meaning choosing the default number of nodes for b-spline fitting (5) tends to overfit. When reducing the number of nodes to 3 we obtain fitting results very similar to those obtained with lowess. Analogous, we can drastically reduce the robustness of the lowess fit by lowering the span value from 0.66 to 0.3
thus increasing the local weights drastically, which is similar to increasing num_nodes
with b_spline fit.
Figure 8: Fitting curve (top) and selected LC-MS peaks of aligned samples (bottom) after RT alignment. \(RT\) vs \(\Delta RT\) plots of b_spline fit with num_nodes = 7 and lowess fit with span = 0.3.
The given example with very few peaks in common was selected to simplify the demonstration of fit models and corresponding parameters. Typically, the number of common peaks is much higher as shown in Figure 9.
Figure 9: Typical retention time alignment using default fit model settings (“b-spline” , num_nodes = 5)
emzed3 provides the possibility to perform a “post acquisition mass calibration” based on known m/z peaks to correct for m/z instrument drifts. The mass accuracy stability depends on the type of intrument and can vary more or less significantly over time and noticeable mass drifts might be observed.
t_aligned = mz_align(table, reference_table, tol=0.015, destination=None, min_r2=0.95, max_tol=0.001, min_points=5, interactive=False)
The functions performs affine linear mz value correction for feature tables with function arguments
table
: a peaks table with required columns
reference table
: Table defining reference peaks with required the columns
name
compound names
mz
calculated mz values of compounds
rtmin
, rtmax
, the expected retention time window
polarity
the polarity of the compound ion (+/-)
tol
: maximal tolerated difference of reference and feature table mz values to include compound peak into fitting procsess.
destination
: Target folder for quality plots \(\Delta RT(RT)\) including the fitting curve (see Figure 7). For each feature table a quality plot is created.
min_r2
: Minimal quality of linear regression. Fitting results with \(R^{2} < min_r2\) will be ignored.
max_tol
: maximal allowed m/z diffence between calculated and corrected m/z values. Fitting results containing calibration peaks with \(abs(mz_corr-mz_calc) > max\) will be ignored.
interactive
: The interactive mode allows peak exclusion from the fitting curve by the user.
In chapter 3 we extracted amino acids by a targeted approach. We can use amino acid peaks to demonstrate mz alignment.
[5]:
t_ref
[5]:
name | mf | m0 | mz | rt | rtmin | rtmax | polarity |
---|---|---|---|---|---|---|---|
str | str | float | MzType | RtType | RtType | RtType | str |
Phenylalanine | C9H11NO2 | 165.079000 | 166.086300 | 2.27 m | 2.14 m | 2.40 m | + |
Leucine | C6H13NO2 | 131.094600 | 132.101900 | 1.65 m | 1.52 m | 1.78 m | + |
Tryptophane | C11H12N2O2 | 204.089900 | 205.097200 | 2.35 m | 2.22 m | 2.48 m | + |
Valine | C5H11NO2 | 117.079000 | 118.086300 | 3.27 m | 3.14 m | 3.40 m | + |
Tyrosine | C9H11NO3 | 181.073900 | 182.081200 | 3.63 m | 3.50 m | 3.76 m | + |
Alanine | C3H7NO2 | 89.047700 | 90.055000 | 4.00 m | 3.87 m | 4.13 m | + |
Threonine | C4H9NO3 | 119.058200 | 120.065500 | 4.61 m | 4.48 m | 4.74 m | + |
[6]:
mz_align = emzed.align.mz_align
t_mz_aligned = mz_align(t_caulo, t_ref)
t1 = t_caulo.extract_columns("id", "mz")
t2 = t_mz_aligned.extract_columns("id", "mz")
comp = t1.join(t2, t1.id == t2.id)
comp[10:100:10]
5 MATCHES FROM REFERENCE
NUMPOINTS= 5 goodness=0.014
[6]:
id | mz | id__0 | mz__0 |
---|---|---|---|
int | MzType | int | MzType |
10 | 104.070310 | 10 | 104.070613 |
20 | 112.075332 | 20 | 112.075635 |
30 | 118.085986 | 30 | 118.086289 |
40 | 128.106696 | 40 | 128.106997 |
50 | 132.101695 | 50 | 132.101996 |
60 | 141.102055 | 60 | 141.102356 |
70 | 148.060255 | 70 | 148.060555 |
80 | 151.122885 | 80 | 151.123185 |
90 | 153.102109 | 90 | 153.102408 |
[7]:
comp[-101:-1:10]
[7]:
id | mz | id__0 | mz__0 |
---|---|---|---|
int | MzType | int | MzType |
1048 | 554.857002 | 1048 | 554.857275 |
1058 | 562.352317 | 1058 | 562.352589 |
1068 | 564.358909 | 1068 | 564.359182 |
1078 | 570.005582 | 1078 | 570.005854 |
1088 | 577.371586 | 1088 | 577.371858 |
1098 | 578.038838 | 1098 | 578.039109 |
1108 | 584.379706 | 1108 | 584.379977 |
1118 | 586.377630 | 1118 | 586.377901 |
1128 | 591.373498 | 1128 | 591.373769 |
1138 | 592.891840 | 1138 | 592.892110 |
Comments: 1. The instrument was already well calibrated and hence we obtain a neglectable mz correction in the sub milli mass unit range. 2. Selected calibration peaks are not ideal, since they do not span the complete mz range. Best practice would be to spike the samples with calibrants and corresponding LC-MS peaks do not overlap with sample peaks.
Comparing peaks in tables of different samples is quite cumbersome since identic peaks having different ids in different samples. So far, to compare peaks we applied the table methods join
, left_join
or filter
. The alignment module provides the function ``align_peaks`` that assigns a global peak id to all samples of interest. Depending on observed mz and rt shifts it is recommended to perform rt and mz alignment prior to peak alignment.
[8]:
help(emzed.align.align_peaks)
Help on function align_peaks in module emzed.align.sample_alignment:
align_peaks(tables, mz_tol, rt_tol)
Assigns global peak ids over samples sucht that peaks which differ by
given tolerances are are assigned to the sampe global peak id.
Tables are modified in-place by adding a new columns ``global_peak_id``.
:param tables: list or tuple of peak tables with columns "id", "mz" and "rt" at
least.
:param mz_tol: mz tolerance in Da.
:param rt_tol: rt tolerance in seconds.
:returns: None
A new id global_peak_id is assigned to all tables in place. Required table column names are id, mz, rt
. Depending on observed retention time and m/z shift of the data set, retention time and mass alignment should be performed prior to peak aligning. As an example we apply align_peaks
to retetion time aligned samples tables_lowess:
[9]:
emzed.align.align_peaks(tables_lowess, mz_tol=0.001, rt_tol=2.0)
t_all = emzed.Table.stack_tables(tables_lowess).sort_by(
"global_peak_id"
)
t_all[:10]
number of total peaks : 1600
number of global peaks ids: 1583
[9]:
id | global_peak_id | feature_id | feature_size | mz | mzmin | mzmax | rt | rtmin | rtmax | intensity | quality | fwhm | z | source |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
int | int | int | int | MzType | MzType | MzType | RtType | RtType | RtType | float | float | RtType | int | str |
10 | 0 | 9 | 1 | 104.070310 | 104.070198 | 104.070457 | 2.65 m | 2.60 m | 2.73 m | 5.82e+05 | 1.31e-04 | 0.08 m | 0 | AA_sample_caulobacter.mzML |
1 | 0 | 1 | 1 | 104.070530 | 104.070404 | 104.070633 | 2.65 m | 2.55 m | 2.75 m | 2.93e+06 | 3.62e-03 | 0.10 m | 0 | AA_sample_arabidopsis.mzML |
11 | 1 | 10 | 1 | 104.106672 | 104.106583 | 104.106750 | 1.41 m | 1.36 m | 1.45 m | 2.72e+05 | 6.10e-05 | 0.05 m | 0 | AA_sample_caulobacter.mzML |
2 | 1 | 2 | 2 | 104.106882 | 104.106766 | 104.107101 | 1.42 m | 1.38 m | 1.94 m | 1.89e+07 | 2.40e-02 | 0.04 m | 1 | AA_sample_arabidopsis.mzML |
5 | 2 | 5 | 1 | 92.663911 | 92.663506 | 92.664627 | 0.89 m | 0.86 m | 1.01 m | 2.98e+05 | 6.69e-05 | 0.05 m | 0 | AA_sample_caulobacter.mzML |
7 | 3 | 7 | 1 | 99.091387 | 99.091187 | 99.091454 | 1.14 m | 1.11 m | 1.22 m | 1.18e+06 | 2.65e-04 | 0.04 m | 0 | AA_sample_caulobacter.mzML |
14 | 4 | 13 | 1 | 105.042087 | 105.041962 | 105.042160 | 3.87 m | 3.86 m | 3.89 m | 4.95e+04 | 1.11e-05 | 0.03 m | 0 | AA_sample_caulobacter.mzML |
8 | 5 | 8 | 2 | 102.127362 | 102.127266 | 102.127548 | 0.93 m | 0.90 m | 1.39 m | 6.95e+07 | 1.62e-02 | 0.04 m | 1 | AA_sample_caulobacter.mzML |
13 | 6 | 12 | 1 | 105.042006 | 105.041916 | 105.042137 | 3.56 m | 3.43 m | 3.72 m | 1.27e+06 | 2.84e-04 | 0.15 m | 0 | AA_sample_caulobacter.mzML |
4 | 7 | 3 | 1 | 104.106940 | 104.106728 | 104.107063 | 2.09 m | 1.95 m | 3.02 m | 5.69e+05 | 7.04e-04 | 0.20 m | 0 | AA_sample_arabidopsis.mzML |
© Copyright 2012-2024 ETH Zurich
Last build 2024-03-25 10:41:42.995953.
Created using Sphinx 7.2.6.