Skip to content

emzed.chemistry

formula_table(min_mass, max_mass, *, mass_c=None, mass_h=None, mass_n=None, mass_o=None, mass_p=None, mass_s=None, c_range=None, h_range=None, n_range=None, o_range=None, p_range=None, s_range=None, apply_rules=True, apply_rule_1=True, apply_rule_2=True, apply_rule_4=True, apply_rule_5=True, apply_rule_6=True, rule_45_range='extended')

This is a Python version of HR2 formula generator for CHNOPS, see https://fiehnlab.ucdavis.edu/projects/seven-golden-rules

This function generates a table containing molecular formulas consisting of elements C, H, N, O, P and S having a mass in range [min_mass, max_mass]. For each element one can provide an given count or an inclusive range of atom counts considered in this process.

Putting some restrictions on atomcounts, eg C=(0, 100), can speed up the process tremendously.

MolecularFormula

Represent a molecular formula as both string and element-count mapping.

as_dict()

Return the molecular formula as a plain dict mapping atoms to counts.

as_string()

Return the normalized molecular-formula string or None if invalid.

mass(**specialisations)

Calculate the exact mass of the formula.

Parameters:

Name Type Description Default
specialisations

optional isotope overrides such as C=12.0 or C=mass.C12 for unresolved elements.

{}

Returns:

Type Description

exact mass as float or None if an element/isotope is unknown.

compute_centroids(mf, explained_abundance, *, abundances=None)

computes table with theoretial ms peaks of molecular formula.

Usage examples:

compute_centroids("C6S2", 0.995)
compute_centroids("C6S2", 0.995, abundances=dict(C={12: 0.5, 13: 0.5}))

Parameters:

Name Type Description Default
mf

molecular sum formula.

required
explained_abundance

stopping criterium, value is between 0 and 1.

required
abundances

override natural abundances.

None

Returns:

Type Description

table with columns id, mf, m0, abundance.

measured_centroids(mf, R, explained_abundance, *, abundances=None)

computes table with theoretial measured ms peaks of molecular formula.

Usage examples:

measured_centroids("C6S2", 200_000, 0.995)
measured_centroids("C6S2", 200_000, 0.995, abundances=dict(C={12: 0.5, 13: 0.5}))

Parameters:

Name Type Description Default
mf

molecular sum formula.

required
R

resolution defined as as zz / FWHM

required
explained_abundance

stopping criterium, value is between 0 and 1.

required
abundances

override natural abundances.

None

Returns:

Type Description

table with columns id, R, mf, m0, abundance.

plot_profile(mf, R, explained_abundance, *, path=None, abundances=None)

plots theoretial ms peaks of molecular formula.

Usage examples:

plot_profile("C6S2", 200_000, 0.995)
plot_profile("C6S2", 200_000, 0.995, path="profile.png")
plot_profile("C6S2", 200_000, 0.995, abundances=dict(C={12: 0.5, 13: 0.5}))

Parameters:

Name Type Description Default
mf

molecular sum formula.

required
R

resolution defined as as zz / FWHM

required
explained_abundance

stopping criterium, value is between 0 and 1.

required
path

path to save plot to. If not provided a plot window will pop up.

None
abundances

override natural abundances.

None

Lazy access to natural isotope abundances.

The module forwards isotope abundances such as C12 and element abundance maps such as C via __getattr__ without loading the full element table up front.

__dir__()

forward attributes for autocompletion

Predefined adduct tables and convenience subsets for targeted annotation.

The module exposes:

  • all: every predefined adduct
  • charge-based subsets such as positive and negative
  • single-adduct tables addressable as Python identifiers such as M_plus_H

Convenience access to exact masses and common particle masses.

The module exposes a small set of particle masses directly (e, p, n) and lazily forwards element and isotope masses via __getattr__ so that expressions such as emzed.mass.C12 work without preloading the full elements table.

__dir__()

forward attributes for autocompletion

of(mf, **specialisation)

Calculate the exact mass for a molecular formula.

Parameters:

Name Type Description Default
mf

molecular formula string.

required
specialisation

optional isotope specialisations forwarded to emzed.chemistry.MolecularFormula.mass.

{}

Returns:

Type Description

exact mass as float.

DelayedElementsTable

Lazy proxy for the elements table to avoid loading it at import time.

create_abundance_dict(symbols, abundances)

Create the lazy-access abundance dictionary exported by abundance.

create_elements_table(symbols, atomic_numbers, abundances, atomic_masses, average_masses)

Create the tabular element/isotope representation used by elements.

create_mass_dict(symbols, atomic_masses, average_masses)

Create the lazy-access mass dictionary exported by :mod:emzed.chemistry.mass.

fix_pubchem_entries(name, isotopes, masses, abundances)

Normalize a few known inconsistencies in the bundled PubChem export.

load_elements()

Load element masses and abundances from bundled OpenMS and PubChem data.

Returns:

Type Description

tuple (table, mass_dict, abundance_dict, masses).

load_elements_pubchem()

Load bundled element/isotope data derived from PubChem JSON resources.

parse_float(txt)

txt from pubchem can be sth like "1.23 45 (23)" or '[1.2211, 1.2212]'

Helpers for downloading and assembling PubChem-derived compound tables.

Downloader

Parallel downloader for PubChem summary data filtered to emzed's sources.

download(limit=None, result_path=None)

Download PubChem summary JSON data and compress it into one archive.

Parameters:

Name Type Description Default
limit

optional maximum number of compound identifiers to download.

None
result_path

optional output path for the compressed JSON archive.

None

Returns:

Type Description

path to the generated .gz file.

PubChemAccessor

Thin wrapper around the PubChem E-Utilities search and summary endpoints.

get_count(search_term=None)

Count number of compuouds for given search term.

Parameters:

Name Type Description Default
search_term

In case search term is not provided we search for compounds which are from KEGG, HMDB or Biocyc and carry no charge. For more complicated searches, like restricting the search term only for some fields, the term can be constructed manually by using the search form at https://www.ncbi.nlm.nih.gov/pccompound/advanced

None

Returns:

Type Description

integer number

get_identifiers(start=0, end=1000000, search_term=None, source=None)

Get compuoud identifiers for given search term. These can be used later for retrieving details of compounds.

Parameters:

Name Type Description Default
start

fetch results starting and given index

0
end

fetch results up to given index

1000000
search_term

In case search term is not provided we search for compounds which comd from KEGG, HMDB or Biocyc and carry no charge. For more complicated searches, like restricting the search term only for some fields, the term can be constructed manually by using the search form at https://www.ncbi.nlm.nih.gov/pccompound/advanced

None
source

in case the user does not provie a search term on can restrict fetching user ids from specified source, like 'HMDB' only.

None

Returns:

Type Description

list of strings

get_summary_data(ids)

Fetches data for given compound ids.

Parameters:

Name Type Description Default
ids

list of compound ids. you can use get_identifiers to dermine identifiers by searching for terms and meta data first.

required

Returns:

Type Description

dictionary mapping each id to a dictionary with keys 'cid', 'molecularweight', 'molecularformula', 'iupacname', 'inchi', 'inchikey', 'canonicalsmiles' and 'synonymlist'.

register_pubchem_api_key(email_address, api_key, *, overwrite=False)

The api key is required if you want to donwload larger amounts of data or if you make more than 3 requests per second.

  1. You have to create a user accout at https://www.ncbi.nlm.nih.gov

  2. To create the key, go to the “Settings” page of your NCBI account. (Hint: after signing in, simply click on your NCBI username in the upper right corner of any NCBI page.)

  3. You’ll see a new “API Key Management” area. Click the “Create an API Key” button, and copy the resulting key.

Parameters:

Name Type Description Default
email_address

your valid email address.

required
api_key

valid API key

required
overwrite

overwrite existing data.

False

assemble_table(gz_file, path=None)

Convert a downloaded PubChem JSON archive into an emzed.Table.

Parameters:

Name Type Description Default
gz_file

path to the compressed JSON archive produced by Downloader.

required
path

optional output path for the resulting table database.

None

Returns:

Type Description

emzed.Table.

fast_m0(mf)

Estimate the monoisotopic mass of a formula string using bundled element data.

Parameters:

Name Type Description Default
mf

molecular formula string.

required

Returns:

Type Description

monoisotopic mass as float or None if an element is unknown.

retry(n)

Retry a function up to n times with a short delay between attempts.

Utility script for regenerating the bundled PubChem element isotope data.

fetch(number)

Extract isotope masses and abundances for one element from PubChem.

Parameters:

Name Type Description Default
number

atomic number to request from PubChem.

required

Returns:

Type Description

tuple (number, name, isotopes, masses, abundances).

get(number)

Fetch the raw PubChem element record for one atomic number.

Parameters:

Name Type Description Default
number

atomic number to request from PubChem.

required

Returns:

Type Description

parsed JSON data or None if repeated requests fail.

main()

Download bundled element isotope data for the supported atomic-number range.