emzed.chemistry¶

`formula_table(min_mass, max_mass, *, mass_c=None, mass_h=None, mass_n=None, mass_o=None, mass_p=None, mass_s=None, c_range=None, h_range=None, n_range=None, o_range=None, p_range=None, s_range=None, apply_rules=True, apply_rule_1=True, apply_rule_2=True, apply_rule_4=True, apply_rule_5=True, apply_rule_6=True, rule_45_range='extended')` ¶

This is a Python version of HR2 formula generator for CHNOPS, see https://fiehnlab.ucdavis.edu/projects/seven-golden-rules

This function generates a table containing molecular formulas consisting of elements C, H, N, O, P and S having a mass in range [min_mass, max_mass]. For each element one can provide an given count or an inclusive range of atom counts considered in this process.

Putting some restrictions on atomcounts, eg C=(0, 100), can speed up the process tremendously.

`MolecularFormula` ¶

Represent a molecular formula as both string and element-count mapping.

`as_dict()` ¶

Return the molecular formula as a plain dict mapping atoms to counts.

`as_string()` ¶

Return the normalized molecular-formula string or None if invalid.

`mass(**specialisations)` ¶

Calculate the exact mass of the formula.

Parameters:

Name	Type	Description	Default
`specialisations`		optional isotope overrides such as `C=12.0` or `C=mass.C12` for unresolved elements.	`{}`

Returns:

Type	Description
	exact mass as `float` or `None` if an element/isotope is unknown.

`compute_centroids(mf, explained_abundance, *, abundances=None)` ¶

computes table with theoretial ms peaks of molecular formula.

Usage examples:

compute_centroids("C6S2", 0.995)
compute_centroids("C6S2", 0.995, abundances=dict(C={12: 0.5, 13: 0.5}))

Parameters:

Name	Description	Default
`mf`	molecular sum formula.	required
`explained_abundance`	stopping criterium, value is between 0 and 1.	required
`abundances`	override natural abundances.	`None`

Returns:

Type	Description
	table with columns `id`, `mf`, `m0`, `abundance`.

`measured_centroids(mf, R, explained_abundance, *, abundances=None)` ¶

computes table with theoretial measured ms peaks of molecular formula.

Usage examples:

measured_centroids("C6S2", 200_000, 0.995)
measured_centroids("C6S2", 200_000, 0.995, abundances=dict(C={12: 0.5, 13: 0.5}))

Parameters:

Name	Description	Default
`mf`	molecular sum formula.	required
`R`	resolution defined as as zz / FWHM	required
`explained_abundance`	stopping criterium, value is between 0 and 1.	required
`abundances`	override natural abundances.	`None`

Returns:

Type	Description
	table with columns `id`, `R`, `mf`, `m0`, `abundance`.

`plot_profile(mf, R, explained_abundance, *, path=None, abundances=None)` ¶

plots theoretial ms peaks of molecular formula.

Usage examples:

plot_profile("C6S2", 200_000, 0.995)
plot_profile("C6S2", 200_000, 0.995, path="profile.png")
plot_profile("C6S2", 200_000, 0.995, abundances=dict(C={12: 0.5, 13: 0.5}))

Parameters:

Name	Description	Default
`mf`	molecular sum formula.	required
`R`	resolution defined as as zz / FWHM	required
`explained_abundance`	stopping criterium, value is between 0 and 1.	required
`path`	path to save plot to. If not provided a plot window will pop up.	`None`
`abundances`	override natural abundances.	`None`

Natural isotope abundances.

Provides isotope abundance values such as C12 and element abundance maps such as C.

`dir()` ¶

forward attributes for autocompletion

Predefined adduct tables and convenience subsets for targeted annotation.

The module exposes:

all: every predefined adduct
charge-based subsets such as positive and negative
single-adduct tables addressable as Python identifiers such as M_plus_H

Exact-mass utilities for formulas, particles, and isotopes.

`dir()` ¶

lazy forward attributes for autocompletion

`of(mf, **specialisation)` ¶

Calculate the exact mass for a molecular formula.

Parameters:

Name	Type	Description	Default
`mf`		molecular formula string.	required
`specialisation`		optional isotope specialisations forwarded to `emzed.chemistry.MolecularFormula.mass`.	`{}`

Returns:

Type	Description
	exact mass as `float`.

`DelayedElementsTable` ¶

Lazy proxy for the elements table to avoid loading it at import time.

`create_abundance_dict(symbols, abundances)` ¶

Create the lazy-access abundance dictionary exported by abundance.

`create_elements_table(symbols, atomic_numbers, abundances, atomic_masses, average_masses)` ¶

Create the tabular element/isotope representation used by elements.

`create_mass_dict(symbols, atomic_masses, average_masses)` ¶

Create the lazy-access mass dictionary exported by :mod:emzed.chemistry.mass.

`fix_pubchem_entries(name, isotopes, masses, abundances)` ¶

Normalize a few known inconsistencies in the bundled PubChem export.

`load_elements()` ¶

Load element masses and abundances from bundled OpenMS and PubChem data.

Returns:

Type	Description
	tuple `(table, mass_dict, abundance_dict, masses)`.

`load_elements_pubchem()` ¶

Load bundled element/isotope data derived from PubChem JSON resources.

`parse_float(txt)` ¶

txt from pubchem can be sth like "1.23 45 (23)" or '[1.2211, 1.2212]'

Helpers for downloading and assembling PubChem-derived compound tables.

`Downloader` ¶

Parallel downloader for PubChem summary data filtered to emzed's sources.

`download(limit=None, result_path=None)` ¶

Download PubChem summary JSON data and compress it into one archive.

Parameters:

Name	Type	Description	Default
`limit`		optional maximum number of compound identifiers to download.	`None`
`result_path`		optional output path for the compressed JSON archive.	`None`

Returns:

Type	Description
	path to the generated `.gz` file.

`PubChemAccessor` ¶

Thin wrapper around the PubChem E-Utilities search and summary endpoints.

`get_count(search_term=None)` ¶

Count number of compuouds for given search term.

Parameters:

Name	Type	Description	Default
`search_term`		In case search term is not provided we search for compounds which are from KEGG, HMDB or Biocyc and carry no charge. For more complicated searches, like restricting the search term only for some fields, the term can be constructed manually by using the search form at https://www.ncbi.nlm.nih.gov/pccompound/advanced	`None`

Returns:

Type	Description
	integer number

`get_identifiers(start=0, end=1000000, search_term=None, source=None)` ¶

Get compuoud identifiers for given search term. These can be used later for retrieving details of compounds.

Parameters:

Name	Description	Default
`start`	fetch results starting and given index	`0`
`end`	fetch results up to given index	`1000000`
`search_term`	In case search term is not provided we search for compounds which comd from KEGG, HMDB or Biocyc and carry no charge. For more complicated searches, like restricting the search term only for some fields, the term can be constructed manually by using the search form at https://www.ncbi.nlm.nih.gov/pccompound/advanced	`None`
`source`	in case the user does not provie a search term on can restrict fetching user ids from specified source, like 'HMDB' only.	`None`

Returns:

Type	Description
	list of strings

`get_summary_data(ids)` ¶

Fetches data for given compound ids.

Parameters:

Name	Type	Description	Default
`ids`		list of compound ids. you can use get_identifiers to dermine identifiers by searching for terms and meta data first.	required

Returns:

Type	Description
	dictionary mapping each id to a dictionary with keys 'cid', 'molecularweight', 'molecularformula', 'iupacname', 'inchi', 'inchikey', 'canonicalsmiles' and 'synonymlist'.

`register_pubchem_api_key(email_address, api_key, *, overwrite=False)` ¶

The api key is required if you want to donwload larger amounts of data or if you make more than 3 requests per second.

You have to create a user accout at https://www.ncbi.nlm.nih.gov
To create the key, go to the “Settings” page of your NCBI account. (Hint: after signing in, simply click on your NCBI username in the upper right corner of any NCBI page.)
You’ll see a new “API Key Management” area. Click the “Create an API Key” button, and copy the resulting key.

Parameters:

Name	Description	Default
`email_address`	your valid email address.	required
`api_key`	valid API key	required
`overwrite`	overwrite existing data.	`False`

`assemble_table(gz_file, path=None)` ¶

Convert a downloaded PubChem JSON archive into an emzed.Table.

Parameters:

Name	Type	Description	Default
`gz_file`		path to the compressed JSON archive produced by `Downloader`.	required
`path`		optional output path for the resulting table database.	`None`

Returns:

Type	Description
	`emzed.Table`.

`fast_m0(mf)` ¶

Estimate the monoisotopic mass of a formula string using bundled element data.

Parameters:

Name	Type	Description	Default
`mf`		molecular formula string.	required

Returns:

Type	Description
	monoisotopic mass as `float` or `None` if an element is unknown.

`retry(n)` ¶

Retry a function up to n times with a short delay between attempts.

Utility script for regenerating the bundled PubChem element isotope data.

`fetch(number)` ¶

Extract isotope masses and abundances for one element from PubChem.

Parameters:

Name	Type	Description	Default
`number`		atomic number to request from PubChem.	required

Returns:

Type	Description
	tuple `(number, name, isotopes, masses, abundances)`.

`get(number)` ¶

Fetch the raw PubChem element record for one atomic number.

Parameters:

Name	Type	Description	Default
`number`		atomic number to request from PubChem.	required

Returns:

Type	Description
	parsed JSON data or `None` if repeated requests fail.

`main()` ¶

Download bundled element isotope data for the supported atomic-number range.

emzed.chemistry¶

MolecularFormula ¶

as_dict() ¶

as_string() ¶

mass(**specialisations) ¶

compute_centroids(mf, explained_abundance, *, abundances=None) ¶

measured_centroids(mf, R, explained_abundance, *, abundances=None) ¶

plot_profile(mf, R, explained_abundance, *, path=None, abundances=None) ¶

__dir__() ¶

__dir__() ¶

of(mf, **specialisation) ¶

DelayedElementsTable ¶

create_abundance_dict(symbols, abundances) ¶

create_elements_table(symbols, atomic_numbers, abundances, atomic_masses, average_masses) ¶

create_mass_dict(symbols, atomic_masses, average_masses) ¶

fix_pubchem_entries(name, isotopes, masses, abundances) ¶

load_elements() ¶

load_elements_pubchem() ¶

parse_float(txt) ¶

Downloader ¶

download(limit=None, result_path=None) ¶

PubChemAccessor ¶

get_count(search_term=None) ¶

get_identifiers(start=0, end=1000000, search_term=None, source=None) ¶

get_summary_data(ids) ¶

register_pubchem_api_key(email_address, api_key, *, overwrite=False) ¶

assemble_table(gz_file, path=None) ¶

fast_m0(mf) ¶

retry(n) ¶

fetch(number) ¶

get(number) ¶

main() ¶

`MolecularFormula` ¶

`as_dict()` ¶

`as_string()` ¶

`mass(**specialisations)` ¶

`compute_centroids(mf, explained_abundance, *, abundances=None)` ¶

`measured_centroids(mf, R, explained_abundance, *, abundances=None)` ¶

`plot_profile(mf, R, explained_abundance, *, path=None, abundances=None)` ¶

`dir()` ¶

`dir()` ¶

`of(mf, **specialisation)` ¶

`DelayedElementsTable` ¶

`create_abundance_dict(symbols, abundances)` ¶

`create_elements_table(symbols, atomic_numbers, abundances, atomic_masses, average_masses)` ¶

`create_mass_dict(symbols, atomic_masses, average_masses)` ¶

`fix_pubchem_entries(name, isotopes, masses, abundances)` ¶

`load_elements()` ¶

`load_elements_pubchem()` ¶

`parse_float(txt)` ¶

`Downloader` ¶

`download(limit=None, result_path=None)` ¶

`PubChemAccessor` ¶

`get_count(search_term=None)` ¶

`get_identifiers(start=0, end=1000000, search_term=None, source=None)` ¶

`get_summary_data(ids)` ¶

`register_pubchem_api_key(email_address, api_key, *, overwrite=False)` ¶

`assemble_table(gz_file, path=None)` ¶

`fast_m0(mf)` ¶

`retry(n)` ¶

`fetch(number)` ¶

`get(number)` ¶

`main()` ¶