OnToma documentation¶
Introduction¶
OnToma is a Python module which maps the disease or phenotype terms to EFO, the ontology used in the Open Targets platform.
Note
More precisely, only a subset of EFO is used in the Open Targets platform, called EFO slim. It is released alongside with every new version of EFO.
OnToma supports two kinds of inputs: either identifiers from other ontologies (e.g. OMIM:102900
), or strings (e.g. pyruvate kinase hyperactivity
).
The way OnToma operates is by trying a series of lookups in EFO and also querying tools from the EBI ontology stack such as OxO and ZOOMA.
For each input, it will return anywhere from zero to multiple matches in EFO. This version of OnToma only returns mappings which are considered of high quality.
For each input type, you can run OnToma from command line or directly from other Python code.
Installing OnToma¶
pip install --upgrade ontoma
Running OnToma from CLI¶
In this mode, the input is a file or STDIN, with one input per line:
echo -e 'asthma\npyruvate kinase hyperactivity' | ontoma --input-type string
echo -e 'OMIM:102900\nOMIM:104310' | ontoma --input-type ontology
The output format is a TSV file, containing one query-to-EFO mapping per line. The columns to include are configurable via the --columns
flag and can include the following:
query
: The original query, e.g. asthmaid_normalised
: The EFO term ID in the internal normalised CURIE representation as used by OnToma, e.g. EFO:0000270.id_ot_schema
: The EFO term ID as supported by the Open Targets JSON schema, e.g. EFO_0000270.id_full_uri
: The full EFO term URI, e.g. http://www.ebi.ac.uk/efo/EFO_0000270.label
: The normalised (lower case) label as specified in EFO, e.g. asthma.
By default, two columns are included: query
and id_ot_schema
.
In case no results were found, the query will be missing from the output. In case multiple results were found, the query will appear multiple times in the output.
You can read about additional flags by running ontoma --help
.
Running OnToma from Python code¶
from ontoma import OnToma
otmap = OnToma()
result_code = otmap.find_term('OMIM:102900', code=True)
result_string = otmap.find_term('asthma')
# Columns are available as attributes.
print(result.id_ot_schema) # Prints 'EFO_0000270'.
Speeding up subsequent OnToma runs¶
When you initialise an OnToma client, it needs to download and parse the latest EFO OT slim release. Depending on your internet connection, it may take anywhere from 10 seconds to a few minutes.
To speed up subsequent OnToma runs, you can specify a cache directory to avoid doing this every time. This can be supplied by a --cache-dir
option in the CLI or by a cache_dir
parameter in the Python interface.
Contents¶
Development¶
Set up environment¶
git clone https://github.com/opentargets/OnToma.git
python3 -m venv env
source env/bin/activate
pip install --editable .
Install development packages¶
python3 -m pip install --upgrade \
build pip twine \
pytest \
sphinx sphinx-rtd-theme
Adding a dependency¶
Installation dependencies are stored in the setup.py
file, in the install_requires
section.
Releasing a new version¶
- Modify the version in the
VERSION
file. - Add a tag:
git tag $(cat VERSION) && git push origin --tags
. - Create a release on GitHub.
- Generate distribution archives:
python3 -m build
. - Upload the release:
python3 -m twine upload dist/*
. Use the usual login and password for PyPi.
Performing a comparison benchmark¶
This can be used in case of major updates to OnToma algorithm to make sure they don’t break things significantly.
- Call
benchmark/sample.sh
to create a random sample of input strings. This will fetch 100 random records each from ClinVar, PhenoDigm, and gene2phenotype sources. Together they are thought to cover a wide range of use case well. Due to deduplication, the final filesample.txt
may contain slightly less than 300 records. - Optionally, to add some additional samples from PanelApp JSON file, you can do:
jq 'keys[]' diseaseToEfo_results.json | tr -d '"' | shuf -n 100 >> sample.txt; sort -uf -o sample.txt sample.txt
. - Process the file so that the final output format is a two column TSV file with the following columns: original query; full URI for an EFO term. Do not include the header.
- For OnToma pre-v1.0.0, use:
time ontoma sample.txt - | awk -F$'\t' '{if (length($2) != 0) {print $1 "\t" $2}}' | tail -n+2 > ontoma_old.txt
. - For OnToma v1.0.0+, use:
time ontoma --cache-dir EFO_CACHE --columns query,id_full_uri <sample.txt | tail -n+2 >ontoma_new.txt
.
- For OnToma pre-v1.0.0, use:
- Collate the results by using
benchmark/collate.py --old ontoma_old.txt --new ontoma_new.txt > benchmark_comparison.txt
. Load the resulting file into Google Sheets and manually mark the mappings as good/bad/uncertain. Exact, letter to letter matches are marked as good automatically.
OnToma interface¶
Main interface class.
-
class
ontoma.interface.
OnToma
(cache_dir=None, efo_release='latest')¶ Open Targets ontology mapping wrapper. Please refer to documentation for usage details.
-
filter_identifiers_by_efo_current
(normalised_identifiers)¶ Returns a subset of the idenfitiers which are in EFO and not marked as obsolete.
-
find_term
(query: str, code: bool = False, suggest: bool = False) → list¶ For a given query (an ontology identifier or a string), find matches in EFO Open Targets slim.
The algorithm operates in a series of steps. If a given step is successful, the result is returned immediately, and the remaining steps are not executed. If all steps fail to provide a match, None is returned. Note that in general more than one mapping can be returned. This can happen for complex traits which require more than one ontology term to represent them.
If the code flag is specified, it is assumed that the query is an ontology identifier, such as ‘OMIM:615632’, and the following steps are attempted: 1. See if the term is already in EFO OT slim OWL file. 2. Match terms by cross-references (hasDbXref) from the OWL file. 3. Mapping from the manual cross-reference database. 4. Request through OxO with a distance of 2.
If the query is a string, the following steps are attempted: 5. Exact name match from EFO OT slim OWL file. 6. Exact synonym (hasExactSynonym) from the OWL file. 7. Mapping from the manual string-to-ontology database. 8. High confidence mapping from ZOOMA with default parameters.
The following functionality is planned, but not yet implemented. — If the query is a string, and additionally the suggest flag is specified, additional steps are attempted: 9. Inexact synonyms (hasRelatedSynonym) from the OWL file. 10. Any confidence mapping from ZOOMA with default parameters.
- Args:
- query: Either an ontology identifier, or the disease/phenotype string to be matched to an EFO code. code: Whether to treat the query as an ontology identifier. suggest: Whether to report low quality mappings which are not guaranteed to be contained in EFO OT slim.
- Returns:
- A list of values dependent on the verbose flag (either strings with ontology identifiers, or a dictionary of additional information). The list will be empty if no hits were identified.
-
get_label_from_efo
(normalised_identifier)¶
-
step01_owl_identifier_match
(normalised_identifier)¶ If the term is already present in EFO, return it as is.
-
step02_owl_db_xref
(normalised_identifier)¶ If there are terms in EFO referenced by the hasDbXref field to the query, return them.
-
step03_manual_xref
(normalised_identifier)¶ Look for the queried term in the manual ontology-to-ontology mapping list.
-
step04_oxo_query
(normalised_identifier)¶ Find cross-references using OxO.
-
step05_owl_name_match
(normalised_string)¶ Find EFO terms which match the string query exactly.
-
step06_owl_exact_synonym
(normalised_string)¶ Find EFO terms which have the query as an exact synonym.
-
step07_manual_mapping
(normalised_string)¶ Find the query in the manual string-to-ontology mapping database.
-
step08_zooma_high_confidence
(normalised_string)¶
-
step10_zooma_any
(normalised_string)¶
-
ZOOMA wrapper¶
ZOOMA API wrapper.
-
class
ontoma.zooma.
ZoomaClient
(zooma_base='https://www.ebi.ac.uk/spot/zooma/v2/api')¶ Simple client to query ZOOMA. Will look in all curated datasources and perform a fuzzy search in EFO. Only HIGH quality mappings are considered.
-
search
(query_string)¶ Query ZOOMA and return all high confidence mappings.
-