CHEBI Predicates
This notebook is intended as an explanatory guide to the importance of edge types (predicates) in ontologies.
Citric acid and its ion forms
For this guide, we are going to look at citric acid and it’s conjugate forms such as citrate(3-). These chemical entities are very similar and in fact are readily interchangeable in cells.
Biologists and biochemists may talk of “citric acid” and “citrate” interchangeably.
We can see the corresponding CHEBI entries:
Citric acid
Citrate(3-)
Accessing CHEBI through OAK
There are different ways to access CHEBI, we will use the sqlite adapter. See also part 7 of the tutorial.
We will use the selector sqlite:obo:chebi
to access CHEBI.
We will be using the command line interface via Jupyter for this tutortial, but the equivalent operations can be done via Python.
First we will set up a Jupyter alias.
[1]:
%alias chebi runoak -i sqlite:obo:chebi
If we wanted to do the equivalen on the command line, we would do:
alias chebi="runoak -i sqlite:obo:chebi"
Basic lookup
Next we will do some basic lookup. The first time you run this may take some time, as the sqlite file is downloaded. Subsequent operations will be faster.
[3]:
chebi info "citric acid"
CHEBI:30769 ! citric acid
Term metadata
To check we have the right term, let’s look at all of the CHEBI metadata, including mappings and chemical formulae:
[4]:
chebi term-metadata "citric acid"
IAO:0000115: A tricarboxylic acid that is propane-1,2,3-tricarboxylic acid bearing
a hydroxy substituent at position 2. It is an important metabolite in the pathway
of all aerobic organisms.
id: CHEBI:30769
obo:chebi/charge: '0'
obo:chebi/formula: C6H8O7
obo:chebi/inchi: InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10/h13H,1-2H2,(H,7,8)(H,9,10)(H,11,12)
obo:chebi/inchikey: KRKNYBCHXYNGOX-UHFFFAOYSA-N
obo:chebi/mass: '192.123'
obo:chebi/monoisotopicmass: '192.02700'
obo:chebi/smiles: OC(=O)CC(O)(CC(O)=O)C(O)=O
oio:hasAlternativeId:
- CHEBI:23322
- CHEBI:3727
- CHEBI:41523
oio:hasDbXref:
- BPDB:1359
- Beilstein:782061
- CAS:77-92-9
- DrugBank:DB04272
- Drug_Central:666
- Gmelin:4240
- HMDB:HMDB0000094
- KEGG:C00158
- KEGG:D00037
- KNApSAcK:C00007619
- MetaCyc:CIT
- PDBeChem:CIT
- PMID:11762832
- PMID:11782123
- PMID:11857437
- PMID:14537820
- PMID:15311880
- PMID:15934243
- PMID:16232627
- PMID:17190852
- PMID:17357118
- PMID:17604395
- PMID:18298573
- PMID:18960216
- PMID:19288211
- PMID:22115968
- PMID:22192423
- PMID:22264346
- PMID:22373571
- PMID:22509852
- Reaxys:782061
- Wikipedia:Citric_Acid
oio:hasExactSynonym:
- 2-hydroxypropane-1,2,3-tricarboxylic acid
- CITRIC ACID
- Citric acid
oio:hasOBONamespace: chebi_ontology
oio:hasRelatedSynonym:
- 2-Hydroxy-1,2,3-propanetricarboxylic acid
- 2-Hydroxytricarballylic acid
- 3-Carboxy-3-hydroxypentane-1,5-dioic acid
- Citronensaeure
- E330
- H3cit
oio:id: CHEBI:30769
oio:inSubset: obo:chebi#3_STAR
rdfs:label: citric acid
---
We can do the same thing for the same chemical in a different protonation state, citrate(3-)
:
[6]:
chebi term-metadata "citrate(3-)"
IAO:0000115: A tricarboxylic acid trianion, obtained by deprotonation of the three
carboxy groups of citric acid.
id: CHEBI:16947
obo:chebi/charge: '-3'
obo:chebi/formula: C6H5O7
obo:chebi/inchi: InChI=1S/C6H8O7/c7-3(8)1-6(13,5(11)12)2-4(9)10/h13H,1-2H2,(H,7,8)(H,9,10)(H,11,12)/p-3
obo:chebi/inchikey: KRKNYBCHXYNGOX-UHFFFAOYSA-K
obo:chebi/mass: '189.09970'
obo:chebi/monoisotopicmass: '189.00517'
obo:chebi/smiles: OC(CC([O-])=O)(CC([O-])=O)C([O-])=O
oio:hasAlternativeId:
- CHEBI:13999
- CHEBI:23321
- CHEBI:42563
oio:hasDbXref:
- Beilstein:1884707
- CAS:126-44-3
- Gmelin:4239
- KEGG:C00158
- PDBeChem:FLC
- Reaxys:1884707
oio:hasExactSynonym: 2-hydroxypropane-1,2,3-tricarboxylate
oio:hasOBONamespace: chebi_ontology
oio:hasRelatedSynonym:
- 2-hydroxy-1,2,3-propanetricarboxylate
- 2-hydroxy-1,2,3-propanetricarboxylate(3-)
- 2-hydroxy-1,2,3-propanetricarboxylic acid, ion(3-)
- 2-hydroxytricarballylate
- CITRATE ANION
- cit
- cit(3-)
- citrate
oio:id: CHEBI:16947
oio:inSubset: obo:chebi#3_STAR
rdfs:label: citrate(3-)
---
Computing similarity
There are various ways to measure chemical similarity.
Here we are using an ontology library, not a chemical library like RDKit, so we can measure similarity with respect to their shared parentage.
We will use the similarity command that measures semantic similarity.
Like many OAK operations, it is parameterized by a predicates option. For example:
--predicates rdfs:subClassOf
This can be shortened to:
-p i
This instructs OAK to use only the is-a relationship when computing parentage.
Note that many libraries don’t provide any option here, and only allow is-a relationships
The similarity command takes two term lists separated by @
- here we just want to do a simple pairwise comparison, we specify one term either side:
[8]:
chebi similarity -p i "citrate(3-)" @ "citric acid"
ancestor_id: CHEBI:37577
ancestor_information_content: 1.792616548579986
ancestor_label: heteroatomic molecular entity
jaccard_similarity: 0.25
object_id: CHEBI:30769
object_label: citric acid
phenodigm_score: 0.6694431545284457
subject_id: CHEBI:16947
subject_label: citrate(3-)
---
What is this telling us?
the jaccard similarity is 0.25, which is very low
the most recent common ancestor is the very general and abstract sounding
heteroatomic molecular entity
, which has a low information content of 1.8
Why is the score so low?
Remember at the start of this guide we looked at the chemical structures, which are almost identical. And biologically these are interchangeable. Why is the similarity so low?
Investigating ontological oddities is one of the strengths of OAK. We can take a number of different approaches, but the easiest is to start by just visualizing the terms and their ancestors:
[9]:
chebi viz -p i "citrate(3-)" "citric acid" -o output/citrate.png
As can be seen, the is-a graphs of these two terms are almost completely separated. You have to go all the way up to heteroatomic molecular entity
to find the common ancestor, just like the similarity output told us.
So what’s going on?
So what’s going on here? is there something missing from CHEBI?
In fact, CHEBI is like this by design, and we see the same pattern/template repeated for all acids.
But all is not lost, CHEBI has other relationships we can use here.
Which leads us to one of the main lessons when using ontologies:
Always make use of the full range of edge types
CHEBI has many other edge types we can use here.
Currently OAK doesn’t have a quick way of summarizing edge statistics, but we can do this easily with a SQL query on the sqlite database we downloaded earlier, querying the Edge table:
[3]:
!echo "SELECT predicate, count(*) FROM edge GROUP BY predicate;" | sqlite3 $HOME/.data/oaklib/chebi.db
BFO:0000051|3947
RO:0000087|42533
obo:chebi#has_functional_parent|18459
obo:chebi#has_parent_hydride|1752
obo:chebi#is_conjugate_acid_of|8340
obo:chebi#is_conjugate_base_of|8340
obo:chebi#is_enantiomer_of|2700
obo:chebi#is_substituent_group_from|1279
obo:chebi#is_tautomer_of|1846
rdfs:subClassOf|235113
CHEBI mostly uses it’s own relationship types, and a few from RO, we can query what these are:
[6]:
chebi info BFO:0000051 RO:0000087
BFO:0000051 ! has part
RO:0000087 ! has role
Next let’s try again, using the viz command, but this time adding a different predicate.
We can specify a list of predicates separated by ,
with the --predicates
option on most commands:
[9]:
chebi viz -p "i,obo:chebi#is_conjugate_acid_of" "citrate(3-)" "citric acid" -o output/citrate-conj-acid-of.png
This time the terms are much closer together. However, they are not “next” to each other, which brings us to another lesson:
Number of hops is often meaningless with ontologies
A common metric with graph operations is counting number of hops. However, for knowledge graphs, this metric can be misleading or meaningless. It may be tempting to do something like “weighting” predicates but this is always ad-hoc.
With ontologies, predicates have meaning and we want this to be take into account.
Calculating similarity using all predicates
In OAK the default is usually to use all predicates
Thus if we simply ask for the overall similarity between citric acid and the 3- form, i.e via:
[11]:
chebi similarity "citrate(3-)" @ "citric acid"
ancestor_id: CHEBI:133748
ancestor_information_content: 12.881018338799942
ancestor_label: citrate anion
jaccard_similarity: 0.6526315789473685
object_id: CHEBI:30769
object_label: citric acid
phenodigm_score: 2.899406721538221
subject_id: CHEBI:16947
subject_label: citrate(3-)
---
This is much better than before, reflecting the true biochemical similarity between these.
The jaccard similarity is 0.65, still not great
the MRCA is the more meaningful
citrate anion
which has a higher IC of 12.88
So how did OAK calculate this?
Here OAK made use of all edge types in the CHEBI relation graph. This is in contrast to many methods that only make use of is-a relationships. This might be a good baked in assumption if you are only doing similarity on HPO (but even then it can be limiting).
For other ontologies, we need to make use of other predicates.
At this stage you may be thinking: “Ah! All ontologies are DAGs, so OAK is using the CHEBI DAG here!”
This brings us to our next point
Ontologies are not DAGs
This is a common misconception. Ontologies are not DAGs, no matter what you may have previously heard.
And in general avoid baking in assumptions generalized from a few cases when it comes to ontologies
Often ontologies will be released in a form that is guaranteed to be a DAG because so many tools mistakenly assume an ontology is a DAG. But if you are using one of these dumbed down forms of an ontology you are missing useful information.
Let’s take a look at CHEBI again. This time we will use two different relationship types (predicates), and exclude is-a:
[13]:
chebi viz CHEBI:16947 CHEBI:30769 -p "obo:chebi#is_conjugate_acid_of,obo:chebi#is_conjugate_base_of" -o output/citrate-not-a-dag.png
This is definitely not a DAG.
In fact there is no reason to assume that for an ontology or a knowledge graph the structure will be a DAG. A lot of relationships in real life are inherently cyclic, and this definitely holds for chemistry, we have cyclic structures and cyclic relationships, and chemicals cycle through these different protonation states.
So how do we handle these?
At this point you might be thinking it makes no sense to use measures like semantic similarity over cyclic graphs. Or that it may be necessary to include ad-hoc measures like maximum distance. But this isn’t the case
Because ontology graphs are existential graphs over concepts, where the existence of the subject depends on the existence of the object, you can still use algorithms designed with concepts of “ancestors” and “descendants”. The overall structure of the relation graph will still (in general) follow a pattern of narrowing down to more general concepts.
There are two broad approaches:
naive graph walking, with cycle checks
use the relation graph
The first is trivial to implement, just implement traversal as you normally would, but remove the assumption of acyclicity.
The OAK sqlite adapter uses the 2nd approach, making use of relation graph
Relation Graph
Relation graph is a tool for calculating the closure of ontology relationships. Unlike naive graph walking, it takes into account the semantics of the ontology and of predicates as intended by the producers of these ontologies
More formally, RG materializes the entailment of all SubClassOf axioms, including those axioms that have existential restrictions on the right hand side.
Relation graph can be obtained and installed from its github repo.
[15]:
!relation-graph --help
relation-graph
Usage: relation-graph [options]
--usage <bool>
Print usage and exit
--help | -h <bool>
Print help message and exit
--ontology-file <filename>
Input OWL ontology
--output-file <filename>
File to stream output triples to.
--mode <RDF|OWL>
Configure style of triples to be output. RDF mode is the default; each existential relation is collapsed to a single direct triple.
--property <IRI>
Property to restrict output relations to. Provide option multiple times for multiple properties.
--properties-file <filename>
File containing line-separated property IRIs to restrict output relations to.
--output-subclasses <bool>
Include entailed rdfs:subClassOf or owl:equivalentClass relations in output (default false)
--reflexive-subclasses <bool>
When outputting rdfs:subClassOf, include relations to self for every class (default true)
--equivalence-as-subclass <bool>
When outputting equivalent classes, output reciprocal rdfs:subClassOf triples instead of owl:equivalentClass triples (default true)
--output-classes <bool>
Output any triples where classes are subjects (default true)
--output-individuals <bool>
Output triples where individuals are subjects, with classes as objects (default false)
--disable-owl-nothing <bool>
Disable inference of unsatisfiable classes by the whelk reasoner (default false)
--verbose <bool>
Set log level to INFO
SemSQL Builds have relation-graph pre-computed
If you access an ontology via the sqlite method, it will make use of an ontology loaded in using the SemSQL schema, which has relation-graph precomputed.
We can take a look first at the CL
[17]:
%alias cl runoak -i sqlite:obo:cl
The OAK relationships
command will query all relationships (by default “outgoing”) for an entity.
If you pass in --include-entailed
it will include entailed (inferred by reasoner) relationships, here coming from RG:
[24]:
cl relationships --include-entailed astrocyte > output/astrocyte-rg.tsv
The size of a RG can be large so we will explore it with pandas:
[26]:
import pandas as pd
df = pd.read_csv("output/astrocyte-rg.tsv", sep="\t")
df
[26]:
subject | predicate | object | subject_label | predicate_label | object_label | |
---|---|---|---|---|---|---|
0 | CL:0000127 | BFO:0000050 | BFO:0000002 | astrocyte | part of | continuant |
1 | CL:0000127 | BFO:0000050 | BFO:0000004 | astrocyte | part of | independent continuant |
2 | CL:0000127 | BFO:0000050 | BFO:0000040 | astrocyte | part of | material entity |
3 | CL:0000127 | BFO:0000050 | CARO:0000000 | astrocyte | part of | anatomical entity |
4 | CL:0000127 | BFO:0000050 | CARO:0000006 | astrocyte | part of | material anatomical entity |
... | ... | ... | ... | ... | ... | ... |
299 | CL:0000127 | rdfs:subClassOf | CL:0000127 | astrocyte | None | astrocyte |
300 | CL:0000127 | rdfs:subClassOf | CL:0000255 | astrocyte | None | eukaryotic cell |
301 | CL:0000127 | rdfs:subClassOf | CL:0000548 | astrocyte | None | animal cell |
302 | CL:0000127 | rdfs:subClassOf | CL:0002319 | astrocyte | None | neural cell |
303 | CL:0000127 | rdfs:subClassOf | CL:0002371 | astrocyte | None | somatic cell |
304 rows × 6 columns
These are all guaranteed correct (according to the semantics of classes and object properties in the ontology)
They are not guaranteed useful. There are many trivial edges, eg.
every astrocyte is part of some material entity
every astrocyte is a subtype of a continuant
But these less useful ones will “fall out in the wash” when we use them in methods like semantic similarity
We can query the EntailedEdge table more directly:
[30]:
!echo "SELECT * FROM entailed_edge WHERE predicate='BFO:0000050' AND object LIKE 'UBERON:%' LIMIT 5" | sqlite3 $HOME/.data/oaklib/cl.db
UBERON:0001612|BFO:0000050|UBERON:0000477
UBERON:0001612|BFO:0000050|UBERON:0010000
UBERON:0001612|BFO:0000050|UBERON:0000055
UBERON:0001612|BFO:0000050|UBERON:0003509
UBERON:0001612|BFO:0000050|UBERON:0001637
OAK uses Relation Graph in semantic similarity
The OAK SQL backend will use RG when calculating semantic similarity.
there is no need to worry whether the structure is a DAG or a tree
no need to implement “hops” or ad-hoc mechanisms
All semantic similarity measures then become trivial operations on RG, parameterizable by a semantic predicate
Jaccard is simply the intersection of common ancestors in the RG divided by the union of ancestors
MRCA, IC etc work as expected
Relation Graph is only as good as its inputs
Garbage in, Garbage out
if ontologies include false axioms, RG will give false results
if ontologies are incomplete, RG will give incomplete answers
Usually for most ontologies we care about in a project like Monarch, we have decent QC in place, and in general most methods should be resilient to this.
[ ]: