OAK Developers Tutorial

This tutorial is primarily for Python Developers who wish to use OAK in their applications. These include applications such as:

  • building ontology-driven data portals

  • creating curation tools

  • data science and machine learning applications

  • web services

Some basic knowledge of the overall architecture and capabilities of OAK is assumed.

You may want to start with the slides on the command line here: https://doi.org/10.5281/zenodo.7708963

Or part 1 of the tutorial here: https://incatools.github.io/ontology-access-kit/intro/tutorial01.html

There is a video of the walkthrough of this tutorial: https://www.youtube.com/watch?v=nVTWazO_Gu0

How to follow this tutorial

The easiest way to run this tutorial is to clone the repo and run locally:

  1. clone the repo here https://github.com/INCATools/ontology-access-kit/

  2. cd ontology-access-kit

  3. poetry install

  4. poetry run jupyter notebook

Alternatively, everything here should work on a fresh install of oak from pypi. You will need to make sure the test files from tests/input are accessible.

Some of the examples work with these test files - others will work with versions of ontologies on the web.

Change directory so that test files are directly accessible

Note: this is necessary if you are running from a checkout of the OAK repo, since this notebook is in a subfolder

[1]:
%cd ..
/Users/cjm/repos/ontology-access-kit

The OAK documentation makes heavy use of some of the unit test files in the tests/input folder.

This include a little mini test subset of GO, available in different formats for the purposes of testing different adapters:

[2]:
!ls tests/input/go-nucleus.*
tests/input/go-nucleus.cx      tests/input/go-nucleus.ofn
tests/input/go-nucleus.db      tests/input/go-nucleus.owl
tests/input/go-nucleus.json    tests/input/go-nucleus.owl.ttl
tests/input/go-nucleus.obo

If you are an OAK core developer it helps to be aware of these files, as you will likely be writing new unit tests.

If one the other hand you just want to use OAK in your own code you don’t need to know anything about these except that they are handy for quick testing.

Running Examples from the OAK sphinx docs

The sphinx docs include code examples, these are visible from the >>>s

For example, in:

https://incatools.github.io/ontology-access-kit/packages/interfaces/basic

You can see sections like this:

images/oak-docs-code.png

If you click the “copy” button it will copy the code only (no >>>s, and no output) such that you can paste directly into a Python REPL or a notebook (provided the paths are preserved).

E.g. try copying this section from the docs

>>> from oaklib import get_adapter
>>> adapter = get_adapter('tests/input/go-nucleus.db')
>>> print(adapter.label("GO:0005634"))
[4]:
from oaklib import get_adapter
adapter = get_adapter('tests/input/go-nucleus.db')
print(adapter.label("GO:0005634"))
nucleus

Hurray! All examples throughout the OAK docs should work

Note that you can play with using different input formats - this should give the same results:

[6]:
from oaklib import get_adapter
adapter = get_adapter('tests/input/go-nucleus.obo')
print(adapter.label("GO:0005634"))
nucleus

Working with whole ontologies

Almost all the examples in this tutorial make use of pre-made sqlite versions of ontologies.

These are specified using selector syntax:

sqlite:obo:ONTID

E.g.

sqlite:obo:cl

When you use this for the first time it will download and cache the file (using pystow), so there may be an initial lag

[8]:
from oaklib import get_adapter
adapter = get_adapter("sqlite:obo:cl")
print(adapter.label("CL:0000540"))
neuron

Hurray! We successfully fetched the label (name) for a class ID in the cell ontology!

Note that the label method is part of the BasicOntologyInterface

BasicOntologyInterface

The BasicOntologyInterface provides basic methods that encompass the majority of what most people need to do when working with ontologies - lookups of various kinds as well as simple graph operations.

OAK has the architectural concept of separating interfaces from implementations. It helps to read about this concept, but for now you don’t need to worry about it. We are using the sql adapter which fully implements almost all the existing OAK interfaces

Fetching ancestors

Next we are going to fetch ancestors.

Note: it would help to review tutorial part 1 to understand basic concepts of edges, ancestors, and predicates.

[9]:
from oaklib.datamodels.vocabulary import IS_A
for anc in adapter.ancestors("CL:0000540", predicates=[IS_A]):
    print(anc, adapter.label(anc))
CL:0000540 neuron
BFO:0000002 continuant
BFO:0000004 independent continuant
BFO:0000040 material entity
CL:0000000 cell
CL:0000003 native cell
CL:0000211 electrically active cell
CL:0000255 eukaryotic cell
CL:0000393 electrically responsive cell
CL:0000404 electrically signaling cell
CL:0000540 neuron
CL:0000548 animal cell
CL:0002319 neural cell
CL:0002371 somatic cell

Fetching descendants

Let’s try working with descendants.

This time we are going to demonstrate how OAK deals with collections.

note we expect the following codenotto work

[10]:
neurons = adapter.descendants("CL:0000540", predicates=[IS_A])
try:
    print(len(neurons))
except(Exception) as e:
    print(f"PROBLEM: {e}")
PROBLEM: object of type 'generator' has no len()

Why didn’t this work? What does object of type 'generator' has no len() mean?

To understand why we will mention a key concept in OAK, that of the iterator

Iterators

OAK methods rarely return lists - instead they return iterators. This means that code is better adaptable to use cases where you want to work with potentially very large lists or you want to stream results. See:

See best practice

However, if you don’t care about this you can simple use list(...) to get a list:

[11]:
neurons = list(adapter.descendants("CL:0000540", predicates=[IS_A]))
print(len(neurons))
454

You can also cast to set() to use set operations like intersections.

For example, let’s say we want to compose our neuron query above with a query to fetch all things in the forebrain, to get all neurons in the forebrain:

[12]:
from oaklib.datamodels.vocabulary import IS_A, PART_OF
parts_of_forebrain = set(adapter.descendants("UBERON:0001890", predicates=[IS_A, PART_OF]))
print(len(parts_of_forebrain))
219

You may be wondering what Uberon terms are doing here given that we requested the cell ontology in get_adapter

One under-appreciated fact of OBO is that many ontologies are in fact mini “knowledge graphs”, linking out to nodes in other ontologies. See

OK next lets do an intersection between the two lists

[13]:
for cell in parts_of_forebrain.intersection(neurons):
    print(cell, adapter.label(cell))
CL:1001435 periglomerular cell
CL:4023040 L2/3-6 intratelencephalic projecting glutamatergic cortical neuron
CL:1001571 hippocampal pyramidal neuron
CL:1001502 mitral cell
CL:1001434 olfactory bulb interneuron
CL:4023048 L4/5 intratelencephalic projecting glutamatergic neuron of the primary motor cortex
CL:1001505 parvocellular neurosecretory cell
CL:4023008 intratelencephalic-projecting glutamatergic cortical neuron
CL:4023049 L5 intratelencephalic projecting glutamatergic neuron of the primary motor cortex
CL:4023047 L2/3 intratelencephalic projecting glutamatergic neuron of the primary motor cortex
CL:4023081 inverted L6 intratelencephalic projecting glutamatergic neuron of the primary motor cortex (Mmus)
CL:4023050 L6 intratelencephalic projecting glutamatergic neuron of the primary motor cortex
CL:1001503 olfactory bulb tufted cell
CL:4023080 stellate L6 intratelencephalic projecting glutamatergic neuron of the primary motor cortex (Mmus)

This is all the neurons that are part of the forebrain.

Readers familiar with OWL and Protege may like to think of this as similar to a DL query for the expression neuron and part-of some forebrain – there are some theoretical differences we won’t get into here but for practical purposes the results should be the same.

Relationships

The above example uses the ancestors and descendants query in BasicOntologyInterface.

We can also get the relationships using the relationships method:

[15]:
for rel in adapter.relationships(["CL:0000540"]):
    print(rel)
('CL:0000540', 'RO:0002215', 'GO:0019226')
('CL:0000540', 'rdfs:subClassOf', 'BFO:0000040')
('CL:0000540', 'rdfs:subClassOf', 'CL:0000393')
('CL:0000540', 'rdfs:subClassOf', 'CL:0000404')
('CL:0000540', 'rdfs:subClassOf', 'CL:0002319')
[17]:
for _s, p, o in adapter.relationships(["CL:0000540"]):
    print(f" {p} {adapter.label(p)} {o} {adapter.label(o)}")
 RO:0002215 capable of GO:0019226 transmission of nerve impulse
 rdfs:subClassOf None BFO:0000040 material entity
 rdfs:subClassOf None CL:0000393 electrically responsive cell
 rdfs:subClassOf None CL:0000404 electrically signaling cell
 rdfs:subClassOf None CL:0002319 neural cell
[18]:
for s, p, _o in list(adapter.relationships(objects=["CL:0000540"]))[0:10]:
    print(f" {p} FROM: {s} {adapter.label(s)}")
 rdfs:subClassOf FROM: CL:0000028 CNS neuron (sensu Nematoda and Protostomia)
 rdfs:subClassOf FROM: CL:0000029 neural crest derived neuron
 rdfs:subClassOf FROM: CL:0000099 interneuron
 rdfs:subClassOf FROM: CL:0000102 polymodal neuron
 rdfs:subClassOf FROM: CL:0000104 multipolar neuron
 rdfs:subClassOf FROM: CL:0000105 pseudounipolar neuron
 rdfs:subClassOf FROM: CL:0000106 unipolar neuron
 rdfs:subClassOf FROM: CL:0000108 cholinergic neuron
 rdfs:subClassOf FROM: CL:0000109 adrenergic neuron
 rdfs:subClassOf FROM: CL:0000110 peptidergic neuron
[14]:
for _s, _p, o in adapter.relationships(["CL:0000540"], predicates=[PART_OF], include_entailed=True):
    print(o, adapter.label(o))
BFO:0000002 continuant
BFO:0000004 independent continuant
BFO:0000040 material entity
CARO:0000000 anatomical entity
CARO:0000006 material anatomical entity
CARO:0030000 biological entity
UBERON:0000061 anatomical structure
UBERON:0000465 material anatomical entity
UBERON:0000467 anatomical system
UBERON:0000468 multicellular organism
UBERON:0001016 nervous system
UBERON:0001062 anatomical entity
UBERON:0010000 multicellular anatomical structure

This includes all the entailed part-of relationships from neuron, including trivial ones (“every neuron is part of a material entity”)

[19]:
for s, _p, o in list(adapter.relationships(objects=["CL:0000540"], predicates=[IS_A], include_entailed=True))[0:10]:
    print(s, adapter.label(s))
CL:0000705 R6 photoreceptor cell
CL:4023108 oxytocin-secreting magnocellular cell
CL:0004240 WF1 amacrine cell
CL:0004242 WF3-1 amacrine cell
CL:1000380 type 1 vestibular sensory cell of epithelium of macula of saccule of membranous labyrinth
CL:1001582 lateral ventricle neuron
CL:4023128 rostral periventricular region of the third ventricle KDNy neuron
CL:0003020 retinal ganglion cell C outer
CL:4023094 tufted pyramidal neuron
CL:4023057 cerebellar inhibitory GABAergic interneuron

Creating a Data Frame for Relationships

Next we will see how to create a small data frame for relationships for forebrain neurons:

[20]:
import pandas as pd

forebrain_neurons = parts_of_forebrain.intersection(neurons)

objs = []
for s, p, o in adapter.relationships(forebrain_neurons):
    objs.append({"s": s, "s_label": adapter.label(s),
                 "p": p, "p_label": adapter.label(p),
                 "o": o, "o_label": adapter.label(o)})

df = pd.DataFrame(objs)
df
[20]:
s s_label p p_label o o_label
0 CL:1001434 olfactory bulb interneuron BFO:0000050 part of UBERON:0002264 olfactory bulb
1 CL:1001434 olfactory bulb interneuron RO:0002100 has soma location UBERON:0002264 olfactory bulb
2 CL:1001434 olfactory bulb interneuron rdfs:subClassOf None CL:0000101 sensory neuron
3 CL:1001434 olfactory bulb interneuron rdfs:subClassOf None CL:0000402 CNS interneuron
4 CL:1001434 olfactory bulb interneuron rdfs:subClassOf None CL:0012001 neuron of the forebrain
5 CL:1001435 periglomerular cell RO:0002100 has soma location UBERON:0005377 olfactory bulb glomerular layer
6 CL:1001435 periglomerular cell rdfs:subClassOf None CL:1001434 olfactory bulb interneuron
7 CL:1001502 mitral cell RO:0002100 has soma location UBERON:0004186 olfactory bulb mitral cell layer
8 CL:1001502 mitral cell rdfs:subClassOf None CL:1001434 olfactory bulb interneuron
9 CL:1001503 olfactory bulb tufted cell BFO:0000050 part of UBERON:0005376 olfactory bulb external plexiform layer
10 CL:1001503 olfactory bulb tufted cell rdfs:subClassOf None CARO:0000000 anatomical entity
11 CL:1001503 olfactory bulb tufted cell rdfs:subClassOf None CL:0000540 neuron
12 CL:1001505 parvocellular neurosecretory cell BFO:0000050 part of UBERON:0001930 paraventricular nucleus of hypothalamus
13 CL:1001505 parvocellular neurosecretory cell RO:0002215 capable of GO:0030103 vasopressin secretion
14 CL:1001505 parvocellular neurosecretory cell rdfs:subClassOf None CARO:0000000 anatomical entity
15 CL:1001505 parvocellular neurosecretory cell rdfs:subClassOf None CL:0000167 peptide hormone secreting cell
16 CL:1001505 parvocellular neurosecretory cell rdfs:subClassOf None CL:0000381 neurosecretory neuron
17 CL:1001505 parvocellular neurosecretory cell rdfs:subClassOf None CL:2000030 hypothalamus cell
18 CL:1001571 hippocampal pyramidal neuron BFO:0000050 part of UBERON:0002313 hippocampus pyramidal layer
19 CL:1001571 hippocampal pyramidal neuron rdfs:subClassOf None CL:0002608 hippocampal neuron
20 CL:1001571 hippocampal pyramidal neuron rdfs:subClassOf None CL:4023111 cerebral cortex pyramidal neuron
21 CL:4023008 intratelencephalic-projecting glutamatergic co... RO:0000053 bearer of PATO:0070034 intratelencephalic projecting
22 CL:4023008 intratelencephalic-projecting glutamatergic co... rdfs:subClassOf None CL:0000679 glutamatergic neuron
23 CL:4023008 intratelencephalic-projecting glutamatergic co... rdfs:subClassOf None CL:0010012 cerebral cortex neuron
24 CL:4023040 L2/3-6 intratelencephalic projecting glutamate... rdfs:subClassOf None CL:4023008 intratelencephalic-projecting glutamatergic co...
25 CL:4023047 L2/3 intratelencephalic projecting glutamaterg... RO:0002100 has soma location UBERON:0001384 primary motor cortex
26 CL:4023047 L2/3 intratelencephalic projecting glutamaterg... RO:0002100 has soma location UBERON:8440000 cortical layer II/III
27 CL:4023047 L2/3 intratelencephalic projecting glutamaterg... rdfs:subClassOf None CL:4023040 L2/3-6 intratelencephalic projecting glutamate...
28 CL:4023048 L4/5 intratelencephalic projecting glutamaterg... RO:0002100 has soma location UBERON:0001384 primary motor cortex
29 CL:4023048 L4/5 intratelencephalic projecting glutamaterg... RO:0002100 has soma location UBERON:8440001 cortical layer IV/V
30 CL:4023048 L4/5 intratelencephalic projecting glutamaterg... rdfs:subClassOf None CL:4023040 L2/3-6 intratelencephalic projecting glutamate...
31 CL:4023049 L5 intratelencephalic projecting glutamatergic... RO:0002100 has soma location UBERON:0001384 primary motor cortex
32 CL:4023049 L5 intratelencephalic projecting glutamatergic... RO:0002100 has soma location UBERON:0005394 cortical layer V
33 CL:4023049 L5 intratelencephalic projecting glutamatergic... rdfs:subClassOf None CL:4023040 L2/3-6 intratelencephalic projecting glutamate...
34 CL:4023050 L6 intratelencephalic projecting glutamatergic... RO:0000053 bearer of PATO:0070019 untufted pyramidal morphology
35 CL:4023050 L6 intratelencephalic projecting glutamatergic... RO:0002100 has soma location UBERON:0005395 cortical layer VI
36 CL:4023050 L6 intratelencephalic projecting glutamatergic... rdfs:subClassOf None CL:2000049 primary motor cortex pyramidal cell
37 CL:4023050 L6 intratelencephalic projecting glutamatergic... rdfs:subClassOf None CL:4023040 L2/3-6 intratelencephalic projecting glutamate...
38 CL:4023080 stellate L6 intratelencephalic projecting glut... RO:0000053 bearer of PATO:0070020 stellate pyramidal morphology
39 CL:4023080 stellate L6 intratelencephalic projecting glut... rdfs:subClassOf None CL:4023050 L6 intratelencephalic projecting glutamatergic...
40 CL:4023081 inverted L6 intratelencephalic projecting glut... RO:0000053 bearer of PATO:0070021 inverted pyramidal morphology
41 CL:4023081 inverted L6 intratelencephalic projecting glut... rdfs:subClassOf None CL:4023050 L6 intratelencephalic projecting glutamatergic...

Aliases

The BasicOntologyInterface has a deliberately simple datamodel for aliases that can be expressed by returning simple strings and tuples. Later on we can see how to leverage the more advanced OBO Graphs data model

[21]:
adapter.entity_aliases("CL:0000540")
[21]:
['nerve cell', 'neuron']
[22]:
for pred, alias in adapter.alias_relationships("CL:0000540"):
    print(pred, alias)
rdfs:label neuron
oio:hasExactSynonym nerve cell

Mappings

Similar to aliases, the BasicOntologyInterface has a very simple model of mappings. Later on we will see how we can use the MappingProviderInterface to get more granular information.

[25]:
for pred, xref in adapter.simple_mappings_by_curie("CL:0000202"):
    print(pred, xref)
oio:hasDbXref FMA:62364

Subsets

See Subsets in the OAK Glossary.

Subsets allow terms to be placed into groups outside the hierarchy for different purposes.

To illustrate we will switch our example to use GO which has a rich variety of subsets

[26]:
go_adapter = get_adapter("sqlite:obo:go")
[27]:
for subset in go_adapter.subsets():
    print(subset)
chebi_ph7_3
3_STAR
1_STAR
goslim_plant
goslim_pir
goslim_flybase_ribbon
goslim_chembl
goslim_agr
goslim_metagenomics
goslim_yeast
goslim_pombe
gocheck_do_not_annotate
goslim_generic
goslim_drosophila
goslim_candida
prokaryote_subset
gocheck_do_not_manually_annotate
goslim_synapse
goslim_mouse
SOFA
Alliance_of_Genome_Resources
biosapiens

note this includes subsets for ontologies that have been merged in

[28]:
for e in list(go_adapter.subset_members("goslim_generic"))[0:10]:
    print(e, go_adapter.label(e))
GO:0000228 nuclear chromosome
GO:0000278 mitotic cell cycle
GO:0000910 cytokinesis
GO:0001618 virus receptor activity
GO:0002181 cytoplasmic translation
GO:0002376 immune system process
GO:0003012 muscle system process
GO:0003013 circulatory system process
GO:0003014 renal system process
GO:0003016 respiratory system process

Plotting how GO subsets inter-relate

Now we are ready for a simple mini application - showing commonalities between

[29]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sets = []

for subset in go_adapter.subsets():
    members = set([x for x in go_adapter.subset_members(subset) if x.startswith("GO:")])
    if members:
        sets.append((subset, members))


# Number of sets
N = len(sets)

# Initialize an empty matrix to store the number of members in common
intersection_matrix = np.zeros((N, N))

# Calculate the intersections between each pair of sets
for i in range(N):
    for j in range(N):
        intersection_matrix[i, j] = len(sets[i][1].intersection(sets[j][1]))

# Get the set names
set_names = [s[0] for s in sets]

# Create a pandas DataFrame with the intersection matrix and set names as index and columns
intersection_df = pd.DataFrame(intersection_matrix, index=set_names, columns=set_names)

# Plot the clustermap with dendrograms
sns.clustermap(intersection_df, annot=True, cmap='viridis', fmt='g', figsize=(8, 6))
plt.title('Clustermap of Common Members Between Sets with Dendrograms', y=1.03)
plt.show()

../_images/examples_Developers-Tutorial_47_0.png

Search Interface

So far we have been doing basic lookup information, assuming we know the ID in advance.

What if we don’t know the ID but just have a label, or if we don’t even have a particular concept in mind, and just want to search?

If so, the SearchInterface is your friend!

Lookup by label

[30]:
for result in adapter.basic_search("neuron"):
    print(result)
CL:0000540

now let’s try searching for the capitalized form:

[31]:
len(list(adapter.basic_search("Neuron")))
[31]:
0

uh oh!

By design, the default is case sensitive. But we can pass a SearchConfiguration to make search more customizable.

You can read more about the SearchConfiguration datamodel here:

a note on data models the BasicOntologyInterface is designed to work without any particular data model, returning only simple lists and tuples. Other interfaces typically need to work with more sophisticated structures, so we use data models here.

[32]:
from oaklib.datamodels.search import SearchConfiguration, SearchTermSyntax, SearchProperty
[33]:
config = SearchConfiguration(force_case_insensitive=True)
[34]:
len(list(adapter.basic_search("Neuron", config)))
[34]:
1
[35]:
len(list(adapter.basic_search("NeUrOn", config)))
[35]:
1

We can also do regexes, starts-with, ends with etc (but see below for caveat)

[37]:
config = SearchConfiguration(syntax=SearchTermSyntax.STARTS_WITH)
[38]:
for result in adapter.basic_search("neuron", config):
    print(result, adapter.label(result))
CARO:0001001 neuron projection bundle
CL:0000006 neuronal receptor cell
CL:0000095 neuron associated cell
CL:0000123 neuron associated cell (sensu Vertebrata)
CL:0000130 neuron associated cell (sensu Nematoda and Protostomia)
CL:0000540 neuron
CL:0000555 neuronal brush cell
CL:0002611 neuron of the dorsal spinal cord
CL:0002612 neuron of the ventral spinal cord
CL:0002614 neuron of the substantia nigra
CL:0012001 neuron of the forebrain
GO:0001764 neuron migration
GO:0019228 neuronal action potential
GO:0030182 neuron differentiation
GO:0031175 neuron projection development
GO:0032589 neuron projection membrane
GO:0042551 neuron maturation
GO:0043005 neuron projection
GO:0043025 neuronal cell body
GO:0044306 neuron projection terminus
GO:0048666 neuron development
GO:0048812 neuron projection morphogenesis
GO:0051402 neuron apoptotic process
GO:0060705 neuron differentiation involved in salivary gland development
GO:0070050 neuron cellular homeostasis
GO:0070997 neuron death
GO:0106027 neuron projection organization
GO:0120111 neuron projection cytoplasm
GO:0150099 neuron-glial cell signaling
PATO:0070033 neuron projection quality
PR:000005460 neuronal acetylcholine receptor subunit alpha-7
PR:000044062 neuronal acetylcholine receptor subunit alpha-7, signal peptide removed form
PR:000044063 neuronal acetylcholine receptor subunit alpha-7, signal peptide removed form (human)
PR:P36544 neuronal acetylcholine receptor subunit alpha-7 (human)
PR:P49582 neuronal acetylcholine receptor subunit alpha-7 (mouse)
UBERON:0000122 neuron projection bundle
UBERON:0004904 neuron projection bundle connecting eye with brain

now we can try a regex:

[39]:
config = SearchConfiguration(syntax=SearchTermSyntax.REGULAR_EXPRESSION)
for result in adapter.basic_search("^neuron", config):
    print(result, adapter.label(result))
CARO:0001001 neuron projection bundle
CL:0000006 neuronal receptor cell
CL:0000095 neuron associated cell
CL:0000123 neuron associated cell (sensu Vertebrata)
CL:0000130 neuron associated cell (sensu Nematoda and Protostomia)
CL:0000540 neuron
CL:0000555 neuronal brush cell
CL:0002611 neuron of the dorsal spinal cord
CL:0002612 neuron of the ventral spinal cord
CL:0002614 neuron of the substantia nigra
CL:0012001 neuron of the forebrain
GO:0001764 neuron migration
GO:0019228 neuronal action potential
GO:0030182 neuron differentiation
GO:0031175 neuron projection development
GO:0032589 neuron projection membrane
GO:0042551 neuron maturation
GO:0043005 neuron projection
GO:0043025 neuronal cell body
GO:0044306 neuron projection terminus
GO:0048666 neuron development
GO:0048812 neuron projection morphogenesis
GO:0051402 neuron apoptotic process
GO:0060705 neuron differentiation involved in salivary gland development
GO:0070050 neuron cellular homeostasis
GO:0070997 neuron death
GO:0106027 neuron projection organization
GO:0120111 neuron projection cytoplasm
GO:0150099 neuron-glial cell signaling
PATO:0070033 neuron projection quality
PR:000005460 neuronal acetylcholine receptor subunit alpha-7
PR:000044062 neuronal acetylcholine receptor subunit alpha-7, signal peptide removed form
PR:000044063 neuronal acetylcholine receptor subunit alpha-7, signal peptide removed form (human)
PR:P36544 neuronal acetylcholine receptor subunit alpha-7 (human)
PR:P49582 neuronal acetylcholine receptor subunit alpha-7 (mouse)
UBERON:0000122 neuron projection bundle
UBERON:0004904 neuron projection bundle connecting eye with brain

Caveat on regexes

If your adapter is talking to sqlite, then the regex must be of a form that can be translated to a LIKE query

(OAK takes care of this translation - as a developer you should only care about the interface, not implementation)

In future we may have strategies to allow more powerful lexical search with sqlite…

Searching on mapped identifiers

You can search on arbitrary properties, such as synonyms or even mapped identifiers (object_id in SSSOM lingo)

[40]:
config = SearchConfiguration(properties=[SearchProperty.MAPPED_IDENTIFIER])
for result in adapter.basic_search("FMA:62364", config):
    print(result, adapter.label(result))
CL:0000202 auditory hair cell
CL:4023120 cochlea auditory hair cell

SSSOM Mappings

Up above we saw that the default datamodel for mappings in OAK is simple. For more advanced operations, you can use:

MappingProviderInterface

This makes use of the https://w3id.org/sssom data model

[41]:
neurons = list(adapter.descendants("CL:0000540", predicates=[IS_A]))
mappings = list(adapter.sssom_mappings(neurons))
[42]:
len(mappings)
[42]:
186
[43]:
print(mappings[0])
Mapping(subject_id='CL:0000099', predicate_id='oio:hasDbXref', object_id='BTO:0003811', mapping_justification='semapv:UnspecifiedMatching', subject_label=None, subject_category=None, predicate_label=None, predicate_modifier=None, object_label=None, object_category=None, author_id=[], author_label=[], reviewer_id=[], reviewer_label=[], creator_id=[], creator_label=[], license=None, subject_type=None, subject_source='CL', subject_source_version=None, object_type=None, object_source='BTO', object_source_version=None, mapping_provider=None, mapping_source=None, mapping_cardinality=None, mapping_tool=None, mapping_tool_version=None, mapping_date=None, confidence=None, curation_rule=[], curation_rule_text=[], subject_match_field=[], object_match_field=[], match_string=[], subject_preprocessing=[], object_preprocessing=[], semantic_similarity_score=None, semantic_similarity_measure=None, see_also=[], other=None, comment=None)
[44]:
from linkml_runtime.dumpers import yaml_dumper
[45]:
print(yaml_dumper.dumps(mappings[0:2]))
- subject_id: CL:0000099
  predicate_id: oio:hasDbXref
  object_id: BTO:0003811
  mapping_justification: semapv:UnspecifiedMatching
  subject_source: CL
  object_source: BTO
- subject_id: CL:0000099
  predicate_id: oio:hasDbXref
  object_id: FBbt:00005125
  mapping_justification: semapv:UnspecifiedMatching
  subject_source: CL
  object_source: FBbt

Text Annotation

Interface: TextAnnotatorInterface

The text annotator uses the https://w3id.org/linkml/text-annotator data model. This models each annotation as an TextAnnotation object with fields such as subject_start and subject_end (marking the span in the text) and object_id and object_label (the matched concept):

[46]:
for ann in adapter.annotate_text("this is a goblet cell from the intestinal epithelium"):
    print(ann.subject_start, ann.subject_end, ann.object_id, ann.object_label)
18 21 CARO:0000013 cell
18 21 CL:0000000 cell
9 9 CHEBI:15339 A
11 11 CHEBI:15428 G
18 18 CHEBI:27594 C
11 21 CL:0000160 goblet cell
1 2 PR:000016301 TH
1 2 PR:P07101 TH
1 2 PR:P24529 Th
1 2 UBERON:0001897 Th
43 52 UBERON:0000483 epithelium
32 52 UBERON:0001277 intestinal epithelium

OBO Graph Interface

OboGraphInterface

[47]:
graph = adapter.ancestor_graph(["CL:0000540"], predicates=[IS_A, PART_OF])
[48]:
len(graph.nodes)
[48]:
25
[49]:
len(graph.edges)
[49]:
29

Exporting subgraphs to GraphViz

See also part 5 of the tutorial

[50]:
from oaklib.utilities.obograph_utils import graph_to_image
[51]:
graph_to_image(graph, seeds=["CL:0000540"], imgfile="examples/output/neuron-v1.png")

img

Adding a stylesheet

The graph above is a little plain and boring looking. We can spice it up using a StyleMap.

For now we will use the standard stylemap in src/oaklib/conf/obograph-style.json:

[52]:
from oaklib.utilities.obograph_utils import default_stylemap_path
[53]:
graph_to_image(graph, seeds=["CL:0000540"], imgfile="examples/output/neuron-v2.png", stylemap=default_stylemap_path())

img

Working with annotations

[54]:
hp = get_adapter("src/oaklib/conf/hpoa-g2p-input-spec.yaml")
[55]:
len(list(hp.associations()))
[55]:
238269
[56]:
for assoc in list(hp.associations())[0:15]:
    print(assoc, hp.label(assoc.object))
Association(subject='NCBIGene:ncbi_gene_id', predicate=None, object='hpo_id', property_values=[]) None
Association(subject='NCBIGene:10', predicate=None, object='HP:0000007', property_values=[]) Autosomal recessive inheritance
Association(subject='NCBIGene:10', predicate=None, object='HP:0001939', property_values=[]) Abnormality of metabolism/homeostasis
Association(subject='NCBIGene:16', predicate=None, object='HP:0002460', property_values=[]) Distal muscle weakness
Association(subject='NCBIGene:16', predicate=None, object='HP:0002451', property_values=[]) Limb dystonia
Association(subject='NCBIGene:16', predicate=None, object='HP:0010871', property_values=[]) Sensory ataxia
Association(subject='NCBIGene:16', predicate=None, object='HP:0009886', property_values=[]) Trichorrhexis nodosa
Association(subject='NCBIGene:16', predicate=None, object='HP:0002421', property_values=[]) Poor head control
Association(subject='NCBIGene:16', predicate=None, object='HP:0001298', property_values=[]) Encephalopathy
Association(subject='NCBIGene:16', predicate=None, object='HP:0001290', property_values=[]) Generalized hypotonia
Association(subject='NCBIGene:16', predicate=None, object='HP:0001273', property_values=[]) Abnormal corpus callosum morphology
Association(subject='NCBIGene:16', predicate=None, object='HP:0001268', property_values=[]) Mental deterioration
Association(subject='NCBIGene:16', predicate=None, object='HP:0002599', property_values=[]) Head titubation
Association(subject='NCBIGene:16', predicate=None, object='HP:0001284', property_values=[]) Areflexia
Association(subject='NCBIGene:16', predicate=None, object='HP:0001250', property_values=[]) Seizure
[58]:
# Fetch sensory ataxia genes (including those annotated to is-a descendants of the term)
ataxia_assocs = list(hp.associations(objects=["HP:0010871"], object_closure_predicates=[IS_A]))
len(ataxia_assocs)
[58]:
15
[59]:
genes = list(set([assoc.subject for assoc in ataxia_assocs]))
len(genes)
[59]:
15
[60]:
genes[0:5]
[60]:
['NCBIGene:56652',
 'NCBIGene:5428',
 'NCBIGene:16',
 'NCBIGene:1959',
 'NCBIGene:57716']
[61]:
node_normalizer = get_adapter("translator:")
uniprot_ids = set()
for gene in genes:
    for m in node_normalizer.sssom_mappings([gene], source="UniProtKB"):
        uniprot_ids.add(m.object_id)
uniprot_ids
[61]:
{'UniProtKB:A0A024RDV7',
 'UniProtKB:A0A140VJE4',
 'UniProtKB:A0A2R8Y4V4',
 'UniProtKB:A0A2R8Y746',
 'UniProtKB:A0A2U3TZU2',
 'UniProtKB:A0A514TP98',
 'UniProtKB:A0A5F9ZI26',
 'UniProtKB:A0A8I5KYI5',
 'UniProtKB:A8KA82',
 'UniProtKB:A8MU75',
 'UniProtKB:B2RB38',
 'UniProtKB:B4DE36',
 'UniProtKB:E5KNU5',
 'UniProtKB:E5KSY5',
 'UniProtKB:O00505',
 'UniProtKB:P06744',
 'UniProtKB:P11161',
 'UniProtKB:P25189',
 'UniProtKB:P49588',
 'UniProtKB:P54098',
 'UniProtKB:P54802',
 'UniProtKB:Q01453',
 'UniProtKB:Q13217',
 'UniProtKB:Q6FH25',
 'UniProtKB:Q8TF17',
 'UniProtKB:Q96K19',
 'UniProtKB:Q96RR1',
 'UniProtKB:Q9BXM0',
 'UniProtKB:Q9H5I5',
 'UniProtKB:Q9H6V3',
 'UniProtKB:Q9Y5Y0'}
[63]:
go = get_adapter("src/oaklib/conf/go-human-input-spec.yaml")
[64]:
results = list(go.enriched_classes(uniprot_ids, object_closure_predicates=[IS_A, PART_OF], autolabel=True))
[66]:
for result in results:
    print(f"{result.class_id} '{result.class_label}' {result.p_value_adjusted:0.2e}")
GO:0042552 'myelination' 9.72e-04
GO:0008366 'axon ensheathment' 1.06e-03
GO:0007272 'ensheathment of neurons' 1.06e-03
GO:0007422 'peripheral nervous system development' 6.38e-03
GO:0014037 'Schwann cell differentiation' 2.75e-02
[67]:
terms = [r.class_id for r in results]
graph = go.ancestor_graph(terms, predicates=[IS_A, PART_OF])
[68]:
graph_to_image(graph, seeds=terms, imgfile="examples/output/go-enrichment-from-hp.png", stylemap=default_stylemap_path())

img

[ ]: