OAK validate-definitions command

This notebook is intended as a supplement to the main OAK CLI docs.

This notebook provides examples for the validate-definitions command. This forms part of a suite of validate commands.

Help Option

You can get help on any OAK command using --help

[16]:
!runoak validate-definitions --help
Usage: runoak validate-definitions [OPTIONS] [TERMS]...

  Checks presence and structure of text definitions.

  To run:

      runoak validate-definitions -i db/uberon.db -o results.tsv

  By default this will apply basic text mining of text definitions to check
  against machine actionable OBO text definition guideline rules. This can
  result in an initial lag - to skip this, and ONLY perform checks for
  *presence* of definitions, use --skip-text-annotation:

  Example: -------

      runoak validate-definitions -i db/uberon.db --skip-text-annotation

  Like most OAK commands, this accepts lists of terms or term queries as
  arguments. You can pass in a CURIE list to selectively validate individual
  classes

  Example: -------

       runoak validate-definitions -i db/cl.db CL:0002053

  Only on CL identifiers:

      runoak validate-definitions -i db/cl.db i^CL:

  Only on neuron hierarchy:

      runoak validate-definitions -i db/cl.db .desc//p=i neuron

  Output format:

  This command emits objects conforming to the OAK validation datamodel. See
  https://incatools.github.io/ontology-access-kit/datamodels for more on OAK
  datamodels.

  The default serialization of the datamodel is CSV.

  Notes: -----

  This command is largely redundant with the validate command, but is useful
  for targeted validation focused solely on definitions

Options:
  --skip-text-annotation / --no-skip-text-annotation
                                  If true, do not parse text annotations
                                  [default: no-skip-text-annotation]
  -C, --configuration-file TEXT   Path to a configuration file. This is
                                  typically a YAML file, but may be a JSON
                                  file
  --adapter-mapping TEXT          Multiple prefix=selector pairs, e.g.
                                  --adapter-mapping uberon=db/uberon.db
  -O, --output-type TEXT          Desired output type
  -o, --output FILENAME           Output file, e.g. obo file
  --help                          Show this message and exit.

Example: Validation over Test Ontology

To illustrate this command we will use a deliberately altered version of a subset of GO.

We will query the subset that are descendants of cellular process using the query .desc//p=i "cellular_component"

[17]:
!runoak -i simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml .desc//p=i "cellular_component" -o output/validate-definitions.output.tsv

The output is a TSV file with a summary of the issues found.

We can load this into a pandas dataframe for further analysis. This also has the advantage of displaying tables nicely in Jupyter notebooks such as this one.

If you were actually using this on the command line you may prefer to use your own TSV processing tools, or to simply load into google sheets.

[18]:
import pandas as pd
df = pd.read_csv("output/validate-definitions.output.tsv", sep="\t")
df
[18]:
type subject subject_label severity instantiates predicate object object_str source info
0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle WARNING NaN IAO:0000115 NaN Organized structure of distinctive morphology ... NaN Cannot parse genus and differentia
1 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle NaN NaN IAO:0000115 NaN NaN NaN Logical definition element not found in text: ...
2 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle NaN NaN IAO:0000115 NaN NaN NaN Logical definition element not found in text: ...
3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region WARNING NaN IAO:0000115 NaN Any (proper) part of the cytoplasm of a single... NaN Cannot parse genus and differentia
4 oaklib.om:DCC#S3 GO:0099738 cell cortex region NaN NaN IAO:0000115 NaN complete extent of cell cortex NaN Did not match whole text: cell cortex < comple...
5 oaklib.om:DCC#S11 GO:0099738 cell cortex region NaN NaN IAO:0000115 NaN underlies some some region of the plasma membrane NaN Wrong position, 'cell cortex' not in 'underlie...
6 oaklib.om:DCC#S3 GO:0071944 cell periphery WARNING NaN IAO:0000115 NaN The part of a cell encompassing the cell corte... NaN Cannot parse genus and differentia
7 oaklib.om:DCC#S11 GO:0031090 organelle membrane NaN NaN IAO:0000115 NaN is one of the two lipid bilayers of an organel... NaN Logical definition element not found in text: ...
8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle WARNING NaN IAO:0000115 NaN Organized structure of distinctive morphology ... NaN Cannot parse genus and differentia
9 oaklib.om:DCC#S11 GO:0043229 intracellular organelle NaN NaN IAO:0000115 NaN NaN NaN Logical definition element not found in text: ...
10 oaklib.om:DCC#S11 GO:0043229 intracellular organelle NaN NaN IAO:0000115 NaN NaN NaN Logical definition element not found in text: ...
11 oaklib.om:DCC#S3 GO:0031967 organelle envelope WARNING NaN IAO:0000115 NaN A double membrane structure enclosing an organ... NaN Cannot parse genus and differentia
12 oaklib.om:DCC#S3 GO:0031975 envelope WARNING NaN IAO:0000115 NaN A multilayered structure surrounding all or pa... NaN Cannot parse genus and differentia
13 oaklib.om:DCC#Any GO:0098590 plasma membrane region INFO NaN IAO:0000115 NaN A membrane that is a (regional) part of the pl... NaN No problems with definition
14 oaklib.om:DCC#S0 GO:0012505 endomembrane system ERROR NaN IAO:0000115 NaN NaN NaN Missing text definition
15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure WARNING NaN IAO:0000115 NaN A component of a cell contained within (but no... NaN Cannot parse genus and differentia
16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type WARNING NaN IAO:0000115 NaN fake definition to test retracted typo in refe... NaN Cannot parse genus and differentia
17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle WARNING NaN IAO:0000115 NaN Organized structure of distinctive morphology ... NaN Cannot parse genus and differentia
18 oaklib.om:DCC#S11 GO:0043227 membrane-bounded organelle NaN NaN IAO:0000115 NaN NaN NaN Logical definition element not found in text: ...
19 oaklib.om:DCC#S11 GO:0005938 cell cortex NaN NaN IAO:0000115 NaN region of a cell NaN Logical definition element not found in text: ...
20 oaklib.om:DCC#S11 GO:0005938 cell cortex NaN NaN IAO:0000115 NaN lies just beneath the plasma membrane and ofte... NaN Logical definition element not found in text: ...
21 oaklib.om:DCC#S7 GO:0009579 thylakoid NaN NaN IAO:0000115 NaN The structure in a plant cell that is known as... NaN Circular, thylakoid (GO:0009579 in definition
22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction WARNING NaN IAO:0000115 NaN fake definition to test retracted reference NaN Cannot parse genus and differentia
23 oaklib.om:DCC#S3 GO:0005575 cellular_component WARNING NaN IAO:0000115 NaN A location, relative to cellular compartments ... NaN Cannot parse genus and differentia
24 oaklib.om:DCC#Any GO:0005634 nucleus INFO NaN IAO:0000115 NaN A membrane-bounded organelle of eukaryotic cel... NaN No problems with definition
25 oaklib.om:DCC#S3 GO:0016020 membrane WARNING NaN IAO:0000115 NaN A lipid bilayer along with all the proteins an... NaN Cannot parse genus and differentia
26 oaklib.om:DCC#Any GO:0110165 cellular anatomical entity INFO NaN IAO:0000115 NaN A part of a cellular organism that is either a... NaN No problems with definition
27 oaklib.om:DCC#Any GO:0005635 nuclear envelope INFO NaN IAO:0000115 NaN A double lipid bilayer that is part of the nuc... NaN No problems with definition
28 oaklib.om:DCC#Any GO:0005886 plasma membrane INFO NaN IAO:0000115 NaN The membrane surrounding a cell that separates... NaN No problems with definition
29 oaklib.om:DCC#S1 GO:0005773 vacuole NaN NaN IAO:0000115 NaN NaN NaN Definiendum should not appear at the start
30 oaklib.om:DCC#S11 GO:0031965 nuclear membrane NaN NaN IAO:0000115 NaN envelope NaN Logical definition element not found in text: ...
31 oaklib.om:DCC#S1 GO:0005737 cytoplasm NaN NaN IAO:0000115 NaN NaN NaN Definiendum should not appear at the start
32 oaklib.om:DCC#Any GO:0034357 photosynthetic membrane INFO NaN IAO:0000115 NaN A membrane enriched in complexes formed of rea... NaN No problems with definition
33 oaklib.om:DCC#S3 GO:0043226 organelle WARNING NaN IAO:0000115 NaN Organized structure of distinctive morphology ... NaN Cannot parse genus and differentia
34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type ERROR NaN IAO:0000115 PMID:9999999999999 fake definition to test retracted typo in refe... NaN publication not found: PMID:9999999999999
35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction ERROR NaN IAO:0000115 PMID:19717156 NaN NaN publication is retracted: A role for plasma tr...

The rows conform to ValidationResults in the OAK ontology-metadata data model.

The values of the type field are from the DefinitionConstraintComponent enumeration.

These themselves are modeled off of the taxonomy from Seppälä, Ruttenberg, and Smith, Guidelines for writing definitions in ontologies.

[19]:
df["type"].unique()
[19]:
array(['oaklib.om:DCC#S3', 'oaklib.om:DCC#S11', 'oaklib.om:DCC#Any',
       'oaklib.om:DCC#S0', 'oaklib.om:DCC#S7', 'oaklib.om:DCC#S1',
       'oaklib.om:DCC#S20.1', 'oaklib.om:DCC#S20.2'], dtype=object)
[20]:
df.groupby("type").size().reset_index(name='counts')
[20]:
type counts
0 oaklib.om:DCC#Any 6
1 oaklib.om:DCC#S0 1
2 oaklib.om:DCC#S1 2
3 oaklib.om:DCC#S11 10
4 oaklib.om:DCC#S20.1 1
5 oaklib.om:DCC#S20.2 1
6 oaklib.om:DCC#S3 14
7 oaklib.om:DCC#S7 1

Next we’ll filter out less informative columns

[21]:
df = df[["type", "subject", "subject_label", "object_str", "info"]]
df
[21]:
type subject subject_label object_str info
0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia
1 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle NaN Logical definition element not found in text: ...
2 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle NaN Logical definition element not found in text: ...
3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region Any (proper) part of the cytoplasm of a single... Cannot parse genus and differentia
4 oaklib.om:DCC#S3 GO:0099738 cell cortex region complete extent of cell cortex Did not match whole text: cell cortex < comple...
5 oaklib.om:DCC#S11 GO:0099738 cell cortex region underlies some some region of the plasma membrane Wrong position, 'cell cortex' not in 'underlie...
6 oaklib.om:DCC#S3 GO:0071944 cell periphery The part of a cell encompassing the cell corte... Cannot parse genus and differentia
7 oaklib.om:DCC#S11 GO:0031090 organelle membrane is one of the two lipid bilayers of an organel... Logical definition element not found in text: ...
8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia
9 oaklib.om:DCC#S11 GO:0043229 intracellular organelle NaN Logical definition element not found in text: ...
10 oaklib.om:DCC#S11 GO:0043229 intracellular organelle NaN Logical definition element not found in text: ...
11 oaklib.om:DCC#S3 GO:0031967 organelle envelope A double membrane structure enclosing an organ... Cannot parse genus and differentia
12 oaklib.om:DCC#S3 GO:0031975 envelope A multilayered structure surrounding all or pa... Cannot parse genus and differentia
13 oaklib.om:DCC#Any GO:0098590 plasma membrane region A membrane that is a (regional) part of the pl... No problems with definition
14 oaklib.om:DCC#S0 GO:0012505 endomembrane system NaN Missing text definition
15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure A component of a cell contained within (but no... Cannot parse genus and differentia
16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type fake definition to test retracted typo in refe... Cannot parse genus and differentia
17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia
18 oaklib.om:DCC#S11 GO:0043227 membrane-bounded organelle NaN Logical definition element not found in text: ...
19 oaklib.om:DCC#S11 GO:0005938 cell cortex region of a cell Logical definition element not found in text: ...
20 oaklib.om:DCC#S11 GO:0005938 cell cortex lies just beneath the plasma membrane and ofte... Logical definition element not found in text: ...
21 oaklib.om:DCC#S7 GO:0009579 thylakoid The structure in a plant cell that is known as... Circular, thylakoid (GO:0009579 in definition
22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction fake definition to test retracted reference Cannot parse genus and differentia
23 oaklib.om:DCC#S3 GO:0005575 cellular_component A location, relative to cellular compartments ... Cannot parse genus and differentia
24 oaklib.om:DCC#Any GO:0005634 nucleus A membrane-bounded organelle of eukaryotic cel... No problems with definition
25 oaklib.om:DCC#S3 GO:0016020 membrane A lipid bilayer along with all the proteins an... Cannot parse genus and differentia
26 oaklib.om:DCC#Any GO:0110165 cellular anatomical entity A part of a cellular organism that is either a... No problems with definition
27 oaklib.om:DCC#Any GO:0005635 nuclear envelope A double lipid bilayer that is part of the nuc... No problems with definition
28 oaklib.om:DCC#Any GO:0005886 plasma membrane The membrane surrounding a cell that separates... No problems with definition
29 oaklib.om:DCC#S1 GO:0005773 vacuole NaN Definiendum should not appear at the start
30 oaklib.om:DCC#S11 GO:0031965 nuclear membrane envelope Logical definition element not found in text: ...
31 oaklib.om:DCC#S1 GO:0005737 cytoplasm NaN Definiendum should not appear at the start
32 oaklib.om:DCC#Any GO:0034357 photosynthetic membrane A membrane enriched in complexes formed of rea... No problems with definition
33 oaklib.om:DCC#S3 GO:0043226 organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia
34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type fake definition to test retracted typo in refe... publication not found: PMID:9999999999999
35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction NaN publication is retracted: A role for plasma tr...

Missing Definitions

This is the most trivial way to fail a definition check - not to include one. We can see all the missing definitions:

[22]:
df[df["type"] == "oaklib.om:DCC#S0"]

[22]:
type subject subject_label object_str info
14 oaklib.om:DCC#S0 GO:0012505 endomembrane system NaN Missing text definition

Of course, in the real ontology this term has a definition

Non genus-differentia structure

The OAK validate definitions command follows SRS and assumes good definitions follow genus-differentia structure.

We can see the ones that fail this (S3):

[23]:
df[df["type"] == "oaklib.om:DCC#S3"]
[23]:
type subject subject_label object_str info
0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia
3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region Any (proper) part of the cytoplasm of a single... Cannot parse genus and differentia
4 oaklib.om:DCC#S3 GO:0099738 cell cortex region complete extent of cell cortex Did not match whole text: cell cortex < comple...
6 oaklib.om:DCC#S3 GO:0071944 cell periphery The part of a cell encompassing the cell corte... Cannot parse genus and differentia
8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia
11 oaklib.om:DCC#S3 GO:0031967 organelle envelope A double membrane structure enclosing an organ... Cannot parse genus and differentia
12 oaklib.om:DCC#S3 GO:0031975 envelope A multilayered structure surrounding all or pa... Cannot parse genus and differentia
15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure A component of a cell contained within (but no... Cannot parse genus and differentia
16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type fake definition to test retracted typo in refe... Cannot parse genus and differentia
17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia
22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction fake definition to test retracted reference Cannot parse genus and differentia
23 oaklib.om:DCC#S3 GO:0005575 cellular_component A location, relative to cellular compartments ... Cannot parse genus and differentia
25 oaklib.om:DCC#S3 GO:0016020 membrane A lipid bilayer along with all the proteins an... Cannot parse genus and differentia
33 oaklib.om:DCC#S3 GO:0043226 organelle Organized structure of distinctive morphology ... Cannot parse genus and differentia

Many of these are actual definitions rather than ones manipulated for test purposes.

There is room for valid disagreement about whether rewriting some of these following genus-differentia form would improve things for either users or annotators. Arguably at least the subtypes of organelle could simply state how they are differentiated from organelles in general rather than repeating the somewhat wordy “Organized structure of distinctive morphology…”

Circular definitions

[24]:
df[df["type"] == "oaklib.om:DCC#S7"]
[24]:
type subject subject_label object_str info
21 oaklib.om:DCC#S7 GO:0009579 thylakoid The structure in a plant cell that is known as... Circular, thylakoid (GO:0009579 in definition

Not following convention

[25]:
df[df["type"] == "oaklib.om:DCC#S1"]
[25]:
type subject subject_label object_str info
29 oaklib.om:DCC#S1 GO:0005773 vacuole NaN Definiendum should not appear at the start
31 oaklib.om:DCC#S1 GO:0005737 cytoplasm NaN Definiendum should not appear at the start

Definition Reference Issues

Typos in PMIDs

[26]:
df[df["type"] == "oaklib.om:DCC#S20.1"]

[26]:
type subject subject_label object_str info
34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type fake definition to test retracted typo in refe... publication not found: PMID:9999999999999

Retracted publications

[27]:
df[df["type"] == "oaklib.om:DCC#S20.2"]

[27]:
type subject subject_label object_str info
35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction NaN publication is retracted: A role for plasma tr...

Using LLMs to validate definitions

For this example we will use an LLM to validate this GO catalytic activity:

[Term]
id: GO:0000010
name: trans-hexaprenyltranstransferase activity
namespace: molecular_function
alt_id: GO:0036422
def: "Catalysis of the reaction: (2E,6E)-farnesyl diphosphate + 4 isopentenyl diphosphate = 4 diphosphate + all-trans-heptaprenyl diphosphate." [PMID:9708911, RHEA:27794]
synonym: "all-trans-heptaprenyl-diphosphate synthase activity" RELATED [EC:2.5.1.30]
synonym: "HepPP synthase activity" RELATED [EC:2.5.1.30]
synonym: "heptaprenyl diphosphate synthase activity" RELATED []
synonym: "heptaprenyl pyrophosphate synthase activity" RELATED [EC:2.5.1.30]
synonym: "heptaprenyl pyrophosphate synthetase activity" RELATED [EC:2.5.1.30]
xref: EC:2.5.1.30
xref: MetaCyc:TRANS-HEXAPRENYLTRANSTRANSFERASE-RXN
xref: RHEA:27794

There are two references for this:

[13]:
!runoak --stacktrace -i llm:{claude-3-opus}:simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml GO:0000010 -O yaml -o output/validate-definitions.llm.yaml
[14]:
import yaml
report = yaml.safe_load(open("output/validate-definitions.llm.yaml"))
[15]:
for k, v in report.items():
    if len(str(v)) > 50:
        lines = v.split("\n")
        lines = [f"  {line}" for line in lines]
        lines = [""] + lines + [""]
        v = "\n".join(lines)
    print(f"{k}: {v}")
type: https://w3id.org/oak/ontology-metadata/DCC.S20
subject: GO:0000010
severity: INFO
predicate: IAO:0000115
object_str:
  id: PMID:9708911
  title: Biological significance of the side chain length of ubiquinone in Saccharomyces
    cerevisiae.
  abstract: Ubiquinone (UQ), an important component of the electron transfer system,
    is constituted of a quinone structure and a side chain isoprenoid. The side chain
    length of UQ differs between microorganisms, and this difference has been used for
    taxonomic study. In this study, we have addressed the importance of the length of
    the side chain of UQ for cells, and examined the effect of chain length by producing
    UQs with isoprenoid chain lengths between 5 and 10 in Saccharomyces cerevisiae.
    To make the different UQ species, different types of prenyl diphosphate synthases
    were expressed in a S. cerevisiae COQ1 mutant defective for hexaprenyl diphosphate
    synthesis. As a result, we found that the original species of UQ (in this case UQ-6)
    had maximum functionality. However, we found that other species of UQ could replace
    UQ-6. Thus a broad spectrum of different UQ species are biologically functional
    in yeast cells, although cells seem to display a preference for their own particular
    type of UQ.
  publication_type: Research Support, Non-U.S. Gov't


info:
  The term "trans-hexaprenyltranstransferase activity" has a HIGH level of alignment with the cited reference PMID:9708911. The abstract supports the definition well, as evidenced by these key points:

  1. The study examines the importance of the side chain length of ubiquinone (UQ) in Saccharomyces cerevisiae, which directly relates to the activity of trans-hexaprenyltranstransferase.

  2. The abstract mentions "hexaprenyl diphosphate synthesis" in S. cerevisiae, which is the product of trans-hexaprenyltranstransferase activity.

  3. The study found that the original species of UQ (UQ-6) had maximum functionality in yeast cells, suggesting a preference for the hexaprenyl side chain length produced by trans-hexaprenyltranstransferase.

  No sections of the abstract misalign with or contradict the term definition. The definition is appropriately specific, focusing on the enzyme's activity without providing additional details about its structure or cellular role.

definition:
  Catalysis of the reaction: (2E,6E)-farnesyl diphosphate + 4 isopentenyl diphosphate = 4 diphosphate + all-trans-heptaprenyl diphosphate.

definition_source: PMID:9708911

COMMENTARY

Note that as this is an LLM the output differs every time!

In some cases, the LLM is failing to see that the paper is indeed about trans-hexaprenyltranstransferase activity, the output is useful as it shows us that the abstract is not directly about this activity.

[ ]: