{ "cells": [ { "cell_type": "markdown", "id": "0a28b88d-4deb-4d0a-a110-f27adf077e23", "metadata": {}, "source": [ "# OAK validate-definitions command\n", "\n", "This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).\n", "\n", "This notebook provides examples for the `validate-definitions` command.\n", "This forms part of a suite of *validate* commands.\n", " \n", "## Help Option\n", "\n", "You can get help on any OAK command using `--help`" ] }, { "cell_type": "code", "execution_count": 16, "id": "c223f678-f82f-4b06-8e19-1a5b7323e571", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:27.966036Z", "start_time": "2024-04-15T00:50:25.530846Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: runoak validate-definitions [OPTIONS] [TERMS]...\n", "\n", " Checks presence and structure of text definitions.\n", "\n", " To run:\n", "\n", " runoak validate-definitions -i db/uberon.db -o results.tsv\n", "\n", " By default this will apply basic text mining of text definitions to check\n", " against machine actionable OBO text definition guideline rules. This can\n", " result in an initial lag - to skip this, and ONLY perform checks for\n", " *presence* of definitions, use --skip-text-annotation:\n", "\n", " Example: -------\n", "\n", " runoak validate-definitions -i db/uberon.db --skip-text-annotation\n", "\n", " Like most OAK commands, this accepts lists of terms or term queries as\n", " arguments. You can pass in a CURIE list to selectively validate individual\n", " classes\n", "\n", " Example: -------\n", "\n", " runoak validate-definitions -i db/cl.db CL:0002053\n", "\n", " Only on CL identifiers:\n", "\n", " runoak validate-definitions -i db/cl.db i^CL:\n", "\n", " Only on neuron hierarchy:\n", "\n", " runoak validate-definitions -i db/cl.db .desc//p=i neuron\n", "\n", " Output format:\n", "\n", " This command emits objects conforming to the OAK validation datamodel. See\n", " https://incatools.github.io/ontology-access-kit/datamodels for more on OAK\n", " datamodels.\n", "\n", " The default serialization of the datamodel is CSV.\n", "\n", " Notes: -----\n", "\n", " This command is largely redundant with the validate command, but is useful\n", " for targeted validation focused solely on definitions\n", "\n", "Options:\n", " --skip-text-annotation / --no-skip-text-annotation\n", " If true, do not parse text annotations\n", " [default: no-skip-text-annotation]\n", " -C, --configuration-file TEXT Path to a configuration file. This is\n", " typically a YAML file, but may be a JSON\n", " file\n", " --adapter-mapping TEXT Multiple prefix=selector pairs, e.g.\n", " --adapter-mapping uberon=db/uberon.db\n", " -O, --output-type TEXT Desired output type\n", " -o, --output FILENAME Output file, e.g. obo file\n", " --help Show this message and exit.\n" ] } ], "source": [ "!runoak validate-definitions --help" ] }, { "cell_type": "markdown", "id": "01f38163-db22-4c51-ae46-10e8b8e6d53c", "metadata": {}, "source": [ "## Example: Validation over Test Ontology\n", "\n", "To illustrate this command we will use a deliberately altered version of a subset of GO.\n", "\n", "We will query the subset that are descendants of cellular process using the query `.desc//p=i \"cellular_component\"`" ] }, { "cell_type": "code", "execution_count": 17, "id": "c9b86e52-87a7-449c-baac-81981e7ce632", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:30.655424Z", "start_time": "2024-04-15T00:50:27.968820Z" } }, "outputs": [], "source": [ "!runoak -i simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml .desc//p=i \"cellular_component\" -o output/validate-definitions.output.tsv" ] }, { "cell_type": "markdown", "id": "27c1668fc8d1a8de", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "The output is a TSV file with a summary of the issues found.\n", "\n", "We can load this into a pandas dataframe for further analysis. This also has the advantage of\n", "displaying tables nicely in Jupyter notebooks such as this one.\n", "\n", "If you were actually using this on the command line you may prefer to use your own TSV processing tools,\n", "or to simply load into google sheets." ] }, { "cell_type": "code", "execution_count": 18, "id": "5fc9b15d-cc81-400a-8660-f92491baa120", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:30.953116Z", "start_time": "2024-04-15T00:50:30.658190Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelseverityinstantiatespredicateobjectobject_strsourceinfo
0oaklib.om:DCC#S3GO:0043231intracellular membrane-bounded organelleWARNINGNaNIAO:0000115NaNOrganized structure of distinctive morphology ...NaNCannot parse genus and differentia
1oaklib.om:DCC#S11GO:0043231intracellular membrane-bounded organelleNaNNaNIAO:0000115NaNNaNNaNLogical definition element not found in text: ...
2oaklib.om:DCC#S11GO:0043231intracellular membrane-bounded organelleNaNNaNIAO:0000115NaNNaNNaNLogical definition element not found in text: ...
3oaklib.om:DCC#S3GO:0099568cytoplasmic regionWARNINGNaNIAO:0000115NaNAny (proper) part of the cytoplasm of a single...NaNCannot parse genus and differentia
4oaklib.om:DCC#S3GO:0099738cell cortex regionNaNNaNIAO:0000115NaNcomplete extent of cell cortexNaNDid not match whole text: cell cortex < comple...
5oaklib.om:DCC#S11GO:0099738cell cortex regionNaNNaNIAO:0000115NaNunderlies some some region of the plasma membraneNaNWrong position, 'cell cortex' not in 'underlie...
6oaklib.om:DCC#S3GO:0071944cell peripheryWARNINGNaNIAO:0000115NaNThe part of a cell encompassing the cell corte...NaNCannot parse genus and differentia
7oaklib.om:DCC#S11GO:0031090organelle membraneNaNNaNIAO:0000115NaNis one of the two lipid bilayers of an organel...NaNLogical definition element not found in text: ...
8oaklib.om:DCC#S3GO:0043229intracellular organelleWARNINGNaNIAO:0000115NaNOrganized structure of distinctive morphology ...NaNCannot parse genus and differentia
9oaklib.om:DCC#S11GO:0043229intracellular organelleNaNNaNIAO:0000115NaNNaNNaNLogical definition element not found in text: ...
10oaklib.om:DCC#S11GO:0043229intracellular organelleNaNNaNIAO:0000115NaNNaNNaNLogical definition element not found in text: ...
11oaklib.om:DCC#S3GO:0031967organelle envelopeWARNINGNaNIAO:0000115NaNA double membrane structure enclosing an organ...NaNCannot parse genus and differentia
12oaklib.om:DCC#S3GO:0031975envelopeWARNINGNaNIAO:0000115NaNA multilayered structure surrounding all or pa...NaNCannot parse genus and differentia
13oaklib.om:DCC#AnyGO:0098590plasma membrane regionINFONaNIAO:0000115NaNA membrane that is a (regional) part of the pl...NaNNo problems with definition
14oaklib.om:DCC#S0GO:0012505endomembrane systemERRORNaNIAO:0000115NaNNaNNaNMissing text definition
15oaklib.om:DCC#S3GO:0005622intracellular anatomical structureWARNINGNaNIAO:0000115NaNA component of a cell contained within (but no...NaNCannot parse genus and differentia
16oaklib.om:DCC#S3GO:9999998fake term for testing pmid typeWARNINGNaNIAO:0000115NaNfake definition to test retracted typo in refe...NaNCannot parse genus and differentia
17oaklib.om:DCC#S3GO:0043227membrane-bounded organelleWARNINGNaNIAO:0000115NaNOrganized structure of distinctive morphology ...NaNCannot parse genus and differentia
18oaklib.om:DCC#S11GO:0043227membrane-bounded organelleNaNNaNIAO:0000115NaNNaNNaNLogical definition element not found in text: ...
19oaklib.om:DCC#S11GO:0005938cell cortexNaNNaNIAO:0000115NaNregion of a cellNaNLogical definition element not found in text: ...
20oaklib.om:DCC#S11GO:0005938cell cortexNaNNaNIAO:0000115NaNlies just beneath the plasma membrane and ofte...NaNLogical definition element not found in text: ...
21oaklib.om:DCC#S7GO:0009579thylakoidNaNNaNIAO:0000115NaNThe structure in a plant cell that is known as...NaNCircular, thylakoid (GO:0009579 in definition
22oaklib.om:DCC#S3GO:9999999fake term for testing retractionWARNINGNaNIAO:0000115NaNfake definition to test retracted referenceNaNCannot parse genus and differentia
23oaklib.om:DCC#S3GO:0005575cellular_componentWARNINGNaNIAO:0000115NaNA location, relative to cellular compartments ...NaNCannot parse genus and differentia
24oaklib.om:DCC#AnyGO:0005634nucleusINFONaNIAO:0000115NaNA membrane-bounded organelle of eukaryotic cel...NaNNo problems with definition
25oaklib.om:DCC#S3GO:0016020membraneWARNINGNaNIAO:0000115NaNA lipid bilayer along with all the proteins an...NaNCannot parse genus and differentia
26oaklib.om:DCC#AnyGO:0110165cellular anatomical entityINFONaNIAO:0000115NaNA part of a cellular organism that is either a...NaNNo problems with definition
27oaklib.om:DCC#AnyGO:0005635nuclear envelopeINFONaNIAO:0000115NaNA double lipid bilayer that is part of the nuc...NaNNo problems with definition
28oaklib.om:DCC#AnyGO:0005886plasma membraneINFONaNIAO:0000115NaNThe membrane surrounding a cell that separates...NaNNo problems with definition
29oaklib.om:DCC#S1GO:0005773vacuoleNaNNaNIAO:0000115NaNNaNNaNDefiniendum should not appear at the start
30oaklib.om:DCC#S11GO:0031965nuclear membraneNaNNaNIAO:0000115NaNenvelopeNaNLogical definition element not found in text: ...
31oaklib.om:DCC#S1GO:0005737cytoplasmNaNNaNIAO:0000115NaNNaNNaNDefiniendum should not appear at the start
32oaklib.om:DCC#AnyGO:0034357photosynthetic membraneINFONaNIAO:0000115NaNA membrane enriched in complexes formed of rea...NaNNo problems with definition
33oaklib.om:DCC#S3GO:0043226organelleWARNINGNaNIAO:0000115NaNOrganized structure of distinctive morphology ...NaNCannot parse genus and differentia
34oaklib.om:DCC#S20.1GO:9999998fake term for testing pmid typeERRORNaNIAO:0000115PMID:9999999999999fake definition to test retracted typo in refe...NaNpublication not found: PMID:9999999999999
35oaklib.om:DCC#S20.2GO:9999999fake term for testing retractionERRORNaNIAO:0000115PMID:19717156NaNNaNpublication is retracted: A role for plasma tr...
\n", "
" ], "text/plain": [ " type subject subject_label \\\n", "0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle \n", "1 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n", "2 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n", "3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region \n", "4 oaklib.om:DCC#S3 GO:0099738 cell cortex region \n", "5 oaklib.om:DCC#S11 GO:0099738 cell cortex region \n", "6 oaklib.om:DCC#S3 GO:0071944 cell periphery \n", "7 oaklib.om:DCC#S11 GO:0031090 organelle membrane \n", "8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle \n", "9 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n", "10 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n", "11 oaklib.om:DCC#S3 GO:0031967 organelle envelope \n", "12 oaklib.om:DCC#S3 GO:0031975 envelope \n", "13 oaklib.om:DCC#Any GO:0098590 plasma membrane region \n", "14 oaklib.om:DCC#S0 GO:0012505 endomembrane system \n", "15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure \n", "16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type \n", "17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle \n", "18 oaklib.om:DCC#S11 GO:0043227 membrane-bounded organelle \n", "19 oaklib.om:DCC#S11 GO:0005938 cell cortex \n", "20 oaklib.om:DCC#S11 GO:0005938 cell cortex \n", "21 oaklib.om:DCC#S7 GO:0009579 thylakoid \n", "22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction \n", "23 oaklib.om:DCC#S3 GO:0005575 cellular_component \n", "24 oaklib.om:DCC#Any GO:0005634 nucleus \n", "25 oaklib.om:DCC#S3 GO:0016020 membrane \n", "26 oaklib.om:DCC#Any GO:0110165 cellular anatomical entity \n", "27 oaklib.om:DCC#Any GO:0005635 nuclear envelope \n", "28 oaklib.om:DCC#Any GO:0005886 plasma membrane \n", "29 oaklib.om:DCC#S1 GO:0005773 vacuole \n", "30 oaklib.om:DCC#S11 GO:0031965 nuclear membrane \n", "31 oaklib.om:DCC#S1 GO:0005737 cytoplasm \n", "32 oaklib.om:DCC#Any GO:0034357 photosynthetic membrane \n", "33 oaklib.om:DCC#S3 GO:0043226 organelle \n", "34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type \n", "35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction \n", "\n", " severity instantiates predicate object \\\n", "0 WARNING NaN IAO:0000115 NaN \n", "1 NaN NaN IAO:0000115 NaN \n", "2 NaN NaN IAO:0000115 NaN \n", "3 WARNING NaN IAO:0000115 NaN \n", "4 NaN NaN IAO:0000115 NaN \n", "5 NaN NaN IAO:0000115 NaN \n", "6 WARNING NaN IAO:0000115 NaN \n", "7 NaN NaN IAO:0000115 NaN \n", "8 WARNING NaN IAO:0000115 NaN \n", "9 NaN NaN IAO:0000115 NaN \n", "10 NaN NaN IAO:0000115 NaN \n", "11 WARNING NaN IAO:0000115 NaN \n", "12 WARNING NaN IAO:0000115 NaN \n", "13 INFO NaN IAO:0000115 NaN \n", "14 ERROR NaN IAO:0000115 NaN \n", "15 WARNING NaN IAO:0000115 NaN \n", "16 WARNING NaN IAO:0000115 NaN \n", "17 WARNING NaN IAO:0000115 NaN \n", "18 NaN NaN IAO:0000115 NaN \n", "19 NaN NaN IAO:0000115 NaN \n", "20 NaN NaN IAO:0000115 NaN \n", "21 NaN NaN IAO:0000115 NaN \n", "22 WARNING NaN IAO:0000115 NaN \n", "23 WARNING NaN IAO:0000115 NaN \n", "24 INFO NaN IAO:0000115 NaN \n", "25 WARNING NaN IAO:0000115 NaN \n", "26 INFO NaN IAO:0000115 NaN \n", "27 INFO NaN IAO:0000115 NaN \n", "28 INFO NaN IAO:0000115 NaN \n", "29 NaN NaN IAO:0000115 NaN \n", "30 NaN NaN IAO:0000115 NaN \n", "31 NaN NaN IAO:0000115 NaN \n", "32 INFO NaN IAO:0000115 NaN \n", "33 WARNING NaN IAO:0000115 NaN \n", "34 ERROR NaN IAO:0000115 PMID:9999999999999 \n", "35 ERROR NaN IAO:0000115 PMID:19717156 \n", "\n", " object_str source \\\n", "0 Organized structure of distinctive morphology ... NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 Any (proper) part of the cytoplasm of a single... NaN \n", "4 complete extent of cell cortex NaN \n", "5 underlies some some region of the plasma membrane NaN \n", "6 The part of a cell encompassing the cell corte... NaN \n", "7 is one of the two lipid bilayers of an organel... NaN \n", "8 Organized structure of distinctive morphology ... NaN \n", "9 NaN NaN \n", "10 NaN NaN \n", "11 A double membrane structure enclosing an organ... NaN \n", "12 A multilayered structure surrounding all or pa... NaN \n", "13 A membrane that is a (regional) part of the pl... NaN \n", "14 NaN NaN \n", "15 A component of a cell contained within (but no... NaN \n", "16 fake definition to test retracted typo in refe... NaN \n", "17 Organized structure of distinctive morphology ... NaN \n", "18 NaN NaN \n", "19 region of a cell NaN \n", "20 lies just beneath the plasma membrane and ofte... NaN \n", "21 The structure in a plant cell that is known as... NaN \n", "22 fake definition to test retracted reference NaN \n", "23 A location, relative to cellular compartments ... NaN \n", "24 A membrane-bounded organelle of eukaryotic cel... NaN \n", "25 A lipid bilayer along with all the proteins an... NaN \n", "26 A part of a cellular organism that is either a... NaN \n", "27 A double lipid bilayer that is part of the nuc... NaN \n", "28 The membrane surrounding a cell that separates... NaN \n", "29 NaN NaN \n", "30 envelope NaN \n", "31 NaN NaN \n", "32 A membrane enriched in complexes formed of rea... NaN \n", "33 Organized structure of distinctive morphology ... NaN \n", "34 fake definition to test retracted typo in refe... NaN \n", "35 NaN NaN \n", "\n", " info \n", "0 Cannot parse genus and differentia \n", "1 Logical definition element not found in text: ... \n", "2 Logical definition element not found in text: ... \n", "3 Cannot parse genus and differentia \n", "4 Did not match whole text: cell cortex < comple... \n", "5 Wrong position, 'cell cortex' not in 'underlie... \n", "6 Cannot parse genus and differentia \n", "7 Logical definition element not found in text: ... \n", "8 Cannot parse genus and differentia \n", "9 Logical definition element not found in text: ... \n", "10 Logical definition element not found in text: ... \n", "11 Cannot parse genus and differentia \n", "12 Cannot parse genus and differentia \n", "13 No problems with definition \n", "14 Missing text definition \n", "15 Cannot parse genus and differentia \n", "16 Cannot parse genus and differentia \n", "17 Cannot parse genus and differentia \n", "18 Logical definition element not found in text: ... \n", "19 Logical definition element not found in text: ... \n", "20 Logical definition element not found in text: ... \n", "21 Circular, thylakoid (GO:0009579 in definition \n", "22 Cannot parse genus and differentia \n", "23 Cannot parse genus and differentia \n", "24 No problems with definition \n", "25 Cannot parse genus and differentia \n", "26 No problems with definition \n", "27 No problems with definition \n", "28 No problems with definition \n", "29 Definiendum should not appear at the start \n", "30 Logical definition element not found in text: ... \n", "31 Definiendum should not appear at the start \n", "32 No problems with definition \n", "33 Cannot parse genus and differentia \n", "34 publication not found: PMID:9999999999999 \n", "35 publication is retracted: A role for plasma tr... " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"output/validate-definitions.output.tsv\", sep=\"\\t\")\n", "df" ] }, { "cell_type": "markdown", "id": "f4209133-fd5c-4ecd-a0c4-a5dc4cb8a57a", "metadata": {}, "source": [ "The rows conform to ValidationResults in the [OAK ontology-metadata](https://w3id.org/oak/ontology-metadata/) data model.\n", "\n", "The values of the type field are from the [DefinitionConstraintComponent](https://w3id.org/oak/ontology-metadata/DefinitionConstraintComponent) enumeration.\n", "\n", "These themselves are modeled off of the taxonomy from Seppälä, Ruttenberg, and Smith, [Guidelines for writing definitions in ontologies](https://philpapers.org/archive/SEPGFW.pdf)." ] }, { "cell_type": "code", "execution_count": 19, "id": "421c556c-df3e-4281-914b-613e3d467036", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:30.958784Z", "start_time": "2024-04-15T00:50:30.954200Z" } }, "outputs": [ { "data": { "text/plain": [ "array(['oaklib.om:DCC#S3', 'oaklib.om:DCC#S11', 'oaklib.om:DCC#Any',\n", " 'oaklib.om:DCC#S0', 'oaklib.om:DCC#S7', 'oaklib.om:DCC#S1',\n", " 'oaklib.om:DCC#S20.1', 'oaklib.om:DCC#S20.2'], dtype=object)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"type\"].unique()" ] }, { "cell_type": "code", "execution_count": 20, "id": "aea2cfe0-70bf-4b76-89e2-2bfdbdd3a084", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:30.966660Z", "start_time": "2024-04-15T00:50:30.962252Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typecounts
0oaklib.om:DCC#Any6
1oaklib.om:DCC#S01
2oaklib.om:DCC#S12
3oaklib.om:DCC#S1110
4oaklib.om:DCC#S20.11
5oaklib.om:DCC#S20.21
6oaklib.om:DCC#S314
7oaklib.om:DCC#S71
\n", "
" ], "text/plain": [ " type counts\n", "0 oaklib.om:DCC#Any 6\n", "1 oaklib.om:DCC#S0 1\n", "2 oaklib.om:DCC#S1 2\n", "3 oaklib.om:DCC#S11 10\n", "4 oaklib.om:DCC#S20.1 1\n", "5 oaklib.om:DCC#S20.2 1\n", "6 oaklib.om:DCC#S3 14\n", "7 oaklib.om:DCC#S7 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"type\").size().reset_index(name='counts')" ] }, { "cell_type": "markdown", "id": "f28d70f482239b30", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "Next we'll filter out less informative columns" ] }, { "cell_type": "code", "execution_count": 21, "id": "c1df05dd32082e69", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:30.994801Z", "start_time": "2024-04-15T00:50:30.971926Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelobject_strinfo
0oaklib.om:DCC#S3GO:0043231intracellular membrane-bounded organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
1oaklib.om:DCC#S11GO:0043231intracellular membrane-bounded organelleNaNLogical definition element not found in text: ...
2oaklib.om:DCC#S11GO:0043231intracellular membrane-bounded organelleNaNLogical definition element not found in text: ...
3oaklib.om:DCC#S3GO:0099568cytoplasmic regionAny (proper) part of the cytoplasm of a single...Cannot parse genus and differentia
4oaklib.om:DCC#S3GO:0099738cell cortex regioncomplete extent of cell cortexDid not match whole text: cell cortex < comple...
5oaklib.om:DCC#S11GO:0099738cell cortex regionunderlies some some region of the plasma membraneWrong position, 'cell cortex' not in 'underlie...
6oaklib.om:DCC#S3GO:0071944cell peripheryThe part of a cell encompassing the cell corte...Cannot parse genus and differentia
7oaklib.om:DCC#S11GO:0031090organelle membraneis one of the two lipid bilayers of an organel...Logical definition element not found in text: ...
8oaklib.om:DCC#S3GO:0043229intracellular organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
9oaklib.om:DCC#S11GO:0043229intracellular organelleNaNLogical definition element not found in text: ...
10oaklib.om:DCC#S11GO:0043229intracellular organelleNaNLogical definition element not found in text: ...
11oaklib.om:DCC#S3GO:0031967organelle envelopeA double membrane structure enclosing an organ...Cannot parse genus and differentia
12oaklib.om:DCC#S3GO:0031975envelopeA multilayered structure surrounding all or pa...Cannot parse genus and differentia
13oaklib.om:DCC#AnyGO:0098590plasma membrane regionA membrane that is a (regional) part of the pl...No problems with definition
14oaklib.om:DCC#S0GO:0012505endomembrane systemNaNMissing text definition
15oaklib.om:DCC#S3GO:0005622intracellular anatomical structureA component of a cell contained within (but no...Cannot parse genus and differentia
16oaklib.om:DCC#S3GO:9999998fake term for testing pmid typefake definition to test retracted typo in refe...Cannot parse genus and differentia
17oaklib.om:DCC#S3GO:0043227membrane-bounded organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
18oaklib.om:DCC#S11GO:0043227membrane-bounded organelleNaNLogical definition element not found in text: ...
19oaklib.om:DCC#S11GO:0005938cell cortexregion of a cellLogical definition element not found in text: ...
20oaklib.om:DCC#S11GO:0005938cell cortexlies just beneath the plasma membrane and ofte...Logical definition element not found in text: ...
21oaklib.om:DCC#S7GO:0009579thylakoidThe structure in a plant cell that is known as...Circular, thylakoid (GO:0009579 in definition
22oaklib.om:DCC#S3GO:9999999fake term for testing retractionfake definition to test retracted referenceCannot parse genus and differentia
23oaklib.om:DCC#S3GO:0005575cellular_componentA location, relative to cellular compartments ...Cannot parse genus and differentia
24oaklib.om:DCC#AnyGO:0005634nucleusA membrane-bounded organelle of eukaryotic cel...No problems with definition
25oaklib.om:DCC#S3GO:0016020membraneA lipid bilayer along with all the proteins an...Cannot parse genus and differentia
26oaklib.om:DCC#AnyGO:0110165cellular anatomical entityA part of a cellular organism that is either a...No problems with definition
27oaklib.om:DCC#AnyGO:0005635nuclear envelopeA double lipid bilayer that is part of the nuc...No problems with definition
28oaklib.om:DCC#AnyGO:0005886plasma membraneThe membrane surrounding a cell that separates...No problems with definition
29oaklib.om:DCC#S1GO:0005773vacuoleNaNDefiniendum should not appear at the start
30oaklib.om:DCC#S11GO:0031965nuclear membraneenvelopeLogical definition element not found in text: ...
31oaklib.om:DCC#S1GO:0005737cytoplasmNaNDefiniendum should not appear at the start
32oaklib.om:DCC#AnyGO:0034357photosynthetic membraneA membrane enriched in complexes formed of rea...No problems with definition
33oaklib.om:DCC#S3GO:0043226organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
34oaklib.om:DCC#S20.1GO:9999998fake term for testing pmid typefake definition to test retracted typo in refe...publication not found: PMID:9999999999999
35oaklib.om:DCC#S20.2GO:9999999fake term for testing retractionNaNpublication is retracted: A role for plasma tr...
\n", "
" ], "text/plain": [ " type subject subject_label \\\n", "0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle \n", "1 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n", "2 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n", "3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region \n", "4 oaklib.om:DCC#S3 GO:0099738 cell cortex region \n", "5 oaklib.om:DCC#S11 GO:0099738 cell cortex region \n", "6 oaklib.om:DCC#S3 GO:0071944 cell periphery \n", "7 oaklib.om:DCC#S11 GO:0031090 organelle membrane \n", "8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle \n", "9 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n", "10 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n", "11 oaklib.om:DCC#S3 GO:0031967 organelle envelope \n", "12 oaklib.om:DCC#S3 GO:0031975 envelope \n", "13 oaklib.om:DCC#Any GO:0098590 plasma membrane region \n", "14 oaklib.om:DCC#S0 GO:0012505 endomembrane system \n", "15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure \n", "16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type \n", "17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle \n", "18 oaklib.om:DCC#S11 GO:0043227 membrane-bounded organelle \n", "19 oaklib.om:DCC#S11 GO:0005938 cell cortex \n", "20 oaklib.om:DCC#S11 GO:0005938 cell cortex \n", "21 oaklib.om:DCC#S7 GO:0009579 thylakoid \n", "22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction \n", "23 oaklib.om:DCC#S3 GO:0005575 cellular_component \n", "24 oaklib.om:DCC#Any GO:0005634 nucleus \n", "25 oaklib.om:DCC#S3 GO:0016020 membrane \n", "26 oaklib.om:DCC#Any GO:0110165 cellular anatomical entity \n", "27 oaklib.om:DCC#Any GO:0005635 nuclear envelope \n", "28 oaklib.om:DCC#Any GO:0005886 plasma membrane \n", "29 oaklib.om:DCC#S1 GO:0005773 vacuole \n", "30 oaklib.om:DCC#S11 GO:0031965 nuclear membrane \n", "31 oaklib.om:DCC#S1 GO:0005737 cytoplasm \n", "32 oaklib.om:DCC#Any GO:0034357 photosynthetic membrane \n", "33 oaklib.om:DCC#S3 GO:0043226 organelle \n", "34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type \n", "35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction \n", "\n", " object_str \\\n", "0 Organized structure of distinctive morphology ... \n", "1 NaN \n", "2 NaN \n", "3 Any (proper) part of the cytoplasm of a single... \n", "4 complete extent of cell cortex \n", "5 underlies some some region of the plasma membrane \n", "6 The part of a cell encompassing the cell corte... \n", "7 is one of the two lipid bilayers of an organel... \n", "8 Organized structure of distinctive morphology ... \n", "9 NaN \n", "10 NaN \n", "11 A double membrane structure enclosing an organ... \n", "12 A multilayered structure surrounding all or pa... \n", "13 A membrane that is a (regional) part of the pl... \n", "14 NaN \n", "15 A component of a cell contained within (but no... \n", "16 fake definition to test retracted typo in refe... \n", "17 Organized structure of distinctive morphology ... \n", "18 NaN \n", "19 region of a cell \n", "20 lies just beneath the plasma membrane and ofte... \n", "21 The structure in a plant cell that is known as... \n", "22 fake definition to test retracted reference \n", "23 A location, relative to cellular compartments ... \n", "24 A membrane-bounded organelle of eukaryotic cel... \n", "25 A lipid bilayer along with all the proteins an... \n", "26 A part of a cellular organism that is either a... \n", "27 A double lipid bilayer that is part of the nuc... \n", "28 The membrane surrounding a cell that separates... \n", "29 NaN \n", "30 envelope \n", "31 NaN \n", "32 A membrane enriched in complexes formed of rea... \n", "33 Organized structure of distinctive morphology ... \n", "34 fake definition to test retracted typo in refe... \n", "35 NaN \n", "\n", " info \n", "0 Cannot parse genus and differentia \n", "1 Logical definition element not found in text: ... \n", "2 Logical definition element not found in text: ... \n", "3 Cannot parse genus and differentia \n", "4 Did not match whole text: cell cortex < comple... \n", "5 Wrong position, 'cell cortex' not in 'underlie... \n", "6 Cannot parse genus and differentia \n", "7 Logical definition element not found in text: ... \n", "8 Cannot parse genus and differentia \n", "9 Logical definition element not found in text: ... \n", "10 Logical definition element not found in text: ... \n", "11 Cannot parse genus and differentia \n", "12 Cannot parse genus and differentia \n", "13 No problems with definition \n", "14 Missing text definition \n", "15 Cannot parse genus and differentia \n", "16 Cannot parse genus and differentia \n", "17 Cannot parse genus and differentia \n", "18 Logical definition element not found in text: ... \n", "19 Logical definition element not found in text: ... \n", "20 Logical definition element not found in text: ... \n", "21 Circular, thylakoid (GO:0009579 in definition \n", "22 Cannot parse genus and differentia \n", "23 Cannot parse genus and differentia \n", "24 No problems with definition \n", "25 Cannot parse genus and differentia \n", "26 No problems with definition \n", "27 No problems with definition \n", "28 No problems with definition \n", "29 Definiendum should not appear at the start \n", "30 Logical definition element not found in text: ... \n", "31 Definiendum should not appear at the start \n", "32 No problems with definition \n", "33 Cannot parse genus and differentia \n", "34 publication not found: PMID:9999999999999 \n", "35 publication is retracted: A role for plasma tr... " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df[[\"type\", \"subject\", \"subject_label\", \"object_str\", \"info\"]]\n", "df" ] }, { "cell_type": "markdown", "id": "8ad6ef24d0daf11f", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Missing Definitions\n", "\n", "This is the most trivial way to fail a definition check - not to include one. We can see all the missing definitions:\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "381e7c7da587668e", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:31.048081Z", "start_time": "2024-04-15T00:50:30.979466Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelobject_strinfo
14oaklib.om:DCC#S0GO:0012505endomembrane systemNaNMissing text definition
\n", "
" ], "text/plain": [ " type subject subject_label object_str \\\n", "14 oaklib.om:DCC#S0 GO:0012505 endomembrane system NaN \n", "\n", " info \n", "14 Missing text definition " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"type\"] == \"oaklib.om:DCC#S0\"]\n" ] }, { "cell_type": "markdown", "id": "f8844c7876451383", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "Of course, in the real ontology this term has a definition" ] }, { "cell_type": "markdown", "id": "c098cdf7a5665add", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Non genus-differentia structure\n", "\n", "The OAK validate definitions command follows [SRS]( https://philpapers.org/archive/SEPGFW.pdf) and assumes good definitions follow genus-differentia structure.\n", "\n", "We can see the ones that fail this (S3):" ] }, { "cell_type": "code", "execution_count": 23, "id": "9cf1490c83491596", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:31.052182Z", "start_time": "2024-04-15T00:50:30.987744Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelobject_strinfo
0oaklib.om:DCC#S3GO:0043231intracellular membrane-bounded organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
3oaklib.om:DCC#S3GO:0099568cytoplasmic regionAny (proper) part of the cytoplasm of a single...Cannot parse genus and differentia
4oaklib.om:DCC#S3GO:0099738cell cortex regioncomplete extent of cell cortexDid not match whole text: cell cortex < comple...
6oaklib.om:DCC#S3GO:0071944cell peripheryThe part of a cell encompassing the cell corte...Cannot parse genus and differentia
8oaklib.om:DCC#S3GO:0043229intracellular organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
11oaklib.om:DCC#S3GO:0031967organelle envelopeA double membrane structure enclosing an organ...Cannot parse genus and differentia
12oaklib.om:DCC#S3GO:0031975envelopeA multilayered structure surrounding all or pa...Cannot parse genus and differentia
15oaklib.om:DCC#S3GO:0005622intracellular anatomical structureA component of a cell contained within (but no...Cannot parse genus and differentia
16oaklib.om:DCC#S3GO:9999998fake term for testing pmid typefake definition to test retracted typo in refe...Cannot parse genus and differentia
17oaklib.om:DCC#S3GO:0043227membrane-bounded organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
22oaklib.om:DCC#S3GO:9999999fake term for testing retractionfake definition to test retracted referenceCannot parse genus and differentia
23oaklib.om:DCC#S3GO:0005575cellular_componentA location, relative to cellular compartments ...Cannot parse genus and differentia
25oaklib.om:DCC#S3GO:0016020membraneA lipid bilayer along with all the proteins an...Cannot parse genus and differentia
33oaklib.om:DCC#S3GO:0043226organelleOrganized structure of distinctive morphology ...Cannot parse genus and differentia
\n", "
" ], "text/plain": [ " type subject subject_label \\\n", "0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle \n", "3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region \n", "4 oaklib.om:DCC#S3 GO:0099738 cell cortex region \n", "6 oaklib.om:DCC#S3 GO:0071944 cell periphery \n", "8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle \n", "11 oaklib.om:DCC#S3 GO:0031967 organelle envelope \n", "12 oaklib.om:DCC#S3 GO:0031975 envelope \n", "15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure \n", "16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type \n", "17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle \n", "22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction \n", "23 oaklib.om:DCC#S3 GO:0005575 cellular_component \n", "25 oaklib.om:DCC#S3 GO:0016020 membrane \n", "33 oaklib.om:DCC#S3 GO:0043226 organelle \n", "\n", " object_str \\\n", "0 Organized structure of distinctive morphology ... \n", "3 Any (proper) part of the cytoplasm of a single... \n", "4 complete extent of cell cortex \n", "6 The part of a cell encompassing the cell corte... \n", "8 Organized structure of distinctive morphology ... \n", "11 A double membrane structure enclosing an organ... \n", "12 A multilayered structure surrounding all or pa... \n", "15 A component of a cell contained within (but no... \n", "16 fake definition to test retracted typo in refe... \n", "17 Organized structure of distinctive morphology ... \n", "22 fake definition to test retracted reference \n", "23 A location, relative to cellular compartments ... \n", "25 A lipid bilayer along with all the proteins an... \n", "33 Organized structure of distinctive morphology ... \n", "\n", " info \n", "0 Cannot parse genus and differentia \n", "3 Cannot parse genus and differentia \n", "4 Did not match whole text: cell cortex < comple... \n", "6 Cannot parse genus and differentia \n", "8 Cannot parse genus and differentia \n", "11 Cannot parse genus and differentia \n", "12 Cannot parse genus and differentia \n", "15 Cannot parse genus and differentia \n", "16 Cannot parse genus and differentia \n", "17 Cannot parse genus and differentia \n", "22 Cannot parse genus and differentia \n", "23 Cannot parse genus and differentia \n", "25 Cannot parse genus and differentia \n", "33 Cannot parse genus and differentia " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"type\"] == \"oaklib.om:DCC#S3\"]" ] }, { "cell_type": "markdown", "id": "27f9e7b747b071de", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "Many of these are actual definitions rather than ones manipulated for test purposes.\n", "\n", "There is room for valid disagreement about whether rewriting some of these following genus-differentia form would improve things for either users or annotators. Arguably at least the subtypes of organelle could simply state how they are differentiated from organelles in general rather than repeating the somewhat wordy _\"Organized structure of distinctive morphology...\"_" ] }, { "cell_type": "markdown", "id": "c56d3a9c531e5a09", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Circular definitions" ] }, { "cell_type": "code", "execution_count": 24, "id": "adcbad5fae63e7fb", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:31.052559Z", "start_time": "2024-04-15T00:50:30.994899Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelobject_strinfo
21oaklib.om:DCC#S7GO:0009579thylakoidThe structure in a plant cell that is known as...Circular, thylakoid (GO:0009579 in definition
\n", "
" ], "text/plain": [ " type subject subject_label \\\n", "21 oaklib.om:DCC#S7 GO:0009579 thylakoid \n", "\n", " object_str \\\n", "21 The structure in a plant cell that is known as... \n", "\n", " info \n", "21 Circular, thylakoid (GO:0009579 in definition " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"type\"] == \"oaklib.om:DCC#S7\"]" ] }, { "cell_type": "markdown", "id": "34eb55cf06afa332", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Not following convention" ] }, { "cell_type": "code", "execution_count": 25, "id": "cf4d18796842b46", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:50:31.062863Z", "start_time": "2024-04-15T00:50:31.004181Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelobject_strinfo
29oaklib.om:DCC#S1GO:0005773vacuoleNaNDefiniendum should not appear at the start
31oaklib.om:DCC#S1GO:0005737cytoplasmNaNDefiniendum should not appear at the start
\n", "
" ], "text/plain": [ " type subject subject_label object_str \\\n", "29 oaklib.om:DCC#S1 GO:0005773 vacuole NaN \n", "31 oaklib.om:DCC#S1 GO:0005737 cytoplasm NaN \n", "\n", " info \n", "29 Definiendum should not appear at the start \n", "31 Definiendum should not appear at the start " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"type\"] == \"oaklib.om:DCC#S1\"]" ] }, { "cell_type": "markdown", "id": "4c5189bd46804bd8", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Definition Reference Issues\n", "\n", "### Typos in PMIDs\n" ] }, { "cell_type": "code", "execution_count": 26, "id": "35e1f10deba2c6c9", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:51:38.780848Z", "start_time": "2024-04-15T00:51:38.770256Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelobject_strinfo
34oaklib.om:DCC#S20.1GO:9999998fake term for testing pmid typefake definition to test retracted typo in refe...publication not found: PMID:9999999999999
\n", "
" ], "text/plain": [ " type subject subject_label \\\n", "34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type \n", "\n", " object_str \\\n", "34 fake definition to test retracted typo in refe... \n", "\n", " info \n", "34 publication not found: PMID:9999999999999 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"type\"] == \"oaklib.om:DCC#S20.1\"]\n" ] }, { "cell_type": "markdown", "id": "7a288d8fc507acc4", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "### Retracted publications" ] }, { "cell_type": "code", "execution_count": 27, "id": "f5245d99ab0864d5", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T00:52:02.693591Z", "start_time": "2024-04-15T00:52:02.687692Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typesubjectsubject_labelobject_strinfo
35oaklib.om:DCC#S20.2GO:9999999fake term for testing retractionNaNpublication is retracted: A role for plasma tr...
\n", "
" ], "text/plain": [ " type subject subject_label \\\n", "35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction \n", "\n", " object_str info \n", "35 NaN publication is retracted: A role for plasma tr... " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"type\"] == \"oaklib.om:DCC#S20.2\"]\n" ] }, { "cell_type": "markdown", "id": "7e8d97bc6e6c20b0", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Using LLMs to validate definitions\n", "\n", "For this example we will use an LLM to validate this GO catalytic activity:\n", "\n", "```yaml\n", "[Term]\n", "id: GO:0000010\n", "name: trans-hexaprenyltranstransferase activity\n", "namespace: molecular_function\n", "alt_id: GO:0036422\n", "def: \"Catalysis of the reaction: (2E,6E)-farnesyl diphosphate + 4 isopentenyl diphosphate = 4 diphosphate + all-trans-heptaprenyl diphosphate.\" [PMID:9708911, RHEA:27794]\n", "synonym: \"all-trans-heptaprenyl-diphosphate synthase activity\" RELATED [EC:2.5.1.30]\n", "synonym: \"HepPP synthase activity\" RELATED [EC:2.5.1.30]\n", "synonym: \"heptaprenyl diphosphate synthase activity\" RELATED []\n", "synonym: \"heptaprenyl pyrophosphate synthase activity\" RELATED [EC:2.5.1.30]\n", "synonym: \"heptaprenyl pyrophosphate synthetase activity\" RELATED [EC:2.5.1.30]\n", "xref: EC:2.5.1.30\n", "xref: MetaCyc:TRANS-HEXAPRENYLTRANSTRANSFERASE-RXN\n", "xref: RHEA:27794\n", "```\n", "\n", "There are two references for this:\n", "\n", " - the publication [PMID:9708911](https://pubmed.ncbi.nlm.nih.gov/9708911/)\n", " - the RHEA reaction [RHEA:27794](https://www.rhea-db.org/reaction?id=27794)" ] }, { "cell_type": "code", "execution_count": 13, "id": "4e29eb9d8ff5df4c", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T01:00:28.475900Z", "start_time": "2024-04-15T01:00:13.437742Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "!runoak --stacktrace -i llm:{claude-3-opus}:simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml GO:0000010 -O yaml -o output/validate-definitions.llm.yaml" ] }, { "cell_type": "code", "execution_count": 14, "id": "69f6da5532285cf9", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T01:01:41.771699Z", "start_time": "2024-04-15T01:01:41.744373Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import yaml\n", "report = yaml.safe_load(open(\"output/validate-definitions.llm.yaml\"))" ] }, { "cell_type": "code", "execution_count": 15, "id": "b35f8ffab12b1b6b", "metadata": { "ExecuteTime": { "end_time": "2024-04-15T01:09:34.475682Z", "start_time": "2024-04-15T01:09:34.465369Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "type: https://w3id.org/oak/ontology-metadata/DCC.S20\n", "subject: GO:0000010\n", "severity: INFO\n", "predicate: IAO:0000115\n", "object_str: \n", " id: PMID:9708911\n", " title: Biological significance of the side chain length of ubiquinone in Saccharomyces\n", " cerevisiae.\n", " abstract: Ubiquinone (UQ), an important component of the electron transfer system,\n", " is constituted of a quinone structure and a side chain isoprenoid. The side chain\n", " length of UQ differs between microorganisms, and this difference has been used for\n", " taxonomic study. In this study, we have addressed the importance of the length of\n", " the side chain of UQ for cells, and examined the effect of chain length by producing\n", " UQs with isoprenoid chain lengths between 5 and 10 in Saccharomyces cerevisiae.\n", " To make the different UQ species, different types of prenyl diphosphate synthases\n", " were expressed in a S. cerevisiae COQ1 mutant defective for hexaprenyl diphosphate\n", " synthesis. As a result, we found that the original species of UQ (in this case UQ-6)\n", " had maximum functionality. However, we found that other species of UQ could replace\n", " UQ-6. Thus a broad spectrum of different UQ species are biologically functional\n", " in yeast cells, although cells seem to display a preference for their own particular\n", " type of UQ.\n", " publication_type: Research Support, Non-U.S. Gov't\n", " \n", "\n", "info: \n", " The term \"trans-hexaprenyltranstransferase activity\" has a HIGH level of alignment with the cited reference PMID:9708911. The abstract supports the definition well, as evidenced by these key points:\n", " \n", " 1. The study examines the importance of the side chain length of ubiquinone (UQ) in Saccharomyces cerevisiae, which directly relates to the activity of trans-hexaprenyltranstransferase.\n", " \n", " 2. The abstract mentions \"hexaprenyl diphosphate synthesis\" in S. cerevisiae, which is the product of trans-hexaprenyltranstransferase activity.\n", " \n", " 3. The study found that the original species of UQ (UQ-6) had maximum functionality in yeast cells, suggesting a preference for the hexaprenyl side chain length produced by trans-hexaprenyltranstransferase.\n", " \n", " No sections of the abstract misalign with or contradict the term definition. The definition is appropriately specific, focusing on the enzyme's activity without providing additional details about its structure or cellular role.\n", "\n", "definition: \n", " Catalysis of the reaction: (2E,6E)-farnesyl diphosphate + 4 isopentenyl diphosphate = 4 diphosphate + all-trans-heptaprenyl diphosphate.\n", "\n", "definition_source: PMID:9708911\n" ] } ], "source": [ "for k, v in report.items():\n", " if len(str(v)) > 50:\n", " lines = v.split(\"\\n\")\n", " lines = [f\" {line}\" for line in lines]\n", " lines = [\"\"] + lines + [\"\"]\n", " v = \"\\n\".join(lines)\n", " print(f\"{k}: {v}\")\n", "\n", " " ] }, { "cell_type": "markdown", "id": "233f8a645b3517f2", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "__COMMENTARY__\n", "\n", "Note that as this is an LLM the output differs every time!\n", "\n", "In some cases, the LLM is failing to see that the paper is indeed about trans-hexaprenyltranstransferase activity, the output is useful as it shows us that the abstract is not directly about this activity." ] }, { "cell_type": "code", "execution_count": null, "id": "df16f8ef-a274-4c8c-a1a5-bbef76597842", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }