{
"cells": [
{
"cell_type": "markdown",
"id": "0a28b88d-4deb-4d0a-a110-f27adf077e23",
"metadata": {},
"source": [
"# OAK validate-definitions command\n",
"\n",
"This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).\n",
"\n",
"This notebook provides examples for the `validate-definitions` command.\n",
"This forms part of a suite of *validate* commands.\n",
" \n",
"## Help Option\n",
"\n",
"You can get help on any OAK command using `--help`"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "c223f678-f82f-4b06-8e19-1a5b7323e571",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:27.966036Z",
"start_time": "2024-04-15T00:50:25.530846Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Usage: runoak validate-definitions [OPTIONS] [TERMS]...\n",
"\n",
" Checks presence and structure of text definitions.\n",
"\n",
" To run:\n",
"\n",
" runoak validate-definitions -i db/uberon.db -o results.tsv\n",
"\n",
" By default this will apply basic text mining of text definitions to check\n",
" against machine actionable OBO text definition guideline rules. This can\n",
" result in an initial lag - to skip this, and ONLY perform checks for\n",
" *presence* of definitions, use --skip-text-annotation:\n",
"\n",
" Example: -------\n",
"\n",
" runoak validate-definitions -i db/uberon.db --skip-text-annotation\n",
"\n",
" Like most OAK commands, this accepts lists of terms or term queries as\n",
" arguments. You can pass in a CURIE list to selectively validate individual\n",
" classes\n",
"\n",
" Example: -------\n",
"\n",
" runoak validate-definitions -i db/cl.db CL:0002053\n",
"\n",
" Only on CL identifiers:\n",
"\n",
" runoak validate-definitions -i db/cl.db i^CL:\n",
"\n",
" Only on neuron hierarchy:\n",
"\n",
" runoak validate-definitions -i db/cl.db .desc//p=i neuron\n",
"\n",
" Output format:\n",
"\n",
" This command emits objects conforming to the OAK validation datamodel. See\n",
" https://incatools.github.io/ontology-access-kit/datamodels for more on OAK\n",
" datamodels.\n",
"\n",
" The default serialization of the datamodel is CSV.\n",
"\n",
" Notes: -----\n",
"\n",
" This command is largely redundant with the validate command, but is useful\n",
" for targeted validation focused solely on definitions\n",
"\n",
"Options:\n",
" --skip-text-annotation / --no-skip-text-annotation\n",
" If true, do not parse text annotations\n",
" [default: no-skip-text-annotation]\n",
" -C, --configuration-file TEXT Path to a configuration file. This is\n",
" typically a YAML file, but may be a JSON\n",
" file\n",
" --adapter-mapping TEXT Multiple prefix=selector pairs, e.g.\n",
" --adapter-mapping uberon=db/uberon.db\n",
" -O, --output-type TEXT Desired output type\n",
" -o, --output FILENAME Output file, e.g. obo file\n",
" --help Show this message and exit.\n"
]
}
],
"source": [
"!runoak validate-definitions --help"
]
},
{
"cell_type": "markdown",
"id": "01f38163-db22-4c51-ae46-10e8b8e6d53c",
"metadata": {},
"source": [
"## Example: Validation over Test Ontology\n",
"\n",
"To illustrate this command we will use a deliberately altered version of a subset of GO.\n",
"\n",
"We will query the subset that are descendants of cellular process using the query `.desc//p=i \"cellular_component\"`"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "c9b86e52-87a7-449c-baac-81981e7ce632",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:30.655424Z",
"start_time": "2024-04-15T00:50:27.968820Z"
}
},
"outputs": [],
"source": [
"!runoak -i simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml .desc//p=i \"cellular_component\" -o output/validate-definitions.output.tsv"
]
},
{
"cell_type": "markdown",
"id": "27c1668fc8d1a8de",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"The output is a TSV file with a summary of the issues found.\n",
"\n",
"We can load this into a pandas dataframe for further analysis. This also has the advantage of\n",
"displaying tables nicely in Jupyter notebooks such as this one.\n",
"\n",
"If you were actually using this on the command line you may prefer to use your own TSV processing tools,\n",
"or to simply load into google sheets."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "5fc9b15d-cc81-400a-8660-f92491baa120",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:30.953116Z",
"start_time": "2024-04-15T00:50:30.658190Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" severity | \n",
" instantiates | \n",
" predicate | \n",
" object | \n",
" object_str | \n",
" source | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043231 | \n",
" intracellular membrane-bounded organelle | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" Organized structure of distinctive morphology ... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 1 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043231 | \n",
" intracellular membrane-bounded organelle | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 2 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043231 | \n",
" intracellular membrane-bounded organelle | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 3 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0099568 | \n",
" cytoplasmic region | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" Any (proper) part of the cytoplasm of a single... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 4 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0099738 | \n",
" cell cortex region | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" complete extent of cell cortex | \n",
" NaN | \n",
" Did not match whole text: cell cortex < comple... | \n",
"
\n",
" \n",
" 5 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0099738 | \n",
" cell cortex region | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" underlies some some region of the plasma membrane | \n",
" NaN | \n",
" Wrong position, 'cell cortex' not in 'underlie... | \n",
"
\n",
" \n",
" 6 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0071944 | \n",
" cell periphery | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" The part of a cell encompassing the cell corte... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 7 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0031090 | \n",
" organelle membrane | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" is one of the two lipid bilayers of an organel... | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 8 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043229 | \n",
" intracellular organelle | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" Organized structure of distinctive morphology ... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 9 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043229 | \n",
" intracellular organelle | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 10 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043229 | \n",
" intracellular organelle | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 11 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0031967 | \n",
" organelle envelope | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A double membrane structure enclosing an organ... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 12 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0031975 | \n",
" envelope | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A multilayered structure surrounding all or pa... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 13 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0098590 | \n",
" plasma membrane region | \n",
" INFO | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A membrane that is a (regional) part of the pl... | \n",
" NaN | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 14 | \n",
" oaklib.om:DCC#S0 | \n",
" GO:0012505 | \n",
" endomembrane system | \n",
" ERROR | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Missing text definition | \n",
"
\n",
" \n",
" 15 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0005622 | \n",
" intracellular anatomical structure | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A component of a cell contained within (but no... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 16 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:9999998 | \n",
" fake term for testing pmid type | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" fake definition to test retracted typo in refe... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 17 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043227 | \n",
" membrane-bounded organelle | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" Organized structure of distinctive morphology ... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 18 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043227 | \n",
" membrane-bounded organelle | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 19 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0005938 | \n",
" cell cortex | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" region of a cell | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 20 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0005938 | \n",
" cell cortex | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" lies just beneath the plasma membrane and ofte... | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 21 | \n",
" oaklib.om:DCC#S7 | \n",
" GO:0009579 | \n",
" thylakoid | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" The structure in a plant cell that is known as... | \n",
" NaN | \n",
" Circular, thylakoid (GO:0009579 in definition | \n",
"
\n",
" \n",
" 22 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:9999999 | \n",
" fake term for testing retraction | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" fake definition to test retracted reference | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 23 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0005575 | \n",
" cellular_component | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A location, relative to cellular compartments ... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 24 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0005634 | \n",
" nucleus | \n",
" INFO | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A membrane-bounded organelle of eukaryotic cel... | \n",
" NaN | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 25 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0016020 | \n",
" membrane | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A lipid bilayer along with all the proteins an... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 26 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0110165 | \n",
" cellular anatomical entity | \n",
" INFO | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A part of a cellular organism that is either a... | \n",
" NaN | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 27 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0005635 | \n",
" nuclear envelope | \n",
" INFO | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A double lipid bilayer that is part of the nuc... | \n",
" NaN | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 28 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0005886 | \n",
" plasma membrane | \n",
" INFO | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" The membrane surrounding a cell that separates... | \n",
" NaN | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 29 | \n",
" oaklib.om:DCC#S1 | \n",
" GO:0005773 | \n",
" vacuole | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Definiendum should not appear at the start | \n",
"
\n",
" \n",
" 30 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0031965 | \n",
" nuclear membrane | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" envelope | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 31 | \n",
" oaklib.om:DCC#S1 | \n",
" GO:0005737 | \n",
" cytoplasm | \n",
" NaN | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Definiendum should not appear at the start | \n",
"
\n",
" \n",
" 32 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0034357 | \n",
" photosynthetic membrane | \n",
" INFO | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" A membrane enriched in complexes formed of rea... | \n",
" NaN | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 33 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043226 | \n",
" organelle | \n",
" WARNING | \n",
" NaN | \n",
" IAO:0000115 | \n",
" NaN | \n",
" Organized structure of distinctive morphology ... | \n",
" NaN | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 34 | \n",
" oaklib.om:DCC#S20.1 | \n",
" GO:9999998 | \n",
" fake term for testing pmid type | \n",
" ERROR | \n",
" NaN | \n",
" IAO:0000115 | \n",
" PMID:9999999999999 | \n",
" fake definition to test retracted typo in refe... | \n",
" NaN | \n",
" publication not found: PMID:9999999999999 | \n",
"
\n",
" \n",
" 35 | \n",
" oaklib.om:DCC#S20.2 | \n",
" GO:9999999 | \n",
" fake term for testing retraction | \n",
" ERROR | \n",
" NaN | \n",
" IAO:0000115 | \n",
" PMID:19717156 | \n",
" NaN | \n",
" NaN | \n",
" publication is retracted: A role for plasma tr... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label \\\n",
"0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle \n",
"1 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n",
"2 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n",
"3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region \n",
"4 oaklib.om:DCC#S3 GO:0099738 cell cortex region \n",
"5 oaklib.om:DCC#S11 GO:0099738 cell cortex region \n",
"6 oaklib.om:DCC#S3 GO:0071944 cell periphery \n",
"7 oaklib.om:DCC#S11 GO:0031090 organelle membrane \n",
"8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle \n",
"9 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n",
"10 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n",
"11 oaklib.om:DCC#S3 GO:0031967 organelle envelope \n",
"12 oaklib.om:DCC#S3 GO:0031975 envelope \n",
"13 oaklib.om:DCC#Any GO:0098590 plasma membrane region \n",
"14 oaklib.om:DCC#S0 GO:0012505 endomembrane system \n",
"15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure \n",
"16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type \n",
"17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle \n",
"18 oaklib.om:DCC#S11 GO:0043227 membrane-bounded organelle \n",
"19 oaklib.om:DCC#S11 GO:0005938 cell cortex \n",
"20 oaklib.om:DCC#S11 GO:0005938 cell cortex \n",
"21 oaklib.om:DCC#S7 GO:0009579 thylakoid \n",
"22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction \n",
"23 oaklib.om:DCC#S3 GO:0005575 cellular_component \n",
"24 oaklib.om:DCC#Any GO:0005634 nucleus \n",
"25 oaklib.om:DCC#S3 GO:0016020 membrane \n",
"26 oaklib.om:DCC#Any GO:0110165 cellular anatomical entity \n",
"27 oaklib.om:DCC#Any GO:0005635 nuclear envelope \n",
"28 oaklib.om:DCC#Any GO:0005886 plasma membrane \n",
"29 oaklib.om:DCC#S1 GO:0005773 vacuole \n",
"30 oaklib.om:DCC#S11 GO:0031965 nuclear membrane \n",
"31 oaklib.om:DCC#S1 GO:0005737 cytoplasm \n",
"32 oaklib.om:DCC#Any GO:0034357 photosynthetic membrane \n",
"33 oaklib.om:DCC#S3 GO:0043226 organelle \n",
"34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type \n",
"35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction \n",
"\n",
" severity instantiates predicate object \\\n",
"0 WARNING NaN IAO:0000115 NaN \n",
"1 NaN NaN IAO:0000115 NaN \n",
"2 NaN NaN IAO:0000115 NaN \n",
"3 WARNING NaN IAO:0000115 NaN \n",
"4 NaN NaN IAO:0000115 NaN \n",
"5 NaN NaN IAO:0000115 NaN \n",
"6 WARNING NaN IAO:0000115 NaN \n",
"7 NaN NaN IAO:0000115 NaN \n",
"8 WARNING NaN IAO:0000115 NaN \n",
"9 NaN NaN IAO:0000115 NaN \n",
"10 NaN NaN IAO:0000115 NaN \n",
"11 WARNING NaN IAO:0000115 NaN \n",
"12 WARNING NaN IAO:0000115 NaN \n",
"13 INFO NaN IAO:0000115 NaN \n",
"14 ERROR NaN IAO:0000115 NaN \n",
"15 WARNING NaN IAO:0000115 NaN \n",
"16 WARNING NaN IAO:0000115 NaN \n",
"17 WARNING NaN IAO:0000115 NaN \n",
"18 NaN NaN IAO:0000115 NaN \n",
"19 NaN NaN IAO:0000115 NaN \n",
"20 NaN NaN IAO:0000115 NaN \n",
"21 NaN NaN IAO:0000115 NaN \n",
"22 WARNING NaN IAO:0000115 NaN \n",
"23 WARNING NaN IAO:0000115 NaN \n",
"24 INFO NaN IAO:0000115 NaN \n",
"25 WARNING NaN IAO:0000115 NaN \n",
"26 INFO NaN IAO:0000115 NaN \n",
"27 INFO NaN IAO:0000115 NaN \n",
"28 INFO NaN IAO:0000115 NaN \n",
"29 NaN NaN IAO:0000115 NaN \n",
"30 NaN NaN IAO:0000115 NaN \n",
"31 NaN NaN IAO:0000115 NaN \n",
"32 INFO NaN IAO:0000115 NaN \n",
"33 WARNING NaN IAO:0000115 NaN \n",
"34 ERROR NaN IAO:0000115 PMID:9999999999999 \n",
"35 ERROR NaN IAO:0000115 PMID:19717156 \n",
"\n",
" object_str source \\\n",
"0 Organized structure of distinctive morphology ... NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 Any (proper) part of the cytoplasm of a single... NaN \n",
"4 complete extent of cell cortex NaN \n",
"5 underlies some some region of the plasma membrane NaN \n",
"6 The part of a cell encompassing the cell corte... NaN \n",
"7 is one of the two lipid bilayers of an organel... NaN \n",
"8 Organized structure of distinctive morphology ... NaN \n",
"9 NaN NaN \n",
"10 NaN NaN \n",
"11 A double membrane structure enclosing an organ... NaN \n",
"12 A multilayered structure surrounding all or pa... NaN \n",
"13 A membrane that is a (regional) part of the pl... NaN \n",
"14 NaN NaN \n",
"15 A component of a cell contained within (but no... NaN \n",
"16 fake definition to test retracted typo in refe... NaN \n",
"17 Organized structure of distinctive morphology ... NaN \n",
"18 NaN NaN \n",
"19 region of a cell NaN \n",
"20 lies just beneath the plasma membrane and ofte... NaN \n",
"21 The structure in a plant cell that is known as... NaN \n",
"22 fake definition to test retracted reference NaN \n",
"23 A location, relative to cellular compartments ... NaN \n",
"24 A membrane-bounded organelle of eukaryotic cel... NaN \n",
"25 A lipid bilayer along with all the proteins an... NaN \n",
"26 A part of a cellular organism that is either a... NaN \n",
"27 A double lipid bilayer that is part of the nuc... NaN \n",
"28 The membrane surrounding a cell that separates... NaN \n",
"29 NaN NaN \n",
"30 envelope NaN \n",
"31 NaN NaN \n",
"32 A membrane enriched in complexes formed of rea... NaN \n",
"33 Organized structure of distinctive morphology ... NaN \n",
"34 fake definition to test retracted typo in refe... NaN \n",
"35 NaN NaN \n",
"\n",
" info \n",
"0 Cannot parse genus and differentia \n",
"1 Logical definition element not found in text: ... \n",
"2 Logical definition element not found in text: ... \n",
"3 Cannot parse genus and differentia \n",
"4 Did not match whole text: cell cortex < comple... \n",
"5 Wrong position, 'cell cortex' not in 'underlie... \n",
"6 Cannot parse genus and differentia \n",
"7 Logical definition element not found in text: ... \n",
"8 Cannot parse genus and differentia \n",
"9 Logical definition element not found in text: ... \n",
"10 Logical definition element not found in text: ... \n",
"11 Cannot parse genus and differentia \n",
"12 Cannot parse genus and differentia \n",
"13 No problems with definition \n",
"14 Missing text definition \n",
"15 Cannot parse genus and differentia \n",
"16 Cannot parse genus and differentia \n",
"17 Cannot parse genus and differentia \n",
"18 Logical definition element not found in text: ... \n",
"19 Logical definition element not found in text: ... \n",
"20 Logical definition element not found in text: ... \n",
"21 Circular, thylakoid (GO:0009579 in definition \n",
"22 Cannot parse genus and differentia \n",
"23 Cannot parse genus and differentia \n",
"24 No problems with definition \n",
"25 Cannot parse genus and differentia \n",
"26 No problems with definition \n",
"27 No problems with definition \n",
"28 No problems with definition \n",
"29 Definiendum should not appear at the start \n",
"30 Logical definition element not found in text: ... \n",
"31 Definiendum should not appear at the start \n",
"32 No problems with definition \n",
"33 Cannot parse genus and differentia \n",
"34 publication not found: PMID:9999999999999 \n",
"35 publication is retracted: A role for plasma tr... "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"output/validate-definitions.output.tsv\", sep=\"\\t\")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "f4209133-fd5c-4ecd-a0c4-a5dc4cb8a57a",
"metadata": {},
"source": [
"The rows conform to ValidationResults in the [OAK ontology-metadata](https://w3id.org/oak/ontology-metadata/) data model.\n",
"\n",
"The values of the type field are from the [DefinitionConstraintComponent](https://w3id.org/oak/ontology-metadata/DefinitionConstraintComponent) enumeration.\n",
"\n",
"These themselves are modeled off of the taxonomy from Seppälä, Ruttenberg, and Smith, [Guidelines for writing definitions in ontologies](https://philpapers.org/archive/SEPGFW.pdf)."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "421c556c-df3e-4281-914b-613e3d467036",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:30.958784Z",
"start_time": "2024-04-15T00:50:30.954200Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['oaklib.om:DCC#S3', 'oaklib.om:DCC#S11', 'oaklib.om:DCC#Any',\n",
" 'oaklib.om:DCC#S0', 'oaklib.om:DCC#S7', 'oaklib.om:DCC#S1',\n",
" 'oaklib.om:DCC#S20.1', 'oaklib.om:DCC#S20.2'], dtype=object)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"type\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "aea2cfe0-70bf-4b76-89e2-2bfdbdd3a084",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:30.966660Z",
"start_time": "2024-04-15T00:50:30.962252Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" counts | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" oaklib.om:DCC#Any | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" oaklib.om:DCC#S0 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" oaklib.om:DCC#S1 | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" oaklib.om:DCC#S11 | \n",
" 10 | \n",
"
\n",
" \n",
" 4 | \n",
" oaklib.om:DCC#S20.1 | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" oaklib.om:DCC#S20.2 | \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
" oaklib.om:DCC#S3 | \n",
" 14 | \n",
"
\n",
" \n",
" 7 | \n",
" oaklib.om:DCC#S7 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type counts\n",
"0 oaklib.om:DCC#Any 6\n",
"1 oaklib.om:DCC#S0 1\n",
"2 oaklib.om:DCC#S1 2\n",
"3 oaklib.om:DCC#S11 10\n",
"4 oaklib.om:DCC#S20.1 1\n",
"5 oaklib.om:DCC#S20.2 1\n",
"6 oaklib.om:DCC#S3 14\n",
"7 oaklib.om:DCC#S7 1"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(\"type\").size().reset_index(name='counts')"
]
},
{
"cell_type": "markdown",
"id": "f28d70f482239b30",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Next we'll filter out less informative columns"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "c1df05dd32082e69",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:30.994801Z",
"start_time": "2024-04-15T00:50:30.971926Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" object_str | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043231 | \n",
" intracellular membrane-bounded organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 1 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043231 | \n",
" intracellular membrane-bounded organelle | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 2 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043231 | \n",
" intracellular membrane-bounded organelle | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 3 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0099568 | \n",
" cytoplasmic region | \n",
" Any (proper) part of the cytoplasm of a single... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 4 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0099738 | \n",
" cell cortex region | \n",
" complete extent of cell cortex | \n",
" Did not match whole text: cell cortex < comple... | \n",
"
\n",
" \n",
" 5 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0099738 | \n",
" cell cortex region | \n",
" underlies some some region of the plasma membrane | \n",
" Wrong position, 'cell cortex' not in 'underlie... | \n",
"
\n",
" \n",
" 6 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0071944 | \n",
" cell periphery | \n",
" The part of a cell encompassing the cell corte... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 7 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0031090 | \n",
" organelle membrane | \n",
" is one of the two lipid bilayers of an organel... | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 8 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043229 | \n",
" intracellular organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 9 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043229 | \n",
" intracellular organelle | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 10 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043229 | \n",
" intracellular organelle | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 11 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0031967 | \n",
" organelle envelope | \n",
" A double membrane structure enclosing an organ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 12 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0031975 | \n",
" envelope | \n",
" A multilayered structure surrounding all or pa... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 13 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0098590 | \n",
" plasma membrane region | \n",
" A membrane that is a (regional) part of the pl... | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 14 | \n",
" oaklib.om:DCC#S0 | \n",
" GO:0012505 | \n",
" endomembrane system | \n",
" NaN | \n",
" Missing text definition | \n",
"
\n",
" \n",
" 15 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0005622 | \n",
" intracellular anatomical structure | \n",
" A component of a cell contained within (but no... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 16 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:9999998 | \n",
" fake term for testing pmid type | \n",
" fake definition to test retracted typo in refe... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 17 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043227 | \n",
" membrane-bounded organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 18 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0043227 | \n",
" membrane-bounded organelle | \n",
" NaN | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 19 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0005938 | \n",
" cell cortex | \n",
" region of a cell | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 20 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0005938 | \n",
" cell cortex | \n",
" lies just beneath the plasma membrane and ofte... | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 21 | \n",
" oaklib.om:DCC#S7 | \n",
" GO:0009579 | \n",
" thylakoid | \n",
" The structure in a plant cell that is known as... | \n",
" Circular, thylakoid (GO:0009579 in definition | \n",
"
\n",
" \n",
" 22 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:9999999 | \n",
" fake term for testing retraction | \n",
" fake definition to test retracted reference | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 23 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0005575 | \n",
" cellular_component | \n",
" A location, relative to cellular compartments ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 24 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0005634 | \n",
" nucleus | \n",
" A membrane-bounded organelle of eukaryotic cel... | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 25 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0016020 | \n",
" membrane | \n",
" A lipid bilayer along with all the proteins an... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 26 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0110165 | \n",
" cellular anatomical entity | \n",
" A part of a cellular organism that is either a... | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 27 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0005635 | \n",
" nuclear envelope | \n",
" A double lipid bilayer that is part of the nuc... | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 28 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0005886 | \n",
" plasma membrane | \n",
" The membrane surrounding a cell that separates... | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 29 | \n",
" oaklib.om:DCC#S1 | \n",
" GO:0005773 | \n",
" vacuole | \n",
" NaN | \n",
" Definiendum should not appear at the start | \n",
"
\n",
" \n",
" 30 | \n",
" oaklib.om:DCC#S11 | \n",
" GO:0031965 | \n",
" nuclear membrane | \n",
" envelope | \n",
" Logical definition element not found in text: ... | \n",
"
\n",
" \n",
" 31 | \n",
" oaklib.om:DCC#S1 | \n",
" GO:0005737 | \n",
" cytoplasm | \n",
" NaN | \n",
" Definiendum should not appear at the start | \n",
"
\n",
" \n",
" 32 | \n",
" oaklib.om:DCC#Any | \n",
" GO:0034357 | \n",
" photosynthetic membrane | \n",
" A membrane enriched in complexes formed of rea... | \n",
" No problems with definition | \n",
"
\n",
" \n",
" 33 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043226 | \n",
" organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 34 | \n",
" oaklib.om:DCC#S20.1 | \n",
" GO:9999998 | \n",
" fake term for testing pmid type | \n",
" fake definition to test retracted typo in refe... | \n",
" publication not found: PMID:9999999999999 | \n",
"
\n",
" \n",
" 35 | \n",
" oaklib.om:DCC#S20.2 | \n",
" GO:9999999 | \n",
" fake term for testing retraction | \n",
" NaN | \n",
" publication is retracted: A role for plasma tr... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label \\\n",
"0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle \n",
"1 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n",
"2 oaklib.om:DCC#S11 GO:0043231 intracellular membrane-bounded organelle \n",
"3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region \n",
"4 oaklib.om:DCC#S3 GO:0099738 cell cortex region \n",
"5 oaklib.om:DCC#S11 GO:0099738 cell cortex region \n",
"6 oaklib.om:DCC#S3 GO:0071944 cell periphery \n",
"7 oaklib.om:DCC#S11 GO:0031090 organelle membrane \n",
"8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle \n",
"9 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n",
"10 oaklib.om:DCC#S11 GO:0043229 intracellular organelle \n",
"11 oaklib.om:DCC#S3 GO:0031967 organelle envelope \n",
"12 oaklib.om:DCC#S3 GO:0031975 envelope \n",
"13 oaklib.om:DCC#Any GO:0098590 plasma membrane region \n",
"14 oaklib.om:DCC#S0 GO:0012505 endomembrane system \n",
"15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure \n",
"16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type \n",
"17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle \n",
"18 oaklib.om:DCC#S11 GO:0043227 membrane-bounded organelle \n",
"19 oaklib.om:DCC#S11 GO:0005938 cell cortex \n",
"20 oaklib.om:DCC#S11 GO:0005938 cell cortex \n",
"21 oaklib.om:DCC#S7 GO:0009579 thylakoid \n",
"22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction \n",
"23 oaklib.om:DCC#S3 GO:0005575 cellular_component \n",
"24 oaklib.om:DCC#Any GO:0005634 nucleus \n",
"25 oaklib.om:DCC#S3 GO:0016020 membrane \n",
"26 oaklib.om:DCC#Any GO:0110165 cellular anatomical entity \n",
"27 oaklib.om:DCC#Any GO:0005635 nuclear envelope \n",
"28 oaklib.om:DCC#Any GO:0005886 plasma membrane \n",
"29 oaklib.om:DCC#S1 GO:0005773 vacuole \n",
"30 oaklib.om:DCC#S11 GO:0031965 nuclear membrane \n",
"31 oaklib.om:DCC#S1 GO:0005737 cytoplasm \n",
"32 oaklib.om:DCC#Any GO:0034357 photosynthetic membrane \n",
"33 oaklib.om:DCC#S3 GO:0043226 organelle \n",
"34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type \n",
"35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction \n",
"\n",
" object_str \\\n",
"0 Organized structure of distinctive morphology ... \n",
"1 NaN \n",
"2 NaN \n",
"3 Any (proper) part of the cytoplasm of a single... \n",
"4 complete extent of cell cortex \n",
"5 underlies some some region of the plasma membrane \n",
"6 The part of a cell encompassing the cell corte... \n",
"7 is one of the two lipid bilayers of an organel... \n",
"8 Organized structure of distinctive morphology ... \n",
"9 NaN \n",
"10 NaN \n",
"11 A double membrane structure enclosing an organ... \n",
"12 A multilayered structure surrounding all or pa... \n",
"13 A membrane that is a (regional) part of the pl... \n",
"14 NaN \n",
"15 A component of a cell contained within (but no... \n",
"16 fake definition to test retracted typo in refe... \n",
"17 Organized structure of distinctive morphology ... \n",
"18 NaN \n",
"19 region of a cell \n",
"20 lies just beneath the plasma membrane and ofte... \n",
"21 The structure in a plant cell that is known as... \n",
"22 fake definition to test retracted reference \n",
"23 A location, relative to cellular compartments ... \n",
"24 A membrane-bounded organelle of eukaryotic cel... \n",
"25 A lipid bilayer along with all the proteins an... \n",
"26 A part of a cellular organism that is either a... \n",
"27 A double lipid bilayer that is part of the nuc... \n",
"28 The membrane surrounding a cell that separates... \n",
"29 NaN \n",
"30 envelope \n",
"31 NaN \n",
"32 A membrane enriched in complexes formed of rea... \n",
"33 Organized structure of distinctive morphology ... \n",
"34 fake definition to test retracted typo in refe... \n",
"35 NaN \n",
"\n",
" info \n",
"0 Cannot parse genus and differentia \n",
"1 Logical definition element not found in text: ... \n",
"2 Logical definition element not found in text: ... \n",
"3 Cannot parse genus and differentia \n",
"4 Did not match whole text: cell cortex < comple... \n",
"5 Wrong position, 'cell cortex' not in 'underlie... \n",
"6 Cannot parse genus and differentia \n",
"7 Logical definition element not found in text: ... \n",
"8 Cannot parse genus and differentia \n",
"9 Logical definition element not found in text: ... \n",
"10 Logical definition element not found in text: ... \n",
"11 Cannot parse genus and differentia \n",
"12 Cannot parse genus and differentia \n",
"13 No problems with definition \n",
"14 Missing text definition \n",
"15 Cannot parse genus and differentia \n",
"16 Cannot parse genus and differentia \n",
"17 Cannot parse genus and differentia \n",
"18 Logical definition element not found in text: ... \n",
"19 Logical definition element not found in text: ... \n",
"20 Logical definition element not found in text: ... \n",
"21 Circular, thylakoid (GO:0009579 in definition \n",
"22 Cannot parse genus and differentia \n",
"23 Cannot parse genus and differentia \n",
"24 No problems with definition \n",
"25 Cannot parse genus and differentia \n",
"26 No problems with definition \n",
"27 No problems with definition \n",
"28 No problems with definition \n",
"29 Definiendum should not appear at the start \n",
"30 Logical definition element not found in text: ... \n",
"31 Definiendum should not appear at the start \n",
"32 No problems with definition \n",
"33 Cannot parse genus and differentia \n",
"34 publication not found: PMID:9999999999999 \n",
"35 publication is retracted: A role for plasma tr... "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df[[\"type\", \"subject\", \"subject_label\", \"object_str\", \"info\"]]\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "8ad6ef24d0daf11f",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Missing Definitions\n",
"\n",
"This is the most trivial way to fail a definition check - not to include one. We can see all the missing definitions:\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "381e7c7da587668e",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:31.048081Z",
"start_time": "2024-04-15T00:50:30.979466Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" object_str | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 14 | \n",
" oaklib.om:DCC#S0 | \n",
" GO:0012505 | \n",
" endomembrane system | \n",
" NaN | \n",
" Missing text definition | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label object_str \\\n",
"14 oaklib.om:DCC#S0 GO:0012505 endomembrane system NaN \n",
"\n",
" info \n",
"14 Missing text definition "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"type\"] == \"oaklib.om:DCC#S0\"]\n"
]
},
{
"cell_type": "markdown",
"id": "f8844c7876451383",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Of course, in the real ontology this term has a definition"
]
},
{
"cell_type": "markdown",
"id": "c098cdf7a5665add",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Non genus-differentia structure\n",
"\n",
"The OAK validate definitions command follows [SRS]( https://philpapers.org/archive/SEPGFW.pdf) and assumes good definitions follow genus-differentia structure.\n",
"\n",
"We can see the ones that fail this (S3):"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "9cf1490c83491596",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:31.052182Z",
"start_time": "2024-04-15T00:50:30.987744Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" object_str | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043231 | \n",
" intracellular membrane-bounded organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 3 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0099568 | \n",
" cytoplasmic region | \n",
" Any (proper) part of the cytoplasm of a single... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 4 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0099738 | \n",
" cell cortex region | \n",
" complete extent of cell cortex | \n",
" Did not match whole text: cell cortex < comple... | \n",
"
\n",
" \n",
" 6 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0071944 | \n",
" cell periphery | \n",
" The part of a cell encompassing the cell corte... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 8 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043229 | \n",
" intracellular organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 11 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0031967 | \n",
" organelle envelope | \n",
" A double membrane structure enclosing an organ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 12 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0031975 | \n",
" envelope | \n",
" A multilayered structure surrounding all or pa... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 15 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0005622 | \n",
" intracellular anatomical structure | \n",
" A component of a cell contained within (but no... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 16 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:9999998 | \n",
" fake term for testing pmid type | \n",
" fake definition to test retracted typo in refe... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 17 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043227 | \n",
" membrane-bounded organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 22 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:9999999 | \n",
" fake term for testing retraction | \n",
" fake definition to test retracted reference | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 23 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0005575 | \n",
" cellular_component | \n",
" A location, relative to cellular compartments ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 25 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0016020 | \n",
" membrane | \n",
" A lipid bilayer along with all the proteins an... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
" 33 | \n",
" oaklib.om:DCC#S3 | \n",
" GO:0043226 | \n",
" organelle | \n",
" Organized structure of distinctive morphology ... | \n",
" Cannot parse genus and differentia | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label \\\n",
"0 oaklib.om:DCC#S3 GO:0043231 intracellular membrane-bounded organelle \n",
"3 oaklib.om:DCC#S3 GO:0099568 cytoplasmic region \n",
"4 oaklib.om:DCC#S3 GO:0099738 cell cortex region \n",
"6 oaklib.om:DCC#S3 GO:0071944 cell periphery \n",
"8 oaklib.om:DCC#S3 GO:0043229 intracellular organelle \n",
"11 oaklib.om:DCC#S3 GO:0031967 organelle envelope \n",
"12 oaklib.om:DCC#S3 GO:0031975 envelope \n",
"15 oaklib.om:DCC#S3 GO:0005622 intracellular anatomical structure \n",
"16 oaklib.om:DCC#S3 GO:9999998 fake term for testing pmid type \n",
"17 oaklib.om:DCC#S3 GO:0043227 membrane-bounded organelle \n",
"22 oaklib.om:DCC#S3 GO:9999999 fake term for testing retraction \n",
"23 oaklib.om:DCC#S3 GO:0005575 cellular_component \n",
"25 oaklib.om:DCC#S3 GO:0016020 membrane \n",
"33 oaklib.om:DCC#S3 GO:0043226 organelle \n",
"\n",
" object_str \\\n",
"0 Organized structure of distinctive morphology ... \n",
"3 Any (proper) part of the cytoplasm of a single... \n",
"4 complete extent of cell cortex \n",
"6 The part of a cell encompassing the cell corte... \n",
"8 Organized structure of distinctive morphology ... \n",
"11 A double membrane structure enclosing an organ... \n",
"12 A multilayered structure surrounding all or pa... \n",
"15 A component of a cell contained within (but no... \n",
"16 fake definition to test retracted typo in refe... \n",
"17 Organized structure of distinctive morphology ... \n",
"22 fake definition to test retracted reference \n",
"23 A location, relative to cellular compartments ... \n",
"25 A lipid bilayer along with all the proteins an... \n",
"33 Organized structure of distinctive morphology ... \n",
"\n",
" info \n",
"0 Cannot parse genus and differentia \n",
"3 Cannot parse genus and differentia \n",
"4 Did not match whole text: cell cortex < comple... \n",
"6 Cannot parse genus and differentia \n",
"8 Cannot parse genus and differentia \n",
"11 Cannot parse genus and differentia \n",
"12 Cannot parse genus and differentia \n",
"15 Cannot parse genus and differentia \n",
"16 Cannot parse genus and differentia \n",
"17 Cannot parse genus and differentia \n",
"22 Cannot parse genus and differentia \n",
"23 Cannot parse genus and differentia \n",
"25 Cannot parse genus and differentia \n",
"33 Cannot parse genus and differentia "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"type\"] == \"oaklib.om:DCC#S3\"]"
]
},
{
"cell_type": "markdown",
"id": "27f9e7b747b071de",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Many of these are actual definitions rather than ones manipulated for test purposes.\n",
"\n",
"There is room for valid disagreement about whether rewriting some of these following genus-differentia form would improve things for either users or annotators. Arguably at least the subtypes of organelle could simply state how they are differentiated from organelles in general rather than repeating the somewhat wordy _\"Organized structure of distinctive morphology...\"_"
]
},
{
"cell_type": "markdown",
"id": "c56d3a9c531e5a09",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Circular definitions"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "adcbad5fae63e7fb",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:31.052559Z",
"start_time": "2024-04-15T00:50:30.994899Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" object_str | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 21 | \n",
" oaklib.om:DCC#S7 | \n",
" GO:0009579 | \n",
" thylakoid | \n",
" The structure in a plant cell that is known as... | \n",
" Circular, thylakoid (GO:0009579 in definition | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label \\\n",
"21 oaklib.om:DCC#S7 GO:0009579 thylakoid \n",
"\n",
" object_str \\\n",
"21 The structure in a plant cell that is known as... \n",
"\n",
" info \n",
"21 Circular, thylakoid (GO:0009579 in definition "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"type\"] == \"oaklib.om:DCC#S7\"]"
]
},
{
"cell_type": "markdown",
"id": "34eb55cf06afa332",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Not following convention"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "cf4d18796842b46",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:50:31.062863Z",
"start_time": "2024-04-15T00:50:31.004181Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" object_str | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 29 | \n",
" oaklib.om:DCC#S1 | \n",
" GO:0005773 | \n",
" vacuole | \n",
" NaN | \n",
" Definiendum should not appear at the start | \n",
"
\n",
" \n",
" 31 | \n",
" oaklib.om:DCC#S1 | \n",
" GO:0005737 | \n",
" cytoplasm | \n",
" NaN | \n",
" Definiendum should not appear at the start | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label object_str \\\n",
"29 oaklib.om:DCC#S1 GO:0005773 vacuole NaN \n",
"31 oaklib.om:DCC#S1 GO:0005737 cytoplasm NaN \n",
"\n",
" info \n",
"29 Definiendum should not appear at the start \n",
"31 Definiendum should not appear at the start "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"type\"] == \"oaklib.om:DCC#S1\"]"
]
},
{
"cell_type": "markdown",
"id": "4c5189bd46804bd8",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Definition Reference Issues\n",
"\n",
"### Typos in PMIDs\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "35e1f10deba2c6c9",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:51:38.780848Z",
"start_time": "2024-04-15T00:51:38.770256Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" object_str | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 34 | \n",
" oaklib.om:DCC#S20.1 | \n",
" GO:9999998 | \n",
" fake term for testing pmid type | \n",
" fake definition to test retracted typo in refe... | \n",
" publication not found: PMID:9999999999999 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label \\\n",
"34 oaklib.om:DCC#S20.1 GO:9999998 fake term for testing pmid type \n",
"\n",
" object_str \\\n",
"34 fake definition to test retracted typo in refe... \n",
"\n",
" info \n",
"34 publication not found: PMID:9999999999999 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"type\"] == \"oaklib.om:DCC#S20.1\"]\n"
]
},
{
"cell_type": "markdown",
"id": "7a288d8fc507acc4",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### Retracted publications"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "f5245d99ab0864d5",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T00:52:02.693591Z",
"start_time": "2024-04-15T00:52:02.687692Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" subject | \n",
" subject_label | \n",
" object_str | \n",
" info | \n",
"
\n",
" \n",
" \n",
" \n",
" 35 | \n",
" oaklib.om:DCC#S20.2 | \n",
" GO:9999999 | \n",
" fake term for testing retraction | \n",
" NaN | \n",
" publication is retracted: A role for plasma tr... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type subject subject_label \\\n",
"35 oaklib.om:DCC#S20.2 GO:9999999 fake term for testing retraction \n",
"\n",
" object_str info \n",
"35 NaN publication is retracted: A role for plasma tr... "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"type\"] == \"oaklib.om:DCC#S20.2\"]\n"
]
},
{
"cell_type": "markdown",
"id": "7e8d97bc6e6c20b0",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Using LLMs to validate definitions\n",
"\n",
"For this example we will use an LLM to validate this GO catalytic activity:\n",
"\n",
"```yaml\n",
"[Term]\n",
"id: GO:0000010\n",
"name: trans-hexaprenyltranstransferase activity\n",
"namespace: molecular_function\n",
"alt_id: GO:0036422\n",
"def: \"Catalysis of the reaction: (2E,6E)-farnesyl diphosphate + 4 isopentenyl diphosphate = 4 diphosphate + all-trans-heptaprenyl diphosphate.\" [PMID:9708911, RHEA:27794]\n",
"synonym: \"all-trans-heptaprenyl-diphosphate synthase activity\" RELATED [EC:2.5.1.30]\n",
"synonym: \"HepPP synthase activity\" RELATED [EC:2.5.1.30]\n",
"synonym: \"heptaprenyl diphosphate synthase activity\" RELATED []\n",
"synonym: \"heptaprenyl pyrophosphate synthase activity\" RELATED [EC:2.5.1.30]\n",
"synonym: \"heptaprenyl pyrophosphate synthetase activity\" RELATED [EC:2.5.1.30]\n",
"xref: EC:2.5.1.30\n",
"xref: MetaCyc:TRANS-HEXAPRENYLTRANSTRANSFERASE-RXN\n",
"xref: RHEA:27794\n",
"```\n",
"\n",
"There are two references for this:\n",
"\n",
" - the publication [PMID:9708911](https://pubmed.ncbi.nlm.nih.gov/9708911/)\n",
" - the RHEA reaction [RHEA:27794](https://www.rhea-db.org/reaction?id=27794)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "4e29eb9d8ff5df4c",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T01:00:28.475900Z",
"start_time": "2024-04-15T01:00:13.437742Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"!runoak --stacktrace -i llm:{claude-3-opus}:simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml GO:0000010 -O yaml -o output/validate-definitions.llm.yaml"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "69f6da5532285cf9",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T01:01:41.771699Z",
"start_time": "2024-04-15T01:01:41.744373Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import yaml\n",
"report = yaml.safe_load(open(\"output/validate-definitions.llm.yaml\"))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "b35f8ffab12b1b6b",
"metadata": {
"ExecuteTime": {
"end_time": "2024-04-15T01:09:34.475682Z",
"start_time": "2024-04-15T01:09:34.465369Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"type: https://w3id.org/oak/ontology-metadata/DCC.S20\n",
"subject: GO:0000010\n",
"severity: INFO\n",
"predicate: IAO:0000115\n",
"object_str: \n",
" id: PMID:9708911\n",
" title: Biological significance of the side chain length of ubiquinone in Saccharomyces\n",
" cerevisiae.\n",
" abstract: Ubiquinone (UQ), an important component of the electron transfer system,\n",
" is constituted of a quinone structure and a side chain isoprenoid. The side chain\n",
" length of UQ differs between microorganisms, and this difference has been used for\n",
" taxonomic study. In this study, we have addressed the importance of the length of\n",
" the side chain of UQ for cells, and examined the effect of chain length by producing\n",
" UQs with isoprenoid chain lengths between 5 and 10 in Saccharomyces cerevisiae.\n",
" To make the different UQ species, different types of prenyl diphosphate synthases\n",
" were expressed in a S. cerevisiae COQ1 mutant defective for hexaprenyl diphosphate\n",
" synthesis. As a result, we found that the original species of UQ (in this case UQ-6)\n",
" had maximum functionality. However, we found that other species of UQ could replace\n",
" UQ-6. Thus a broad spectrum of different UQ species are biologically functional\n",
" in yeast cells, although cells seem to display a preference for their own particular\n",
" type of UQ.\n",
" publication_type: Research Support, Non-U.S. Gov't\n",
" \n",
"\n",
"info: \n",
" The term \"trans-hexaprenyltranstransferase activity\" has a HIGH level of alignment with the cited reference PMID:9708911. The abstract supports the definition well, as evidenced by these key points:\n",
" \n",
" 1. The study examines the importance of the side chain length of ubiquinone (UQ) in Saccharomyces cerevisiae, which directly relates to the activity of trans-hexaprenyltranstransferase.\n",
" \n",
" 2. The abstract mentions \"hexaprenyl diphosphate synthesis\" in S. cerevisiae, which is the product of trans-hexaprenyltranstransferase activity.\n",
" \n",
" 3. The study found that the original species of UQ (UQ-6) had maximum functionality in yeast cells, suggesting a preference for the hexaprenyl side chain length produced by trans-hexaprenyltranstransferase.\n",
" \n",
" No sections of the abstract misalign with or contradict the term definition. The definition is appropriately specific, focusing on the enzyme's activity without providing additional details about its structure or cellular role.\n",
"\n",
"definition: \n",
" Catalysis of the reaction: (2E,6E)-farnesyl diphosphate + 4 isopentenyl diphosphate = 4 diphosphate + all-trans-heptaprenyl diphosphate.\n",
"\n",
"definition_source: PMID:9708911\n"
]
}
],
"source": [
"for k, v in report.items():\n",
" if len(str(v)) > 50:\n",
" lines = v.split(\"\\n\")\n",
" lines = [f\" {line}\" for line in lines]\n",
" lines = [\"\"] + lines + [\"\"]\n",
" v = \"\\n\".join(lines)\n",
" print(f\"{k}: {v}\")\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"id": "233f8a645b3517f2",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"__COMMENTARY__\n",
"\n",
"Note that as this is an LLM the output differs every time!\n",
"\n",
"In some cases, the LLM is failing to see that the paper is indeed about trans-hexaprenyltranstransferase activity, the output is useful as it shows us that the abstract is not directly about this activity."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "df16f8ef-a274-4c8c-a1a5-bbef76597842",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}