{ "cells": [ { "cell_type": "markdown", "id": "1cd4a3da-5c5c-46ce-9423-3b7a48b7f6ca", "metadata": {}, "source": [ "# OAK statistics command\n", "\n", "This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).\n", "\n", "This notebook provides examples for the `statistics` command, which can be used to calculate basic descriptive statistics\n", "for an ontology\n", "\n", "## Help Option\n", "\n", "You can get help on any OAK command using `--help`" ] }, { "cell_type": "code", "execution_count": 1, "id": "8940e44b-f1fc-4440-88ba-8064c33a48e6", "metadata": { "ExecuteTime": { "end_time": "2024-03-26T23:26:04.707363Z", "start_time": "2024-03-26T23:26:01.937733Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: runoak statistics [OPTIONS] [BRANCHES]...\r\n", "\r\n", " Shows all descriptive/summary statistics\r\n", "\r\n", " Example: ------- runoak -i sqlite:obo:pr statistics\r\n", "\r\n", " By default, this will show combined summary statistics for all terms\r\n", "\r\n", " You can also break down the statistics in two ways:\r\n", "\r\n", " - by a collection of branch roots\r\n", "\r\n", " - by a metadata property (e.g. oio:hasOBONamespace, rdfs:isDefinedBy)\r\n", "\r\n", " - by prefix (e.g. GO, PR, CL, OBI)\r\n", "\r\n", " Example: ------- runoak -i sqlite:obo:pr statistics -p\r\n", " oio:hasOBONamespace\r\n", "\r\n", " Note: the oio:hasOBONamespace is *not* the same as the ID prefix, it is a\r\n", " field that is used by a subset of ontologies to partition classes into broad\r\n", " groupings, similar to subsets. Its use is non-standard, yet a lot of\r\n", " ontologies use this as the main partitioning mechanism.\r\n", "\r\n", " A note on bundled ontologies:\r\n", "\r\n", " The standard release many OBO ontologies \"bundles\" parts of other ontologies\r\n", " (formally, the release product includes a merged imports closure of import\r\n", " modules). This can complicate generation of statistics. A naive count of all\r\n", " classes in the main OBI release will include not only \"native\" OBI classes,\r\n", " but also classes from other ontologies that are bundled in the release.\r\n", "\r\n", " For bundled ontologies, we recommend some kind of partitioning, such as via\r\n", " defined roots, or via the CURIE prefix, using the ``--group-by-prefix``\r\n", " option.\r\n", "\r\n", " Output formats:\r\n", "\r\n", " The recommended output types for this command are yaml, json, or csv. The\r\n", " default output type is yaml, following the SummaryStatistics data model.\r\n", " This is naturally nested, as the statistics includes faceted groupings (e.g.\r\n", " edge counts are broken down by predicate). When specifying a flat format\r\n", " like csv, this is flattened into a single table, with dynamic column names.\r\n", "\r\n", " Change statistics:\r\n", "\r\n", " You can optionally combine the ontology statistics with a change summary\r\n", " relative to another ontology, using the ``--compare-with`` option.\r\n", "\r\n", " Example: ------- runoak -i v2.obo statistics --group-by-obo-namespace\r\n", " --compare-with v1.obo\r\n", "\r\n", " This will also include change stats broken down by KGCL change types. If a\r\n", " group-by option is specified, these will be grouped accordingly.\r\n", "\r\n", " Python API:\r\n", "\r\n", " https://incatools.github.io/ontology-access-kit/interfaces/summary-\r\n", " statistics\r\n", "\r\n", " Data model:\r\n", "\r\n", " https://w3id.org/oak/summary-statistics\r\n", "\r\n", "Options:\r\n", " -O, --output-type [obo|obojson|ofn|rdf|json|yaml|fhirjson|csv|tsv|nl]\r\n", " Desired output type\r\n", " --group-by-property TEXT group summaries by a metadata property, e.g.\r\n", " rdfs:isDefinedBy\r\n", " --group-by-obo-namespace / --no-group-by-obo-namespace\r\n", " shortcut for --group-by-property\r\n", " oio:hasOBONamespace (note this is distinct\r\n", " from the ID namespace) [default: no-group-\r\n", " by-obo-namespace]\r\n", " --group-by-prefix / --no-group-by-prefix\r\n", " shortcut for --group-by-property sh:prefix.\r\n", " Groups by the prefix of the CURIE [default:\r\n", " no-group-by-prefix]\r\n", " --group-by-defined-by / --no-group-by-defined-by\r\n", " shortcut for --group-by-property\r\n", " rdfs:isDefinedBy. This may be inferred from\r\n", " prefix if not set explicitly [default: no-\r\n", " group-by-defined-by]\r\n", " --include-residuals / --no-include-residuals\r\n", " If true include an OTHER category for terms\r\n", " that do not have the property\r\n", " -X, --compare-with TEXT Compare with another ontology\r\n", " -P, --has-prefix TEXT filter based on a prefix, e.g. OBI\r\n", " -o, --output FILENAME Output file, e.g. obo file\r\n", " --help Show this message and exit.\r\n" ] } ], "source": [ "!runoak statistics --help" ] }, { "cell_type": "markdown", "id": "ed1c706e-82e0-4168-bdaf-8bd96a3cd72a", "metadata": {}, "source": [ "## Set up an alias\n", "\n", "For convenience we will set up some aliases for use in this notebook" ] }, { "cell_type": "code", "execution_count": 18, "id": "c7350ec6-6070-45f5-a058-0da9e29ae086", "metadata": { "ExecuteTime": { "end_time": "2024-03-27T00:29:08.956988Z", "start_time": "2024-03-27T00:29:08.951340Z" } }, "outputs": [], "source": [ "alias chebi runoak -i sqlite:obo:chebi" ] }, { "cell_type": "markdown", "source": [ "## Calculating summary statistics (default YAML output)\n", "\n", "We can calculate the summary stats using the `statistics` command. The output is quite lengthy,\n", "so we will use `--output` (`-o`) to direct to a yamml file:" ], "metadata": { "collapsed": false }, "id": "4020400699697b8e" }, { "cell_type": "code", "execution_count": 19, "id": "81046b24-3811-4f7c-9eb1-e4e36bff370d", "metadata": { "ExecuteTime": { "end_time": "2024-03-27T00:30:20.090591Z", "start_time": "2024-03-27T00:29:10.137939Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:root:bad mapping: KEGG_COMPOUND \r\n", "WARNING:root:bad mapping: IUPAC\r\n", "WARNING:root:bad mapping: ChemIDplus\r\n", "WARNING:root:bad mapping: UniProt\r\n", "WARNING:root:bad mapping: DrugCentral\r\n", "WARNING:root:bad mapping: LINCS\r\n", "WARNING:root:bad mapping: KEGG_DRUG\r\n", "WARNING:root:bad mapping: ChEBI\r\n", "WARNING:root:bad mapping: ChEMBL\r\n", "WARNING:root:bad mapping: DrugBank\r\n", "WARNING:root:bad mapping: WHO_MedNet\r\n", "WARNING:root:bad mapping: PDBeChem\r\n", "WARNING:root:bad mapping: NIST_Chemistry_WebBook\r\n", "WARNING:root:bad mapping: PPDB\r\n", "WARNING:root:bad mapping: LIPID_MAPS\r\n", "WARNING:root:bad mapping: IUPHAR\r\n", "WARNING:root:bad mapping: HMDB\r\n", "WARNING:root:bad mapping: SUBMITTER\r\n", "WARNING:root:bad mapping: MetaCyc\r\n", "WARNING:root:bad mapping: JCBN\r\n", "WARNING:root:bad mapping: GlyTouCan\r\n", "WARNING:root:bad mapping: KNApSAcK\r\n", "WARNING:root:bad mapping: IUBMB\r\n", "WARNING:root:bad mapping: CBN\r\n", "WARNING:root:bad mapping: Alan_Wood's_Pesticides\r\n", "WARNING:root:bad mapping: GlyGen\r\n", "WARNING:root:bad mapping: KEGG_GLYCAN\r\n", "WARNING:root:bad mapping: RESID\r\n", "WARNING:root:bad mapping: PubChem\r\n", "WARNING:root:bad mapping: FooDB\r\n", "WARNING:root:bad mapping: VSDB\r\n", "WARNING:root:bad mapping: UM-BBD\r\n", "WARNING:root:bad mapping: MolBase\r\n", "WARNING:root:bad mapping: COMe\r\n", "WARNING:root:bad mapping: Beilstein\r\n", "WARNING:root:bad mapping: Patent\r\n", "WARNING:root:bad mapping: PDB\r\n", "WARNING:root:bad mapping: SMID\r\n" ] } ], "source": [ "chebi statistics -o output/chebi.stats.yaml" ] }, { "cell_type": "markdown", "source": [ "__Note__ CHEBI has a lot of bad xrefs, hence the output" ], "metadata": { "collapsed": false }, "id": "c23707ed5c1a794f" }, { "cell_type": "markdown", "source": [ "## Exploring the output\n", "\n", "Let's look at the top of the YAML file:" ], "metadata": { "collapsed": false }, "id": "a42ba18a15b14b48" }, { "cell_type": "code", "execution_count": 20, "id": "065686c3-24b6-4752-b290-2eb110f8913b", "metadata": { "ExecuteTime": { "end_time": "2024-03-27T00:30:20.236768Z", "start_time": "2024-03-27T00:30:20.092802Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: AllOntologies\r\n", "ontologies:\r\n", "- id: obo:chebi.owl\r\n", " version: obo:chebi/231/chebi.owl\r\n", "was_generated_by:\r\n", " started_at_time: '2024-03-26T17:29:56.627143'\r\n", " was_associated_with: OAK\r\n", " acted_on_behalf_of: cjm\r\n", "class_count: 217549\r\n", "deprecated_class_count: 18650\r\n", "non_deprecated_class_count: 198899\r\n", "class_count_with_text_definitions: 53575\r\n", "class_count_without_text_definitions: 163974\r\n", "object_property_count: 10\r\n", "annotation_property_count: 37\r\n", "named_individual_count: 0\r\n", "subset_count: 3\r\n", "rdf_triple_count: 6860047\r\n", "subclass_of_axiom_count: 368285\r\n", "equivalent_classes_axiom_count: 0\r\n", "edge_count_by_predicate:\r\n", " BFO:0000051:\r\n", " facet: BFO:0000051\r\n", " filtered_count: 4029\r\n", " RO:0000087:\r\n", " facet: RO:0000087\r\n", " filtered_count: 43636\r\n", " obo:chebi#has_functional_parent:\r\n", " facet: obo:chebi#has_functional_parent\r\n", " filtered_count: 19632\r\n", " obo:chebi#has_parent_hydride:\r\n", " facet: obo:chebi#has_parent_hydride\r\n", " filtered_count: 1799\r\n", " obo:chebi#is_conjugate_acid_of:\r\n", " facet: obo:chebi#is_conjugate_acid_of\r\n", " filtered_count: 8484\r\n", " obo:chebi#is_conjugate_base_of:\r\n", " facet: obo:chebi#is_conjugate_base_of\r\n", " filtered_count: 8484\r\n", " obo:chebi#is_enantiomer_of:\r\n", " facet: obo:chebi#is_enantiomer_of\r\n", " filtered_count: 2754\r\n", " obo:chebi#is_substituent_group_from:\r\n", " facet: obo:chebi#is_substituent_group_from\r\n", " filtered_count: 1287\r\n", " obo:chebi#is_tautomer_of:\r\n", " facet: obo:chebi#is_tautomer_of\r\n", " filtered_count: 1886\r\n", " rdfs:subClassOf:\r\n", " facet: rdfs:subClassOf\r\n" ] } ], "source": [ "!head -50 output/chebi.stats.yaml" ] }, { "cell_type": "markdown", "source": [ "Like all objects produced by OAK, there is a data dictionary / data model. The ontology stats\n", "one is [https://w3id.org/oak/summary-statistics](https://w3id.org/oak/summary-statistics),\n", "you can use this link to browse documentation etc.\n", "\n", "**A well defined data dictionary is necessary for communicating aggregate statistics accurately**.\n", "Often when ontologies are reported informally, it's ambiguous whether *number of terms* means:\n", "\n", "- number of *classes*, *classes plus relationship types*, or *classes plus some other elements*\n", "- including or excluding deprecated (obsolete) entities\n", "\n", "The OAK summary statistics data dictionary aims to provide a **standard for ontology reporting**.\n", "\n", "YAML allows for nesting which is a natural way to group things; for example:\n", "\n", "```yaml\n", "edge_count_by_predicate:\n", " BFO:0000051:\n", " facet: BFO:0000051\n", " filtered_count: 4003\n", " RO:0000087:\n", " facet: RO:0000087\n", " filtered_count: 43082\n", "```\n", "\n", "This says that there are 4003 part-of (BFO:0000050) and 43082 has-role (RO:00000087) [relationships](https://incatools.github.io/ontology-access-kit/glossary.html#term-Relationship).\n", "\n", "See the [OAK guide to relationships](https://incatools.github.io/ontology-access-kit/guide/relationships-and-graphs.html)\n", "to understand more.\n", "\n", "## Mapping Stats\n", "\n", "Further on in the YAML we can see mapping stats. See (https://w3id.org/ssssom)[https://w3id.org/ssssom] to\n", "understand the OAK mapping data model.\n", "\n", "These are broken down\n", "\n", "- by mapping predicate (for many ontologies this is only `oio:hasDbXref`)\n", "- my mapping object source (i.e. the database or ontology that is mapped to)" ], "metadata": { "collapsed": false }, "id": "66b8945aa523b8d5" }, { "cell_type": "code", "execution_count": 21, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mapping_statement_count_by_predicate:\r\n", " oio:hasDbXref:\r\n", " facet: oio:hasDbXref\r\n", " filtered_count: 345271\r\n", "mapping_statement_count_by_object_source:\r\n", " BFO:\r\n", " facet: BFO\r\n", " filtered_count: 1\r\n", " RO:\r\n", " facet: RO\r\n", " filtered_count: 1\r\n", " KNApSAcK:\r\n", " facet: KNApSAcK\r\n", " filtered_count: 5185\r\n", " KEGG:\r\n", " facet: KEGG\r\n", " filtered_count: 22228\r\n", " CAS:\r\n", " facet: CAS\r\n", " filtered_count: 28938\r\n", " KEGG_COMPOUND:\r\n", " facet: KEGG_COMPOUND\r\n", " filtered_count: 19870\r\n", " Beilstein:\r\n", " facet: Beilstein\r\n", " filtered_count: 9187\r\n", " IUPAC:\r\n", " facet: IUPAC\r\n", " filtered_count: 61013\r\n", " ChemIDplus:\r\n", " facet: ChemIDplus\r\n", " filtered_count: 33383\r\n", " UniProt:\r\n", " facet: UniProt\r\n", " filtered_count: 16047\r\n", " LINCS:\r\n", " facet: LINCS\r\n", " filtered_count: 41392\r\n", " Drug_Central:\r\n", " facet: Drug_Central\r\n", " filtered_count: 3784\r\n", " DrugCentral:\r\n", " facet: DrugCentral\r\n", " filtered_count: 6202\r\n", " Wikipedia:\r\n", "--\r\n", "mapping_statement_count_subject_by_object_source:\r\n", " BFO:\r\n", " facet: BFO\r\n", " filtered_count: 1\r\n", " RO:\r\n", " facet: RO\r\n", " filtered_count: 1\r\n", " KNApSAcK:\r\n", " facet: KNApSAcK\r\n", " filtered_count: 5091\r\n", " KEGG:\r\n", " facet: KEGG\r\n", " filtered_count: 20233\r\n", " CAS:\r\n", " facet: CAS\r\n", " filtered_count: 28615\r\n", " KEGG_COMPOUND:\r\n", " facet: KEGG_COMPOUND\r\n", " filtered_count: 19870\r\n", " Beilstein:\r\n", " facet: Beilstein\r\n", " filtered_count: 8704\r\n", " IUPAC:\r\n", " facet: IUPAC\r\n", " filtered_count: 61013\r\n", " ChemIDplus:\r\n", " facet: ChemIDplus\r\n", " filtered_count: 33383\r\n", " UniProt:\r\n", " facet: UniProt\r\n", " filtered_count: 16047\r\n", " LINCS:\r\n", " facet: LINCS\r\n", " filtered_count: 41389\r\n", " Drug_Central:\r\n", " facet: Drug_Central\r\n", " filtered_count: 3783\r\n", " DrugCentral:\r\n", " facet: DrugCentral\r\n", " filtered_count: 6202\r\n", " Wikipedia:\r\n" ] } ], "source": [ "!grep -A40 ^mapping_statement_count output/chebi.stats.yaml" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-03-27T00:30:20.378726Z", "start_time": "2024-03-27T00:30:20.237175Z" } }, "id": "688b55507ca72f41" }, { "cell_type": "markdown", "source": [ "As expected, CHEBI does not make use of SKOS mapping predicates, and mappings\n", "are dominated by databases like KEGG, CAS.\n" ], "metadata": { "collapsed": false }, "id": "65c80c02acc5a77d" }, { "cell_type": "markdown", "source": [ "## TSV Output\n", "\n", "YAML is not a very natural format for doing further data science or statistical processing.\n", "\n", "FOr that we can use the `csv` option (which actually defaults to tsv...)" ], "metadata": { "collapsed": false }, "id": "61767f28d19545d" }, { "cell_type": "code", "execution_count": 9, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:root:bad mapping: KEGG_COMPOUND\r\n", "WARNING:root:bad mapping: IUPAC\r\n", "WARNING:root:bad mapping: ChemIDplus\r\n", "WARNING:root:bad mapping: UniProt\r\n", "WARNING:root:bad mapping: DrugCentral\r\n", "WARNING:root:bad mapping: LINCS\r\n", "WARNING:root:bad mapping: KEGG_DRUG\r\n", "WARNING:root:bad mapping: ChEBI\r\n", "WARNING:root:bad mapping: ChEMBL\r\n", "WARNING:root:bad mapping: DrugBank\r\n", "WARNING:root:bad mapping: WHO_MedNet\r\n", "WARNING:root:bad mapping: PDBeChem\r\n", "WARNING:root:bad mapping: NIST_Chemistry_WebBook\r\n", "WARNING:root:bad mapping: LIPID_MAPS\r\n", "WARNING:root:bad mapping: IUPHAR\r\n", "WARNING:root:bad mapping: HMDB\r\n", "WARNING:root:bad mapping: SUBMITTER\r\n", "WARNING:root:bad mapping: MetaCyc\r\n", "WARNING:root:bad mapping: JCBN\r\n", "WARNING:root:bad mapping: GlyTouCan\r\n", "WARNING:root:bad mapping: KNApSAcK\r\n", "WARNING:root:bad mapping: IUBMB\r\n", "WARNING:root:bad mapping: EMBL\r\n", "WARNING:root:bad mapping: CBN\r\n", "WARNING:root:bad mapping: Alan_Wood's_Pesticides\r\n", "WARNING:root:bad mapping: GlyGen\r\n", "WARNING:root:bad mapping: PPDB\r\n", "WARNING:root:bad mapping: KEGG_GLYCAN\r\n", "WARNING:root:bad mapping: RESID\r\n", "WARNING:root:bad mapping: PubChem\r\n", "WARNING:root:bad mapping: FooDB\r\n", "WARNING:root:bad mapping: VSDB\r\n", "WARNING:root:bad mapping: UM-BBD\r\n", "WARNING:root:bad mapping: MolBase\r\n", "WARNING:root:bad mapping: COMe\r\n", "WARNING:root:bad mapping: EBI_Industry_Programme\r\n", "WARNING:root:bad mapping: EuroFIR\r\n", "WARNING:root:bad mapping: Beilstein\r\n", "WARNING:root:bad mapping: Patent\r\n", "WARNING:root:bad mapping: PDB\r\n", "WARNING:root:bad mapping: SMID\r\n" ] } ], "source": [ "chebi statistics -o output/chebi.stats.tsv -O csv" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-03-27T00:07:55.650586Z", "start_time": "2024-03-27T00:07:30.421752Z" } }, "id": "d35e47fb825f3f00" }, { "cell_type": "markdown", "source": [ "To illustrate this we will use pandas:" ], "metadata": { "collapsed": false }, "id": "88340e60db1a177b" }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "data": { "text/plain": " id compared_with agents class_count deprecated_class_count \\\n0 AllOntologies NaN NaN 185295 18628 \n\n non_deprecated_class_count class_count_with_text_definitions \\\n0 166667 53049 \n\n class_count_without_text_definitions object_property_count \\\n0 132246 10 \n\n annotation_property_count ... \\\n0 37 ... \n\n mapping_statement_count_subject_by_object_source_CTX \\\n0 3 \n\n mapping_statement_count_subject_by_object_source_SMID \\\n0 307 \n\n class_count_by_subset_1_STAR class_count_by_subset_2_STAR \\\n0 2945 102919 \n\n class_count_by_subset_3_STAR was_generated_by_started_at_time \\\n0 60803 2024-03-26T17:07:33.778117 \n\n was_generated_by_was_associated_with was_generated_by_acted_on_behalf_of \\\n0 OAK cjm \n\n ontologies_id ontologies_version \n0 obo:chebi.owl obo:chebi/226/chebi.owl \n\n[1 rows x 177 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
idcompared_withagentsclass_countdeprecated_class_countnon_deprecated_class_countclass_count_with_text_definitionsclass_count_without_text_definitionsobject_property_countannotation_property_count...mapping_statement_count_subject_by_object_source_CTXmapping_statement_count_subject_by_object_source_SMIDclass_count_by_subset_1_STARclass_count_by_subset_2_STARclass_count_by_subset_3_STARwas_generated_by_started_at_timewas_generated_by_was_associated_withwas_generated_by_acted_on_behalf_ofontologies_idontologies_version
0AllOntologiesNaNNaN18529518628166667530491322461037...33072945102919608032024-03-26T17:07:33.778117OAKcjmobo:chebi.owlobo:chebi/226/chebi.owl
\n

1 rows × 177 columns

\n
" }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"output/chebi.stats.tsv\", sep=\"\\t\")\n", "df\n" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-03-27T00:08:27.895520Z", "start_time": "2024-03-27T00:08:27.863507Z" } }, "id": "c3381ed2ce4edaf2" }, { "cell_type": "markdown", "source": [ "This format is useful if you have multiple ontologies (see later).\n", "But for a single ontology it's more convenient to melt this:" ], "metadata": { "collapsed": false }, "id": "dd1382a9dd752048" }, { "cell_type": "code", "execution_count": 13, "outputs": [ { "data": { "text/plain": " Property Value\n0 id AllOntologies\n1 compared_with NaN\n2 agents NaN\n3 class_count 185295\n4 deprecated_class_count 18628\n5 non_deprecated_class_count 166667\n6 class_count_with_text_definitions 53049\n7 class_count_without_text_definitions 132246\n8 object_property_count 10\n9 annotation_property_count 37\n10 named_individual_count 0\n11 subset_count 3\n12 rdf_triple_count 6158555\n13 subclass_of_axiom_count 330989\n14 equivalent_classes_axiom_count 0\n15 entailed_edge_count_by_predicate {}\n16 distinct_synonym_count 332744\n17 synonym_statement_count 346486\n18 class_count_by_category {}\n19 contributor_summary {}\n20 change_summary {}\n21 merged_class_query 18559\n22 deprecated_property_count 0\n23 edge_count_by_predicate_BFO:0000051 4003\n24 edge_count_by_predicate_RO:0000087 43082\n25 edge_count_by_predicate_has_functional_parent 18664\n26 edge_count_by_predicate_has_parent_hydride 1764\n27 edge_count_by_predicate_is_conjugate_acid_of 8434\n28 edge_count_by_predicate_is_conjugate_base_of 8434\n29 edge_count_by_predicate_is_enantiomer_of 2728\n30 edge_count_by_predicate_is_substituent_group_from 1284\n31 edge_count_by_predicate_is_tautomer_of 1884\n32 edge_count_by_predicate_rdfs:subClassOf 240712\n33 edge_count_by_predicate_rdfs:subPropertyOf 6\n34 synonym_statement_count_by_predicate_hasExactS... 100585\n35 synonym_statement_count_by_predicate_hasRelate... 234002\n36 mapping_statement_count_by_predicate_hasDbXref 317151\n37 mapping_statement_count_by_object_source_BFO 1\n38 mapping_statement_count_by_object_source_RO 1\n39 mapping_statement_count_by_object_source_KNApSAcK 5152", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
PropertyValue
0idAllOntologies
1compared_withNaN
2agentsNaN
3class_count185295
4deprecated_class_count18628
5non_deprecated_class_count166667
6class_count_with_text_definitions53049
7class_count_without_text_definitions132246
8object_property_count10
9annotation_property_count37
10named_individual_count0
11subset_count3
12rdf_triple_count6158555
13subclass_of_axiom_count330989
14equivalent_classes_axiom_count0
15entailed_edge_count_by_predicate{}
16distinct_synonym_count332744
17synonym_statement_count346486
18class_count_by_category{}
19contributor_summary{}
20change_summary{}
21merged_class_query18559
22deprecated_property_count0
23edge_count_by_predicate_BFO:00000514003
24edge_count_by_predicate_RO:000008743082
25edge_count_by_predicate_has_functional_parent18664
26edge_count_by_predicate_has_parent_hydride1764
27edge_count_by_predicate_is_conjugate_acid_of8434
28edge_count_by_predicate_is_conjugate_base_of8434
29edge_count_by_predicate_is_enantiomer_of2728
30edge_count_by_predicate_is_substituent_group_from1284
31edge_count_by_predicate_is_tautomer_of1884
32edge_count_by_predicate_rdfs:subClassOf240712
33edge_count_by_predicate_rdfs:subPropertyOf6
34synonym_statement_count_by_predicate_hasExactS...100585
35synonym_statement_count_by_predicate_hasRelate...234002
36mapping_statement_count_by_predicate_hasDbXref317151
37mapping_statement_count_by_object_source_BFO1
38mapping_statement_count_by_object_source_RO1
39mapping_statement_count_by_object_source_KNApSAcK5152
\n
" }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdf = df.melt(var_name='Property', value_name='Value')\n", "mdf[0:40]" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-03-27T00:10:22.008072Z", "start_time": "2024-03-27T00:10:21.985735Z" } }, "id": "1e71a74c64f01ea3" }, { "cell_type": "markdown", "source": [ "Note this uses a very generic way of flattening the yaml so some columns make less sense out of context - \n", "e.g. the \"agent\" field belongs to a parent object that describes what \"agent\" generated the stats\n", "(TODO: this should say \"oaklib\")" ], "metadata": { "collapsed": false }, "id": "4415b1c56ef20078" }, { "cell_type": "markdown", "source": [ "## Multi-ontology merges\n", "\n", "Many OBO ontologies bundle portions of other ontologies with their main release. This can\n", "be confusing! For more details see [OWL Format Variants](https://oboacademy.github.io/obook/explanation/owl-format-variants/)\n", "in the obook.\n", "\n", "As an example, consider naively calculating stats for the standard release of the\n", "Cell Ontology (CL):" ], "metadata": { "collapsed": false }, "id": "1c59d85909ca59b1" }, { "cell_type": "code", "execution_count": 14, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:root:bad mapping: GSE137537\r\n", "WARNING:root:bad mapping: 10.1007/s004180050142\r\n", "WARNING:root:bad mapping: NIFSTD\r\n", "WARNING:root:bad mapping: Noradrenergic_cell_group_A6&oldid=981960774\r\n", "WARNING:root:bad mapping: _Chapter_3\r\n", "WARNING:root:bad mapping: A12.2.15.042\r\n", "WARNING:root:bad mapping: PHENOSCAPE\r\n", "id: AllOntologies\r\n", "ontologies:\r\n", "- id: obo:cl.owl\r\n", " version: obo:cl/releases/2023-09-21/cl.owl\r\n", " version_info: '2023-09-21'\r\n", "was_generated_by:\r\n", " started_at_time: '2024-03-26T17:16:02.669245'\r\n", " was_associated_with: OAK\r\n", " acted_on_behalf_of: cjm\r\n", "class_count: 28330\r\n", "deprecated_class_count: 261\r\n", "non_deprecated_class_count: 28069\r\n", "class_count_with_text_definitions: 15110\r\n", "class_count_without_text_definitions: 13220\r\n", "object_property_count: 297\r\n", "annotation_property_count: 241\r\n", "named_individual_count: 18\r\n", "subset_count: 63\r\n", "rdf_triple_count: 681623\r\n", "subclass_of_axiom_count: 44142\r\n" ] } ], "source": [ "!runoak -i sqlite:obo:cl statistics | head -20" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-03-27T00:16:05.708936Z", "start_time": "2024-03-27T00:16:00.362529Z" } }, "id": "86416e4ad56be0c7" }, { "cell_type": "markdown", "source": [ "Looking at this you might think CL has 28k classes. In fact, this is the total number of\n", "classes in the ontology as defined by OWL, where here \"ontology\" means the merged\n", "product that includes bits of GO, Uberon, etc. Confusing, huh?\n", "\n", "Ideally the OBO Foundry would move towards making *base files* the default, but in the absence of this,\n", "we have a few options:\n", "\n", "* Filtering by prefix (using `-P`)\n", "* Grouping using some property such as the prefix.\n", "\n", "We'll try the latter\n" ], "metadata": { "collapsed": false }, "id": "40386173a317f262" }, { "cell_type": "code", "execution_count": 22, "outputs": [], "source": [ "!runoak -i sqlite:obo:cl statistics --group-by-prefix -o output/cl.stats.grouped.tsv -O csv" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-03-27T00:36:40.859841Z", "start_time": "2024-03-27T00:34:13.090696Z" } }, "id": "1099d9e9542df418" }, { "cell_type": "code", "execution_count": 23, "outputs": [ { "data": { "text/plain": " id compared_with agents class_count deprecated_class_count \\\n0 \n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
idcompared_withagentsclass_countdeprecated_class_countnon_deprecated_class_countclass_count_with_text_definitionsclass_count_without_text_definitionsobject_property_countannotation_property_count...class_count_by_subset_non_informativeclass_count_by_subset_organ_slimclass_count_by_subset_pheno_slimclass_count_by_subset_phenotype_rcnclass_count_by_subset_uberon_slimclass_count_by_subset_unverified_taxonomic_groupingclass_count_by_subset_upper_levelclass_count_by_subset_vertebrate_coremapping_statement_count_by_object_source_GORELmapping_statement_count_subject_by_object_source_GOREL
0<httpNaNNaN0000001...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1<httpsNaNNaN0000001...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2BFONaNNaN150159660...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3BSPONaNNaN00000240...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4CARONaNNaN2002020000...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
5CHEBINaNNaN12301231810500...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
6CLNaNNaN29692492720255541430...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7GONaNNaN7265272637264100...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
8IAONaNNaN60642023...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
9NCBITaxonNaNNaN1380138013800...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
10OMONaNNaN0000002...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11PATONaNNaN1850185184100...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
12PRNaNNaN7480748747100...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
13RONaNNaN1011024016...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14UBERONNaNNaN467004670430836200...47.0136.01373.03.0809.01.049.0448.0NaNNaN
15citoNaNNaN0000001...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
16dceNaNNaN0000006...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
17dctermsNaNNaN0000007...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
18foafNaNNaN0000002...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
19oboNaNNaN1010001024116...NaNNaNNaNNaNNaNNaNNaNNaN2.02.0
20oioNaNNaN00000061...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
21owlNaNNaN0000001...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
22rdfsNaNNaN0000003...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
23skosNaNNaN0000001...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
24xsdNaNNaN0000000...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n

25 rows × 512 columns

\n" }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"output/cl.stats.grouped.tsv\", sep=\"\\t\")\n", "df" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-03-27T00:36:40.903385Z", "start_time": "2024-03-27T00:36:40.861941Z" } }, "id": "206115e9a39e8ddf" }, { "cell_type": "markdown", "source": [ "Here we can see the numbers broken down by ontology. The number of classes in the CL row is now accurate.\n", "Note of course that the other numbers don't reflect totals for the external ontology as a whole -- it's\n", "just the number that has been merged into CL\n" ], "metadata": { "collapsed": false }, "id": "12c9fcd0a9363258" }, { "cell_type": "markdown", "source": [ "## Diff stats\n", "\n", "You can also use `--compare-with` to compare stats with a different release of an ontology. Note this\n", "is effictively the same as running `diff` with `--statistics`. See diff docs for details." ], "metadata": { "collapsed": false }, "id": "bc153c4a21629345" }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false }, "id": "76d6c523e691af29" } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }