OAK statistics command

This notebook is intended as a supplement to the main OAK CLI docs.

This notebook provides examples for the statistics command, which can be used to calculate basic descriptive statistics for an ontology

Help Option

You can get help on any OAK command using --help

[1]:
!runoak statistics --help
Usage: runoak statistics [OPTIONS] [BRANCHES]...

  Shows all descriptive/summary statistics

  Example: -------     runoak -i sqlite:obo:pr statistics

  By default, this will show combined summary statistics for all terms

  You can also break down the statistics in two ways:

  - by a collection of branch roots

  - by a metadata property (e.g. oio:hasOBONamespace, rdfs:isDefinedBy)

  - by prefix (e.g. GO, PR, CL, OBI)

  Example: -------     runoak -i sqlite:obo:pr statistics -p
  oio:hasOBONamespace

  Note: the oio:hasOBONamespace is *not* the same as the ID prefix, it is a
  field that is used by a subset of ontologies to partition classes into broad
  groupings, similar to subsets. Its use is non-standard, yet a lot of
  ontologies use this as the main partitioning mechanism.

  A note on bundled ontologies:

  The standard release many OBO ontologies "bundles" parts of other ontologies
  (formally, the release product includes a merged imports closure of import
  modules). This can complicate generation of statistics. A naive count of all
  classes in the main OBI release will include not only "native" OBI classes,
  but also classes from other ontologies that are bundled in the release.

  For bundled ontologies, we recommend some kind of partitioning, such as via
  defined roots, or via the CURIE prefix, using the ``--group-by-prefix``
  option.

  Output formats:

  The recommended output types for this command are yaml, json, or csv. The
  default output type is yaml, following the SummaryStatistics data model.
  This is naturally nested, as the statistics includes faceted groupings (e.g.
  edge counts are broken down by predicate). When specifying a flat format
  like csv, this is flattened into a single table, with dynamic column names.

  Change statistics:

  You can optionally combine the ontology statistics with a change summary
  relative to another ontology, using the ``--compare-with`` option.

  Example: -------     runoak -i v2.obo statistics --group-by-obo-namespace
  --compare-with v1.obo

  This will also include change stats broken down by KGCL change types. If a
  group-by option is specified, these will be grouped accordingly.

  Python API:

     https://incatools.github.io/ontology-access-kit/interfaces/summary-
     statistics

  Data model:

     https://w3id.org/oak/summary-statistics

Options:
  -O, --output-type [obo|obojson|ofn|rdf|json|yaml|fhirjson|csv|tsv|nl]
                                  Desired output type
  --group-by-property TEXT        group summaries by a metadata property, e.g.
                                  rdfs:isDefinedBy
  --group-by-obo-namespace / --no-group-by-obo-namespace
                                  shortcut for --group-by-property
                                  oio:hasOBONamespace (note this is distinct
                                  from the ID namespace)  [default: no-group-
                                  by-obo-namespace]
  --group-by-prefix / --no-group-by-prefix
                                  shortcut for --group-by-property sh:prefix.
                                  Groups by the prefix of the CURIE  [default:
                                  no-group-by-prefix]
  --group-by-defined-by / --no-group-by-defined-by
                                  shortcut for --group-by-property
                                  rdfs:isDefinedBy. This may be inferred from
                                  prefix if not set explicitly  [default: no-
                                  group-by-defined-by]
  --include-residuals / --no-include-residuals
                                  If true include an OTHER category for terms
                                  that do not have the property
  -X, --compare-with TEXT         Compare with another ontology
  -P, --has-prefix TEXT           filter based on a prefix, e.g. OBI
  -o, --output FILENAME           Output file, e.g. obo file
  --help                          Show this message and exit.

Set up an alias

For convenience we will set up some aliases for use in this notebook

[18]:
alias chebi runoak -i sqlite:obo:chebi

Calculating summary statistics (default YAML output)

We can calculate the summary stats using the statistics command. The output is quite lengthy, so we will use --output (-o) to direct to a yamml file:

[19]:
chebi statistics -o output/chebi.stats.yaml
WARNING:root:bad mapping: KEGG_COMPOUND
WARNING:root:bad mapping: IUPAC
WARNING:root:bad mapping: ChemIDplus
WARNING:root:bad mapping: UniProt
WARNING:root:bad mapping: DrugCentral
WARNING:root:bad mapping: LINCS
WARNING:root:bad mapping: KEGG_DRUG
WARNING:root:bad mapping: ChEBI
WARNING:root:bad mapping: ChEMBL
WARNING:root:bad mapping: DrugBank
WARNING:root:bad mapping: WHO_MedNet
WARNING:root:bad mapping: PDBeChem
WARNING:root:bad mapping: NIST_Chemistry_WebBook
WARNING:root:bad mapping: PPDB
WARNING:root:bad mapping: LIPID_MAPS
WARNING:root:bad mapping: IUPHAR
WARNING:root:bad mapping: HMDB
WARNING:root:bad mapping: SUBMITTER
WARNING:root:bad mapping: MetaCyc
WARNING:root:bad mapping: JCBN
WARNING:root:bad mapping: GlyTouCan
WARNING:root:bad mapping: KNApSAcK
WARNING:root:bad mapping: IUBMB
WARNING:root:bad mapping: CBN
WARNING:root:bad mapping: Alan_Wood's_Pesticides
WARNING:root:bad mapping: GlyGen
WARNING:root:bad mapping: KEGG_GLYCAN
WARNING:root:bad mapping: RESID
WARNING:root:bad mapping: PubChem
WARNING:root:bad mapping: FooDB
WARNING:root:bad mapping: VSDB
WARNING:root:bad mapping: UM-BBD
WARNING:root:bad mapping: MolBase
WARNING:root:bad mapping: COMe
WARNING:root:bad mapping: Beilstein
WARNING:root:bad mapping: Patent
WARNING:root:bad mapping: PDB
WARNING:root:bad mapping: SMID

Note CHEBI has a lot of bad xrefs, hence the output

Exploring the output

Let’s look at the top of the YAML file:

[20]:
!head -50 output/chebi.stats.yaml
id: AllOntologies
ontologies:
- id: obo:chebi.owl
  version: obo:chebi/231/chebi.owl
was_generated_by:
  started_at_time: '2024-03-26T17:29:56.627143'
  was_associated_with: OAK
  acted_on_behalf_of: cjm
class_count: 217549
deprecated_class_count: 18650
non_deprecated_class_count: 198899
class_count_with_text_definitions: 53575
class_count_without_text_definitions: 163974
object_property_count: 10
annotation_property_count: 37
named_individual_count: 0
subset_count: 3
rdf_triple_count: 6860047
subclass_of_axiom_count: 368285
equivalent_classes_axiom_count: 0
edge_count_by_predicate:
  BFO:0000051:
    facet: BFO:0000051
    filtered_count: 4029
  RO:0000087:
    facet: RO:0000087
    filtered_count: 43636
  obo:chebi#has_functional_parent:
    facet: obo:chebi#has_functional_parent
    filtered_count: 19632
  obo:chebi#has_parent_hydride:
    facet: obo:chebi#has_parent_hydride
    filtered_count: 1799
  obo:chebi#is_conjugate_acid_of:
    facet: obo:chebi#is_conjugate_acid_of
    filtered_count: 8484
  obo:chebi#is_conjugate_base_of:
    facet: obo:chebi#is_conjugate_base_of
    filtered_count: 8484
  obo:chebi#is_enantiomer_of:
    facet: obo:chebi#is_enantiomer_of
    filtered_count: 2754
  obo:chebi#is_substituent_group_from:
    facet: obo:chebi#is_substituent_group_from
    filtered_count: 1287
  obo:chebi#is_tautomer_of:
    facet: obo:chebi#is_tautomer_of
    filtered_count: 1886
  rdfs:subClassOf:
    facet: rdfs:subClassOf

Like all objects produced by OAK, there is a data dictionary / data model. The ontology stats one is https://w3id.org/oak/summary-statistics, you can use this link to browse documentation etc.

A well defined data dictionary is necessary for communicating aggregate statistics accurately. Often when ontologies are reported informally, it’s ambiguous whether number of terms means:

  • number of classes, classes plus relationship types, or classes plus some other elements

  • including or excluding deprecated (obsolete) entities

The OAK summary statistics data dictionary aims to provide a standard for ontology reporting.

YAML allows for nesting which is a natural way to group things; for example:

edge_count_by_predicate:
  BFO:0000051:
    facet: BFO:0000051
    filtered_count: 4003
  RO:0000087:
    facet: RO:0000087
    filtered_count: 43082

This says that there are 4003 part-of (BFO:0000050) and 43082 has-role (RO:00000087) relationships.

See the OAK guide to relationships to understand more.

Mapping Stats

Further on in the YAML we can see mapping stats. See (https://w3id.org/ssssom)[https://w3id.org/ssssom] to understand the OAK mapping data model.

These are broken down

  • by mapping predicate (for many ontologies this is only oio:hasDbXref)

  • my mapping object source (i.e. the database or ontology that is mapped to)

[21]:
!grep -A40 ^mapping_statement_count output/chebi.stats.yaml
mapping_statement_count_by_predicate:
  oio:hasDbXref:
    facet: oio:hasDbXref
    filtered_count: 345271
mapping_statement_count_by_object_source:
  BFO:
    facet: BFO
    filtered_count: 1
  RO:
    facet: RO
    filtered_count: 1
  KNApSAcK:
    facet: KNApSAcK
    filtered_count: 5185
  KEGG:
    facet: KEGG
    filtered_count: 22228
  CAS:
    facet: CAS
    filtered_count: 28938
  KEGG_COMPOUND:
    facet: KEGG_COMPOUND
    filtered_count: 19870
  Beilstein:
    facet: Beilstein
    filtered_count: 9187
  IUPAC:
    facet: IUPAC
    filtered_count: 61013
  ChemIDplus:
    facet: ChemIDplus
    filtered_count: 33383
  UniProt:
    facet: UniProt
    filtered_count: 16047
  LINCS:
    facet: LINCS
    filtered_count: 41392
  Drug_Central:
    facet: Drug_Central
    filtered_count: 3784
  DrugCentral:
    facet: DrugCentral
    filtered_count: 6202
  Wikipedia:
--
mapping_statement_count_subject_by_object_source:
  BFO:
    facet: BFO
    filtered_count: 1
  RO:
    facet: RO
    filtered_count: 1
  KNApSAcK:
    facet: KNApSAcK
    filtered_count: 5091
  KEGG:
    facet: KEGG
    filtered_count: 20233
  CAS:
    facet: CAS
    filtered_count: 28615
  KEGG_COMPOUND:
    facet: KEGG_COMPOUND
    filtered_count: 19870
  Beilstein:
    facet: Beilstein
    filtered_count: 8704
  IUPAC:
    facet: IUPAC
    filtered_count: 61013
  ChemIDplus:
    facet: ChemIDplus
    filtered_count: 33383
  UniProt:
    facet: UniProt
    filtered_count: 16047
  LINCS:
    facet: LINCS
    filtered_count: 41389
  Drug_Central:
    facet: Drug_Central
    filtered_count: 3783
  DrugCentral:
    facet: DrugCentral
    filtered_count: 6202
  Wikipedia:

As expected, CHEBI does not make use of SKOS mapping predicates, and mappings are dominated by databases like KEGG, CAS.

TSV Output

YAML is not a very natural format for doing further data science or statistical processing.

FOr that we can use the csv option (which actually defaults to tsv…)

[9]:
chebi statistics -o output/chebi.stats.tsv -O csv
WARNING:root:bad mapping: KEGG_COMPOUND
WARNING:root:bad mapping: IUPAC
WARNING:root:bad mapping: ChemIDplus
WARNING:root:bad mapping: UniProt
WARNING:root:bad mapping: DrugCentral
WARNING:root:bad mapping: LINCS
WARNING:root:bad mapping: KEGG_DRUG
WARNING:root:bad mapping: ChEBI
WARNING:root:bad mapping: ChEMBL
WARNING:root:bad mapping: DrugBank
WARNING:root:bad mapping: WHO_MedNet
WARNING:root:bad mapping: PDBeChem
WARNING:root:bad mapping: NIST_Chemistry_WebBook
WARNING:root:bad mapping: LIPID_MAPS
WARNING:root:bad mapping: IUPHAR
WARNING:root:bad mapping: HMDB
WARNING:root:bad mapping: SUBMITTER
WARNING:root:bad mapping: MetaCyc
WARNING:root:bad mapping: JCBN
WARNING:root:bad mapping: GlyTouCan
WARNING:root:bad mapping: KNApSAcK
WARNING:root:bad mapping: IUBMB
WARNING:root:bad mapping: EMBL
WARNING:root:bad mapping: CBN
WARNING:root:bad mapping: Alan_Wood's_Pesticides
WARNING:root:bad mapping: GlyGen
WARNING:root:bad mapping: PPDB
WARNING:root:bad mapping: KEGG_GLYCAN
WARNING:root:bad mapping: RESID
WARNING:root:bad mapping: PubChem
WARNING:root:bad mapping: FooDB
WARNING:root:bad mapping: VSDB
WARNING:root:bad mapping: UM-BBD
WARNING:root:bad mapping: MolBase
WARNING:root:bad mapping: COMe
WARNING:root:bad mapping: EBI_Industry_Programme
WARNING:root:bad mapping: EuroFIR
WARNING:root:bad mapping: Beilstein
WARNING:root:bad mapping: Patent
WARNING:root:bad mapping: PDB
WARNING:root:bad mapping: SMID

To illustrate this we will use pandas:

[11]:
import pandas as pd
df = pd.read_csv("output/chebi.stats.tsv", sep="\t")
df

[11]:
id compared_with agents class_count deprecated_class_count non_deprecated_class_count class_count_with_text_definitions class_count_without_text_definitions object_property_count annotation_property_count ... mapping_statement_count_subject_by_object_source_CTX mapping_statement_count_subject_by_object_source_SMID class_count_by_subset_1_STAR class_count_by_subset_2_STAR class_count_by_subset_3_STAR was_generated_by_started_at_time was_generated_by_was_associated_with was_generated_by_acted_on_behalf_of ontologies_id ontologies_version
0 AllOntologies NaN NaN 185295 18628 166667 53049 132246 10 37 ... 3 307 2945 102919 60803 2024-03-26T17:07:33.778117 OAK cjm obo:chebi.owl obo:chebi/226/chebi.owl

1 rows × 177 columns

This format is useful if you have multiple ontologies (see later). But for a single ontology it’s more convenient to melt this:

[13]:
mdf = df.melt(var_name='Property', value_name='Value')
mdf[0:40]
[13]:
Property Value
0 id AllOntologies
1 compared_with NaN
2 agents NaN
3 class_count 185295
4 deprecated_class_count 18628
5 non_deprecated_class_count 166667
6 class_count_with_text_definitions 53049
7 class_count_without_text_definitions 132246
8 object_property_count 10
9 annotation_property_count 37
10 named_individual_count 0
11 subset_count 3
12 rdf_triple_count 6158555
13 subclass_of_axiom_count 330989
14 equivalent_classes_axiom_count 0
15 entailed_edge_count_by_predicate {}
16 distinct_synonym_count 332744
17 synonym_statement_count 346486
18 class_count_by_category {}
19 contributor_summary {}
20 change_summary {}
21 merged_class_query 18559
22 deprecated_property_count 0
23 edge_count_by_predicate_BFO:0000051 4003
24 edge_count_by_predicate_RO:0000087 43082
25 edge_count_by_predicate_has_functional_parent 18664
26 edge_count_by_predicate_has_parent_hydride 1764
27 edge_count_by_predicate_is_conjugate_acid_of 8434
28 edge_count_by_predicate_is_conjugate_base_of 8434
29 edge_count_by_predicate_is_enantiomer_of 2728
30 edge_count_by_predicate_is_substituent_group_from 1284
31 edge_count_by_predicate_is_tautomer_of 1884
32 edge_count_by_predicate_rdfs:subClassOf 240712
33 edge_count_by_predicate_rdfs:subPropertyOf 6
34 synonym_statement_count_by_predicate_hasExactS... 100585
35 synonym_statement_count_by_predicate_hasRelate... 234002
36 mapping_statement_count_by_predicate_hasDbXref 317151
37 mapping_statement_count_by_object_source_BFO 1
38 mapping_statement_count_by_object_source_RO 1
39 mapping_statement_count_by_object_source_KNApSAcK 5152

Note this uses a very generic way of flattening the yaml so some columns make less sense out of context - e.g. the “agent” field belongs to a parent object that describes what “agent” generated the stats (TODO: this should say “oaklib”)

Multi-ontology merges

Many OBO ontologies bundle portions of other ontologies with their main release. This can be confusing! For more details see OWL Format Variants in the obook.

As an example, consider naively calculating stats for the standard release of the Cell Ontology (CL):

[14]:
!runoak -i sqlite:obo:cl statistics | head -20
WARNING:root:bad mapping: GSE137537
WARNING:root:bad mapping: 10.1007/s004180050142
WARNING:root:bad mapping: NIFSTD
WARNING:root:bad mapping: Noradrenergic_cell_group_A6&oldid=981960774
WARNING:root:bad mapping: _Chapter_3
WARNING:root:bad mapping: A12.2.15.042
WARNING:root:bad mapping: PHENOSCAPE
id: AllOntologies
ontologies:
- id: obo:cl.owl
  version: obo:cl/releases/2023-09-21/cl.owl
  version_info: '2023-09-21'
was_generated_by:
  started_at_time: '2024-03-26T17:16:02.669245'
  was_associated_with: OAK
  acted_on_behalf_of: cjm
class_count: 28330
deprecated_class_count: 261
non_deprecated_class_count: 28069
class_count_with_text_definitions: 15110
class_count_without_text_definitions: 13220
object_property_count: 297
annotation_property_count: 241
named_individual_count: 18
subset_count: 63
rdf_triple_count: 681623
subclass_of_axiom_count: 44142

Looking at this you might think CL has 28k classes. In fact, this is the total number of classes in the ontology as defined by OWL, where here “ontology” means the merged product that includes bits of GO, Uberon, etc. Confusing, huh?

Ideally the OBO Foundry would move towards making base files the default, but in the absence of this, we have a few options:

  • Filtering by prefix (using -P)

  • Grouping using some property such as the prefix.

We’ll try the latter

[22]:
!runoak -i sqlite:obo:cl statistics --group-by-prefix -o output/cl.stats.grouped.tsv -O csv
[23]:
df = pd.read_csv("output/cl.stats.grouped.tsv", sep="\t")
df
[23]:
id compared_with agents class_count deprecated_class_count non_deprecated_class_count class_count_with_text_definitions class_count_without_text_definitions object_property_count annotation_property_count ... class_count_by_subset_non_informative class_count_by_subset_organ_slim class_count_by_subset_pheno_slim class_count_by_subset_phenotype_rcn class_count_by_subset_uberon_slim class_count_by_subset_unverified_taxonomic_grouping class_count_by_subset_upper_level class_count_by_subset_vertebrate_core mapping_statement_count_by_object_source_GOREL mapping_statement_count_subject_by_object_source_GOREL
0 <http NaN NaN 0 0 0 0 0 0 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 <https NaN NaN 0 0 0 0 0 0 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 BFO NaN NaN 15 0 15 9 6 6 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 BSPO NaN NaN 0 0 0 0 0 24 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 CARO NaN NaN 20 0 20 20 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 CHEBI NaN NaN 123 0 123 18 105 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 CL NaN NaN 2969 249 2720 2555 414 3 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 GO NaN NaN 7265 2 7263 7264 1 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 IAO NaN NaN 6 0 6 4 2 0 23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NCBITaxon NaN NaN 138 0 138 0 138 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 OMO NaN NaN 0 0 0 0 0 0 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 PATO NaN NaN 185 0 185 184 1 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 PR NaN NaN 748 0 748 747 1 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 RO NaN NaN 1 0 1 1 0 240 16 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 UBERON NaN NaN 4670 0 4670 4308 362 0 0 ... 47.0 136.0 1373.0 3.0 809.0 1.0 49.0 448.0 NaN NaN
15 cito NaN NaN 0 0 0 0 0 0 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 dce NaN NaN 0 0 0 0 0 0 6 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 dcterms NaN NaN 0 0 0 0 0 0 7 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 foaf NaN NaN 0 0 0 0 0 0 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 obo NaN NaN 10 10 0 0 10 24 116 ... NaN NaN NaN NaN NaN NaN NaN NaN 2.0 2.0
20 oio NaN NaN 0 0 0 0 0 0 61 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 owl NaN NaN 0 0 0 0 0 0 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 rdfs NaN NaN 0 0 0 0 0 0 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23 skos NaN NaN 0 0 0 0 0 0 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 xsd NaN NaN 0 0 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

25 rows × 512 columns

Here we can see the numbers broken down by ontology. The number of classes in the CL row is now accurate. Note of course that the other numbers don’t reflect totals for the external ontology as a whole – it’s just the number that has been merged into CL

Diff stats

You can also use --compare-with to compare stats with a different release of an ontology. Note this is effictively the same as running diff with --statistics. See diff docs for details.

[ ]: