OAK statistics command

This notebook is intended as a supplement to the main OAK CLI docs.

This notebook provides examples for the statistics command, which can be used to calculate basic descriptive statistics for an ontology

Help Option

You can get help on any OAK command using --help

[1]:

!runoak statistics --help

Usage: runoak statistics [OPTIONS] [BRANCHES]...

  Shows all descriptive/summary statistics

  Example: -------     runoak -i sqlite:obo:pr statistics

  By default, this will show combined summary statistics for all terms

  You can also break down the statistics in two ways:

  - by a collection of branch roots

  - by a metadata property (e.g. oio:hasOBONamespace, rdfs:isDefinedBy)

  - by prefix (e.g. GO, PR, CL, OBI)

  Example: -------     runoak -i sqlite:obo:pr statistics -p
  oio:hasOBONamespace

  Note: the oio:hasOBONamespace is *not* the same as the ID prefix, it is a
  field that is used by a subset of ontologies to partition classes into broad
  groupings, similar to subsets. Its use is non-standard, yet a lot of
  ontologies use this as the main partitioning mechanism.

  A note on bundled ontologies:

  The standard release many OBO ontologies "bundles" parts of other ontologies
  (formally, the release product includes a merged imports closure of import
  modules). This can complicate generation of statistics. A naive count of all
  classes in the main OBI release will include not only "native" OBI classes,
  but also classes from other ontologies that are bundled in the release.

  For bundled ontologies, we recommend some kind of partitioning, such as via
  defined roots, or via the CURIE prefix, using the ``--group-by-prefix``
  option.

  Output formats:

  The recommended output types for this command are yaml, json, or csv. The
  default output type is yaml, following the SummaryStatistics data model.
  This is naturally nested, as the statistics includes faceted groupings (e.g.
  edge counts are broken down by predicate). When specifying a flat format
  like csv, this is flattened into a single table, with dynamic column names.

  Change statistics:

  You can optionally combine the ontology statistics with a change summary
  relative to another ontology, using the ``--compare-with`` option.

  Example: -------     runoak -i v2.obo statistics --group-by-obo-namespace
  --compare-with v1.obo

  This will also include change stats broken down by KGCL change types. If a
  group-by option is specified, these will be grouped accordingly.

  Python API:

     https://incatools.github.io/ontology-access-kit/interfaces/summary-
     statistics

  Data model:

     https://w3id.org/oak/summary-statistics

Options:
  -O, --output-type [obo|obojson|ofn|rdf|json|yaml|fhirjson|csv|tsv|nl]
                                  Desired output type
  --group-by-property TEXT        group summaries by a metadata property, e.g.
                                  rdfs:isDefinedBy
  --group-by-obo-namespace / --no-group-by-obo-namespace
                                  shortcut for --group-by-property
                                  oio:hasOBONamespace (note this is distinct
                                  from the ID namespace)  [default: no-group-
                                  by-obo-namespace]
  --group-by-prefix / --no-group-by-prefix
                                  shortcut for --group-by-property sh:prefix.
                                  Groups by the prefix of the CURIE  [default:
                                  no-group-by-prefix]
  --group-by-defined-by / --no-group-by-defined-by
                                  shortcut for --group-by-property
                                  rdfs:isDefinedBy. This may be inferred from
                                  prefix if not set explicitly  [default: no-
                                  group-by-defined-by]
  --include-residuals / --no-include-residuals
                                  If true include an OTHER category for terms
                                  that do not have the property
  -X, --compare-with TEXT         Compare with another ontology
  -P, --has-prefix TEXT           filter based on a prefix, e.g. OBI
  -o, --output FILENAME           Output file, e.g. obo file
  --help                          Show this message and exit.

Set up an alias

For convenience we will set up some aliases for use in this notebook

[18]:

alias chebi runoak -i sqlite:obo:chebi

Calculating summary statistics (default YAML output)

We can calculate the summary stats using the statistics command. The output is quite lengthy, so we will use --output (-o) to direct to a yamml file:

[19]:

chebi statistics -o output/chebi.stats.yaml

WARNING:root:bad mapping: KEGG_COMPOUND
WARNING:root:bad mapping: IUPAC
WARNING:root:bad mapping: ChemIDplus
WARNING:root:bad mapping: UniProt
WARNING:root:bad mapping: DrugCentral
WARNING:root:bad mapping: LINCS
WARNING:root:bad mapping: KEGG_DRUG
WARNING:root:bad mapping: ChEBI
WARNING:root:bad mapping: ChEMBL
WARNING:root:bad mapping: DrugBank
WARNING:root:bad mapping: WHO_MedNet
WARNING:root:bad mapping: PDBeChem
WARNING:root:bad mapping: NIST_Chemistry_WebBook
WARNING:root:bad mapping: PPDB
WARNING:root:bad mapping: LIPID_MAPS
WARNING:root:bad mapping: IUPHAR
WARNING:root:bad mapping: HMDB
WARNING:root:bad mapping: SUBMITTER
WARNING:root:bad mapping: MetaCyc
WARNING:root:bad mapping: JCBN
WARNING:root:bad mapping: GlyTouCan
WARNING:root:bad mapping: KNApSAcK
WARNING:root:bad mapping: IUBMB
WARNING:root:bad mapping: CBN
WARNING:root:bad mapping: Alan_Wood's_Pesticides
WARNING:root:bad mapping: GlyGen
WARNING:root:bad mapping: KEGG_GLYCAN
WARNING:root:bad mapping: RESID
WARNING:root:bad mapping: PubChem
WARNING:root:bad mapping: FooDB
WARNING:root:bad mapping: VSDB
WARNING:root:bad mapping: UM-BBD
WARNING:root:bad mapping: MolBase
WARNING:root:bad mapping: COMe
WARNING:root:bad mapping: Beilstein
WARNING:root:bad mapping: Patent
WARNING:root:bad mapping: PDB
WARNING:root:bad mapping: SMID

Note CHEBI has a lot of bad xrefs, hence the output

Exploring the output

Let’s look at the top of the YAML file:

[20]:

!head -50 output/chebi.stats.yaml

id: AllOntologies
ontologies:
- id: obo:chebi.owl
  version: obo:chebi/231/chebi.owl
was_generated_by:
  started_at_time: '2024-03-26T17:29:56.627143'
  was_associated_with: OAK
  acted_on_behalf_of: cjm
class_count: 217549
deprecated_class_count: 18650
non_deprecated_class_count: 198899
class_count_with_text_definitions: 53575
class_count_without_text_definitions: 163974
object_property_count: 10
annotation_property_count: 37
named_individual_count: 0
subset_count: 3
rdf_triple_count: 6860047
subclass_of_axiom_count: 368285
equivalent_classes_axiom_count: 0
edge_count_by_predicate:
  BFO:0000051:
    facet: BFO:0000051
    filtered_count: 4029
  RO:0000087:
    facet: RO:0000087
    filtered_count: 43636
  obo:chebi#has_functional_parent:
    facet: obo:chebi#has_functional_parent
    filtered_count: 19632
  obo:chebi#has_parent_hydride:
    facet: obo:chebi#has_parent_hydride
    filtered_count: 1799
  obo:chebi#is_conjugate_acid_of:
    facet: obo:chebi#is_conjugate_acid_of
    filtered_count: 8484
  obo:chebi#is_conjugate_base_of:
    facet: obo:chebi#is_conjugate_base_of
    filtered_count: 8484
  obo:chebi#is_enantiomer_of:
    facet: obo:chebi#is_enantiomer_of
    filtered_count: 2754
  obo:chebi#is_substituent_group_from:
    facet: obo:chebi#is_substituent_group_from
    filtered_count: 1287
  obo:chebi#is_tautomer_of:
    facet: obo:chebi#is_tautomer_of
    filtered_count: 1886
  rdfs:subClassOf:
    facet: rdfs:subClassOf

Like all objects produced by OAK, there is a data dictionary / data model. The ontology stats one is https://w3id.org/oak/summary-statistics, you can use this link to browse documentation etc.

A well defined data dictionary is necessary for communicating aggregate statistics accurately. Often when ontologies are reported informally, it’s ambiguous whether number of terms means:

number of classes, classes plus relationship types, or classes plus some other elements
including or excluding deprecated (obsolete) entities

The OAK summary statistics data dictionary aims to provide a standard for ontology reporting.

YAML allows for nesting which is a natural way to group things; for example:

edge_count_by_predicate:
  BFO:0000051:
    facet: BFO:0000051
    filtered_count: 4003
  RO:0000087:
    facet: RO:0000087
    filtered_count: 43082

This says that there are 4003 part-of (BFO:0000050) and 43082 has-role (RO:00000087) relationships.

See the OAK guide to relationships to understand more.

Mapping Stats

Further on in the YAML we can see mapping stats. See (https://w3id.org/ssssom)[https://w3id.org/ssssom] to understand the OAK mapping data model.

These are broken down

by mapping predicate (for many ontologies this is only oio:hasDbXref)
my mapping object source (i.e. the database or ontology that is mapped to)

[21]:

!grep -A40 ^mapping_statement_count output/chebi.stats.yaml

mapping_statement_count_by_predicate:
  oio:hasDbXref:
    facet: oio:hasDbXref
    filtered_count: 345271
mapping_statement_count_by_object_source:
  BFO:
    facet: BFO
    filtered_count: 1
  RO:
    facet: RO
    filtered_count: 1
  KNApSAcK:
    facet: KNApSAcK
    filtered_count: 5185
  KEGG:
    facet: KEGG
    filtered_count: 22228
  CAS:
    facet: CAS
    filtered_count: 28938
  KEGG_COMPOUND:
    facet: KEGG_COMPOUND
    filtered_count: 19870
  Beilstein:
    facet: Beilstein
    filtered_count: 9187
  IUPAC:
    facet: IUPAC
    filtered_count: 61013
  ChemIDplus:
    facet: ChemIDplus
    filtered_count: 33383
  UniProt:
    facet: UniProt
    filtered_count: 16047
  LINCS:
    facet: LINCS
    filtered_count: 41392
  Drug_Central:
    facet: Drug_Central
    filtered_count: 3784
  DrugCentral:
    facet: DrugCentral
    filtered_count: 6202
  Wikipedia:
--
mapping_statement_count_subject_by_object_source:
  BFO:
    facet: BFO
    filtered_count: 1
  RO:
    facet: RO
    filtered_count: 1
  KNApSAcK:
    facet: KNApSAcK
    filtered_count: 5091
  KEGG:
    facet: KEGG
    filtered_count: 20233
  CAS:
    facet: CAS
    filtered_count: 28615
  KEGG_COMPOUND:
    facet: KEGG_COMPOUND
    filtered_count: 19870
  Beilstein:
    facet: Beilstein
    filtered_count: 8704
  IUPAC:
    facet: IUPAC
    filtered_count: 61013
  ChemIDplus:
    facet: ChemIDplus
    filtered_count: 33383
  UniProt:
    facet: UniProt
    filtered_count: 16047
  LINCS:
    facet: LINCS
    filtered_count: 41389
  Drug_Central:
    facet: Drug_Central
    filtered_count: 3783
  DrugCentral:
    facet: DrugCentral
    filtered_count: 6202
  Wikipedia:

As expected, CHEBI does not make use of SKOS mapping predicates, and mappings are dominated by databases like KEGG, CAS.

TSV Output

YAML is not a very natural format for doing further data science or statistical processing.

FOr that we can use the csv option (which actually defaults to tsv…)

[9]:

chebi statistics -o output/chebi.stats.tsv -O csv

WARNING:root:bad mapping: KEGG_COMPOUND
WARNING:root:bad mapping: IUPAC
WARNING:root:bad mapping: ChemIDplus
WARNING:root:bad mapping: UniProt
WARNING:root:bad mapping: DrugCentral
WARNING:root:bad mapping: LINCS
WARNING:root:bad mapping: KEGG_DRUG
WARNING:root:bad mapping: ChEBI
WARNING:root:bad mapping: ChEMBL
WARNING:root:bad mapping: DrugBank
WARNING:root:bad mapping: WHO_MedNet
WARNING:root:bad mapping: PDBeChem
WARNING:root:bad mapping: NIST_Chemistry_WebBook
WARNING:root:bad mapping: LIPID_MAPS
WARNING:root:bad mapping: IUPHAR
WARNING:root:bad mapping: HMDB
WARNING:root:bad mapping: SUBMITTER
WARNING:root:bad mapping: MetaCyc
WARNING:root:bad mapping: JCBN
WARNING:root:bad mapping: GlyTouCan
WARNING:root:bad mapping: KNApSAcK
WARNING:root:bad mapping: IUBMB
WARNING:root:bad mapping: EMBL
WARNING:root:bad mapping: CBN
WARNING:root:bad mapping: Alan_Wood's_Pesticides
WARNING:root:bad mapping: GlyGen
WARNING:root:bad mapping: PPDB
WARNING:root:bad mapping: KEGG_GLYCAN
WARNING:root:bad mapping: RESID
WARNING:root:bad mapping: PubChem
WARNING:root:bad mapping: FooDB
WARNING:root:bad mapping: VSDB
WARNING:root:bad mapping: UM-BBD
WARNING:root:bad mapping: MolBase
WARNING:root:bad mapping: COMe
WARNING:root:bad mapping: EBI_Industry_Programme
WARNING:root:bad mapping: EuroFIR
WARNING:root:bad mapping: Beilstein
WARNING:root:bad mapping: Patent
WARNING:root:bad mapping: PDB
WARNING:root:bad mapping: SMID

To illustrate this we will use pandas:

[11]:

import pandas as pd
df = pd.read_csv("output/chebi.stats.tsv", sep="\t")
df

[11]:

	id	compared_with	agents	class_count	deprecated_class_count	non_deprecated_class_count	class_count_with_text_definitions	class_count_without_text_definitions	object_property_count	annotation_property_count	...	mapping_statement_count_subject_by_object_source_CTX	mapping_statement_count_subject_by_object_source_SMID	class_count_by_subset_1_STAR	class_count_by_subset_2_STAR	class_count_by_subset_3_STAR	was_generated_by_started_at_time	was_generated_by_was_associated_with	was_generated_by_acted_on_behalf_of	ontologies_id	ontologies_version
0	AllOntologies	NaN	NaN	185295	18628	166667	53049	132246	10	37	...	3	307	2945	102919	60803	2024-03-26T17:07:33.778117	OAK	cjm	obo:chebi.owl	obo:chebi/226/chebi.owl

1 rows × 177 columns

This format is useful if you have multiple ontologies (see later). But for a single ontology it’s more convenient to melt this:

[13]:

mdf = df.melt(var_name='Property', value_name='Value')
mdf[0:40]

[13]:

	Property	Value
0	id	AllOntologies
1	compared_with	NaN
2	agents	NaN
3	class_count	185295
4	deprecated_class_count	18628
5	non_deprecated_class_count	166667
6	class_count_with_text_definitions	53049
7	class_count_without_text_definitions	132246
8	object_property_count	10
9	annotation_property_count	37
10	named_individual_count	0
11	subset_count	3
12	rdf_triple_count	6158555
13	subclass_of_axiom_count	330989
14	equivalent_classes_axiom_count	0
15	entailed_edge_count_by_predicate	{}
16	distinct_synonym_count	332744
17	synonym_statement_count	346486
18	class_count_by_category	{}
19	contributor_summary	{}
20	change_summary	{}
21	merged_class_query	18559
22	deprecated_property_count	0
23	edge_count_by_predicate_BFO:0000051	4003
24	edge_count_by_predicate_RO:0000087	43082
25	edge_count_by_predicate_has_functional_parent	18664
26	edge_count_by_predicate_has_parent_hydride	1764
27	edge_count_by_predicate_is_conjugate_acid_of	8434
28	edge_count_by_predicate_is_conjugate_base_of	8434
29	edge_count_by_predicate_is_enantiomer_of	2728
30	edge_count_by_predicate_is_substituent_group_from	1284
31	edge_count_by_predicate_is_tautomer_of	1884
32	edge_count_by_predicate_rdfs:subClassOf	240712
33	edge_count_by_predicate_rdfs:subPropertyOf	6
34	synonym_statement_count_by_predicate_hasExactS...	100585
35	synonym_statement_count_by_predicate_hasRelate...	234002
36	mapping_statement_count_by_predicate_hasDbXref	317151
37	mapping_statement_count_by_object_source_BFO	1
38	mapping_statement_count_by_object_source_RO	1
39	mapping_statement_count_by_object_source_KNApSAcK	5152

Note this uses a very generic way of flattening the yaml so some columns make less sense out of context - e.g. the “agent” field belongs to a parent object that describes what “agent” generated the stats (TODO: this should say “oaklib”)

Multi-ontology merges

Many OBO ontologies bundle portions of other ontologies with their main release. This can be confusing! For more details see OWL Format Variants in the obook.

As an example, consider naively calculating stats for the standard release of the Cell Ontology (CL):

[14]:

!runoak -i sqlite:obo:cl statistics | head -20

WARNING:root:bad mapping: GSE137537
WARNING:root:bad mapping: 10.1007/s004180050142
WARNING:root:bad mapping: NIFSTD
WARNING:root:bad mapping: Noradrenergic_cell_group_A6&oldid=981960774
WARNING:root:bad mapping: _Chapter_3
WARNING:root:bad mapping: A12.2.15.042
WARNING:root:bad mapping: PHENOSCAPE
id: AllOntologies
ontologies:
- id: obo:cl.owl
  version: obo:cl/releases/2023-09-21/cl.owl
  version_info: '2023-09-21'
was_generated_by:
  started_at_time: '2024-03-26T17:16:02.669245'
  was_associated_with: OAK
  acted_on_behalf_of: cjm
class_count: 28330
deprecated_class_count: 261
non_deprecated_class_count: 28069
class_count_with_text_definitions: 15110
class_count_without_text_definitions: 13220
object_property_count: 297
annotation_property_count: 241
named_individual_count: 18
subset_count: 63
rdf_triple_count: 681623
subclass_of_axiom_count: 44142

Looking at this you might think CL has 28k classes. In fact, this is the total number of classes in the ontology as defined by OWL, where here “ontology” means the merged product that includes bits of GO, Uberon, etc. Confusing, huh?

Ideally the OBO Foundry would move towards making base files the default, but in the absence of this, we have a few options:

Filtering by prefix (using -P)
Grouping using some property such as the prefix.

We’ll try the latter

[22]:

!runoak -i sqlite:obo:cl statistics --group-by-prefix -o output/cl.stats.grouped.tsv -O csv

[23]:

df = pd.read_csv("output/cl.stats.grouped.tsv", sep="\t")
df

[23]:

	id	compared_with	agents	class_count	deprecated_class_count	non_deprecated_class_count	class_count_with_text_definitions	class_count_without_text_definitions	object_property_count	annotation_property_count	...	class_count_by_subset_non_informative	class_count_by_subset_organ_slim	class_count_by_subset_pheno_slim	class_count_by_subset_phenotype_rcn	class_count_by_subset_uberon_slim	class_count_by_subset_unverified_taxonomic_grouping	class_count_by_subset_upper_level	class_count_by_subset_vertebrate_core	mapping_statement_count_by_object_source_GOREL	mapping_statement_count_subject_by_object_source_GOREL
0	<http	NaN	NaN	0	0	0	0	0	0	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	<https	NaN	NaN	0	0	0	0	0	0	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	BFO	NaN	NaN	15	0	15	9	6	6	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	BSPO	NaN	NaN	0	0	0	0	0	24	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	CARO	NaN	NaN	20	0	20	20	0	0	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	CHEBI	NaN	NaN	123	0	123	18	105	0	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	CL	NaN	NaN	2969	249	2720	2555	414	3	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	GO	NaN	NaN	7265	2	7263	7264	1	0	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	IAO	NaN	NaN	6	0	6	4	2	0	23	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NCBITaxon	NaN	NaN	138	0	138	0	138	0	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	OMO	NaN	NaN	0	0	0	0	0	0	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	PATO	NaN	NaN	185	0	185	184	1	0	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	PR	NaN	NaN	748	0	748	747	1	0	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	RO	NaN	NaN	1	0	1	1	0	240	16	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	UBERON	NaN	NaN	4670	0	4670	4308	362	0	0	...	47.0	136.0	1373.0	3.0	809.0	1.0	49.0	448.0	NaN	NaN
15	cito	NaN	NaN	0	0	0	0	0	0	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	dce	NaN	NaN	0	0	0	0	0	0	6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	dcterms	NaN	NaN	0	0	0	0	0	0	7	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	foaf	NaN	NaN	0	0	0	0	0	0	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	obo	NaN	NaN	10	10	0	0	10	24	116	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	2.0
20	oio	NaN	NaN	0	0	0	0	0	0	61	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21	owl	NaN	NaN	0	0	0	0	0	0	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	rdfs	NaN	NaN	0	0	0	0	0	0	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23	skos	NaN	NaN	0	0	0	0	0	0	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24	xsd	NaN	NaN	0	0	0	0	0	0	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

25 rows × 512 columns

Here we can see the numbers broken down by ontology. The number of classes in the CL row is now accurate. Note of course that the other numbers don’t reflect totals for the external ontology as a whole – it’s just the number that has been merged into CL

Diff stats

You can also use --compare-with to compare stats with a different release of an ontology. Note this is effictively the same as running diff with --statistics. See diff docs for details.

[ ]: