command: crawl

This notebook is intended as a supplement to the main OAK CLI docs.

This notebook provides examples for the crawl command, which is used to walk over multiple ontologies and endpoints.

Help Option

You can get help on any OAK command using --help

[1]:
!runoak crawl --help
Usage: runoak crawl [OPTIONS] [TERMS]...

  Crawl one or more ontologies, hopping over edges and mappings

  Crawl is a powerful command that allows for multi-ontology traversal,
  particularly on mapping paths. Multiple ontologies and ontology sources
  (e.g. BioPortal, OLS) provide mappings between terms. No single ontology is
  likely to have a complete source. Using crawl, you can walk across the union
  of mappings in all ontologies, with custom rules for each ontology (e.g.
  normalizing prefixes).

  Documentation for this command will be provided in a separate notebook.

Options:
  -o, --output TEXT               Path to output file
  -O, --output-type TEXT          Desired output type
  --autolabel / --no-autolabel    If set, results will automatically have
                                  labels assigned  [default: autolabel]
  -M, --maps-to-source TEXT       Return only mappings with subject or object
                                  source equal to this
  --mapper TEXT                   A selector for an adapter that is to be used
                                  for the main lookup operation
  --unmelt / --no-unmelt          Use a wide table for display.  [default: no-
                                  unmelt]
  --adapters TEXT                 A comma-separated list of adapters
  --allowed-prefixes TEXT         A comma-separated list of prefixes to
                                  traverse over
  --mapping-predicates TEXT       A comma-separated list of mapping predicates
                                  to traverse over
  --viz / --no-viz                If true then draw a graph  [default: no-viz]
  -d, --directory TEXT            Directory to write output files
  --whole-ontology / --no-whole-ontology
                                  Run over whole ontology  [default: no-whole-
                                  ontology]
  -C, --config-yaml TEXT
  --help                          Show this message and exit.
[2]:
!cat input/mapping-crawler-config.yaml
adapter_configs:
  OMIM:
  DOID:
    prefix_normalization_map:
      "MIM:PS": "OMIMPS:"
      MIM: OMIM
      UMLS_CUI: UMLS
      SNOMEDCT_US_2023_03_01: SCTID
      NCI: NCIT
  ORDO:
    prefix_normalization_map:
      MeSH: MESH
  NCIT:
  GARD:
adapter_specs:
  OMIM: /Users/cjm/repos/semantic-sql/db/omim.db
allowed_prefixes: [OMIM, DOID, ORDO, OMIMPS, GARD, NCIT, OMIMPS]
mapping_predicates:
  - oio:hasDbXref
  - skos:exactMatch
[4]:
!runoak  crawl -C input/mapping-crawler-config.yaml -d output/refsum-analysis GARD:4648  --viz -o output/refsum.png
img
[8]:
!runoak  crawl -C input/mapping-crawler-config.yaml -d output/refsum-analysis --whole-ontology GARD:4648 GARD:6322
/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/sssom/util.py:162: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df.replace("", np.nan, inplace=True)
/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/sssom/util.py:162: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df.replace("", np.nan, inplace=True)
[9]:
!find output/refsum-analysis
output/refsum-analysis
output/refsum-analysis/clique_summary.csv
output/refsum-analysis/clique_results.csv
output/refsum-analysis/cliques
output/refsum-analysis/cliques/GARD_4648.sssom.tsv
output/refsum-analysis/cliques/GARD_6322.sssom.tsv
[10]:
import pandas as pd
[12]:
pd.read_csv("output/refsum-analysis/clique_summary.csv")
[12]:
Unnamed: 0 mapping_count entity_count average_incoherency max_incoherency incoherency_GARD incoherency_ORDO incoherency_OMIM incoherency_DOID incoherency_NCIT incoherency_OMIMPS
0 count 2.000000 2.000000 2.000000 2.00000 2.00000 2.000000 1.0 2.00000 2.000000 1.0
1 mean 145.500000 44.500000 7.900000 12.50000 12.50000 2.000000 24.0 12.50000 0.500000 0.0
2 std 194.454365 55.861436 11.172287 17.67767 17.67767 2.828427 NaN 17.67767 0.707107 NaN
3 min 8.000000 5.000000 0.000000 0.00000 0.00000 0.000000 24.0 0.00000 0.000000 0.0
4 25% 76.750000 24.750000 3.950000 6.25000 6.25000 1.000000 24.0 6.25000 0.250000 0.0
5 50% 145.500000 44.500000 7.900000 12.50000 12.50000 2.000000 24.0 12.50000 0.500000 0.0
6 75% 214.250000 64.250000 11.850000 18.75000 18.75000 3.000000 24.0 18.75000 0.750000 0.0
7 max 283.000000 84.000000 15.800000 25.00000 25.00000 4.000000 24.0 25.00000 1.000000 0.0
[13]:
pd.read_csv("output/refsum-analysis/clique_results.csv")
[13]:
seed name mapping_count entity_count entities entity_labels predicates mapping_sources average_incoherency max_incoherency sources incoherency_GARD incoherency_ORDO incoherency_OMIM incoherency_DOID incoherency_NCIT incoherency_OMIMPS
0 GARD:4648 NaN 283 84 ['OMIM:614867', 'ORDO:772', 'OMIM:614873', 'GA... {'GARD:4648': 'Infantile Refsum disease', 'ORD... ['rdfs:subClassOf', 'oio:hasDbXref', 'skos:exa... ['obo:ORDO', 'obo:GARD', 'obo:DOID', 'obo:OMIM'] 15.8 25 ['GARD', 'ORDO', 'OMIM', 'DOID', 'NCIT'] 25 4 24.0 25 1 NaN
1 GARD:6322 NaN 8 5 ['ORDO:98249', 'DOID:13359', 'NCIT:C34568', 'O... {'DOID:13359': 'Ehlers-Danlos syndrome', 'GARD... ['oio:hasDbXref', 'skos:exactMatch'] ['obo:DOID', 'obo:GARD'] 0.0 0 ['DOID', 'GARD', 'ORDO', 'OMIMPS', 'NCIT'] 0 0 NaN 0 0 0.0
[15]:
pd.read_csv("output/refsum-analysis/cliques/GARD_4648.sssom.tsv", sep="\t", comment="#")

[15]:
subject_id subject_label predicate_id object_id object_label mapping_justification mapping_source other
0 DOID:0080476 peroxisome biogenesis disorder 1A oio:hasDbXref OMIM:214100 peroxisome biogenesis disorder 1a (zellweger) semapv:UnspecifiedMatching obo:DOID distance: 6, direction: -1
1 DOID:0080476 peroxisome biogenesis disorder 1A oio:hasDbXref OMIM:214100 peroxisome biogenesis disorder 1a (zellweger) semapv:UnspecifiedMatching obo:DOID distance: 7, direction: 1
2 DOID:0080476 peroxisome biogenesis disorder 1A rdfs:subClassOf DOID:905 Zellweger syndrome semapv:ManualMappingCuration obo:DOID NaN
3 DOID:0080477 peroxisome biogenesis disorder 2A oio:hasDbXref OMIM:214110 peroxisome biogenesis disorder 2a (zellweger) semapv:UnspecifiedMatching obo:DOID distance: 6, direction: -1
4 DOID:0080477 peroxisome biogenesis disorder 2A oio:hasDbXref OMIM:214110 peroxisome biogenesis disorder 2a (zellweger) semapv:UnspecifiedMatching obo:DOID distance: 7, direction: 1
... ... ... ... ... ... ... ... ...
278 ORDO:912 Zellweger syndrome oio:hasDbXref OMIM:614887 peroxisome biogenesis disorder 13a (zellweger) semapv:UnspecifiedMatching obo:ORDO distance: 3, direction: 1
279 ORDO:912 Zellweger syndrome oio:hasDbXref OMIM:614887 peroxisome biogenesis disorder 13a (zellweger) semapv:UnspecifiedMatching obo:ORDO distance: 4, direction: -1
280 ORDO:912 Zellweger syndrome oio:hasDbXref OMIM:617370 peroxisome biogenesis disorder 10b semapv:UnspecifiedMatching obo:ORDO distance: 2, direction: -1
281 ORDO:912 Zellweger syndrome oio:hasDbXref OMIM:617370 peroxisome biogenesis disorder 10b semapv:UnspecifiedMatching obo:ORDO distance: 3, direction: 1
282 ORDO:912 Zellweger syndrome rdfs:subClassOf ORDO:79189 Peroxisome biogenesis disorder semapv:ManualMappingCuration obo:ORDO NaN

283 rows × 8 columns

[ ]: