{
"cells": [
{
"cell_type": "markdown",
"id": "0f6c4513",
"metadata": {},
"source": [
"# command: crawl\n",
"\n",
"This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).\n",
"\n",
"This notebook provides examples for the `crawl` command, which is used to walk over multiple ontologies and endpoints.\n",
"\n",
"## Help Option\n",
"\n",
"You can get help on any OAK command using `--help`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "65db4b53",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Usage: runoak crawl [OPTIONS] [TERMS]...\n",
"\n",
" Crawl one or more ontologies, hopping over edges and mappings\n",
"\n",
" Crawl is a powerful command that allows for multi-ontology traversal,\n",
" particularly on mapping paths. Multiple ontologies and ontology sources\n",
" (e.g. BioPortal, OLS) provide mappings between terms. No single ontology is\n",
" likely to have a complete source. Using crawl, you can walk across the union\n",
" of mappings in all ontologies, with custom rules for each ontology (e.g.\n",
" normalizing prefixes).\n",
"\n",
" Documentation for this command will be provided in a separate notebook.\n",
"\n",
"Options:\n",
" -o, --output TEXT Path to output file\n",
" -O, --output-type TEXT Desired output type\n",
" --autolabel / --no-autolabel If set, results will automatically have\n",
" labels assigned [default: autolabel]\n",
" -M, --maps-to-source TEXT Return only mappings with subject or object\n",
" source equal to this\n",
" --mapper TEXT A selector for an adapter that is to be used\n",
" for the main lookup operation\n",
" --unmelt / --no-unmelt Use a wide table for display. [default: no-\n",
" unmelt]\n",
" --adapters TEXT A comma-separated list of adapters\n",
" --allowed-prefixes TEXT A comma-separated list of prefixes to\n",
" traverse over\n",
" --mapping-predicates TEXT A comma-separated list of mapping predicates\n",
" to traverse over\n",
" --viz / --no-viz If true then draw a graph [default: no-viz]\n",
" -d, --directory TEXT Directory to write output files\n",
" --whole-ontology / --no-whole-ontology\n",
" Run over whole ontology [default: no-whole-\n",
" ontology]\n",
" -C, --config-yaml TEXT\n",
" --help Show this message and exit.\n"
]
}
],
"source": [
"!runoak crawl --help"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d0fdbd5f-8d04-4e85-86c8-5ef0f7873ac7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"adapter_configs:\n",
" OMIM:\n",
" DOID:\n",
" prefix_normalization_map:\n",
" \"MIM:PS\": \"OMIMPS:\"\n",
" MIM: OMIM\n",
" UMLS_CUI: UMLS\n",
" SNOMEDCT_US_2023_03_01: SCTID\n",
" NCI: NCIT\n",
" ORDO:\n",
" prefix_normalization_map:\n",
" MeSH: MESH\n",
" NCIT:\n",
" GARD:\n",
"adapter_specs:\n",
" OMIM: /Users/cjm/repos/semantic-sql/db/omim.db\n",
"allowed_prefixes: [OMIM, DOID, ORDO, OMIMPS, GARD, NCIT, OMIMPS]\n",
"mapping_predicates:\n",
" - oio:hasDbXref\n",
" - skos:exactMatch\n"
]
}
],
"source": [
"!cat input/mapping-crawler-config.yaml"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7b051587-1399-4b7b-ad76-a6ca64496d1e",
"metadata": {},
"outputs": [],
"source": [
"!runoak crawl -C input/mapping-crawler-config.yaml -d output/refsum-analysis GARD:4648 --viz -o output/refsum.png"
]
},
{
"cell_type": "markdown",
"id": "408e4bc2-a14f-44bb-ac06-bcee10862675",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9091520d-4615-4943-ac57-df20cde245f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/sssom/util.py:162: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df.replace(\"\", np.nan, inplace=True)\n",
"/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/sssom/util.py:162: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df.replace(\"\", np.nan, inplace=True)\n"
]
}
],
"source": [
"!runoak crawl -C input/mapping-crawler-config.yaml -d output/refsum-analysis --whole-ontology GARD:4648 GARD:6322"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "1d81be3c-d986-4a40-9fbc-ec6b904eaf06",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"output/refsum-analysis\n",
"output/refsum-analysis/clique_summary.csv\n",
"output/refsum-analysis/clique_results.csv\n",
"output/refsum-analysis/cliques\n",
"output/refsum-analysis/cliques/GARD_4648.sssom.tsv\n",
"output/refsum-analysis/cliques/GARD_6322.sssom.tsv\n"
]
}
],
"source": [
"!find output/refsum-analysis"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "ea17b5d8-f758-4f52-a17a-b8ad142572a9",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "16ac02c7-d026-4aec-9ce2-198bf0d05289",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" mapping_count | \n",
" entity_count | \n",
" average_incoherency | \n",
" max_incoherency | \n",
" incoherency_GARD | \n",
" incoherency_ORDO | \n",
" incoherency_OMIM | \n",
" incoherency_DOID | \n",
" incoherency_NCIT | \n",
" incoherency_OMIMPS | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" count | \n",
" 2.000000 | \n",
" 2.000000 | \n",
" 2.000000 | \n",
" 2.00000 | \n",
" 2.00000 | \n",
" 2.000000 | \n",
" 1.0 | \n",
" 2.00000 | \n",
" 2.000000 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" mean | \n",
" 145.500000 | \n",
" 44.500000 | \n",
" 7.900000 | \n",
" 12.50000 | \n",
" 12.50000 | \n",
" 2.000000 | \n",
" 24.0 | \n",
" 12.50000 | \n",
" 0.500000 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" std | \n",
" 194.454365 | \n",
" 55.861436 | \n",
" 11.172287 | \n",
" 17.67767 | \n",
" 17.67767 | \n",
" 2.828427 | \n",
" NaN | \n",
" 17.67767 | \n",
" 0.707107 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" min | \n",
" 8.000000 | \n",
" 5.000000 | \n",
" 0.000000 | \n",
" 0.00000 | \n",
" 0.00000 | \n",
" 0.000000 | \n",
" 24.0 | \n",
" 0.00000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 25% | \n",
" 76.750000 | \n",
" 24.750000 | \n",
" 3.950000 | \n",
" 6.25000 | \n",
" 6.25000 | \n",
" 1.000000 | \n",
" 24.0 | \n",
" 6.25000 | \n",
" 0.250000 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 50% | \n",
" 145.500000 | \n",
" 44.500000 | \n",
" 7.900000 | \n",
" 12.50000 | \n",
" 12.50000 | \n",
" 2.000000 | \n",
" 24.0 | \n",
" 12.50000 | \n",
" 0.500000 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 75% | \n",
" 214.250000 | \n",
" 64.250000 | \n",
" 11.850000 | \n",
" 18.75000 | \n",
" 18.75000 | \n",
" 3.000000 | \n",
" 24.0 | \n",
" 18.75000 | \n",
" 0.750000 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 7 | \n",
" max | \n",
" 283.000000 | \n",
" 84.000000 | \n",
" 15.800000 | \n",
" 25.00000 | \n",
" 25.00000 | \n",
" 4.000000 | \n",
" 24.0 | \n",
" 25.00000 | \n",
" 1.000000 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 mapping_count entity_count average_incoherency \\\n",
"0 count 2.000000 2.000000 2.000000 \n",
"1 mean 145.500000 44.500000 7.900000 \n",
"2 std 194.454365 55.861436 11.172287 \n",
"3 min 8.000000 5.000000 0.000000 \n",
"4 25% 76.750000 24.750000 3.950000 \n",
"5 50% 145.500000 44.500000 7.900000 \n",
"6 75% 214.250000 64.250000 11.850000 \n",
"7 max 283.000000 84.000000 15.800000 \n",
"\n",
" max_incoherency incoherency_GARD incoherency_ORDO incoherency_OMIM \\\n",
"0 2.00000 2.00000 2.000000 1.0 \n",
"1 12.50000 12.50000 2.000000 24.0 \n",
"2 17.67767 17.67767 2.828427 NaN \n",
"3 0.00000 0.00000 0.000000 24.0 \n",
"4 6.25000 6.25000 1.000000 24.0 \n",
"5 12.50000 12.50000 2.000000 24.0 \n",
"6 18.75000 18.75000 3.000000 24.0 \n",
"7 25.00000 25.00000 4.000000 24.0 \n",
"\n",
" incoherency_DOID incoherency_NCIT incoherency_OMIMPS \n",
"0 2.00000 2.000000 1.0 \n",
"1 12.50000 0.500000 0.0 \n",
"2 17.67767 0.707107 NaN \n",
"3 0.00000 0.000000 0.0 \n",
"4 6.25000 0.250000 0.0 \n",
"5 12.50000 0.500000 0.0 \n",
"6 18.75000 0.750000 0.0 \n",
"7 25.00000 1.000000 0.0 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"output/refsum-analysis/clique_summary.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f3a06b4b-00f0-4b61-a612-2647338a83ad",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" seed | \n",
" name | \n",
" mapping_count | \n",
" entity_count | \n",
" entities | \n",
" entity_labels | \n",
" predicates | \n",
" mapping_sources | \n",
" average_incoherency | \n",
" max_incoherency | \n",
" sources | \n",
" incoherency_GARD | \n",
" incoherency_ORDO | \n",
" incoherency_OMIM | \n",
" incoherency_DOID | \n",
" incoherency_NCIT | \n",
" incoherency_OMIMPS | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" GARD:4648 | \n",
" NaN | \n",
" 283 | \n",
" 84 | \n",
" ['OMIM:614867', 'ORDO:772', 'OMIM:614873', 'GA... | \n",
" {'GARD:4648': 'Infantile Refsum disease', 'ORD... | \n",
" ['rdfs:subClassOf', 'oio:hasDbXref', 'skos:exa... | \n",
" ['obo:ORDO', 'obo:GARD', 'obo:DOID', 'obo:OMIM'] | \n",
" 15.8 | \n",
" 25 | \n",
" ['GARD', 'ORDO', 'OMIM', 'DOID', 'NCIT'] | \n",
" 25 | \n",
" 4 | \n",
" 24.0 | \n",
" 25 | \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" GARD:6322 | \n",
" NaN | \n",
" 8 | \n",
" 5 | \n",
" ['ORDO:98249', 'DOID:13359', 'NCIT:C34568', 'O... | \n",
" {'DOID:13359': 'Ehlers-Danlos syndrome', 'GARD... | \n",
" ['oio:hasDbXref', 'skos:exactMatch'] | \n",
" ['obo:DOID', 'obo:GARD'] | \n",
" 0.0 | \n",
" 0 | \n",
" ['DOID', 'GARD', 'ORDO', 'OMIMPS', 'NCIT'] | \n",
" 0 | \n",
" 0 | \n",
" NaN | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" seed name mapping_count entity_count \\\n",
"0 GARD:4648 NaN 283 84 \n",
"1 GARD:6322 NaN 8 5 \n",
"\n",
" entities \\\n",
"0 ['OMIM:614867', 'ORDO:772', 'OMIM:614873', 'GA... \n",
"1 ['ORDO:98249', 'DOID:13359', 'NCIT:C34568', 'O... \n",
"\n",
" entity_labels \\\n",
"0 {'GARD:4648': 'Infantile Refsum disease', 'ORD... \n",
"1 {'DOID:13359': 'Ehlers-Danlos syndrome', 'GARD... \n",
"\n",
" predicates \\\n",
"0 ['rdfs:subClassOf', 'oio:hasDbXref', 'skos:exa... \n",
"1 ['oio:hasDbXref', 'skos:exactMatch'] \n",
"\n",
" mapping_sources average_incoherency \\\n",
"0 ['obo:ORDO', 'obo:GARD', 'obo:DOID', 'obo:OMIM'] 15.8 \n",
"1 ['obo:DOID', 'obo:GARD'] 0.0 \n",
"\n",
" max_incoherency sources \\\n",
"0 25 ['GARD', 'ORDO', 'OMIM', 'DOID', 'NCIT'] \n",
"1 0 ['DOID', 'GARD', 'ORDO', 'OMIMPS', 'NCIT'] \n",
"\n",
" incoherency_GARD incoherency_ORDO incoherency_OMIM incoherency_DOID \\\n",
"0 25 4 24.0 25 \n",
"1 0 0 NaN 0 \n",
"\n",
" incoherency_NCIT incoherency_OMIMPS \n",
"0 1 NaN \n",
"1 0 0.0 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"output/refsum-analysis/clique_results.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8c511b79-c03e-4fd4-ba77-080c8d06ee46",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" subject_id | \n",
" subject_label | \n",
" predicate_id | \n",
" object_id | \n",
" object_label | \n",
" mapping_justification | \n",
" mapping_source | \n",
" other | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" DOID:0080476 | \n",
" peroxisome biogenesis disorder 1A | \n",
" oio:hasDbXref | \n",
" OMIM:214100 | \n",
" peroxisome biogenesis disorder 1a (zellweger) | \n",
" semapv:UnspecifiedMatching | \n",
" obo:DOID | \n",
" distance: 6, direction: -1 | \n",
"
\n",
" \n",
" 1 | \n",
" DOID:0080476 | \n",
" peroxisome biogenesis disorder 1A | \n",
" oio:hasDbXref | \n",
" OMIM:214100 | \n",
" peroxisome biogenesis disorder 1a (zellweger) | \n",
" semapv:UnspecifiedMatching | \n",
" obo:DOID | \n",
" distance: 7, direction: 1 | \n",
"
\n",
" \n",
" 2 | \n",
" DOID:0080476 | \n",
" peroxisome biogenesis disorder 1A | \n",
" rdfs:subClassOf | \n",
" DOID:905 | \n",
" Zellweger syndrome | \n",
" semapv:ManualMappingCuration | \n",
" obo:DOID | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" DOID:0080477 | \n",
" peroxisome biogenesis disorder 2A | \n",
" oio:hasDbXref | \n",
" OMIM:214110 | \n",
" peroxisome biogenesis disorder 2a (zellweger) | \n",
" semapv:UnspecifiedMatching | \n",
" obo:DOID | \n",
" distance: 6, direction: -1 | \n",
"
\n",
" \n",
" 4 | \n",
" DOID:0080477 | \n",
" peroxisome biogenesis disorder 2A | \n",
" oio:hasDbXref | \n",
" OMIM:214110 | \n",
" peroxisome biogenesis disorder 2a (zellweger) | \n",
" semapv:UnspecifiedMatching | \n",
" obo:DOID | \n",
" distance: 7, direction: 1 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 278 | \n",
" ORDO:912 | \n",
" Zellweger syndrome | \n",
" oio:hasDbXref | \n",
" OMIM:614887 | \n",
" peroxisome biogenesis disorder 13a (zellweger) | \n",
" semapv:UnspecifiedMatching | \n",
" obo:ORDO | \n",
" distance: 3, direction: 1 | \n",
"
\n",
" \n",
" 279 | \n",
" ORDO:912 | \n",
" Zellweger syndrome | \n",
" oio:hasDbXref | \n",
" OMIM:614887 | \n",
" peroxisome biogenesis disorder 13a (zellweger) | \n",
" semapv:UnspecifiedMatching | \n",
" obo:ORDO | \n",
" distance: 4, direction: -1 | \n",
"
\n",
" \n",
" 280 | \n",
" ORDO:912 | \n",
" Zellweger syndrome | \n",
" oio:hasDbXref | \n",
" OMIM:617370 | \n",
" peroxisome biogenesis disorder 10b | \n",
" semapv:UnspecifiedMatching | \n",
" obo:ORDO | \n",
" distance: 2, direction: -1 | \n",
"
\n",
" \n",
" 281 | \n",
" ORDO:912 | \n",
" Zellweger syndrome | \n",
" oio:hasDbXref | \n",
" OMIM:617370 | \n",
" peroxisome biogenesis disorder 10b | \n",
" semapv:UnspecifiedMatching | \n",
" obo:ORDO | \n",
" distance: 3, direction: 1 | \n",
"
\n",
" \n",
" 282 | \n",
" ORDO:912 | \n",
" Zellweger syndrome | \n",
" rdfs:subClassOf | \n",
" ORDO:79189 | \n",
" Peroxisome biogenesis disorder | \n",
" semapv:ManualMappingCuration | \n",
" obo:ORDO | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
283 rows × 8 columns
\n",
"
"
],
"text/plain": [
" subject_id subject_label predicate_id \\\n",
"0 DOID:0080476 peroxisome biogenesis disorder 1A oio:hasDbXref \n",
"1 DOID:0080476 peroxisome biogenesis disorder 1A oio:hasDbXref \n",
"2 DOID:0080476 peroxisome biogenesis disorder 1A rdfs:subClassOf \n",
"3 DOID:0080477 peroxisome biogenesis disorder 2A oio:hasDbXref \n",
"4 DOID:0080477 peroxisome biogenesis disorder 2A oio:hasDbXref \n",
".. ... ... ... \n",
"278 ORDO:912 Zellweger syndrome oio:hasDbXref \n",
"279 ORDO:912 Zellweger syndrome oio:hasDbXref \n",
"280 ORDO:912 Zellweger syndrome oio:hasDbXref \n",
"281 ORDO:912 Zellweger syndrome oio:hasDbXref \n",
"282 ORDO:912 Zellweger syndrome rdfs:subClassOf \n",
"\n",
" object_id object_label \\\n",
"0 OMIM:214100 peroxisome biogenesis disorder 1a (zellweger) \n",
"1 OMIM:214100 peroxisome biogenesis disorder 1a (zellweger) \n",
"2 DOID:905 Zellweger syndrome \n",
"3 OMIM:214110 peroxisome biogenesis disorder 2a (zellweger) \n",
"4 OMIM:214110 peroxisome biogenesis disorder 2a (zellweger) \n",
".. ... ... \n",
"278 OMIM:614887 peroxisome biogenesis disorder 13a (zellweger) \n",
"279 OMIM:614887 peroxisome biogenesis disorder 13a (zellweger) \n",
"280 OMIM:617370 peroxisome biogenesis disorder 10b \n",
"281 OMIM:617370 peroxisome biogenesis disorder 10b \n",
"282 ORDO:79189 Peroxisome biogenesis disorder \n",
"\n",
" mapping_justification mapping_source other \n",
"0 semapv:UnspecifiedMatching obo:DOID distance: 6, direction: -1 \n",
"1 semapv:UnspecifiedMatching obo:DOID distance: 7, direction: 1 \n",
"2 semapv:ManualMappingCuration obo:DOID NaN \n",
"3 semapv:UnspecifiedMatching obo:DOID distance: 6, direction: -1 \n",
"4 semapv:UnspecifiedMatching obo:DOID distance: 7, direction: 1 \n",
".. ... ... ... \n",
"278 semapv:UnspecifiedMatching obo:ORDO distance: 3, direction: 1 \n",
"279 semapv:UnspecifiedMatching obo:ORDO distance: 4, direction: -1 \n",
"280 semapv:UnspecifiedMatching obo:ORDO distance: 2, direction: -1 \n",
"281 semapv:UnspecifiedMatching obo:ORDO distance: 3, direction: 1 \n",
"282 semapv:ManualMappingCuration obo:ORDO NaN \n",
"\n",
"[283 rows x 8 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_csv(\"output/refsum-analysis/cliques/GARD_4648.sssom.tsv\", sep=\"\\t\", comment=\"#\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1ec0bd3-ae44-4194-be2e-2261e26b021f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}