{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": [ "# Value Set Expansion examples\n", "\n", "Value sets are enumerated permissible values (aka subsets) that are used for purposes such as filtering ontologies or defining\n", "a subset of terms that are valid for data entry for a particular field.\n", "\n", "Dynamical (extensional) value sets are value sets that are defined by a query (including boolean graph queries) rather than a fixed list of terms. Because handling dynamic value sets at runtime can be complex, a value set expander (materializer) can be used to turn a dynamic value set into a static one.\n", "\n", "Value sets are found in formalisms such as LinkML and FHIR. Currently the OAK Value Set expander only supports LinkML value sets.\n", "\n", "See:\n", "\n", " - https://linkml.io/linkml/schemas/enums.html#dynamic-enums\n" ], "id": "fd22cbb0ed1b4e0d" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## VSKit command\n", "\n", "Currently there is a command with a single subcommand (others may be added later) " ], "id": "81dfe256dcd95176" }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:17:32.099351Z", "start_time": "2025-03-06T23:17:28.978048Z" } }, "cell_type": "code", "source": [ "%%bash\n", "vskit --help" ], "id": "e078aa69337c7f16", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: vskit [OPTIONS] COMMAND [ARGS]...\n", "\n", " Run the ValueSet CLI.\n", "\n", "Options:\n", " -v, --verbose\n", " -q, --quiet TEXT\n", " --help Show this message and exit.\n", "\n", "Commands:\n", " expand Expand a value set.\n" ] } ], "execution_count": 25 }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:03:41.275468Z", "start_time": "2025-03-06T23:03:38.289192Z" } }, "cell_type": "code", "source": [ "%%bash\n", "vskit expand --help" ], "id": "b7dad9a30666f102", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: vskit expand [OPTIONS] [VALUE_SET_NAMES]...\n", "\n", " Expand a value set. EXPERIMENTAL.\n", "\n", " This will expand an *intentional value set* (aka *dynamic enum*), running a\n", " query against an ontology backend or backends to materialize the value set\n", " (permissible values).\n", "\n", " Currently the value set must be specified as LinkML, but in future this will\n", " be possible with other specifications such as FHIR ValueSet objects.\n", "\n", " Each expression in a dynamic enum has a *source ontology*, this is specified\n", " as a CURIE such as:\n", "\n", " - obo:mondo - bioregistry:wikidata\n", "\n", " These can be mapped to specific OAK selectors. By default, any obo prefix is\n", " mapped to the semsql implementation of that. You can use a configuration\n", " file to map to other backends, such as BioPortal or Wikidata. However, note\n", " that not all backends are capable of being able to render all value sets.\n", "\n", " Examples:\n", "\n", " vskit expand -c config.yaml -s schema.yaml -o expanded.yaml\n", " my_value_set1 my_value_set2\n", "\n", " Custom permissible value syntax:\n", "\n", " vskit expand -s schema.yaml -o expanded.yaml --pv-syntax '{label} [{id}]\n", " my_value_set1\n", "\n", "Options:\n", " -c, --config PATH\n", " -s, --schema PATH\n", " -o, --output PATH\n", " --pv-syntax TEXT Enter a LinkML structured_pattern.syntax-style string\n", " --help Show this message and exit.\n" ] } ], "execution_count": 3 }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Example Value Set\n", "\n", "The test inputs folder has an example of multiple different kinds of value sets. " ], "id": "2528e1ebb6e342fe" }, { "metadata": { "collapsed": true, "ExecuteTime": { "end_time": "2025-03-06T23:04:54.938997Z", "start_time": "2025-03-06T23:04:54.656867Z" } }, "cell_type": "code", "source": [ "%%bash\n", "yq . ../../../tests/input/value_set_example.yaml" ], "id": "initial_id", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id: https://w3id.org/linkml/examples/enums\n", "title: Dynamic Enums Example\n", "name: dynamicenums-example\n", "description: This demonstrates the use of dynamic enums\n", "license: https://creativecommons.org/publicdomain/zero/1.0/\n", "prefixes:\n", " linkml: https://w3id.org/linkml/\n", " ex: https://w3id.org/linkml/examples/enums/\n", " sh: https://w3id.org/shacl/\n", " bioregistry: https://bioregistry.io/registry/\n", " MONDO: http://purl.obolibrary.org/obo/MONDO_\n", " NCIT: http://purl.obolibrary.org/obo/NCIT_\n", " loinc: http://loinc.org/\n", "default_prefix: ex\n", "default_range: string\n", "default_curi_maps:\n", " - semweb_context\n", "emit_prefixes:\n", " - linkml\n", " - rdf\n", " - rdfs\n", " - xsd\n", " - owl\n", "imports:\n", " - linkml:types\n", "#==================================\n", "# Classes #\n", "#==================================\n", "classes:\n", " HumanSample:\n", " slots:\n", " - name\n", " - disease\n", "#==================================\n", "# Slots #\n", "#==================================\n", "slots:\n", " name:\n", " range: string\n", " disease:\n", " range: HumanDisease\n", " vital_status:\n", " enum_range:\n", " permissible_values:\n", " LIVING:\n", " DEAD:\n", " UNDEAD:\n", "#==================================\n", "# Enums\n", "#==================================\n", "enums:\n", " GoMembrane:\n", " pv_formula: CURIE\n", " reachable_from:\n", " include_self: true\n", " source_ontology: obo:go\n", " source_nodes:\n", " - GO:0016020 ## membrane\n", " OnlyInEukaryotes:\n", " reachable_from:\n", " source_ontology: obo:go\n", " source_nodes:\n", " - NCBITaxon:2759 ## Eukaryota\n", " relationship_types:\n", " - rdfs:subClassOf\n", " - RO:0002162 ## in taxon\n", " - BFO:0000050 ## part of\n", " MembraneExcludingEukaryotes:\n", " inherits: GoMembrane\n", " minus:\n", " - inherits: OnlyInEukaryotes\n", " Disease:\n", " reachable_from:\n", " source_ontology: bioregistry:mondo\n", " source_nodes:\n", " - MONDO:0000001 ## disease or disorder\n", " is_direct: false\n", " relationship_types:\n", " - rdfs:subClassOf\n", " minus:\n", " permissible_values:\n", " root_node:\n", " meaning: MONDO:0000001 ## disease or disorder\n", " HumanDisease:\n", " description: Extends the Disease value set, including NCIT neoplasms, excluding non-human diseases\n", " inherits:\n", " - Disease\n", " include:\n", " - reachable_from:\n", " source_ontology: bioregistry:ncit\n", " source_nodes:\n", " - NCIT:C3262\n", " minus:\n", " - reachable_from:\n", " source_ontology: bioregistry:mondo\n", " source_nodes:\n", " - MONDO:0005583 ## non-human animal disease\n", " relationship_types:\n", " - rdfs:subClassOf\n", " - permissible_values:\n", " NOT_THIS_ONE:\n", " meaning: MONDO:9999\n", " description: Example of excluding a single node\n", " LoincExample:\n", " enum_uri: http://hl7.org/fhir/ValueSet/example-intensional\n", " see_also:\n", " - https://build.fhir.org/valueset-example-intensional.json.html\n", " include:\n", " - reachable_from:\n", " source_ontology: \"loinc:\"\n", " source_nodes:\n", " - loinc:LP43571-6\n", " is_direct: true\n", " minus:\n", " concepts:\n", " - LOINC:5932-9\n", " HCAExample:\n", " see_also:\n", " - https://github.com/linkml/linkml/issues/274\n", " include:\n", " - reachable_from:\n", " source_ontology: bioregistry:go\n", " source_nodes:\n", " - GO:0007049\n", " - GO:0022403\n", " include_self: false\n", " relationship_types:\n", " - rdfs:subClassOf\n", " minus:\n", " concepts:\n", " - LOINC:5932-9\n", " BodyPartEnum:\n", " reachable_from:\n", " source_ontology: obo:cl\n", " source_nodes:\n", " - CL:0000540 ## neuron\n", " include_self: false\n", " relationship_types:\n", " - rdfs:subClassOf\n", " Brand:\n", " enum_uri: wikidata:Q431289\n", " include:\n", " - reachable_from:\n", " source_ontology: bioregistry:wikidata\n", " source_nodes:\n", " - wikidata:Q431289\n", " include_self: false\n", " relationship_types:\n", " - wdp:P31\n", " - wdp:P279\n", " SerumCholesterolExample:\n", " description: >\n", " This is an example value set that includes all the LOINC codes for serum/plasma cholesterol from v2.36.\n", "\n", " code_set: http://hl7.org/fhir/ValueSet/serum-cholesterol\n", " code_set_version: \"1.0.0\"\n", " pv_formula: CODE\n", " include:\n", " - concepts:\n", " - LP43571-6\n", " minus:\n", " - concepts:\n", " - 5932-9\n", " reachable_from:\n", " source_ontology: http://loinc.org\n", " source_nodes:\n", " - LP43571-6\n", " relationship_types: null\n", " is_direct: true\n", " include_self: true\n", " traverse_up: false\n", " concepts:\n", " - http://loinc.org/LP43571-6\n" ] } ], "execution_count": 4 }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Example value set: membranes in GO\n", "\n", "Let's examine the value set called GoMembrane:" ], "id": "6ffaeb55a89d466d" }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:05:52.307822Z", "start_time": "2025-03-06T23:05:52.257979Z" } }, "cell_type": "code", "source": [ "%%bash\n", "yq .enums.GoMembrane ../../../tests/input/value_set_example.yaml" ], "id": "95b7c62285e232e0", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pv_formula: CURIE\n", "reachable_from:\n", " include_self: true\n", " source_ontology: obo:go\n", " source_nodes:\n", " - GO:0016020 ## membrane\n" ] } ], "execution_count": 7 }, { "metadata": {}, "cell_type": "markdown", "source": [ "You can see this is defined as a simple query that selects all terms that are a subclass of GO:0016020 (membrane);\n", "i.e. an ontology *branch*. Other value sets are more complex involving boolean combinations." ], "id": "a8475127c6933759" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Configuration\n", "\n", "Because the LinkML language is independent of OAK we need to bind the logical names used for vocabularies to OAK selector syntax. This is done in a configuration file." ], "id": "f769a128fda58a57" }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:10:04.991527Z", "start_time": "2025-03-06T23:10:04.988669Z" } }, "cell_type": "code", "source": [ "CONFIG = \"\"\"\n", "resource_resolvers:\n", " obo:go:\n", " shorthand: sqlite:obo:go\n", "\"\"\".strip()" ], "id": "132ec5988d505b11", "outputs": [], "execution_count": 14 }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:10:05.458553Z", "start_time": "2025-03-06T23:10:05.454510Z" } }, "cell_type": "code", "source": [ "with open(\"output/vskit-config.yaml\", \"w\") as f:\n", " f.write(CONFIG)" ], "id": "ae50716c701d5a11", "outputs": [], "execution_count": 15 }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:10:05.956660Z", "start_time": "2025-03-06T23:10:05.906222Z" } }, "cell_type": "code", "source": [ "%%bash\n", "yq . output/vskit-config.yaml" ], "id": "f438e0fdb3c7dac2", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "resource_resolvers:\n", " obo:go:\n", " shorthand: sqlite:obo:go\n" ] } ], "execution_count": 16 }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Expansion\n", "\n", "Now we will expand the value set, into a new materialized file" ], "id": "e3bd301602f082e6" }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:11:53.399909Z", "start_time": "2025-03-06T23:11:47.974578Z" } }, "cell_type": "code", "source": [ "%%bash\n", "vskit expand --config output/vskit-config.yaml --schema ../../../tests/input/value_set_example.yaml GoMembrane -o output/GoMembrane.yaml" ], "id": "590e18f2e7c0348c", "outputs": [], "execution_count": 19 }, { "metadata": {}, "cell_type": "markdown", "source": "We can see the expanded value set below.", "id": "d9726db4ce454ece" }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:13:02.610020Z", "start_time": "2025-03-06T23:13:02.548368Z" } }, "cell_type": "code", "source": [ "%%bash\n", "yq .enums.GoMembrane.permissible_values output/GoMembrane.yaml | head -20" ], "id": "7321163259ab74d6", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GO:0120201:\n", " text: GO:0120201\n", " description: Stack of disc membranes located inside a cone photoreceptor outer segment, and containing densely packed molecules of opsin photoreceptor proteins that traverse the lipid bilayer. Cone disc membranes arise as evaginations of the ciliary membrane during the development of the cone outer segment and remain contiguous with the ciliary membrane.\n", " meaning: GO:0120201\n", " title: cone photoreceptor disc membrane\n", "GO:0060171:\n", " text: GO:0060171\n", " description: The portion of the plasma membrane surrounding a stereocilium.\n", " meaning: GO:0060171\n", " title: stereocilium membrane\n", "GO:0042717:\n", " text: GO:0042717\n", " description: The lipid bilayer associated with a plasma membrane-derived chromatophore; surrounds chromatophores that form complete vesicles.\n", " meaning: GO:0042717\n", " title: plasma membrane-derived chromatophore membrane\n", "GO:0035579:\n", " text: GO:0035579\n", " description: The lipid bilayer surrounding a specific granule, a granule with a membranous, tubular internal structure, found primarily in mature neutrophil cells. Most are released into the extracellular fluid. Specific granules contain lactoferrin, lysozyme, vitamin B12 binding protein and elastase.\n", " meaning: GO:0035579\n", " title: specific granule membrane\n" ] } ], "execution_count": 22 }, { "metadata": {}, "cell_type": "markdown", "source": "note that technically the key value does not need repeated, but the default serialization is to include it.", "id": "3567b839829d6102" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Customizing permissible values\n", "\n", "We can use `--pv-syntax` to customize the permissible value serialization. A python-style format string is used to specify how the permissible values are serialized. The default is `{id}`." ], "id": "9d3cf1b6c58e8d6f" }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:14:44.022902Z", "start_time": "2025-03-06T23:14:38.305675Z" } }, "cell_type": "code", "source": [ "%%bash\n", "vskit expand --pv-syntax '{label} [{id}]' --config output/vskit-config.yaml --schema ../../../tests/input/value_set_example.yaml GoMembrane -o output/GoMembrane2.yaml" ], "id": "fd090ead695dd3f0", "outputs": [], "execution_count": 23 }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T23:14:49.706273Z", "start_time": "2025-03-06T23:14:49.633459Z" } }, "cell_type": "code", "source": [ "%%bash\n", "yq .enums.GoMembrane.permissible_values output/GoMembrane2.yaml | head -20" ], "id": "4b413c636f18f94b", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cone photoreceptor disc membrane [GO:0120201]:\n", " text: cone photoreceptor disc membrane [GO:0120201]\n", " description: Stack of disc membranes located inside a cone photoreceptor outer segment, and containing densely packed molecules of opsin photoreceptor proteins that traverse the lipid bilayer. Cone disc membranes arise as evaginations of the ciliary membrane during the development of the cone outer segment and remain contiguous with the ciliary membrane.\n", " meaning: GO:0120201\n", " title: cone photoreceptor disc membrane\n", "stereocilium membrane [GO:0060171]:\n", " text: stereocilium membrane [GO:0060171]\n", " description: The portion of the plasma membrane surrounding a stereocilium.\n", " meaning: GO:0060171\n", " title: stereocilium membrane\n", "plasma membrane-derived chromatophore membrane [GO:0042717]:\n", " text: plasma membrane-derived chromatophore membrane [GO:0042717]\n", " description: The lipid bilayer associated with a plasma membrane-derived chromatophore; surrounds chromatophores that form complete vesicles.\n", " meaning: GO:0042717\n", " title: plasma membrane-derived chromatophore membrane\n", "specific granule membrane [GO:0035579]:\n", " text: specific granule membrane [GO:0035579]\n", " description: The lipid bilayer surrounding a specific granule, a granule with a membranous, tubular internal structure, found primarily in mature neutrophil cells. Most are released into the extracellular fluid. Specific granules contain lactoferrin, lysozyme, vitamin B12 binding protein and elastase.\n", " meaning: GO:0035579\n", " title: specific granule membrane\n" ] } ], "execution_count": 24 }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": "", "id": "3a548a5eccc56788" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }