Value Set Expansion examples
Value sets are enumerated permissible values (aka subsets) that are used for purposes such as filtering ontologies or defining a subset of terms that are valid for data entry for a particular field.
Dynamical (extensional) value sets are value sets that are defined by a query (including boolean graph queries) rather than a fixed list of terms. Because handling dynamic value sets at runtime can be complex, a value set expander (materializer) can be used to turn a dynamic value set into a static one.
Value sets are found in formalisms such as LinkML and FHIR. Currently the OAK Value Set expander only supports LinkML value sets.
See:
VSKit command
Currently there is a command with a single subcommand (others may be added later)
[25]:
%%bash
vskit --help
Usage: vskit [OPTIONS] COMMAND [ARGS]...
Run the ValueSet CLI.
Options:
-v, --verbose
-q, --quiet TEXT
--help Show this message and exit.
Commands:
expand Expand a value set.
[3]:
%%bash
vskit expand --help
Usage: vskit expand [OPTIONS] [VALUE_SET_NAMES]...
Expand a value set. EXPERIMENTAL.
This will expand an *intentional value set* (aka *dynamic enum*), running a
query against an ontology backend or backends to materialize the value set
(permissible values).
Currently the value set must be specified as LinkML, but in future this will
be possible with other specifications such as FHIR ValueSet objects.
Each expression in a dynamic enum has a *source ontology*, this is specified
as a CURIE such as:
- obo:mondo - bioregistry:wikidata
These can be mapped to specific OAK selectors. By default, any obo prefix is
mapped to the semsql implementation of that. You can use a configuration
file to map to other backends, such as BioPortal or Wikidata. However, note
that not all backends are capable of being able to render all value sets.
Examples:
vskit expand -c config.yaml -s schema.yaml -o expanded.yaml
my_value_set1 my_value_set2
Custom permissible value syntax:
vskit expand -s schema.yaml -o expanded.yaml --pv-syntax '{label} [{id}]
my_value_set1
Options:
-c, --config PATH
-s, --schema PATH
-o, --output PATH
--pv-syntax TEXT Enter a LinkML structured_pattern.syntax-style string
--help Show this message and exit.
Example Value Set
The test inputs folder has an example of multiple different kinds of value sets.
[4]:
%%bash
yq . ../../../tests/input/value_set_example.yaml
id: https://w3id.org/linkml/examples/enums
title: Dynamic Enums Example
name: dynamicenums-example
description: This demonstrates the use of dynamic enums
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
linkml: https://w3id.org/linkml/
ex: https://w3id.org/linkml/examples/enums/
sh: https://w3id.org/shacl/
bioregistry: https://bioregistry.io/registry/
MONDO: http://purl.obolibrary.org/obo/MONDO_
NCIT: http://purl.obolibrary.org/obo/NCIT_
loinc: http://loinc.org/
default_prefix: ex
default_range: string
default_curi_maps:
- semweb_context
emit_prefixes:
- linkml
- rdf
- rdfs
- xsd
- owl
imports:
- linkml:types
#==================================
# Classes #
#==================================
classes:
HumanSample:
slots:
- name
- disease
#==================================
# Slots #
#==================================
slots:
name:
range: string
disease:
range: HumanDisease
vital_status:
enum_range:
permissible_values:
LIVING:
DEAD:
UNDEAD:
#==================================
# Enums
#==================================
enums:
GoMembrane:
pv_formula: CURIE
reachable_from:
include_self: true
source_ontology: obo:go
source_nodes:
- GO:0016020 ## membrane
OnlyInEukaryotes:
reachable_from:
source_ontology: obo:go
source_nodes:
- NCBITaxon:2759 ## Eukaryota
relationship_types:
- rdfs:subClassOf
- RO:0002162 ## in taxon
- BFO:0000050 ## part of
MembraneExcludingEukaryotes:
inherits: GoMembrane
minus:
- inherits: OnlyInEukaryotes
Disease:
reachable_from:
source_ontology: bioregistry:mondo
source_nodes:
- MONDO:0000001 ## disease or disorder
is_direct: false
relationship_types:
- rdfs:subClassOf
minus:
permissible_values:
root_node:
meaning: MONDO:0000001 ## disease or disorder
HumanDisease:
description: Extends the Disease value set, including NCIT neoplasms, excluding non-human diseases
inherits:
- Disease
include:
- reachable_from:
source_ontology: bioregistry:ncit
source_nodes:
- NCIT:C3262
minus:
- reachable_from:
source_ontology: bioregistry:mondo
source_nodes:
- MONDO:0005583 ## non-human animal disease
relationship_types:
- rdfs:subClassOf
- permissible_values:
NOT_THIS_ONE:
meaning: MONDO:9999
description: Example of excluding a single node
LoincExample:
enum_uri: http://hl7.org/fhir/ValueSet/example-intensional
see_also:
- https://build.fhir.org/valueset-example-intensional.json.html
include:
- reachable_from:
source_ontology: "loinc:"
source_nodes:
- loinc:LP43571-6
is_direct: true
minus:
concepts:
- LOINC:5932-9
HCAExample:
see_also:
- https://github.com/linkml/linkml/issues/274
include:
- reachable_from:
source_ontology: bioregistry:go
source_nodes:
- GO:0007049
- GO:0022403
include_self: false
relationship_types:
- rdfs:subClassOf
minus:
concepts:
- LOINC:5932-9
BodyPartEnum:
reachable_from:
source_ontology: obo:cl
source_nodes:
- CL:0000540 ## neuron
include_self: false
relationship_types:
- rdfs:subClassOf
Brand:
enum_uri: wikidata:Q431289
include:
- reachable_from:
source_ontology: bioregistry:wikidata
source_nodes:
- wikidata:Q431289
include_self: false
relationship_types:
- wdp:P31
- wdp:P279
SerumCholesterolExample:
description: >
This is an example value set that includes all the LOINC codes for serum/plasma cholesterol from v2.36.
code_set: http://hl7.org/fhir/ValueSet/serum-cholesterol
code_set_version: "1.0.0"
pv_formula: CODE
include:
- concepts:
- LP43571-6
minus:
- concepts:
- 5932-9
reachable_from:
source_ontology: http://loinc.org
source_nodes:
- LP43571-6
relationship_types: null
is_direct: true
include_self: true
traverse_up: false
concepts:
- http://loinc.org/LP43571-6
Example value set: membranes in GO
Let’s examine the value set called GoMembrane:
[7]:
%%bash
yq .enums.GoMembrane ../../../tests/input/value_set_example.yaml
pv_formula: CURIE
reachable_from:
include_self: true
source_ontology: obo:go
source_nodes:
- GO:0016020 ## membrane
You can see this is defined as a simple query that selects all terms that are a subclass of GO:0016020 (membrane); i.e. an ontology branch. Other value sets are more complex involving boolean combinations.
Configuration
Because the LinkML language is independent of OAK we need to bind the logical names used for vocabularies to OAK selector syntax. This is done in a configuration file.
[14]:
CONFIG = """
resource_resolvers:
obo:go:
shorthand: sqlite:obo:go
""".strip()
[15]:
with open("output/vskit-config.yaml", "w") as f:
f.write(CONFIG)
[16]:
%%bash
yq . output/vskit-config.yaml
resource_resolvers:
obo:go:
shorthand: sqlite:obo:go
Expansion
Now we will expand the value set, into a new materialized file
[19]:
%%bash
vskit expand --config output/vskit-config.yaml --schema ../../../tests/input/value_set_example.yaml GoMembrane -o output/GoMembrane.yaml
We can see the expanded value set below.
[22]:
%%bash
yq .enums.GoMembrane.permissible_values output/GoMembrane.yaml | head -20
GO:0120201:
text: GO:0120201
description: Stack of disc membranes located inside a cone photoreceptor outer segment, and containing densely packed molecules of opsin photoreceptor proteins that traverse the lipid bilayer. Cone disc membranes arise as evaginations of the ciliary membrane during the development of the cone outer segment and remain contiguous with the ciliary membrane.
meaning: GO:0120201
title: cone photoreceptor disc membrane
GO:0060171:
text: GO:0060171
description: The portion of the plasma membrane surrounding a stereocilium.
meaning: GO:0060171
title: stereocilium membrane
GO:0042717:
text: GO:0042717
description: The lipid bilayer associated with a plasma membrane-derived chromatophore; surrounds chromatophores that form complete vesicles.
meaning: GO:0042717
title: plasma membrane-derived chromatophore membrane
GO:0035579:
text: GO:0035579
description: The lipid bilayer surrounding a specific granule, a granule with a membranous, tubular internal structure, found primarily in mature neutrophil cells. Most are released into the extracellular fluid. Specific granules contain lactoferrin, lysozyme, vitamin B12 binding protein and elastase.
meaning: GO:0035579
title: specific granule membrane
note that technically the key value does not need repeated, but the default serialization is to include it.
Customizing permissible values
We can use --pv-syntax
to customize the permissible value serialization. A python-style format string is used to specify how the permissible values are serialized. The default is {id}
.
[23]:
%%bash
vskit expand --pv-syntax '{label} [{id}]' --config output/vskit-config.yaml --schema ../../../tests/input/value_set_example.yaml GoMembrane -o output/GoMembrane2.yaml
[24]:
%%bash
yq .enums.GoMembrane.permissible_values output/GoMembrane2.yaml | head -20
cone photoreceptor disc membrane [GO:0120201]:
text: cone photoreceptor disc membrane [GO:0120201]
description: Stack of disc membranes located inside a cone photoreceptor outer segment, and containing densely packed molecules of opsin photoreceptor proteins that traverse the lipid bilayer. Cone disc membranes arise as evaginations of the ciliary membrane during the development of the cone outer segment and remain contiguous with the ciliary membrane.
meaning: GO:0120201
title: cone photoreceptor disc membrane
stereocilium membrane [GO:0060171]:
text: stereocilium membrane [GO:0060171]
description: The portion of the plasma membrane surrounding a stereocilium.
meaning: GO:0060171
title: stereocilium membrane
plasma membrane-derived chromatophore membrane [GO:0042717]:
text: plasma membrane-derived chromatophore membrane [GO:0042717]
description: The lipid bilayer associated with a plasma membrane-derived chromatophore; surrounds chromatophores that form complete vesicles.
meaning: GO:0042717
title: plasma membrane-derived chromatophore membrane
specific granule membrane [GO:0035579]:
text: specific granule membrane [GO:0035579]
description: The lipid bilayer surrounding a specific granule, a granule with a membranous, tubular internal structure, found primarily in mature neutrophil cells. Most are released into the extracellular fluid. Specific granules contain lactoferrin, lysozyme, vitamin B12 binding protein and elastase.
meaning: GO:0035579
title: specific granule membrane
[ ]: