Value Set Expansion examples

Value sets are enumerated permissible values (aka subsets) that are used for purposes such as filtering ontologies or defining a subset of terms that are valid for data entry for a particular field.

Dynamical (extensional) value sets are value sets that are defined by a query (including boolean graph queries) rather than a fixed list of terms. Because handling dynamic value sets at runtime can be complex, a value set expander (materializer) can be used to turn a dynamic value set into a static one.

Value sets are found in formalisms such as LinkML and FHIR. Currently the OAK Value Set expander only supports LinkML value sets.

See:

VSKit command

Currently there is a command with a single subcommand (others may be added later)

[25]:
%%bash
vskit --help
Usage: vskit [OPTIONS] COMMAND [ARGS]...

  Run the ValueSet CLI.

Options:
  -v, --verbose
  -q, --quiet TEXT
  --help            Show this message and exit.

Commands:
  expand  Expand a value set.
[3]:
%%bash
vskit expand --help
Usage: vskit expand [OPTIONS] [VALUE_SET_NAMES]...

  Expand a value set. EXPERIMENTAL.

  This will expand an *intentional value set* (aka *dynamic enum*), running a
  query against an ontology backend or backends to materialize the value set
  (permissible values).

  Currently the value set must be specified as LinkML, but in future this will
  be possible with other specifications such as FHIR ValueSet objects.

  Each expression in a dynamic enum has a *source ontology*, this is specified
  as a CURIE such as:

  - obo:mondo - bioregistry:wikidata

  These can be mapped to specific OAK selectors. By default, any obo prefix is
  mapped to the semsql implementation of that. You can use a configuration
  file to map to other backends, such as BioPortal or Wikidata. However, note
  that not all backends are capable of being able to render all value sets.

  Examples:

      vskit expand -c config.yaml -s schema.yaml -o expanded.yaml
      my_value_set1 my_value_set2

  Custom permissible value syntax:

      vskit expand -s schema.yaml -o expanded.yaml --pv-syntax '{label} [{id}]
      my_value_set1

Options:
  -c, --config PATH
  -s, --schema PATH
  -o, --output PATH
  --pv-syntax TEXT   Enter a LinkML structured_pattern.syntax-style string
  --help             Show this message and exit.

Example Value Set

The test inputs folder has an example of multiple different kinds of value sets.

[4]:
%%bash
yq . ../../../tests/input/value_set_example.yaml
id: https://w3id.org/linkml/examples/enums
title: Dynamic Enums Example
name: dynamicenums-example
description: This demonstrates the use of dynamic enums
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  linkml: https://w3id.org/linkml/
  ex: https://w3id.org/linkml/examples/enums/
  sh: https://w3id.org/shacl/
  bioregistry: https://bioregistry.io/registry/
  MONDO: http://purl.obolibrary.org/obo/MONDO_
  NCIT: http://purl.obolibrary.org/obo/NCIT_
  loinc: http://loinc.org/
default_prefix: ex
default_range: string
default_curi_maps:
  - semweb_context
emit_prefixes:
  - linkml
  - rdf
  - rdfs
  - xsd
  - owl
imports:
  - linkml:types
#==================================
# Classes                         #
#==================================
classes:
  HumanSample:
    slots:
      - name
      - disease
#==================================
# Slots                           #
#==================================
slots:
  name:
    range: string
  disease:
    range: HumanDisease
  vital_status:
    enum_range:
      permissible_values:
        LIVING:
        DEAD:
        UNDEAD:
#==================================
# Enums
#==================================
enums:
  GoMembrane:
    pv_formula: CURIE
    reachable_from:
      include_self: true
      source_ontology: obo:go
      source_nodes:
        - GO:0016020 ## membrane
  OnlyInEukaryotes:
    reachable_from:
      source_ontology: obo:go
      source_nodes:
        - NCBITaxon:2759 ## Eukaryota
      relationship_types:
        - rdfs:subClassOf
        - RO:0002162 ## in taxon
        - BFO:0000050 ## part of
  MembraneExcludingEukaryotes:
    inherits: GoMembrane
    minus:
      - inherits: OnlyInEukaryotes
  Disease:
    reachable_from:
      source_ontology: bioregistry:mondo
      source_nodes:
        - MONDO:0000001 ## disease or disorder
      is_direct: false
      relationship_types:
        - rdfs:subClassOf
    minus:
      permissible_values:
        root_node:
          meaning: MONDO:0000001 ## disease or disorder
  HumanDisease:
    description: Extends the Disease value set, including NCIT neoplasms, excluding non-human diseases
    inherits:
      - Disease
    include:
      - reachable_from:
          source_ontology: bioregistry:ncit
          source_nodes:
            - NCIT:C3262
    minus:
      - reachable_from:
          source_ontology: bioregistry:mondo
          source_nodes:
            - MONDO:0005583 ## non-human animal disease
          relationship_types:
            - rdfs:subClassOf
      - permissible_values:
          NOT_THIS_ONE:
            meaning: MONDO:9999
            description: Example of excluding a single node
  LoincExample:
    enum_uri: http://hl7.org/fhir/ValueSet/example-intensional
    see_also:
      - https://build.fhir.org/valueset-example-intensional.json.html
    include:
      - reachable_from:
          source_ontology: "loinc:"
          source_nodes:
            - loinc:LP43571-6
          is_direct: true
    minus:
      concepts:
        - LOINC:5932-9
  HCAExample:
    see_also:
      - https://github.com/linkml/linkml/issues/274
    include:
      - reachable_from:
          source_ontology: bioregistry:go
          source_nodes:
            - GO:0007049
            - GO:0022403
          include_self: false
          relationship_types:
            - rdfs:subClassOf
    minus:
      concepts:
        - LOINC:5932-9
  BodyPartEnum:
    reachable_from:
      source_ontology: obo:cl
      source_nodes:
        - CL:0000540 ## neuron
      include_self: false
      relationship_types:
        - rdfs:subClassOf
  Brand:
    enum_uri: wikidata:Q431289
    include:
      - reachable_from:
          source_ontology: bioregistry:wikidata
          source_nodes:
            - wikidata:Q431289
          include_self: false
          relationship_types:
            - wdp:P31
            - wdp:P279
  SerumCholesterolExample:
    description: >
      This is an example value set that includes all the LOINC codes for serum/plasma cholesterol from v2.36.

    code_set: http://hl7.org/fhir/ValueSet/serum-cholesterol
    code_set_version: "1.0.0"
    pv_formula: CODE
    include:
      - concepts:
          - LP43571-6
    minus:
      - concepts:
          - 5932-9
    reachable_from:
      source_ontology: http://loinc.org
      source_nodes:
        - LP43571-6
      relationship_types: null
      is_direct: true
      include_self: true
      traverse_up: false
    concepts:
      - http://loinc.org/LP43571-6

Example value set: membranes in GO

Let’s examine the value set called GoMembrane:

[7]:
%%bash
yq .enums.GoMembrane ../../../tests/input/value_set_example.yaml
pv_formula: CURIE
reachable_from:
  include_self: true
  source_ontology: obo:go
  source_nodes:
    - GO:0016020 ## membrane

You can see this is defined as a simple query that selects all terms that are a subclass of GO:0016020 (membrane); i.e. an ontology branch. Other value sets are more complex involving boolean combinations.

Configuration

Because the LinkML language is independent of OAK we need to bind the logical names used for vocabularies to OAK selector syntax. This is done in a configuration file.

[14]:
CONFIG = """
resource_resolvers:
  obo:go:
    shorthand: sqlite:obo:go
""".strip()
[15]:
with open("output/vskit-config.yaml", "w") as f:
    f.write(CONFIG)
[16]:
%%bash
yq . output/vskit-config.yaml
resource_resolvers:
  obo:go:
    shorthand: sqlite:obo:go

Expansion

Now we will expand the value set, into a new materialized file

[19]:
%%bash
vskit expand --config output/vskit-config.yaml --schema ../../../tests/input/value_set_example.yaml GoMembrane -o output/GoMembrane.yaml

We can see the expanded value set below.

[22]:
%%bash
yq .enums.GoMembrane.permissible_values output/GoMembrane.yaml | head -20
GO:0120201:
  text: GO:0120201
  description: Stack of disc membranes located inside a cone photoreceptor outer segment, and containing densely packed molecules of opsin photoreceptor proteins that traverse the lipid bilayer. Cone disc membranes arise as evaginations of the ciliary membrane during the development of the cone outer segment and remain contiguous with the ciliary membrane.
  meaning: GO:0120201
  title: cone photoreceptor disc membrane
GO:0060171:
  text: GO:0060171
  description: The portion of the plasma membrane surrounding a stereocilium.
  meaning: GO:0060171
  title: stereocilium membrane
GO:0042717:
  text: GO:0042717
  description: The lipid bilayer associated with a plasma membrane-derived chromatophore; surrounds chromatophores that form complete vesicles.
  meaning: GO:0042717
  title: plasma membrane-derived chromatophore membrane
GO:0035579:
  text: GO:0035579
  description: The lipid bilayer surrounding a specific granule, a granule with a membranous, tubular internal structure, found primarily in mature neutrophil cells. Most are released into the extracellular fluid. Specific granules contain lactoferrin, lysozyme, vitamin B12 binding protein and elastase.
  meaning: GO:0035579
  title: specific granule membrane

note that technically the key value does not need repeated, but the default serialization is to include it.

Customizing permissible values

We can use --pv-syntax to customize the permissible value serialization. A python-style format string is used to specify how the permissible values are serialized. The default is {id}.

[23]:
%%bash
vskit expand --pv-syntax '{label} [{id}]' --config output/vskit-config.yaml --schema ../../../tests/input/value_set_example.yaml GoMembrane -o output/GoMembrane2.yaml
[24]:
%%bash
yq .enums.GoMembrane.permissible_values output/GoMembrane2.yaml | head -20
cone photoreceptor disc membrane [GO:0120201]:
  text: cone photoreceptor disc membrane [GO:0120201]
  description: Stack of disc membranes located inside a cone photoreceptor outer segment, and containing densely packed molecules of opsin photoreceptor proteins that traverse the lipid bilayer. Cone disc membranes arise as evaginations of the ciliary membrane during the development of the cone outer segment and remain contiguous with the ciliary membrane.
  meaning: GO:0120201
  title: cone photoreceptor disc membrane
stereocilium membrane [GO:0060171]:
  text: stereocilium membrane [GO:0060171]
  description: The portion of the plasma membrane surrounding a stereocilium.
  meaning: GO:0060171
  title: stereocilium membrane
plasma membrane-derived chromatophore membrane [GO:0042717]:
  text: plasma membrane-derived chromatophore membrane [GO:0042717]
  description: The lipid bilayer associated with a plasma membrane-derived chromatophore; surrounds chromatophores that form complete vesicles.
  meaning: GO:0042717
  title: plasma membrane-derived chromatophore membrane
specific granule membrane [GO:0035579]:
  text: specific granule membrane [GO:0035579]
  description: The lipid bilayer surrounding a specific granule, a granule with a membranous, tubular internal structure, found primarily in mature neutrophil cells. Most are released into the extracellular fluid. Specific granules contain lactoferrin, lysozyme, vitamin B12 binding protein and elastase.
  meaning: GO:0035579
  title: specific granule membrane
[ ]: