Text Annotator Interface

class oaklib.interfaces.text_annotator_interface.TextAnnotatorInterface(resource: ~oaklib.resource.OntologyResource | None = None, strict: bool = False, _multilingual: bool | None = None, autosave: bool = <factory>, exclude_owl_top_and_bottom: bool = <factory>, ontology_metamodel_mapper: ~oaklib.mappers.ontology_metadata_mapper.OntologyMetadataMapper | None = None, _converter: ~curies.api.Converter | None = None, auto_relax_axioms: bool | None = None, cache_lookups: bool = False, property_cache: ~oaklib.utilities.keyval_cache.KeyValCache = <factory>, _edge_index: ~oaklib.indexes.edge_index.EdgeIndex | None = None, _entailed_edge_index: ~oaklib.indexes.edge_index.EdgeIndex | None = None)[source]

Finds occurrences of ontology terms in text.

This interface defines methods for providing Concept Recognition (CR) (grounding) on texts.

For example, given a text:

“the mitochondrion of hippocampal neurons”

An annotator might recognize the concepts “mitochondrion” and “hippocampus neuron” from GO and CL respectively.

Different adapters may choose to implement this differently. The default implementation is to build a simple textual index from an ontology, using all labels and synonyms, and to perform simple string matching.

Adapters that talk to a remote endpoint may leverage more advanced strategies, and may obviate the need for a local indexing step. For example, the Bioportal Adapter will use the OntoPortal annotate endpoint which is pre-indexed over all >1000 ontologies in bioportal.

All return payloads conform to the TextAnnotation data model:

Additional plugins may be available to provide more advanced functionality:

For more advanced extraction use cases, see:

  • OntoGPT - LLM-based NER and schema extraction

lexical_index: LexicalIndex | None = None

If present, some implementations may choose to use this

cache_directory: str | None = None

If present, some implementations may choose to cache any ontology indexes here. These may be used in subsequent invocations, it is up to the user to manage this cache.

rule_collection: MappingRuleCollection | None = None

Mapping rules to apply to the results of the annotation, including synonymizer rules.

annotate_text(text: str, configuration: TextAnnotationConfiguration | None = None) Iterable[TextAnnotation][source]

Annotate a piece of text.

>>> from oaklib import get_adapter
>>> adapter = get_adapter("tests/input/go-nucleus.obo")
>>> for annotation in adapter.annotate_text("The nucleus is a organelle with a membrane"):
...     print(annotation.object_id, annotation.object_label, annotation.subject_start, annotation.subject_end)
GO:0005634 nucleus 5 11
GO:0016020 membrane 35 42
GO:0043226 organelle 18 26
Parameters:
  • text – Text to be annotated.

  • configuration – Text annotation configuration.

Yield:

A generator function that yields annotated results.

annotate_file(text_file: TextIOWrapper, configuration: TextAnnotationConfiguration | None = None) Iterator[TextAnnotation][source]

Annotate text in a file.

Parameters:
  • text_file – Text file that is iterated line-by-line.

  • configuration – Text annotation configuration, defaults to None.

Yield:

Annotation of each line.

annotate_tabular_file(text_file: TextIOWrapper, delimiter: str | None = None, configuration: TextAnnotationConfiguration | None = None, match_column: str | None = None, result_column: str = 'matched_id', result_label_column: str = 'matched_label', match_multiple=False, include_unmatched=True) Iterator[Dict[str, str]][source]

Annotate text in a file.

Parameters:
  • text_file – Text file that is iterated line-by-line.

  • configuration – Text annotation configuration, defaults to None.

Yield:

Annotation of each line.

class oaklib.datamodels.text_annotator.TextAnnotation(*args, _if_missing: Callable[[JsonObj, str], Tuple[bool, Any]] | None = None, **kwargs)[source]

An individual text annotation