Text Annotator Interface

class oaklib.interfaces.text_annotator_interface.TextAnnotatorInterface(resource: ~oaklib.resource.OntologyResource | None = None, strict: bool = False, _multilingual: bool | None = None, autosave: bool = <factory>, exclude_owl_top_and_bottom: bool = <factory>, ontology_metamodel_mapper: ~oaklib.mappers.ontology_metadata_mapper.OntologyMetadataMapper | None = None, _converter: ~curies.api.Converter | None = None, auto_relax_axioms: bool | None = None, cache_lookups: bool = False, property_cache: ~oaklib.utilities.keyval_cache.KeyValCache = <factory>, _edge_index: ~oaklib.indexes.edge_index.EdgeIndex | None = None, _entailed_edge_index: ~oaklib.indexes.edge_index.EdgeIndex | None = None, _prefix_map: ~typing.Mapping[str, str] | None = None)[source]

Finds occurrences of ontology terms in text.

This interface defines methods for providing Concept Recognition (CR) (grounding) on texts.

For example, given a text:

“the mitochondrion of hippocampal neurons”

An annotator might recognize the concepts “mitochondrion” and “hippocampus neuron” from GO and CL respectively.

Different adapters may choose to implement this differently. The default implementation is to build a simple textual index from an ontology, using all labels and synonyms, and to perform simple string matching.

Adapters that talk to a remote endpoint may leverage more advanced strategies, and may obviate the need for a local indexing step. For example, the Bioportal Adapter will use the OntoPortal annotate endpoint which is pre-indexed over all >1000 ontologies in bioportal.

All return payloads conform to the TextAnnotation data model:

https://w3id.org/oak/text-annotator

Additional plugins may be available to provide more advanced functionality:

OAK SciSpacy plugin - provides a Spacy pipeline component
OAK OGER plugin - provides a OGER pipeline component

For more advanced extraction use cases, see:

OntoGPT - LLM-based NER and schema extraction

lexical_index: LexicalIndex | None = None: If present, some implementations may choose to use this

cache_directory: str | None = None: If present, some implementations may choose to cache any ontology indexes here. These may be used in subsequent invocations, it is up to the user to manage this cache.

rule_collection: MappingRuleCollection | None = None: Mapping rules to apply to the results of the annotation, including synonymizer rules.

annotate_text(text: str, configuration: TextAnnotationConfiguration | None = None) → Iterable[TextAnnotation][source]

Annotate a piece of text.

>>> from oaklib import get_adapter
>>> adapter = get_adapter("tests/input/go-nucleus.obo")
>>> for annotation in adapter.annotate_text("The nucleus is a organelle with a membrane"):
...     print(annotation.object_id, annotation.object_label, annotation.subject_start, annotation.subject_end)
GO:0005634 nucleus 5 11
GO:0016020 membrane 35 42
GO:0043226 organelle 18 26

Parameters:

text – Text to be annotated.
configuration – Text annotation configuration.

Yield:

A generator function that yields annotated results.

annotate_file(text_file: TextIOWrapper, configuration: TextAnnotationConfiguration | None = None) → Iterator[TextAnnotation][source]

Annotate text in a file.

Parameters:

text_file – Text file that is iterated line-by-line.
configuration – Text annotation configuration, defaults to None.

Yield:

Annotation of each line.

annotate_tabular_file(text_file: TextIOWrapper, delimiter: str | None = None, configuration: TextAnnotationConfiguration | None = None, match_column: str | None = None, result_column: str = 'matched_id', result_label_column: str = 'matched_label', match_multiple=False, include_unmatched=True) → Iterator[Dict[str, str]][source]

Annotate text in a file.

Parameters:

text_file – Text file that is iterated line-by-line.
configuration – Text annotation configuration, defaults to None.

Yield:

Annotation of each line.

class oaklib.datamodels.text_annotator.TextAnnotation(*args, _if_missing: Callable[[JsonObj, str], Tuple[bool, Any]] | None = None, **kwargs)[source]: An individual text annotation