{ "cells": [ { "cell_type": "markdown", "id": "0f6c4513", "metadata": {}, "source": [ "# OAK enrichment command\n", "\n", "This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).\n", "\n", "This notebook provides examples for the `enrichment` command which produces a summary of ontology classes that are enriched in the associations for an input set of entities.\n", "\n", "See also the end of the [Command Line Tutorial](https://doi.org/10.5281/zenodo.7708963)\n", "\n", "## Help Option\n", "\n", "You can get help on any OAK command using `--help`" ] }, { "cell_type": "code", "execution_count": 1, "id": "65db4b53", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: runoak enrichment [OPTIONS] [TERMS]...\n", "\n", " Run class enrichment analysis.\n", "\n", " Given a sample file of identifiers (e.g. gene IDs), plus a set of\n", " associations (e.g. gene to term associations, return the terms that are\n", " over-represented in the sample set.\n", "\n", " Example:\n", "\n", " runoak -i sqlite:obo:uberon -g gene2anat.txt -G g2t enrichment -U my-\n", " genes.txt -O csv\n", "\n", " This runs an enrichment using Uberon on my-genes.txt, using the\n", " gene2anat.txt file as the association file (assuming simple gene-to-term\n", " format). The output is in CSV format.\n", "\n", " It is recommended you always provide a background set, including all the\n", " entity identifiers considered in the experiment.\n", "\n", " You can specify --filter-redundant to filter out redundant terms. This will\n", " block reporting of any terms that are either subsumed by or subsume a lower\n", " p-value term that is already reported.\n", "\n", " For a full example, see:\n", "\n", " https://github.com/INCATools/ontology-access-\n", " kit/blob/main/notebooks/Commands/Enrichment.ipynb\n", "\n", " Note that it is possible to run \"pseudo-enrichments\" on term lists only by\n", " passing no associations and using --ontology-only. This creates a fake\n", " association set that is simply reflexive relations between each term and\n", " itself. This can be useful for summarizing term lists, but note that\n", " P-values may not be meaningful.\n", "\n", "Options:\n", " -o, --output FILENAME Output file, e.g. obo file\n", " -p, --predicates TEXT A comma-separated list of predicates. This\n", " may be a shorthand (i, p) or CURIE\n", " --autolabel / --no-autolabel If set, results will automatically have\n", " labels assigned [default: autolabel]\n", " -O, --output-type TEXT Desired output type\n", " -o, --output FILENAME Output file, e.g. obo file\n", " --ontology-only / --no-ontology-only\n", " If true, perform a pseudo-enrichment\n", " analysis treating each term as an\n", " association to itself. [default: no-\n", " ontology-only]\n", " --cutoff FLOAT The cutoff for the p-value; any p-values\n", " greater than this are not reported.\n", " [default: 0.05]\n", " -U, --sample-file FILENAME file containing input list of entity IDs\n", " (e.g. gene IDs) [required]\n", " -B, --background-file FILENAME file containing background list of entity\n", " IDs (e.g. gene IDs)\n", " --association-predicates TEXT A comma-separated list of predicates for the\n", " association relation\n", " --filter-redundant / --no-filter-redundant\n", " If true, filter out redundant terms\n", " --help Show this message and exit.\n" ] } ], "source": [ "!runoak enrichment --help" ] }, { "cell_type": "markdown", "id": "c8878ac5", "metadata": {}, "source": [ "## Download example file and setup\n", "\n", "We will use the HPO Association file" ] }, { "cell_type": "code", "execution_count": 2, "id": "12a41f0d", "metadata": {}, "outputs": [], "source": [ "!curl -L -s http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt > input/hpoa_g2p.tsv" ] }, { "cell_type": "markdown", "id": "d57ac006", "metadata": {}, "source": [ "next we will set up an hpo alias" ] }, { "cell_type": "code", "execution_count": 3, "id": "dc71c543", "metadata": {}, "outputs": [], "source": [ "alias hp runoak -i sqlite:obo:hp" ] }, { "cell_type": "markdown", "id": "6033aa66", "metadata": {}, "source": [ "Test this out by querying for associations for a particular gene.\n", "\n", "We need to pass in the association file we downloaded, as well as specify the file type (with `-G`):" ] }, { "cell_type": "code", "execution_count": 4, "id": "2cfa1be8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "subject\tsubject_label\tpredicate\tobject\tobject_label\tproperty_values\tpredicate_label\tnegated\tpublications\tprimary_knowledge_source\taggregator_knowledge_source\tsubject_closure\tsubject_closure_label\tobject_closure\tobject_closure_label\n", "NCBIGene:8192\tNone\tNone\tHP:0001250\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0000013\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0000007\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0010464\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0008232\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0011969\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0004322\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0000786\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n", "NCBIGene:8192\tNone\tNone\tHP:0000815\tNone\t\tNone\tNone\t\tNone\tNone\t\t\t\t\n" ] } ], "source": [ "hp -G hpoa_g2p -g input/hpoa_g2p.tsv associations -Q subject NCBIGene:8192 -O csv | head" ] }, { "cell_type": "markdown", "id": "c9047f55", "metadata": {}, "source": [ "## Enrichment\n", "\n", "We will perform enrichment using a set of genes known to be associated with Ehler-Danlos Syndrome (EDS).\n", "\n", "The gene list is here:\n", "\n", "- [input/eds-genes-ncbigene.tsv](input/eds-genes-ncbigene.tsv)\n", "\n", "Let's take a look at them:" ] }, { "cell_type": "code", "execution_count": 5, "id": "a4c62f21-d6ec-4b34-82bc-3338cbb94ebe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tlabel\n", "NCBIGene:7148\tTNXB\n", "NCBIGene:715\tC1R\n", "NCBIGene:716\tC1S\n", "NCBIGene:126792\tB3GALT6\n", "NCBIGene:55033\tFKBP14\n", "NCBIGene:91252\tSLC39A13\n", "NCBIGene:29940\tDSE\n", "NCBIGene:1303\tCOL12A1\n", "NCBIGene:9509\tADAMTS2\n", "NCBIGene:1278\tCOL1A2\n", "NCBIGene:1281\tCOL3A1\n", "NCBIGene:1289\tCOL5A1\n", "NCBIGene:1290\tCOL5A2\n", "NCBIGene:84627\tZNF469\n", "NCBIGene:113189\tCHST14\n", "NCBIGene:165\tAEBP1\n", "NCBIGene:5351\tPLOD1\n", "NCBIGene:11285\tB4GALT7\n", "NCBIGene:11107\tPRDM5\n" ] } ], "source": [ "!cat input/eds-genes-ncbigene.tsv" ] }, { "cell_type": "markdown", "id": "4a538b5d-55c6-4773-91be-8fd726ede6cf", "metadata": {}, "source": [ "### Running the `enrichment` command\n", "\n", "Next we will run the command itself. Note we use two sets of parameters\n", "\n", "- global OAK parameters:\n", " - the format of the associations (`-G`), here using the HPOA gene to phenotype format\n", " - the path to the association file (`-g`), here the gp2 file we downloaded earlier\n", "- local parameters for the `enrichment` command\n", " - the set of genes to be enriched (via `-U` or `--sample-file`)\n", " - the output format for the results (via `-O` or `--output-type`) - here a TSV, but could also be YAML, RDF\n", " - the `--autolabel` option that will do additional HPO queries to give the names of each term\n", " - the output file via `-o` (`--output`)" ] }, { "cell_type": "code", "execution_count": 6, "id": "ab30433e", "metadata": {}, "outputs": [], "source": [ "hp -G hpoa_g2p -g input/hpoa_g2p.tsv enrichment -U input/eds-genes-ncbigene.tsv -O csv --autolabel -o output/eds-genes-enriched.tsv" ] }, { "cell_type": "markdown", "id": "503f9542-e25f-4ddc-bc70-2c52e2ffe34b", "metadata": {}, "source": [ "### Examining the results\n", "\n", "The best way to look at TSVs in a notebook such as this one is to use pandas to load as a dataframe.\n", "Note however that in most scenarios where you use the command line, this would *not* be wrapped in a notebook,\n", "and you could use your favorite TSV/CSV tool for exploring the results" ] }, { "cell_type": "code", "execution_count": 7, "id": "3a4205f9", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 8, "id": "3f49d300", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | class_id | \n", "p_value | \n", "class_label | \n", "rank | \n", "p_value_adjusted | \n", "false_discovery_rate | \n", "fold_enrichment | \n", "probability | \n", "sample_count | \n", "sample_total | \n", "background_count | \n", "background_total | \n", "ancestor_of_more_informative_result | \n", "descendant_of_more_informative_result | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "HP:0000974 | \n", "2.121426e-37 | \n", "Hyperextensible skin | \n", "1 | \n", "2.895747e-34 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "19 | \n", "19 | \n", "68 | \n", "5011 | \n", "NaN | \n", "NaN | \n", "
1 | \n", "HP:0001075 | \n", "1.245666e-36 | \n", "Atrophic scars | \n", "2 | \n", "1.700334e-33 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "17 | \n", "19 | \n", "37 | \n", "5011 | \n", "NaN | \n", "NaN | \n", "
2 | \n", "HP:0008067 | \n", "9.834712e-29 | \n", "Abnormally lax or hyperextensible skin | \n", "3 | \n", "1.342438e-25 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "19 | \n", "19 | \n", "177 | \n", "5011 | \n", "True | \n", "NaN | \n", "
3 | \n", "HP:0100699 | \n", "1.916985e-28 | \n", "Scarring | \n", "4 | \n", "2.616685e-25 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "19 | \n", "19 | \n", "183 | \n", "5011 | \n", "True | \n", "NaN | \n", "
4 | \n", "HP:0004334 | \n", "7.330381e-28 | \n", "Dermal atrophy | \n", "5 | \n", "1.000597e-24 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "17 | \n", "19 | \n", "102 | \n", "5011 | \n", "True | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
159 | \n", "HP:0010488 | \n", "3.040076e-05 | \n", "Aplasia/Hypoplasia of the palmar creases | \n", "160 | \n", "4.149704e-02 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "3 | \n", "19 | \n", "17 | \n", "5011 | \n", "NaN | \n", "True | \n", "
160 | \n", "HP:0000014 | \n", "3.062457e-05 | \n", "Abnormality of the bladder | \n", "161 | \n", "4.180254e-02 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "9 | \n", "19 | \n", "494 | \n", "5011 | \n", "True | \n", "NaN | \n", "
161 | \n", "HP:0011844 | \n", "3.186952e-05 | \n", "Abnormal appendicular skeleton morphology | \n", "162 | \n", "4.350189e-02 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "16 | \n", "19 | \n", "1853 | \n", "5011 | \n", "True | \n", "NaN | \n", "
162 | \n", "HP:0033353 | \n", "3.286903e-05 | \n", "Abnormal blood vessel morphology | \n", "163 | \n", "4.486623e-02 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "12 | \n", "19 | \n", "967 | \n", "5011 | \n", "True | \n", "True | \n", "
163 | \n", "HP:0025323 | \n", "3.300320e-05 | \n", "Abnormal arterial physiology | \n", "164 | \n", "4.504937e-02 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "6 | \n", "19 | \n", "177 | \n", "5011 | \n", "True | \n", "True | \n", "
164 rows × 14 columns
\n", "