How does the Taxonomic Name Recognition algorithm work in BHL?

When new or updated page text is added to BHL, that text is sent to the API for the Global Names Recognition and Discovery (GNRD) service. See https://gnrd.globalnames.org/ for more information.

The GNRD service does the following:

    1. 1. Analyzes the text
      2. Identifies text strings that might be scientific names
      3. Compares potential names to multiple repositories of scientific names (such as EOL and Catalogue of Life) to identify known names
      4. Compiles the results
      5. Sends the results back

Questions about the details of how all of this is done should be directed to Global Names Architecture.

Here is an example GNRD response to a fragment of text that includes the name “Strix varia”. Fields that are evaluated and/or stored by BHL are highlighted:

{
  "file": null,
  "names": [
    {
      "size": 10,
      "verbatim": "Strix varia",
      "offsetEnd": 10,
      "offsetStart": 0,
      "scientificName": "Strix varia"
    }
  ],
  "total": 1,
  "status": 200,
  "unique": false,
  "engines": [
    "TaxonFinder",
    "NetiNeti"
  ],
  "verbatim": true,
  "input_url": null,
  "token_url": "http://gnrd.globalnames.org/name_finder.json?token=5tn0fs96bb",
  "parameters": {
    "engine": 0,
    "return_content": false,
    "best_match_only": true,
    "data_source_ids": [],
    "detect_language": true,
    "all_data_sources": true,
    "preferred_data_sources": [
      12,
      169
    ]
  },
  "data_sources": [],
  "execution_time": {
    "total_duration": 0.10292387008666992,
    "find_names_duration": 0.0025997161865234375,
    "names_resolution_duration": 0.09673190116882324,
    "text_preparation_duration": 0.003592252731323242
  },
  "resolved_names": [
    {
      "results": [
        {
          "score": 0.988,
          "gni_uuid": "0183c26f-dc58-59b8-a703-edb842ceebf3",
          "prescore": "3|0|0",
          "taxon_id": "57075",
          "match_type": 1,
          "imported_at": "2018-07-05T21:09:32Z",
          "match_value": "Exact string match",
          "name_string": "Strix varia",
          "edit_distance": 0,
          "canonical_form": "Strix varia",
          "data_source_id": 4,
          "data_source_title": "NCBI",
          "classification_path": "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Dipnotetrapodomorpha|Tetrapoda|Amniota|Sauropsida|Sauria|Archelosauria|Archosauria|Dinosauria|Saurischia|Theropoda|Coelurosauria|Aves|Neognathae|Strigiformes|Strigidae|Strix|Strix varia",
          "classification_path_ids": "131567|2759|33154|33208|6072|33213|33511|7711|89593|7742|7776|117570|117571|8287|1338369|32523|32524|8457|32561|1329799|8492|436486|436489|436491|436492|8782|8825|30458|30459|36304|57075",
          "classification_path_ranks": "|superkingdom||kingdom||||phylum|subphylum|||||superclass||||||||||||class|superorder|order|family|genus|species"
        }
      ],
      "is_known_name": true,
      "preferred_results": [
        {
          "url": "http://eol.org/pages/1045909",
          "score": 0.988,
          "gni_uuid": "0183c26f-dc58-59b8-a703-edb842ceebf3",
          "local_id": "1045909",
          "prescore": "3|0|0",
          "taxon_id": "20633053",
          "match_type": 1,
          "imported_at": "2012-05-08T03:07:12Z",
          "match_value": "Exact string match",
          "name_string": "Strix varia",
          "edit_distance": 0,
          "canonical_form": "Strix varia",
          "data_source_id": 12,
          "data_source_title": "EOL",
          "classification_path": "",
          "classification_path_ids": "",
          "classification_path_ranks": ""
        },
        {
          "url": "http://www.ubio.org/browser/details.php?namebankID=3852582",
          "score": 0.988,
          "gni_uuid": "0183c26f-dc58-59b8-a703-edb842ceebf3",
          "local_id": "urn:lsid:ubio.org:namebank:3852582",
          "prescore": "3|0|0",
          "taxon_id": "97060978",
          "global_id": "urn:lsid:ubio.org:namebank:3852582",
          "match_type": 1,
          "imported_at": "2013-05-31T15:56:10Z",
          "match_value": "Exact string match",
          "name_string": "Strix varia",
          "edit_distance": 0,
          "canonical_form": "Strix varia",
          "data_source_id": 169,
          "data_source_title": "uBio NameBank",
          "classification_path": "|Strix varia",
          "classification_path_ids": "",
          "classification_path_ranks": "kingdom|"
        }
      ],
      "in_curated_sources": false,
      "data_sources_number": 23,
      "supplied_name_string": "Strix varia"
    }
  ]
}

The “preferred_results”, “results”, and “match_type” fields are evaluated to determine which data to keep and which to discard (responses can include some very uncertain, or “fuzzy”, matches that BHL does not keep).

Once the names to keep are identified, the following data fields are read from the response and stored in BHL:

  • name_string
  • canonical_form
  • gni_uuid (the Global Names ID for the name)
  • data_source_title (the respository in which the name string was matched)
  • local_id (the ID for the name in the repository in which it was matched)
  • taxon_id (used in place of the local_id, if no local_id value exists)

So, in this example, BHL would store the following information (a name and three identifiers for that name):

  • name_string: Strix varia
  • canonical_form: Strix varia
  • gni_uuid: 0183c26f-dc58-59b8-a703-edb842ceebf3
  • data_source_title: EOL
  • local_id: 1045909
  • data_source_title: uBio NameBank
  • local_id: urn:lsid:ubio.org:namebank:3852582

Note that the “preferred_results” section is favored over the “results” section. That is why NCBI information is not included in what BHL stores.

Leave a Comment