How does the Taxonomic Name Recognition algorithm work in BHL?

Question

Bianca Crowley · Accepted Answer

When new or updated page text is added to BHL, that text is used as input for the gnfinder tool from Global Names. See https://github.com/gnames/gnfinder for more information. The gnfinder tool does the following:

Analyzes the text
Identifies text strings that might be scientific names
Compares potential names to multiple repositories of scientific names (such as EOL and Catalogue of Life) to identify known names
Compiles the results
Outputs the results

Questions about the details of how all of this is done should be directed to Global Names Architecture. Here is an example gnfinder response to a fragment of text that includes the name “Strix varia”. NOTE: This example uses a response from gnfinder version 0.11.1. Other versions of the tool may format the response differently. Fields that are evaluated and/or stored by BHL are highlighted:

{
  "metadata": {
    "date": "2020-06-24T16:39:42.4189206-05:00",
    "gnfinderVersion": "v0.11.1",
    "withBayes": true,
    "tokensAround": 0,
    "language": "eng",
    "detectLanguage": false,
    "totalWords": 462,
    "totalCandidates": 68,
    "totalNames": 7
  },
  "names": [
    {
      "cardinality": 2,
      "verbatim": "Strix varia,",
      "name": "Strix varia",
      "odds": 550545.1983958198,
      "start": 2296,
      "end": 2308,
      "annotationNomenType": "NO_ANNOT",
      "annotation": "",
      "verification": {
        "bestResult": {
          "dataSourceId": 1,
          "dataSourceTitle": "Catalogue of Life",
          "taxonId": "3809730",
          "matchedName": "Strix varia Barton, 1799",
          "matchedCardinality": 2,
          "matchedCanonicalSimple": "Strix varia",
          "matchedCanonicalFull": "Strix varia",
          "classificationPath": "Animalia|Chordata|Aves|Strigiformes|Strigidae|Strix|Strix varia",
          "classificationRank": "kingdom|phylum|class|order|family|genus|species",
          "classificationIds": "3939792|3940184|3944244|3944475|3944476|4195146|3809730",
          "matchType": "ExactCanonicalMatch"
        },
        "dataSourcesNum": 28,
        "dataSourceQuality": "HasCuratedSources",
        "retries": 1
      }
    }
  ]
}

The “dataSourceQuality”, “match_type”, and “odds” fields are evaluated to determine which data to keep and which to discard (responses can include some very uncertain, or “fuzzy”, matches that BHL does not keep). Once the names to keep are identified, the following data fields are read from the response and stored in BHL:

name
matchedName
matchedCanonicalFull
dataSourceId (ID for the respository in which the name string was matched)
dataSourceTitle (the respository in which the name string was matched)
localId (the ID for the name in the repository in which it was matched)
taxonId (used in place of the localId, if no localId value exists)

So, in this example, BHL would store the following information (a name and three identifiers for that name):

name: Strix varia
matchedName: Strix varia Barton, 1799
matchedCanonicalFull: Strix varia
dataSourceId: 1
dataSourceTitle: Catalogue of Life
taxonId: 3809730