How does the Taxonomic Name Recognition algorithm work in BHL?

When new or updated page text is added to BHL, that text is used as input for the gnfinder tool from Global Names. See https://github.com/gnames/gnfinder for more information.

The gnfinder tool does the following:

  1. Analyzes the text
  2. Identifies text strings that might be scientific names
  3. Compares potential names to multiple repositories of scientific names (such as EOL and Catalogue of Life) to identify known names
  4. Compiles the results
  5. Outputs the results

Questions about the details of how all of this is done should be directed to Global Names Architecture.

Here is an example gnfinder response to a fragment of text that includes the name “Strix varia”.

NOTE: This example uses a response from gnfinder version 0.11.1. Other versions of the tool may format the response differently.

Fields that are evaluated and/or stored by BHL are highlighted:

{
  "metadata": {
    "date": "2020-06-24T16:39:42.4189206-05:00",
    "gnfinderVersion": "v0.11.1",
    "withBayes": true,
    "tokensAround": 0,
    "language": "eng",
    "detectLanguage": false,
    "totalWords": 462,
    "totalCandidates": 68,
    "totalNames": 7
  },
  "names": [
    {
      "cardinality": 2,
      "verbatim": "Strix varia,",
      "name": "Strix varia",
      "odds": 550545.1983958198,
      "start": 2296,
      "end": 2308,
      "annotationNomenType": "NO_ANNOT",
      "annotation": "",
      "verification": {
        "bestResult": {
          "dataSourceId": 1,
          "dataSourceTitle": "Catalogue of Life",
          "taxonId": "3809730",
          "matchedName": "Strix varia Barton, 1799",
          "matchedCardinality": 2,
          "matchedCanonicalSimple": "Strix varia",
          "matchedCanonicalFull": "Strix varia",
          "classificationPath": "Animalia|Chordata|Aves|Strigiformes|Strigidae|Strix|Strix varia",
          "classificationRank": "kingdom|phylum|class|order|family|genus|species",
          "classificationIds": "3939792|3940184|3944244|3944475|3944476|4195146|3809730",
          "matchType": "ExactCanonicalMatch"
        },
        "dataSourcesNum": 28,
        "dataSourceQuality": "HasCuratedSources",
        "retries": 1
      }
    }
  ]
}

The “dataSourceQuality”, “match_type”, and “odds” fields are evaluated to determine which data to keep and which to discard (responses can include some very uncertain, or “fuzzy”, matches that BHL does not keep).

Once the names to keep are identified, the following data fields are read from the response and stored in BHL:

  • name
  • matchedName
  • matchedCanonicalFull
  • dataSourceId (ID for the respository in which the name string was matched)
  • dataSourceTitle (the respository in which the name string was matched)
  • localId (the ID for the name in the repository in which it was matched)
  • taxonId (used in place of the localId, if no localId value exists)

So, in this example, BHL would store the following information (a name and three identifiers for that name):

  • name: Strix varia
  • matchedName: Strix varia Barton, 1799
  • matchedCanonicalFull: Strix varia
  • dataSourceId: 1
  • dataSourceTitle: Catalogue of Life
  • taxonId: 3809730