Skip to content

Per-label precision biased upward in cross-label scenarios (IV, VI) #110

@martinlaursen-indsigtai

Description

@martinlaursen-indsigtai

Hi and thanks for maintaining nervaluate. I've used it a lot but I believe I've found an error which biases per-label precision upward in cross-label scenarios (IV, VI).

In compute_metrics, Scenarios IV and VI (where pred.label != true.label) only increment counters keyed by true.label:

#Scenario IV
evaluation_agg_entities_type[true.e_type]['strict']['incorrect'] += 1
# evaluation_agg_entities_type[pred.e_type]['strict']['spurious'] += 1   # commented-out

#Scenario VI
evaluation_agg_entities_type[true.e_type]['strict']['incorrect'] += 1

Nothing is added to evaluation_agg_entities_type[pred.e_type]. As a result, predictions of label L_p that were wrong because the gold was a different label are missing from actual[L_p] (= correct + incorrect + spurious), so per-label precision for L_p is biased upward.

Example:
count | gold | pred
10 | A | A
3 | A | B
7 | B | A
10 | B | B

The model predicts A 17 times (10 correct, 7 wrong). Expected P[A] = 10/17. nervaluate reports P[A] = 10/13, because the 7 cross-label-wrong predictions of A are attributed to incorrect[B] rather than incorrect[A].
Mirror result: nervaluate reports P[B] = 10/17 while the correct value is P[B] = 10/13. The two labels' real precisions get silently swapped.
This only shows up when reading results_per_tag.

The README's note about scenarios IV/VI argues the prediction shouldn't be marked spurious (since gold exists, with a different type). But that justification doesn't address which label's incorrect counter the pair should land in. Currently it lands in only one (gold-side), and the pred-side denominator silently loses the contribution.

Suggested fix is to split the per-label incorrect counter into two:

  • incorrect_as_gold[L]: gold of label L was wrongly handled. Used in recall denominator (possible[L] = correct + incorrect_as_gold + missed).
  • incorrect_as_pred[L]: pred of label L was wrong. Used in precision denominator (actual[L] = correct + incorrect_as_pred + spurious).

For each scenario IV / VI pair, increment incorrect_as_gold[true.label] AND incorrect_as_pred[pred.label]. Same-label scenario V increments both for the same label.

Same pattern is in davidsbatista/NER-Evaluation (https://github.com/davidsbatista/NER-Evaluation) (which this fork inherits from), so the fix benefits both lineages.

Distinct from #66 (which addresses spurious-tags-spread-across-all-classes - a different per-label-P bias with a different cause). Also unrelated to the multi-counting fixes in #39/#40.

Happy to send a PR if a fix would be welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions