Per-label precision biased upward in cross-label scenarios (IV, VI)

Hi and thanks for maintaining nervaluate. I've used it a lot but I believe I've found an error which biases per-label precision upward in cross-label scenarios (IV, VI).

In compute_metrics, Scenarios IV and VI (where pred.label != true.label) only increment counters keyed by true.label:
```
#Scenario IV
evaluation_agg_entities_type[true.e_type]['strict']['incorrect'] += 1
# evaluation_agg_entities_type[pred.e_type]['strict']['spurious'] += 1   # commented-out

#Scenario VI
evaluation_agg_entities_type[true.e_type]['strict']['incorrect'] += 1
```

Nothing is added to evaluation_agg_entities_type[pred.e_type]. As a result, predictions of label L_p that were wrong because the gold was a different label are missing from actual[L_p] (= correct + incorrect + spurious), so per-label precision for L_p is biased upward.

Example:
count | gold | pred
10      | A      | A
3        | A      | B
7        | B      | A
10      | B      | B

The model predicts A 17 times (10 correct, 7 wrong). Expected P[A] = 10/17. nervaluate reports P[A] = 10/13, because the 7 cross-label-wrong predictions of A are attributed to incorrect[B] rather than incorrect[A].
Mirror result: nervaluate reports P[B] = 10/17 while the correct value is P[B] = 10/13. The two labels' real precisions get silently swapped.
This only shows up when reading results_per_tag.

The README's note about scenarios IV/VI argues the prediction shouldn't be marked spurious (since gold exists, with a different type). But that justification doesn't address which label's incorrect counter the pair should land in. Currently it lands in only one (gold-side), and the pred-side denominator silently loses the contribution.

Suggested fix is to split the per-label incorrect counter into two:
- incorrect_as_gold[L]: gold of label L was wrongly handled. Used in recall denominator (possible[L] = correct + incorrect_as_gold + missed).
- incorrect_as_pred[L]: pred of label L was wrong. Used in precision denominator (actual[L] = correct + incorrect_as_pred + spurious).

For each scenario IV / VI pair, increment incorrect_as_gold[true.label] AND incorrect_as_pred[pred.label]. Same-label scenario V increments both for the same label.

Same pattern is in davidsbatista/NER-Evaluation (https://github.com/davidsbatista/NER-Evaluation) (which this fork inherits from), so the fix benefits both lineages.

Distinct from #66 (which addresses spurious-tags-spread-across-all-classes - a different per-label-P bias with a different cause). Also unrelated to the multi-counting fixes in #39/#40.

Happy to send a PR if a fix would be welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-label precision biased upward in cross-label scenarios (IV, VI) #110

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Per-label precision biased upward in cross-label scenarios (IV, VI) #110

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions