Hi and thanks for maintaining nervaluate. I've used it a lot but I believe I've found an error which biases per-label precision upward in cross-label scenarios (IV, VI).
In compute_metrics, Scenarios IV and VI (where pred.label != true.label) only increment counters keyed by true.label:
#Scenario IV
evaluation_agg_entities_type[true.e_type]['strict']['incorrect'] += 1
# evaluation_agg_entities_type[pred.e_type]['strict']['spurious'] += 1 # commented-out
#Scenario VI
evaluation_agg_entities_type[true.e_type]['strict']['incorrect'] += 1
Nothing is added to evaluation_agg_entities_type[pred.e_type]. As a result, predictions of label L_p that were wrong because the gold was a different label are missing from actual[L_p] (= correct + incorrect + spurious), so per-label precision for L_p is biased upward.
Example:
count | gold | pred
10 | A | A
3 | A | B
7 | B | A
10 | B | B
The model predicts A 17 times (10 correct, 7 wrong). Expected P[A] = 10/17. nervaluate reports P[A] = 10/13, because the 7 cross-label-wrong predictions of A are attributed to incorrect[B] rather than incorrect[A].
Mirror result: nervaluate reports P[B] = 10/17 while the correct value is P[B] = 10/13. The two labels' real precisions get silently swapped.
This only shows up when reading results_per_tag.
The README's note about scenarios IV/VI argues the prediction shouldn't be marked spurious (since gold exists, with a different type). But that justification doesn't address which label's incorrect counter the pair should land in. Currently it lands in only one (gold-side), and the pred-side denominator silently loses the contribution.
Suggested fix is to split the per-label incorrect counter into two:
- incorrect_as_gold[L]: gold of label L was wrongly handled. Used in recall denominator (possible[L] = correct + incorrect_as_gold + missed).
- incorrect_as_pred[L]: pred of label L was wrong. Used in precision denominator (actual[L] = correct + incorrect_as_pred + spurious).
For each scenario IV / VI pair, increment incorrect_as_gold[true.label] AND incorrect_as_pred[pred.label]. Same-label scenario V increments both for the same label.
Same pattern is in davidsbatista/NER-Evaluation (https://github.com/davidsbatista/NER-Evaluation) (which this fork inherits from), so the fix benefits both lineages.
Distinct from #66 (which addresses spurious-tags-spread-across-all-classes - a different per-label-P bias with a different cause). Also unrelated to the multi-counting fixes in #39/#40.
Happy to send a PR if a fix would be welcome.
Hi and thanks for maintaining nervaluate. I've used it a lot but I believe I've found an error which biases per-label precision upward in cross-label scenarios (IV, VI).
In compute_metrics, Scenarios IV and VI (where pred.label != true.label) only increment counters keyed by true.label:
Nothing is added to evaluation_agg_entities_type[pred.e_type]. As a result, predictions of label L_p that were wrong because the gold was a different label are missing from actual[L_p] (= correct + incorrect + spurious), so per-label precision for L_p is biased upward.
Example:
count | gold | pred
10 | A | A
3 | A | B
7 | B | A
10 | B | B
The model predicts A 17 times (10 correct, 7 wrong). Expected P[A] = 10/17. nervaluate reports P[A] = 10/13, because the 7 cross-label-wrong predictions of A are attributed to incorrect[B] rather than incorrect[A].
Mirror result: nervaluate reports P[B] = 10/17 while the correct value is P[B] = 10/13. The two labels' real precisions get silently swapped.
This only shows up when reading results_per_tag.
The README's note about scenarios IV/VI argues the prediction shouldn't be marked spurious (since gold exists, with a different type). But that justification doesn't address which label's incorrect counter the pair should land in. Currently it lands in only one (gold-side), and the pred-side denominator silently loses the contribution.
Suggested fix is to split the per-label incorrect counter into two:
For each scenario IV / VI pair, increment incorrect_as_gold[true.label] AND incorrect_as_pred[pred.label]. Same-label scenario V increments both for the same label.
Same pattern is in davidsbatista/NER-Evaluation (https://github.com/davidsbatista/NER-Evaluation) (which this fork inherits from), so the fix benefits both lineages.
Distinct from #66 (which addresses spurious-tags-spread-across-all-classes - a different per-label-P bias with a different cause). Also unrelated to the multi-counting fixes in #39/#40.
Happy to send a PR if a fix would be welcome.