TAC KBP 2015 : Event Detection and Coreference Tasks for English

Scoring Guideline

Evaluation Tools (all tools, with documentation, sample data and test case)

  • The Improved Up-to-date Converter, Scoring and Validation Package tar.gz zip
    • The improvied scorer is version 1.7, which improve the alignment algorithm. The new algorithm will reduce the influence of recall by penalizing false positives more, which should reflect system's performance more accurately.
    • Note that the new version will generally give a lower score comparing to the older version, due to the increased penalty for false positives.
    • View and report issues on GitHub
  • Package used for the official scoring: zip
    • This package is used to produce the KBP scores. We've updated the scorer ever since, for future publications, please refer to the new scorer.
    • Scorer version 1.6.3 is a faster version of the official scorer, the score output are identicial to the older version.


  • Dec 11: Annoucing the updated scorer.
    • See the download section above for more details.
  • Aug 28: Scorer will now canonicalize the types to help reduce some inconsistency in annotation guidelines. The following actions will be taken for all attributes before comparison.
    • All attributes from gold and system will be lowercased.
    • All whitespaces, punctuations will be removed.
    • For example, Contact_Broadcast will be converted to contactbroadcast.
    • These changes should not create any scoring differences if you have already follow the attribute names in guidelines or training data
  • Aug 12: Scorer is now v1.6, converter is now version 1.0.2. It is highly advised to get the current version even if you have updated recently. The version changes address the following changes.
    • Converter :
      • The training data now also annotates mentions in anchor text, the converter now add support those.
      • The default file extension are changed to align with the training data suffix.
      • File extensions are now configurable.
    • Scorer :
      • Due to the large number of double annotation, it is infeasible to compute the optimal alignment between system and gold. We fall back to a greedy method with arbitrary tie breaking, which might not reach the best possible alignment in some cases.
      • We exclude MUC for document average score. This will not affect the final score on coreference. MUC is designed to produce 0 when both gold and system are singletons, while the rest of the metrics produce non-zero scores. See this issue for details.