Skip to content

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant)#2136

Open
stop1one wants to merge 8 commits intoroboflow:developfrom
stop1one:fix/mAR-at-K-per-image
Open

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant)#2136
stop1one wants to merge 8 commits intoroboflow:developfrom
stop1one:fix/mAR-at-K-per-image

Conversation

@stop1one
Copy link

@stop1one stop1one commented Feb 4, 2026

Supersedes #1967

Description

This PR fixes the calculation of mAR@K in MeanAverageRecall to comply with the COCO evaluation protocol.
Previously, the implementation selected the top-K predictions globally across all images, rather than per image.
According to the COCO evaluation protocol, mAR@K should be calculated by considering the top-K highest-confidence detections for each image.

This issue is tracked in issue #1966

To resolve this, I modified the _compute and _compute_average_recall_for_classes function to first filter the statistics by confidence score before concatenating them and calculate the confusion matrix.

No new dependencies are required for this change.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

I tested the change by running the metric on a dataset with varying numbers of predictions per image and verified that, for each image, only the top-K predictions (by confidence) were used in the mAR@K calculation.

Any specific deployment considerations

No special deployment considerations are required.

Docs

  • Docs updated? What were the changes: N/A

@stop1one
Copy link
Author

stop1one commented Feb 4, 2026

@Borda I accidentally closed the previous PR (#1967).
I will continue the work here, focusing on fixing the mAR@K calculation issue.

Regarding the implementation, please check my last question.
I'll start to fix this issue as soon as you provide your feedback.

@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72%. Comparing base (cd0aac2) to head (dc6edc3).

❌ Your project check has failed because the head coverage (72%) is below the target coverage (95%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #2136   +/-   ##
=======================================
  Coverage       72%     72%           
=======================================
  Files           61      61           
  Lines         7249    7246    -3     
=======================================
+ Hits          5246    5249    +3     
+ Misses        2003    1997    -6     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stop1one stop1one marked this pull request as ready for review February 5, 2026 01:15
@stop1one stop1one requested a review from SkalskiP as a code owner February 5, 2026 01:15
@stop1one
Copy link
Author

stop1one commented Feb 5, 2026

I've finished the implementation and resolved all conflicts.

I've added a simple unit test with synthetic data to validate the mAR@K calculation.

Test setup:

  • 15 images in total.
  • Each image has ≤ 5 bounding boxes/detections.
  • Therefore, mAR@10 and mAR@100 should be identical, since $K=10$ is already exceeds the maximum number of detections per image.

Result with the original (buggy) implementation:

E           AssertionError: 
E           Arrays are not almost equal to 5 decimals
E           
E           Mismatched elements: 2 / 3 (66.7%)
E           Max absolute difference among violations: 0.23173375
E           Max relative difference among violations: 0.80613893
E            ACTUAL: array([0.05573, 0.52786, 0.63622])
E            DESIRED: array([0.28746, 0.63622, 0.63622])

As shown above, mAR@10 (0.52786) ≠ mAR@100 (0.63622), which is incorrect. This demonstrates that the original code was applying Top-K filtering across the dataset, rather than per image. The fix in this PR corrects that behaviour.

@Borda Borda requested a review from Copilot February 5, 2026 09:10
@Borda Borda added the bug Something isn't working label Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in the MeanAverageRecall metric calculation to comply with the COCO evaluation protocol. Previously, the implementation selected the top-K predictions globally across all images, rather than per image as specified by COCO. This fix ensures that mAR@K is calculated by considering the top-K highest-confidence detections for each image independently.

Changes:

  • Modified the _compute method to track prediction positions within each image using indices instead of confidence scores
  • Updated _compute_average_recall_for_classes to filter predictions by per-image rank before computing confusion matrix
  • Removed the max_detections parameter from _compute_confusion_matrix since filtering now happens upstream
  • Added comprehensive integration tests with 15 test images covering various scenarios
  • Fixed a duplicate error handling statement (code cleanup)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/supervision/metrics/mean_average_recall.py Core logic fix: Changed from global top-K to per-image top-K by tracking prediction indices within each image and filtering before confusion matrix computation
tests/metrics/test_mean_average_recall.py Added comprehensive test suite with 15 test images covering various detection scenarios including perfect detections, mismatches, and empty predictions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants