Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant) by stop1one · Pull Request #2136 · roboflow/supervision

stop1one · 2026-02-04T09:54:38Z

Supersedes #1967

Description

This PR fixes the calculation of mAR@K in MeanAverageRecall to comply with the COCO evaluation protocol.
Previously, the implementation selected the top-K predictions globally across all images, rather than per image.
According to the COCO evaluation protocol, mAR@K should be calculated by considering the top-K highest-confidence detections for each image.

This issue is tracked in issue #1966

To resolve this, I modified the _compute and _compute_average_recall_for_classes function to first filter the statistics by confidence score before concatenating them and calculate the confusion matrix.

No new dependencies are required for this change.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

I tested the change by running the metric on a dataset with varying numbers of predictions per image and verified that, for each image, only the top-K predictions (by confidence) were used in the mAR@K calculation.

Any specific deployment considerations

No special deployment considerations are required.

Docs

Docs updated? What were the changes: N/A

stop1one · 2026-02-04T10:01:25Z

@Borda I accidentally closed the previous PR (#1967).
I will continue the work here, focusing on fixing the mAR@K calculation issue.

Regarding the implementation, please check my last question.
I'll start to fix this issue as soon as you provide your feedback.

codecov · 2026-02-04T11:38:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72%. Comparing base (cd0aac2) to head (dc6edc3).

❌ Your project check has failed because the head coverage (72%) is below the target coverage (95%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #2136   +/-   ##
=======================================
  Coverage       72%     72%           
=======================================
  Files           61      61           
  Lines         7249    7246    -3     
=======================================
+ Hits          5246    5249    +3     
+ Misses        2003    1997    -6

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stop1one · 2026-02-05T01:19:56Z

I've finished the implementation and resolved all conflicts.

I've added a simple unit test with synthetic data to validate the mAR@K calculation.

Test setup:

15 images in total.

Each image has ≤ 5 bounding boxes/detections.

Therefore, mAR@10 and mAR@100 should be identical, since $K=10$ is already exceeds the maximum number of detections per image.

Result with the original (buggy) implementation:
E           AssertionError: 
E           Arrays are not almost equal to 5 decimals
E           
E           Mismatched elements: 2 / 3 (66.7%)
E           Max absolute difference among violations: 0.23173375
E           Max relative difference among violations: 0.80613893
E            ACTUAL: array([0.05573, 0.52786, 0.63622])
E            DESIRED: array([0.28746, 0.63622, 0.63622])
As shown above, mAR@10 (0.52786) ≠ mAR@100 (0.63622), which is incorrect. This demonstrates that the original code was applying Top-K filtering across the dataset, rather than per image. The fix in this PR corrects that behaviour.

Copilot

Pull request overview

This PR fixes a critical bug in the MeanAverageRecall metric calculation to comply with the COCO evaluation protocol. Previously, the implementation selected the top-K predictions globally across all images, rather than per image as specified by COCO. This fix ensures that mAR@K is calculated by considering the top-K highest-confidence detections for each image independently.

Changes:

Modified the _compute method to track prediction positions within each image using indices instead of confidence scores
Updated _compute_average_recall_for_classes to filter predictions by per-image rank before computing confusion matrix
Removed the max_detections parameter from _compute_confusion_matrix since filtering now happens upstream
Added comprehensive integration tests with 15 test images covering various scenarios
Fixed a duplicate error handling statement (code cleanup)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/supervision/metrics/mean_average_recall.py	Core logic fix: Changed from global top-K to per-image top-K by tracking prediction indices within each image and filtering before confusion matrix computation
tests/metrics/test_mean_average_recall.py	Added comprehensive test suite with 15 test images covering various detection scenarios including perfect detections, mismatches, and empty predictions

tests/metrics/test_mean_average_recall.py

Supersedes roboflow#1967

bd9894d

stop1one mentioned this pull request Feb 4, 2026

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant) #1967

Closed

4 tasks

stop1one and others added 7 commits February 4, 2026 22:46

Merge branch 'roboflow:develop' into fix/mAR-at-K-per-image

aed0ee3

fix: COCO-compliant mAR calculation

e1e7c2c

Add complex test of mAP

530d274

fix(pre_commit): 🎨 auto format pre-commit hooks

9508fe1

Merge branch 'roboflow:develop' into fix/mAR-at-K-per-image

4d45c65

fix(metrics): cast optional detections fields in mAR metric for mypy

992166b

fix(pre_commit): 🎨 auto format pre-commit hooks

dc6edc3

stop1one marked this pull request as ready for review February 5, 2026 01:15

stop1one requested a review from SkalskiP as a code owner February 5, 2026 01:15

Borda requested a review from Copilot February 5, 2026 09:10

Borda added the bug Something isn't working label Feb 5, 2026

Copilot started reviewing on behalf of Borda February 5, 2026 09:10 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

tests/metrics/test_mean_average_recall.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant)#2136

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant)#2136
stop1one wants to merge 8 commits intoroboflow:developfrom
stop1one:fix/mAR-at-K-per-image

stop1one commented Feb 4, 2026

Uh oh!

stop1one commented Feb 4, 2026

Uh oh!

codecov bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

stop1one commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stop1one commented Feb 4, 2026

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

Uh oh!

stop1one commented Feb 4, 2026

Uh oh!

codecov bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stop1one commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 4, 2026 •

edited

Loading