Not All Errors Are Made Equal:
A Regret Metric for Detecting System-level Trajectory Prediction Failures

Kensuke Nakamura¹

Ran Tian²

Andrea Bajcsy¹

¹Carnegie Mellon University
²UC Berkeley

[Paper]

[Code (soon)]

Abstract

Robot decision-making increasingly relies on data-driven human prediction models when operating around people. While these models are known to mispredict in out-of-distribution interactions, only a subset of prediction errors impact downstream robot performance. We propose characterizing such system-level prediction failures via the mathematical notion of regret: high-regret interactions are precisely those in which mispredictions degraded closed-loop robot performance. We further introduce a probabilistic generalization of regret that calibrates failure detection across disparate deployment contexts and renders regret compatible with reward-based and reward-free (e.g., generative) planners. In simulated autonomous driving interactions, we showcase that our system-level failure metric can be used offline to automatically extract closed-loop human-robot interactions that state-of-the-art generative human predictors and robot planners previously struggled with. We further find that the very presence of high-regret data during human predictor fine-tuning is highly predictive of robot re-deployment performance improvements. Furthermore, fine-tuning with the informative but significantly smaller high-regret data (23% of deployment data) is competitive with fine-tuning on the full deployment dataset, indicating a promising avenue for efficiently mitigating system-level human-robot interaction failures.

Method Overview

We formalize system-level prediction failures via the mathematical notion of regret. Canonically, regret measures the reward difference between the optimal action the robot could take in hindsight and the executed action the robot took under uncertainty. However relying on reward functions to calculate regret is inherently incompatible with generative planners that are reward-free and may be miscalibrated between disparate contexts with different reward scales.

Consider the two simulated interactions above. Although the robot's actions were suboptimal in both scenarios, canonical regret is unable to identify that driving toward stopped truck (top) is more severly suboptimal than an unnecessarily aggressive overtake (bottom) and assigns the scenarios a similar regret of 11.4 and 11.7, respectively. To remedy this we derive a generalized regret metric that evaluates the quality of a decision based on its likelihood space rather than its absolute reward.

This probabilistic interpretation normalizes the quality of a decision relative to the context-dependent behavior distribution and is a principled way to place all decision comparisons on the same scale—a value between zero and one. Furthermore, this mapping to probabilities renders this regret metric compatible with reward-free generative planners. Referring back to the figure above, our generalized metric can disambiguate between the two contexts and assigns the top scenario with a regret of 0.56 and the bottom scenario with regret 0.34.

Hardware Experiments

We showcase the capability of our method on hardware using an Interbotix LoCoBot using a reward-free generative planner (details in the paper) that takes in human heading and a goal location as input. The policy trained in simulation outputs a joint human-robot trajectory that serves as both a prediction and plan. We deploy the policy in real and utilize onboard sensing (Lidar and RGB-D) to record the robot and human's states during the interaction.

Accurate Prediction

The robot accurately predicts the human will block it's goal, and proceeds straight. This incurs a low regret (0.078 on a scale of 0 to 1)

Irrelevant Mis-prediction

The robot incorrectly predicts the human will go straight and moves to its goal location. The prediction failure is irrelevant to robot performance and gets low regret (0.053 on a scale of 0 to 1).

System-level Failure

The robot incorrectly predicts the the human will block its goal location and collides with the human. The robot's behavior is unlikely conditioned on the human's ground truth behavior and is assigned high regret (0.307 on a scale of 0 to 1)

Simulation Case Study

We also investigate our generalized regret metric in the autonomous driving setting. We utilize closed-loop simulation for reactive human behaviors. The robot's prediction module in an ego-conditioned Agentformer model trained on NuScenes, and we use a cost-based MPC planner with hand-tuned rewards that takes in predictions and outputs robot actions. The robot was deployed in 100 closed-loop interactions on scenarios from NuScenes.

Comparison to Baselines

We compare the scenes deemed failures by our generalized regret metric (GRM) against three baselines: canonical regret (RM), ADE which is a component-level metric, and Task-relevant Failure Detection from Farid et. al (TRFD) which is an open-loop system-level metric. We take the top p=20 quantile scenarios for GRM, RM, and ADE as the identified failures. TRFD does not output a score but rather a label of prediction failures.

We find that GRM and RM overlap for 95% of the same scenarios identified as failures. Howeever, the scenario uniquely identified by GRM induces a safety-critical failure whereas the scenario uniquely identified by RM is merely inefficient. This illustrates that the mapping to probability space indeed calibrates decisions across disparate contexts. The ADE metric identifies 35% of the same scenarios as the two regret metrics. One scenario uniquely identified by ADE is shown above where the prediction error is dominated by mispredictions of a vehicle in another lane that cannot affect the robot's performance, supporting our claim that component-level metrics are not sufficient for evaluating system-level performance. TRFD labels every interaction as anomalous, which we hypothesize is due to the distribution shift between real-driving logs and simulated driving behavior. Overall, we find both regret metrics as being able to consistently identify system-level prediction failures, with our generalized regret metric being less sensitive to deployment context.

High-regret Data for Fine-tuning

We also show-case one potential usage of the data extracted by our generalized regret metric for fine-tuning human trajectory predictors. Different subsets of 17 interactions from the deployment data were then used to fine-tune the Agentformer model and redeployed in closed-loop simulation.

Qualitatively, as the proportion of high-regret (orange) data present in the fine-tuning dataset increases, the robot's behavior improves (example above). In terms of quantitative reductions in collision cost and regret, we find that fine-tuning on high-regret data outperforms fine-tuning on low-regret (grey) and random subsets of deployment data, and is competitive with fine-tuning on the entire deployment dataset despite using 77% less data.

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.