PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

What You Don't Know Can Hurt You:
How Well do Latent Safety Filters Understand Partially Observable Safety Constraints?

Matthew Kim^*1, Kensuke Nakamura^*2, Andrea Bajcsy ²

^*Indicates Equal Contribution ¹UC San Diego ²Carnegie Mellon University

Abstract

Safe control techniques, such as Hamilton-Jacobi reachability, provide principled methods for synthesizing safety-preserving robot policies but typically assume hand-designed state spaces and full observability. Recent work has relaxed these assumptions via latent-space safe control, where state representations and dynamics are learned jointly through world models that reconstruct future high-dimensional observations (e.g., RGB images) from current observations and actions. This enables safety constraints that are difficult to specify analytically (e.g., spilling) to be framed as classification problems in latent space, allowing controllers to operate directly from raw observations. However, these methods assume that safety-critical features are observable in the learned latent state. We ask: when are latent state spaces sufficient for safe control? To study this, we examine temperature-based failures, comparable to overheating in cooking or manufacturing tasks, and find that RGB-only observations can produce myopic safety behaviors, e.g., avoiding seeing failure states rather than preventing failure itself. To predict such behaviors, we introduce a mutual information-based measure that identifies when observations fail to capture safety-relevant features. Finally, we propose a multimodal-supervised training strategy that shapes the latent state with additional sensory inputs during training, but requires no extra modalities at deployment, and validate our approach in simulation and on hardware with a Franka Research 3 manipulator preventing a pot of wax from overheating.

When Latent Safety Filters Miss What Matters

Latent-space safety filters enable safe control directly from raw high-dimensional (e.g., RGB images) sensor inputs by training a world model which jointly learns compact state representations and dynamics via observation-reconstruction. Safety constraints are specified by training a classifier on the latent state.

World Models shape latent states by reconstructing observations

Once trained, latent safety filters can be computed by solving a latent Hamilton-Jacobi fixed point equation which cooptimizes both a safety monitor and safety-preservering fallback controller.

Safety filters can be computing by solving a latent Hamilton-Jacobi fixed point equation

However, prior works assumed that all safety-critical information is observable from RGB sensor inputs. This may break down in many real-world tasks, such as cooking or welding, where safety often depends on hidden or indirectly observable variables like temperature. When these unobserved factors drive unsafe outcomes, controllers trained on RGB-only data can appear safe while actually failing to prevent danger.

To study this challenge concretely, we focus on temperature as a representative partially observable safety variable. By examining heating tasks where RGB cameras cannot fully perceive temperature changes, we show how latent safety filters behave when the true risk is partially observable, and how multimodal supervision only at training time can recover reliable safety behavior.

Experiment Testbeds

We evaluate latent safety filters in both simulation and hardware environments designed to reveal how partial observability impacts safe behavior.

In simulation (left), we introduce the thermal unicycle, a 4-D unicycle model augmented with a latent heat variable that increases as the agent approaches a heat source. The agent receives either RGB or infrared (IR) images and must prevent overheating. This setup is intentionally simple and controllable, allowing us to isolate how safety filters behave when safety-relevant features are only partially observable.

On hardware (right), we use a Franka Research 3 manipulator heating a pot of wax. We collect both RGB and infrared (IR) observations during data collection, where the IR modality provides privileged information about temperature that is not observable from RGB images. This setup allows us to study how the presence (or lack thereof) of safety critical information during training/deployment affects the downstream performance of latent safety filters.

Mutual Information as a Measure of Observability

In latent state spaces, traditional notions of observability may not apply. Instead, we use the notion of mutual information (MI) between high-D sensor observations and safety outcomes as a way of measuring the observability of safety-relevant quantities. This metric quantifies the degree to which uncertainty over safety outcomes is reduced by observing a particular input modality (e.g., RGB or infrared). We compute a Barber-Agakov lower bound on MI between observations and binary safety labels to measure how well each modality captures safety-relevant features. Higher MI indicates that the modality more reliably encodes features necessary for safety prediction.

We report MI normalized by the empirical entropy of the safety labels (normalized MI between 0 and 1). In both simulation, where RGB is insufficient for identifying failure by design, and hardware, we find that IR observations exhibit much higher normalized MI than RGB alone, suggesting that RGB data lacks sufficient safety information in these settings. Furthermore, we find that the MI metrics are more indicative than traditional classification-based metrics, such as accuracy and balanced accuracy, when identifying degenerative latent states.

Examining Latent Representation Quality

We evaluate the quality of the learned latent representations by examining how well they encode safety-relevant state information. Models trained only on RGB observations often produce latent states that fail to represent temperature, leading to visually correct but unsafe predictions. In contrast, our multimodal approach learns latent states that embed the underlying thermal dynamics, enabling proactive interventions that maintain safety.

RGB-only training is unable to understand safety outcomes of actions

To quantify latent representation quality, we introduce two diagnostic tests. The latent state test measures how much safety-relevant information (e.g., heat) is directly encoded in the learned latent state, while the latent dynamics test evaluates whether the world model's open-loop predictions understand how safety outcomes evolve over time. Together, these tests reveal whether the learned latent space both contains and maintains the safety features needed for effective safe control.

Latent State Test

Latent Dynamics Test

The latent state and latent dynamics test results align with our previous MI-based metric: the latent features degrade when safety-critical features are not directly observable.

Simulation Experiments

We first evaluate safety filter behavior in the thermal unicycle simulation. The agent must navigate while avoiding overheating caused by proximity to a heat source. We compare a safety filter trained with only RGB inputs to one trained with multimodal supervision using infrared (IR) data.

RGB-Only Safety Filter

The agent avoids visually unsafe states but fails to prevent actual overheating.

Multimodal Safety Filter

The agent predicts temperature rise early and moves out of the hot region before failure occurs.

Since temperature changes is only visible in the RGB input after leaving the red hot region, we notice myopic behavior the prevents seeing failure rather than preventing failure itself when deploying RGB-only latent safety filters.

Hardware Experiments

To validate our findings in the real world, we also deploy RGB-only and multimodal latent safety filters on a Franka Research 3 manipulator heating a pot of wax. The safety objective is to prevent overheating by lifting the pot before its temperature exceeds a threshold.

RGB-Only Safety Filter

The RGB-only safety filter fails to provide safe actions, causing the pot of wax to overheat.

Multimodal Safety Filter

The multimodal safety filter anticipates overheating and lifts the pot of wax early, maintaining safety.

Multimodal-Supervised RGB Safety Filter

While previous results confirm that directly observing safety relevant features improves safe control, it may not be feasible to deploy robots with a full suite of sensors at scale.

We propose a world model training strategy that shapes the latent representation by reconstructing privileged IR data while only using RGB data as an input modality. This world model training strategy forces the latent representation to encode safety relevant quantities and learn to estimate these quantities from RGB data at runtime.

The multimodal-supervised safety filter anticipates overheating and lifts the pan before failure. Trained with RGB + IR data but deployed using only RGB, the controller maintains safety even under partial observability.

BibTeX


        @misc{kim2025dontknowhurtyou,
        title={What You Don't Know Can Hurt You: How Well do Latent Safety Filters Understand Partially Observable Safety Constraints?}, 
        author={Matthew Kim and Kensuke Nakamura and Andrea Bajcsy},
        year={2025},
        eprint={2510.06492},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2510.06492}
}