RoHIL: Robust Human-in-the-Loop
Robotic Reinforcement Learning Against Illumination Variations An offline fine-tuning framework with no extra real-robot interaction

Abstract

A One-Shot HIL Run, Amortised Across Many Workstations

Human-in-the-loop reinforcement learning systems achieve near-perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re-collecting demonstrations and re-running HIL on every workstation is incompatible with deployment, and naively fine-tuning on shifted-light data triggers catastrophic forgetting of the source workstation. To close this cross-domain gap, we present RoHIL, an offline fine-tuning framework that uses no extra real-robot interaction.

RoHIL combines (i) a world-model-based image relighter that re-synthesises the visual stream of source-workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination-Retention Replay (IRR), a data-level anti-forgetting mechanism that interleaves relit adaptation transitions with original-light retention transitions to preserve source-workstation Bellman coverage; and (iii) an anchored Bellman–actor regulariser that constrains representation and policy drift from the original source-workstation policy. Across four real-robot manipulation tasks under significant cross-workstation illumination variations, RoHIL substantially improves shifted-light performance where standard HIL-RL collapses, while preserving source-workstation performance — eliminating the need to re-collect data and retrain for every new workstation and environment.

Motivation

HIL Policies Collapse When the Lights Change

A HIL-SERL policy trained at one workstation reaches near-perfect success there, but moving the same robot a few metres away — under different lamp positions and daylight —  breaks it. Below: rollouts of the source-workstation policy on four tasks after illumination shift, with no other change. Pixel jitter cannot reproduce these structured changes, demo re-collection is incompatible with deployment, and naive fine-tuning destroys the source policy. RoHIL is designed to close this gap without any new robot interaction.

All rollouts on this page are played at 3× real-time speed.

(a)RAM Insertion.

Source policy mis-aligns under shifted illumination.

(b)USB Insertion.

Critic over-estimates Q on relit pixels; insertion fails.

(c)Circuit Breaker.

Visual encoder drifts; gripper misses the latch.

(d)Table Wiping.

Specular highlights confuse object detection.

Framework Overview

Three Stages: Relight, Retain, Anchor

Starting from an online-best HIL-SERL policy that degrades under lighting shifts, RoHIL runs an offline robust fine-tune in three coupled stages. The visual encoder is exposed to illumination-diverse evidence via world-model relighting, while replay-buffer balancing and frozen-source anchors jointly suppress source-domain forgetting at the data, representation, and policy levels.

RoHIL framework overview.

RoHIL overview. (a) A world-model relighter re-synthesises the visual stream of source trajectories under multiple HDRI environments while preserving actions and rewards. (b) Offline robust fine-tune with critic Lcritic = LBellman + Lfeat and actor Lactor = LSAC + Lmse, anchored on a frozen source policy. (c) Illumination-Retention Replay buffers interleave relit adaptation and original-light retention transitions to preserve source-workstation Bellman coverage.

01
Stage 1

World-Model Relighting

A diffusion-based HDRI-conditioned video relighter re-illuminates recorded RGB streams under four virtual lighting environments. Actions, rewards, and termination labels stay real; only the visual stream changes — giving illumination-diverse observations from a single source-workstation collection.

02
Stage 2

Illumination-Retention Replay

IRR keeps the RLPD 50/50 demo/RL split but inserts a retention coefficient α: the RL half mixes original-light and relit transitions, preserving Bellman coverage of the source workstation while supplying illumination-adaptation signal. Sweep identifies α = 0.75 as the joint optimum.

03
Stage 3

Anchored Bellman–Actor Regulariser

A frozen-source feature anchor (Lfeat) constrains visual-encoder drift, while a reference-action mean anchor (Lmse) regularises actor pre-tanh means on expert states. Both anchors decay with training to allow late-stage plasticity.

Experimental Results

Robust to a Full Sweep of Illumination Intensity

We evaluate RoHIL against five HIL-compatible baselines (BC, ACT, HG-DAgger, HIL-SERL, IBRL) on four real-robot manipulation tasks under a controlled lighting-shift gradient. RoHIL wins on every task at the headline 60% shift, and stays essentially flat across the full intensity sweep where baselines decay roughly monotonically.

Lighting-shift gradient on all four tasks

Inference success across lighting-shift gradient.

Lighting-shift sweep. Each column is one task; rows are success rate, intervention rate, and mean episode duration. Horizontal axis: shift intensity from 0% to 100%. RoHIL (dark blue) remains near saturation across the sweep, while HIL-SERL, HG-DAgger, and IBRL all degrade roughly monotonically with shift intensity.

Cross-method comparison at 60% illumination shift

Cross-method comparison table at 60% illumination shift.

Headline at 60% shift (60 trajectories per cell). SR is success rate; T is mean successful-episode duration in seconds. Best per task in bold. RoHIL reaches the highest SR on every task (RAM 1.00, USB 1.00, wiping 0.87, breaker 0.83), while baselines fall below 0.45 on RAM and below 0.5 on the breaker.

Experimental Demos

RoHIL Rollouts Under Shifted Lighting

Below are real-robot rollouts of the RoHIL-fine-tuned policy on each task, executed under workstation lighting that the source policy fails on. The same offline framework is used across all four tasks; no per-task tuning is applied beyond the source-workstation HIL-SERL budgets reported in the paper.

All rollouts on this page are played at 3× real-time speed.

(a)RAM Insertion. 100consecutive successes

Memory-stick alignment under shifted illumination.

(b)USB Insertion. 100consecutive successes

Sub-millimetre socket alignment under cross-workstation light.

(c)Circuit Breaker. 25consecutive successes

Latch engagement under specular highlights.

(d)Table Wiping. 10consecutive successes

Long-horizon wiping under daylight + spotlight shift.

Experimental Setup

Hardware, Cameras, and Lighting Conditions

All experiments run on a Franka Emika Panda arm with wrist-mounted RealSense cameras and per-task camera configurations. Cross-workstation illumination shift is realised by sweeping a discrete intensity gradient that combines task-light spotlight reconfigurations and daylight-direction changes.

Per-task camera and workspace layout

Per-task hardware setup with camera placement.

Hardware setup. For each of the four tasks, we annotate the wrist cameras (red), workspace lights (yellow), and side cameras when present. Camera baselines and light placement differ per task, but the same offline RoHIL pipeline applies without modification.

Illumination-shift gradient

Light-intensity gradient across the cross-workstation evaluation sweep.

Lighting-shift gradient. Discrete steps along the deployment-time illumination-shift sweep, including HDRI re-rendering, task-light spotlight reconfiguration, and natural-window-light variations. The 60% shift is used as the headline cross-workstation evaluation point in the cross-method comparison.