Human-in-the-loop reinforcement learning systems achieve near-perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re-collecting demonstrations and re-running HIL on every workstation is incompatible with deployment, and naively fine-tuning on shifted-light data triggers catastrophic forgetting of the source workstation. To close this cross-domain gap, we present RoHIL, an offline fine-tuning framework that uses no extra real-robot interaction.
RoHIL combines (i) a world-model-based image relighter that re-synthesises the visual stream of source-workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination-Retention Replay (IRR), a data-level anti-forgetting mechanism that interleaves relit adaptation transitions with original-light retention transitions to preserve source-workstation Bellman coverage; and (iii) an anchored Bellman–actor regulariser that constrains representation and policy drift from the original source-workstation policy. Across four real-robot manipulation tasks under significant cross-workstation illumination variations, RoHIL substantially improves shifted-light performance where standard HIL-RL collapses, while preserving source-workstation performance — eliminating the need to re-collect data and retrain for every new workstation and environment.
A HIL-SERL policy trained at one workstation reaches near-perfect success there, but moving the same robot a few metres away — under different lamp positions and daylight — breaks it. Below: rollouts of the source-workstation policy on four tasks after illumination shift, with no other change. Pixel jitter cannot reproduce these structured changes, demo re-collection is incompatible with deployment, and naive fine-tuning destroys the source policy. RoHIL is designed to close this gap without any new robot interaction.
All rollouts on this page are played at 3× real-time speed.
Source policy mis-aligns under shifted illumination.
Critic over-estimates Q on relit pixels; insertion fails.
Visual encoder drifts; gripper misses the latch.
Specular highlights confuse object detection.
Starting from an online-best HIL-SERL policy that degrades under lighting shifts, RoHIL runs an offline robust fine-tune in three coupled stages. The visual encoder is exposed to illumination-diverse evidence via world-model relighting, while replay-buffer balancing and frozen-source anchors jointly suppress source-domain forgetting at the data, representation, and policy levels.
RoHIL overview. (a) A world-model relighter re-synthesises the visual stream of source trajectories under multiple HDRI environments while preserving actions and rewards. (b) Offline robust fine-tune with critic Lcritic = LBellman + Lfeat and actor Lactor = LSAC + Lmse, anchored on a frozen source policy. (c) Illumination-Retention Replay buffers interleave relit adaptation and original-light retention transitions to preserve source-workstation Bellman coverage.
A diffusion-based HDRI-conditioned video relighter re-illuminates recorded RGB streams under four virtual lighting environments. Actions, rewards, and termination labels stay real; only the visual stream changes — giving illumination-diverse observations from a single source-workstation collection.
IRR keeps the RLPD 50/50 demo/RL split but inserts a retention coefficient α: the RL half mixes original-light and relit transitions, preserving Bellman coverage of the source workstation while supplying illumination-adaptation signal. Sweep identifies α = 0.75 as the joint optimum.
A frozen-source feature anchor (Lfeat) constrains visual-encoder drift, while a reference-action mean anchor (Lmse) regularises actor pre-tanh means on expert states. Both anchors decay with training to allow late-stage plasticity.
We evaluate RoHIL against five HIL-compatible baselines (BC, ACT, HG-DAgger, HIL-SERL, IBRL) on four real-robot manipulation tasks under a controlled lighting-shift gradient. RoHIL wins on every task at the headline 60% shift, and stays essentially flat across the full intensity sweep where baselines decay roughly monotonically.
Lighting-shift sweep. Each column is one task; rows are success rate, intervention rate, and mean episode duration. Horizontal axis: shift intensity from 0% to 100%. RoHIL (dark blue) remains near saturation across the sweep, while HIL-SERL, HG-DAgger, and IBRL all degrade roughly monotonically with shift intensity.
Headline at 60% shift (60 trajectories per cell). SR is success rate; T is mean successful-episode duration in seconds. Best per task in bold. RoHIL reaches the highest SR on every task (RAM 1.00, USB 1.00, wiping 0.87, breaker 0.83), while baselines fall below 0.45 on RAM and below 0.5 on the breaker.
Below are real-robot rollouts of the RoHIL-fine-tuned policy on each task, executed under workstation lighting that the source policy fails on. The same offline framework is used across all four tasks; no per-task tuning is applied beyond the source-workstation HIL-SERL budgets reported in the paper.
All rollouts on this page are played at 3× real-time speed.
Memory-stick alignment under shifted illumination.
Sub-millimetre socket alignment under cross-workstation light.
Latch engagement under specular highlights.
Long-horizon wiping under daylight + spotlight shift.
All experiments run on a Franka Emika Panda arm with wrist-mounted RealSense cameras and per-task camera configurations. Cross-workstation illumination shift is realised by sweeping a discrete intensity gradient that combines task-light spotlight reconfigurations and daylight-direction changes.
Hardware setup. For each of the four tasks, we annotate the wrist cameras (red), workspace lights (yellow), and side cameras when present. Camera baselines and light placement differ per task, but the same offline RoHIL pipeline applies without modification.
Lighting-shift gradient. Discrete steps along the deployment-time illumination-shift sweep, including HDRI re-rendering, task-light spotlight reconfiguration, and natural-window-light variations. The 60% shift is used as the headline cross-workstation evaluation point in the cross-method comparison.