From the course: Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes

Reward hacking

- Many years ago, a dog heard a child's cries from the banks of the Seine River in Paris. The dog jumped into the water, saved the child by safely bringing it to shore, and that dog was very well rewarded by the locals. The dog received a lot of positive attention, and a beef steak. A couple of days later, a similar thing happened, and the dog again saved a child from the water, and was again rewarded. But when this started to be an almost daily occurrence, the locals decided to take a closer look. Well, it turns out that the dog had started pushing the children into the water so it could save them and get the reward. Today, we would say that dog was reward hacking, and that's something that machines can do, too, especially in reinforcement learning, RL, systems. Consider a floor-cleaning robot that is rewarded every time a mess is cleaned up. Designer expectation is that the ML system driving the robot would be incented to clean up messes quickly and thoroughly. But like that dog in France, the system could start to reward-hack by creating messes that it would then clean up. Systems don't always have to create a problem to solve, though. Reward hacking can also show up in unexpected ways as the system pursues the reward, like an AI human simulation game, where the avatars or players require energy to function. Designers assign energy points to food, but if they don't carefully limit what qualifies as food, it's possible to have digital sims eating their electronic pets to stay powered up. In these examples, no external attacker caused the system to malfunction. The flaws are in the design of the system itself and how the reinforcement learning is rewarded, which is why, without careful design considerations, RL systems, in their drive for rewards, could cause unexpected and messy outcomes.

Contents