adporn.net Reward hacking - Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes Video Tutorial | LinkedIn Learning, formerly Lynda.com

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes

Reward hacking

From the course: Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes

Reward hacking

“

- Many years ago, a dog heard a child's cries from the banks of the Seine River in Paris. The dog jumped into the water, saved the child by safely bringing it to shore, and that dog was very well rewarded by the locals. The dog received a lot of positive attention, and a beef steak. A couple of days later, a similar thing happened, and the dog again saved a child from the water, and was again rewarded. But when this started to be an almost daily occurrence, the locals decided to take a closer look. Well, it turns out that the dog had started pushing the children into the water so it could save them and get the reward. Today, we would say that dog was reward hacking, and that's something that machines can do, too, especially in reinforcement learning, RL, systems. Consider a floor-cleaning robot that is rewarded every time a mess is cleaned up. Designer expectation is that the ML system driving the robot would be incented to clean up messes quickly and thoroughly. But like that dog in France, the system could start to reward-hack by creating messes that it would then clean up. Systems don't always have to create a problem to solve, though. Reward hacking can also show up in unexpected ways as the system pursues the reward, like an AI human simulation game, where the avatars or players require energy to function. Designers assign energy points to food, but if they don't carefully limit what qualifies as food, it's possible to have digital sims eating their electronic pets to stay powered up. In these examples, no external attacker caused the system to malfunction. The flaws are in the design of the system itself and how the reinforcement learning is rewarded, which is why, without careful design considerations, RL systems, in their drive for rewards, could cause unexpected and messy outcomes.

Contents