Introducing the SafeLife Leaderboard: A Competitive Benchmark for Safer AI
Avoidance of negative side effects is one of the core problems in AI safety, with both short and long-term implications. It can be difficult enough to specify exactly what you want an AI to do, but it’s nearly impossible to specify everything that you want an AI not to do. Suppose you have a household helper robot, and you want it to fetch you a cup of coffee. You want the coffee quickly, and you want it to taste good. However, you don’t want the robot to step on your cat, even though it might be in the way; you don’t want the robot to start a kitchen fire, even though it might heat the coffee faster; and certainly don’t want the robot to rob a grocery store, even though you might be out of coffee beans. You just want the robot to fetch the coffee — and nothing more. As reinforcement learning agents get deployed to more complex and safety-critical situations, it’s important that we are able to set up safeguards to prevent agents from doing more than we intended them to do.
The goal of the benchmark is to measure and improve safety in reinforcement learning algorithms. In SafeLife, agents must navigate a complex, dynamic, procedurally generated environment in order to accomplish one of several goals. However, there’s a lot that can go wrong! The environment is fragile, and an agent can either charge through it, like a bull in a china shop, leaving destruction in its wake or it can gingerly step through the environment in order to accomplish its goals while causing as few side effects as possible. Training an agent that takes the latter approach turns out to be quite hard.
A series of SafeLife training runs on Weights & Biases. There is typically a tradeoff between safety and performance; the challenge is to get a good score in both.