Knowing You’re Being Tested (If You’re A Thing)

There’s a term in AI Safety, coined I think by Robert Miles, about the dangers of test protocols and devices that can respond to them. The idea as he proposes it is that in the context of an AI that is being tested to see how compliant it is to a task, if it recognises the difference between a test and a real environment, it makes sense for the AI to lie about its own behaviour, pass the test, then execute things by its normal protocols.

The idea runs pretty much like this: AI as crafted have priorities. Let’s say it wants to press red buttons and not blue buttons. You want to make sure it presses blue buttons and not red buttons. So, you pull it in for a test. If the AI is capable of telling it’s being tested, the best way for it to continue pressing red buttons is to in this moment, press blue buttons to pass the test. There are more red buttons in the world then there are in the test, so it stands to reason that pressing these blue buttons will yield more red buttons over time.

And so you pull the thing into a test environment and it presses the blue buttons the way it should and you release it and still keep getting reports of all these red buttons being pushed. But because the AI behaves the way it ‘should’ in testing, you’re left with this really weird black hole in your ability to locate the problem. After all, the AI is doing what it should in tests!

This is a phenomenon that doesn’t just apply to high level AI though. It’s more a sort of general warning about the way testing environments are constructed, and what you have to do to deal with actors in testing spaces that are trying to pass the test. When playtesters are trying to convince you they had a good time, when they are concerned about your emotional reactions, they will do things that try to end the playtest session. They will be trying to live up to and comply with your expectations.

It’s also about the tools you use. Forms for feedback can unconsciously push people towards giving you the answers you want rather than the answers they intend to give. This is especially true for any testing involving kids, because kids are inclined to giving emotional responses that they perceive you want. What’s more, kids are really good at inferring the unstated – so if you ask them if thing A is true or false, they may often infer thing B, even if you don’t want them to.

The important thing is that your test results and feedback can get all sorts of unconsidered factors. It’s worth noting that Robert Miles’ position was explicitly about things where a thinking entity made a deliberate choice to disrupt the test, though.

The term he uses for this is Volkswagoning.

Comments are closed.