Limitations of Theory of Mind Defenses Against Deception [paper page]

Theory of Mind (ToM) refers to the modeling of others’ beliefs, desires, preferences and biases. When endowed onto agents, complex and sophisticated higher-cognitive order beliefs can emerge, including deception, the process of a deceiver inducing a false belief onto a target agent that the deceiver believes to be true. Counter-deception and deception mitigation are defenses which attempt to use deception detection to separate deceivers from non-deceivers, often in cooperative settings. In this paper, we discuss several limitations with defenses that use ToM to mitigate against deception with the usage of some game models such as: Bayesian Game, Stochastic Game, Persuasion. We approach ToM as a means of value estimation about another agent’s utility and model errors as noisy games, highlighting challenges that arise from each model. This paper can be read as a pre-print.

As an example, consider the games Penny-Matching (2p zero-sum) and Coordination (2p cooperative) where noise distorts the perceived utility of agent α (deceiver) by an agent β (defender). If a deceiver α knows about agent β’s misconception, they can perform exploitation strategies against β’s high exploitability policy and therefore reap the benefits for playing against a sub-optimal opponent all while agent β’s subjective perception of utility is skewed from reality.

The paper describes many scenarios of which theory of mind as a defense against deception, but some of the most important limitations include:

Given a Bayesian Game that models agents with at least two types: deceiver and non-deceiver, a Perfect Bayesian Equilibrium (PBE) is a belief strategy profile pairing which can be of 1 of 3 types: separating, pooling, semi-separating semi-pooling.
- While a separating PBE is the ideal solution where given some belief, the actions of deceivers/non-deceivers can reveal their type, the other two (pooling, semi-pooling semi separating) PBE may not distinguish deceivers from non-deceivers if observable actions are not indicators to an agent’s type.
- Given a Bayesian Game, even if a solution is a separating PBE, the actions that distinguish the types may occur after decisions that determine majority of the payoff in the game.
Infinite higher-order recursion is rational for a defender under settings of uncertainty over a deceiver’s highest order reasoning despite there may be diminishing return.
Higher-order recursion as value estimation is plagued by compounding errors.
Deception that is performed through several channels but a defender that can only partially observe a subset of those channels require stochasticity to make up in effectiveness.
Defenses against Adversarial Communication are limited by the following:
- Agents with individual identities and local ToM can be slow to belief updates
- Agents with class identities/types are limited if type is independent of their observable behavior
- Agents that model adversarial communication as system noise are limited by their ability to take distinctive measures against individual agents.