Plan-Based Policy-Learning for Surveillance Problems (M. Fox, D. Long, A. Coles, and S. Bernardini)

Friday, 01 June 2012

The group has successfully secured ESPRC funds (EP/J012157/1, fEC value £458,818) to research the use of planning to underpin policy learning for surveillance problems.  Surveillance problems give rise to many challenges including the management of uncertainty in an unpredictable environment, the management of restricted resources and the communication of commitments and requests between multiple heterogeneous agent 'observers'. At the heart of surveillance problems lies the need to plan complex sequences of behaviour that achieve surveillance goals. These goals are typically expressed in terms of gathering as much information as possible given constraints, and communicating findings to a human operator.

Planning is combinatorially hard, and planning problems involving metric resources, continuous time and concurrency, as would be required in the solution of non-trivial surveillance problems, are time-consuming to solve. This complexity is greatly exacerbated if uncertainty is captured explicitly within the planning domain models. Although online planning, and plan repair in the case of failure, are feasible in stable situations, they take too long in situations that are changing rapidly. Online planning also requires significant on-board computational resources, which are often not available in surveillance vehicles. Planning under uncertainty cannot therefore be done online in situations typical of many surveillance problems, where computational resources are limited and rapid responses are frequently required. On the other hand, forward planning is certainly required in order to avoid the observers behaving in a purely reactive (and therefore easily distracted) manner.

Since online planning, and planning under uncertainty, are both unrealistic for large-scale, fast-moving surveillance problems, we propose an alternative approach based on plan-based policy-learning. We assume that time and resources are available offline to train effective policies. Our approach is based on Monte Carlo sampling: we sample many instances of the stochastic problem, each instance being a challenging temporal and metric planning problem. We then solve each instance using a high-performing planner, and then apply a classifier to learn a policy as a mapping from states to actions, using the set of solutions as input. We have already demonstrated the effectiveness of this approach in two single-agent cases: management of the loading of multiple batteries, and the control of an autonomous underwater vehicle following the edge of a patch (distinguished by high chlorophyll or high temperature readings) in the coastal waters of the Monterey Bay. We know from our work in both cases that the resulting policies can be very high-performing in terms of robustness to the high degree of uncertainty that often occurs in the physical execution environment. We are now proposing to scale up the approach we took in the batteries and patch-following cases, to the multi-agent coordination problem, addressing the challenges that arise when many agents are coordinating in solving a surveillance problem that requires the integration of multiple policies.

The project begins later this year, and updates will be posted on this site as work progresses.