Recent Projects
|
Online Planning for Ad Hoc Autonomous Agent Teams |
We propose a novel online planning algorithm for ad hoc team settings¡ªchallenging situations in which an agent must collaborate with unknown teammates without prior coordination. Our approach is based on constructing and solving a series of stage games, and then using biased adaptive play to choose actions. The utility function in each stage game is estimated via Monte-Carlo tree search using the UCT algorithm. We establish analytically the convergence of the algorithm and show that it performs well in a variety of ad hoc team domains. |
Rollout Sampling Policy Iteration for Decentralized POMDPs |
We present decentralized rollout sampling policy iteration (DecRSPI)¨Ca new algorithm for multiagent decision problems formalized as DECPOMDPs. DecRSPI is designed to improve scalability and tackle problems that lack an explicit model. The algorithm uses Monte-Carlo methods to generate a sample of reachable belief states. Then it computes a joint policy for each belief state based on the rollout estimations. A new policy representation allows us to represent solutions compactly. The key benefits of the algorithm are its linear time complexity over the number of agents, its bounded memory usage and good solution quality. It can solve larger problems that are intractable for existing planning algorithms. Experimental results confirm the effectiveness and scalability of the approach. |
Approximate Dynamic Programming for DEC-POMDPs |
Dynamic programming is a classical offline planning technique that builds up a set of possible policies backwards from the last step to the first. This is accomplished by backing up the possible policies at each step and pruning those that have lower value for all states of the domain and all possible policies of the other agents. Unfortunately, the number of possible policies grows double exponentially over the horizon. Memory-bounded techniques have shown great promise in solving complex multi-agent planning problems modeled as DEC-POMDPs. Much of the performance gains can be attributed to pruning techniques that alleviate the complexity of the exhaustive backup step. Despite these improvements, state-of-the-art algorithms can still handle a relative small pool of candidate policies, which limits the quality of the solution in some benchmark problems. We developed a new algorithm, Point-Based Policy Generation, which avoids altogether searching the entire joint policy space. The key observation is that the best joint policy for each reachable belief state can be constructed directly, instead of producing first a large set of candidates. The methodologyis to parameterize the policy space and search over the parameterized space using mathematical techniques such as linear programming. We also provide an efficient approximate implementation of this operation using local search. The experimental results show that our solution technique improves the performance significantly in terms of both runtime and solution quality.
Another bottleneck of the dynamic programming approaches is the policy evaluation step. To evaluate all possible joint policies, previous works compute the exact value function for every state thereby are very inefficient for problems with large state spaces. A careful analysis shows that only a small set of states is actually reachable and needs to be considered. Trial-based approaches offer an efficient way to solve single-agent MDPs and POMDPs. These approaches allow agents to focus their computations on regions of the environment they encounter during the trials, leading to significant computational savings. We present a novel trial-based dynamic programming algorithm for DEC-POMDPs that extends these benefits to multi-agent settings. The algorithm uses trialbased methods for both belief generation and policy evaluation. Policy improvement is implemented efficiently using linear programming and a sub-policy reuse technique that helps bound the amount of memory. The results show that the proposed algorithm can produce significant value improvements and is much faster than the best existing planning algorithms. |
Multi-Agent Online Planning with Communication |
Most existing algorithms for solving DEC-POMDPs operate offline, generating some type of a complete policy before execution begins, which specifies what action to take in any runtime situation. While good performance can be achieved using these algorithms, they often take significant time (e.g. more than a day) to solve modest problems. The reason is that they need to consider all possible policies of the other agents (or a sufficiently large set of policies) in order to preserve solution quality. Besides, small changes in the environment's dynamics require recomputing the full policy. In contrast, online algorithms only need to plan the current action and thus can be much faster. Furthermore, online planning can better handle emergencies and unforeseen situations, which allows online approaches to be applicable in many domains for which offline approaches are not adequate. However, implementing online algorithms for decentralized multi-agent systems is very challenging. Since the underlying system state as well as the observations of the other agents are not available during execution time, each agent must reason about all possible histories that could be observed by the other agents and how that may affect its own action selection. Besides, online algorithms must often meet real-time constraint, greatly reducing the available planning time, compared with offline methods. Due to these difficulties, work on online planning for DEC-POMDPs has been sparse.
We propose a new online algorithm for planning under uncertainty in multi-agent settings. The algorithm helps overcome the high computational complexity of solving such problems offline. The key challenges are to maintain coordinated behavior with little or no communication and, when communication is allowed, to optimize value with minimal communication. The algorithm addresses these challenges by generating identical conditional plans based on common knowledge and communicating only when history inconsistency is detected, allowing communication to be postponed if necessary. To be suitable for online operation, the algorithm computes good local policies using a new and fast local search method implemented using linear programming. Moreover, it bounds the amount of memory used at each step and can be applied to problems with arbitrary horizons. The experimental results confirm that the algorithm can solve problems that are too large for the best existing offline planning algorithms and it outperforms the best online method, producing much higher value with much less communication in most cases. The algorithm also proves to be effective when the communication channel is imperfect (periodically unavailable). These results contribute to the scalability of decision-theoretic planning in multi-agent settings. |
Solving Large-Scale and Sparse-Reward
DEC-POMDPs with Correlation-MDPs |
Within a group of cooperating agents the decision making of an individual agent depends on the actions of the other agents. A lot of effort has been made to solve this problem with additional assumptions on the communication abilities of agents. However, in some realworld applications, communication is limited and the assumptions are rarely satisfied. An alternative approach newly developed is to employ a correlation device to correlate the agents' behavior without exchanging information during execution. In our work, we apply correlation device to large-scale and spare-reward domains. As a basis we use the framework of infinite-horizon DEC-POMDPs which represent policies as joint stochastic finite-state controllers. To solve any problem of this kind, a correlation device is firstly calculated by solving Correlation Markov Decision Processes (Correlation-MDPs) and then used to improve the local controller for each agent. By using this method, we are able to
achieve a tradeoff between computational complexity and the quality of the approximation. In addition, we demonstrate that, adversarial problems can be solved by encoding the information of opponents' behavior in the correlation device.We have successfully implemented the proposed method into our 2D simulated robot soccer team and the performance in RoboCup-2006 was encouraging. |