Friday, July 10, 2020 - 9:00am to 10:30am
Location:Virtual Presentation Remote Access Enabled - Zoom
Speaker:RUI SILVA, Ph.D. Student https://ruitsilva.com/
Counterfactual MDPs: Extending MDPs for Planning Beyond Direct Control
Planning under uncertainty using Markov decision processes (MDPs) requires a model of the environment specifying the probabilistic effects of the actions of the agent. The optimal policy of the agent is then computed from such model. As such, when planning, the agent assumes the environment may only change as the direct result of its actions. In this thesis we consider lifting that assumption, and allow the agent to reason over the counterfactual "What if the world were different?" Effectively, we allow the agent to reason over alternative configurations of the world, where more rewarding optimal policies may exist, and over the cost required in shifting the original environment to these alternative worlds.
Our goal is to endow the agent with the ability to plan over the possibility of actually operating in such modified configurations of the world, if beneficial. We introduce Counterfactual MDPs, a new class of MDPs that allows the agent to reason, and plan, over the aforementioned counterfactual. Solving a Counterfactual MDP consists in the maximization of the expected value/cost trade-off over possible changes to the world. We show that this optimization is a NP-Hard problem, and propose P-Iteration – a gradient-based algorithm for solving Counterfactual MDPs. We then extend the Counterfactual MDP model to allow the agent to reason and plan over the uncertainty underlying possible changes to the world. The resulting model is dubbed Stochastic Outcomes Counterfactual MDPs. This model assumes the uncertainty associated with changing the world follows a distribution with parameters the agent can reason over and control, resulting in a new optimization problem. We show how to extend the P-Iteration algorithm to this new class of problems.
In the end we demonstrate the applicability of Counterfactual MDPs and the performance of the algorithms proposed in multiple scenarios. In particular, we show the significant performance improvements that arise from allowing the agent to reason and plan over alternative configurations of the world.
Manuela Veloso (Co-Chair)
Francisco S. Melo (Co-Chair)
Daniel Borrajo (Universidad Carlos III de Madrid)
Pedro Lima ( Instituto Superior Técnico, Lisboa)
Zoom Participation Enabled. See announcement.