Exploration versus exploitation trade-off in infinite horizon Pareto Multi-armed bandits algorithms

Drugan, Madalina; Manderick, Bernard

Published in

Proceedings of the International Conference on Agents and Artificial Intelligence

DOI: 10.5220/0005195500660077

Tools

Export citation

Search in Google Scholar

Exploration versus exploitation trade-off in infinite horizon Pareto Multi-armed bandits algorithms

Proceedings article published in 2015 by Madalina Drugan

, Bernard Manderick

This paper is available in a repository.

Full text: Download

Preprint: policy unknown

Upload

Postprint: policy unknown

Upload

Published version: policy unknown

Upload

Abstract

Multi-objective multi-armed bandits (MOMAB) are multi-armed bandits (MAB) extended to reward vectors. We use the Pareto dominance relation to assess the quality of reward vectors, as opposite to scalarization functions. In this paper, we study the exploration vs exploitation trade-off in infinite horizon MOMABs algorithms. Single objective MABs explore the suboptimal arms and exploit a single optimal arm. MOMABs explore the suboptimal arms, but they also need to exploit fairly all optimal arms. We study the exploration vs exploitation trade-off of the Pareto UCB1 algorithm. We extend UCB2 that is another popular infinite horizon MAB algorithm to rewards vectors using the Pareto dominance relation. We analyse the properties of the proposed MOMAB algorithms in terms of upper regret bounds. We experimentally compare the exploration vs exploitation trade-off of the proposed MOMAB algorithms on a bi-objective Bernoulli environment coming from control theory.

Published in

Links

Tools

Exploration versus exploitation trade-off in infinite horizon Pareto Multi-armed bandits algorithms

Abstract