Q-learning with Nearest Neighbors

Shah, Devavrat and Xie, Qiaomin
arXiv preprint arXiv:1802.03900, 2018

We consider the problem of model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernels, when only a single sample path of the system is available. We focus on the classical approach of Q-learning where the goal is to learn the optimal Q-function. We propose the Nearest Neighbor Q-Learning approach that utilizes nearest neighbor regression method to learn the Q function. We provide finite sample analysis of the convergence rate using this method. In particular, we establish that the algorithm is guaranteed to output an ϵ-accurate estimate of the optimal Q-function with high probability using a number of observations that depends polynomially on ϵ and the model parameters. To establish our results, we develop a robust version of stochastic approximation results; this may be of interest in its own right.