Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Tham khảo tài liệu 'robot learning 2010 part 6', kỹ thuật - công nghệ, cơ khí - chế tạo máy phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả | 68 Robot Learning context of RL is provided by Dearden et al. 1998 1999 who applied Q-learning in a Bayesian framework with an application to the exploration-exploitation trade-off. Poupart et al. 2006 present an approach for efficient online learning and exploration in a Bayesian context they ascribe Bayesian RL to POMDPs. Besides statistical uncertainty consideration is similar to but strictly demarcated from other issues that deal with uncertainty and risk consideration. Consider the work of Heger 1994 and of Geibel 2001 . They deal with risk in the context of undesirable states. Mihatsch Neuneier 2002 developed a method to incorporate the inherent stochasticity of the MDP. Most related to our approach is the recent independent work by Delage Mannor 2007 who solved the percentile optimisation problem by convex optimization and applied it to the exploration-exploitation trade-off. They suppose special priors on the MDP s parameters whereas the present work has no such requirements and can be applied in a more general context of RL methods. 2. Bellman iteration and uncertainty propagation Our concept of incorporating uncertainty into RL consists in applying UP to the Bellman iteration Schneegass et al. 2008 Qm s. a TQm-1 s. a 5 S TPs Is. a R s. a sk rVm-1 sk 6 k-1 here for discrete MDPs. For policy evaluation we have Vm s - Qm s n s with n the used policy and for policy iteration Vm s - maxaeA Qm s a section 1.1 . Thereby we assume a finite number of states s. i e 1 . . . I S I and actions a j e 1 . . . IA I . The Bellman iteration converges with m ro to the optimal Q-function which is appropriate to the estimators P and R. In the general stochastic case which will be important later we set Vm s - T n s a. Qm s a. with n s a the probability of choosing a in s. To obtain the uncertainty of the approached Q-function the technique of UP is applied in parallel to the Bellman iteration. With given covariance matrices Cov P Cov R and Cov P R for the transition .