Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Tham khảo tài liệu 'mobile robots navigation 2008 part 15', kỹ thuật - công nghệ, cơ khí - chế tạo máy phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả | 548 Mobile Robots Navigation The effective reinforcement is used to update the connection weights between PCL and PU in AC i.e. reward expectation associated to a place and also between Actor units and map nodes i.e. reward expectations associated to different orientations . In the first case we use WpcL t 1 WpcL t p r t Epcl t 5 where p is the learning rate and EPCL is the matrix of 1 n eligibility traces corresponding to connections between PCL and PU in AC. In the second case we use WNk t 1 WNk t pr t ENk t V map node k 6 where Wwt is the vector of connection weights between map node k and a maximum of eight Actor units and ENk is the vector of eligibility traces corresponding to a maximum of eight Actor units. As shown in 5 and 6 both learning rules depend on the eligibility of the connections. At the beginning of every trial in a given experiment eligibility traces in AC and in Actor units are initialized to 0. At each time step t in a trial eligibility traces in AC are increased in the connections between PU and the most active neurons within PCL only when the action executed by the animat at time t-1 allowed it to perceive the goal Epcl t Epcl t-1 zPC t 7 where X is the increment parameter and PC stores the activity pattern registered by the collection of neurons in PCL. Also at time step t the eligibility trace e of the connection between the active map node na and the Actor unit corresponding to the current animat orientation dir is increased by T as described by 8 dir dir ed a t en t -1 T . 8 Finally after updating connection weights between PCL and AC and between Actor units and map nodes at any time step t in the trial all eligibilities decay at certain rates Ấ and o respectively as shown in 9 Epcl t ZEpcL t - 1 ENk t ENk t -1 V map node k. 9 The use of the Actor-Critic architecture enables the estimation of reward expectation values of different locations in the environment where maximum expectations correspond to locations from where the goal is .