×

Hybrid MDP based integrated hierarchical Q-learning. (English) Zbl 1267.68177

Summary: As a widely used reinforcement learning method, Q-learning is bedeviled by the curse of dimensionality: The computational complexity grows dramatically with the size of state-action space. To combat this difficulty, an integrated hierarchical Q-learning framework is proposed based on the hybrid Markov decision process (MDP) using temporal abstraction instead of the simple MDP. The learning process is naturally organized into multiple levels of learning, e.g., quantitative (lower) level and qualitative (upper) level, which are modeled as MDP and semi-MDP (SMDP), respectively. This hierarchical control architecture constitutes a hybrid MDP as the model of hierarchical Q-learning, which bridges the two levels of learning. The proposed hierarchical Q-learning can scale up very well and speed up learning with the upper level learning process. Hence this approach is an effective integral learning and control scheme for complex problems. Several experiments are carried out using a puzzle problem in a gridworld environment and a navigation control problem for a mobile robot. The experimental results demonstrate the effectiveness and efficiency of the proposed approach.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68Q32 Computational learning theory

References:

[1] Sutton R, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 1998. 133–156
[2] Feng Z Y, Liang L T, Tan L, et al. Q-learning based heterogenous network self-optimization for reconfigurable network with CPC assistance. Sci China Ser F-Inf Sci, 2009, 52: 2360–2368 · Zbl 1181.68193 · doi:10.1007/s11432-009-0223-5
[3] He P, Jagannathan S. Reinforcement learning-based output feedback control of nonlinear systems with input constraints. IEEE Trans Syst Man Cybern Part B-Cybern, 2005, 35: 150–154 · doi:10.1109/TSMCB.2004.840124
[4] Kondo T, Ito K. A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robot Auton Syst, 2004, 46: 111–124 · doi:10.1016/j.robot.2003.11.006
[5] Morimoto J, Doya K. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robot Auton Syst, 2001, 36: 37–51 · Zbl 1014.68179 · doi:10.1016/S0921-8890(01)00113-0
[6] Chen C, Dong D. Grey system based reactive navigation of mobile robots using reinforcement learning. Int J Innov Comp Inf Control, 2010, 6: 789–800
[7] Cheng D Z. Advances in automation and control research in China. Sci China Ser F-Inf Sci, 2009, 52: 1954–1963 · Zbl 1182.93002 · doi:10.1007/s11432-009-0198-2
[8] Yung N H C, Ye C. An intelligent mobile vehicle navigator based on fuzzy logic and reinforcement learning. IEEE Trans Syst Man Cybern Part B-Cybern, 1999, 29: 314–321 · doi:10.1109/3477.752807
[9] Montesanto A, Tascini G, Puliti P, et al. Navigation with memory in a partially observable environment. Robot Auton Syst, 2006, 54: 84–94 · Zbl 0982.68736 · doi:10.1016/j.robot.2005.09.015
[10] Sutton R. Learning to predict by the methods of temporal difference. Mach Learn, 1988, 3: 9–44
[11] Watkins J C H, Dayan P. Q-learning. Mach Learn, 1992, 8: 279–292
[12] Bertsekas D P, Tsitsiklis J N. Neuro-dynamic Programming. Belmont: Athena Scientific, 1996. 36–51
[13] Chen C, Dong D, Chen Z. Grey reinforcement learning for incomplete information processing. Lect Notes Comput Sci, 2006, 3959: 399–407 · Zbl 1178.68398 · doi:10.1007/11750321_38
[14] Dong D, Chen C, Li H, et al. Quantum reinforcement learning. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 1207–1220 · doi:10.1109/TSMCB.2008.925743
[15] Dong D, Chen C, Tarn T J, et al. Incoherent control of quantum systems with wavefunction controllable subspaces via quantum reinforcement learning. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 957–962 · doi:10.1109/TSMCB.2008.926603
[16] Chen C, Dong D, Chen Z. Quantum computation for action selection using reinforcement learning. Int J Quantum Inf, 2006, 4: 1071–1083 · Zbl 1107.81303 · doi:10.1142/S0219749906002419
[17] Dong D, Chen C, Chen Z, et al. Quantum mechanics helps in learning for more intelligent robots. Chin Phys Lett, 2006, 23: 1691–1694 · doi:10.1088/0256-307X/23/7/010
[18] Dong D, Chen C, Zhang C, et al. Quantum robot: structure, algorithms and applications. Robotica, 2006, 24: 513–521 · doi:10.1017/S0263574705002596
[19] Jing P, Ronald J W. Increment multi-step Q-learning. Mach Learn, 1996, 22: 283–291
[20] Mahadevan S. Average reward reinforcement learning: Foundations, algorithms and empirical results. Mach Learn, 1996, 22: 159–195 · Zbl 1099.68692
[21] Althaus P, Christensen H I. Smooth task switching through behavior competition. Robot Auton Syst, 2003, 44: 241–249 · doi:10.1016/S0921-8890(03)00074-5
[22] Hallerdal M, Hallamy J. Behavior selection on a mobile robot using W-learning. In: Hallam B, Floreano D, Hallam J, et al., eds. Proceedings of the Seventh International Conference on the Simulation of Adaptive Behavior on from animals to animates, Edinburgh, UK, 2002. 93–102
[23] Wiering M, Schmidhuber J. HQ-Learning. Adapt Behav, 1997, 6: 219–246 · doi:10.1177/105971239700600202
[24] Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning. Discret Event Dyn Syst-Theory Appl, 2003, 13: 41–77 · Zbl 1018.93035 · doi:10.1023/A:1022140919877
[25] Chen C, Chen Z. Reinforcement learning for mobile robot: From reaction to deliberation. J Syst Eng Electron, 2005, 16: 611–617
[26] Tsitsiklis J N, VanRoy B. An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control, 1997, 42: 674–690 · Zbl 0914.93075 · doi:10.1109/9.580874
[27] Sutton R S, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst, 2000, 12: 1057–1063
[28] Ormoneit D, Sen S. Kernel-based reinforcement learning. Mach Learn, 2002, 49: 161–178 · Zbl 1014.68069 · doi:10.1023/A:1017928328829
[29] Sutton R, Precup D, Singh S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif Intell, 1999, 112: 181–211 · Zbl 0996.68151 · doi:10.1016/S0004-3702(99)00052-1
[30] Parr P, Russell S. Reinforcement learning with hierarchies of machines. Adv Neural Inf Process Syst, 1998, 10: 1043–1049
[31] Dietterich T G. Hierarchical reinforcement learning with the Maxq value function decomposition. J Artif Intell Res, 2000, 13: 227–303 · Zbl 0963.68085
[32] Theocharous G. Hierarchical learning and planning in partially observable Markov decision processes. Dissertation for Doctoral Degree. East Lansing: Michigan State University, USA, 2002. 30–72
[33] Chen C, Li H, Dong D. Hybrid control for autonomous mobile robot navigation-a hierarchical Q-learning algorithm. IEEE Robot Autom Mag, 2008, 15: 37–47 · doi:10.1109/MRA.2008.921541
[34] Kuipers B. Qualitative Reasoning: Modeling and Simulation with Incomplete Knowledge. Cambridge: MIT Press, 1994. 1–27
[35] Berleant D, Kuipers B. Qualitative and quantitative simulation: Bridging the gap. Artif Intell, 1997, 95: 215–255 · Zbl 0894.68173 · doi:10.1016/S0004-3702(97)00050-7
[36] Guo M Z, Liu Y, Malec J. A new Q-learning algorithm based on the metropolis criterion. IEEE Trans Syst Man Cybern Part B-Cybern, 2004, 34: 2140–2143 · doi:10.1109/TSMCB.2004.832154
[37] Dong D, Chen C, Chu J, et al. Robust quantum-inspired reinforcement learning for robot navigation. IEEE-ASME Trans Mechatron, 2011, in press
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.