Document Zbl 1487.90661

Multi-agent reinforcement learning algorithm to solve a partially-observable multi-agent problem in disaster response. (English) Zbl 1487.90661

Eur. J. Oper. Res. 291, No. 1, 296-308 (2021).

Summary: Disaster response operations typically involve multiple decision-makers, and each decision-maker needs to make its decisions given only incomplete information on the current situation. To account for these characteristics – decision making by multiple decision-makers with partial observations to achieve a shared objective –, we formulate the decision problem as a decentralized-partially observable Markov decision process (dec-POMDP) model. To tackle a well-known difficulty of optimally solving a dec-POMDP model, multi-agent reinforcement learning (MARL) has been used as a solution technique. However, typical MARL algorithms are not always effective to solve dec-POMDP models. Motivated by evidence in single-agent RL cases, we propose a MARL algorithm augmented by pretraining. Specifically, we use behavioral cloning (BC) as a means to pretrain a neural network. We verify the effectiveness of the proposed method by solving a dec-POMDP model for a decentralized selective patient admission problem. Experimental results of three disaster scenarios show that the proposed method is a viable solution approach to solving dec-POMDP problems and that augmenting MARL with BC for its pretraining seems to offer advantages over plain MARL in terms of solution quality and computation time.

MSC:

90C90	Applications of mathematical programming
90B36	Stochastic scheduling theory in operations research
68T20	Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)

Keywords:

OR in disaster relief; artificial intelligence; multi-agent reinforcement learning; imitation learning; selective patient admission

Software:

Adam

Cite Review PDF

Full Text: DOI

References:

[1]	Amato, C.; Dibangoye, J. S.; Zilberstein, S., Incremental policy generation for finite-horizon Dec-POMDPs, ICAPS (2009)
[2]	Argon, N. T.; Ziya, S.; Righter, R., Scheduling impatient jobs in a clearing system with insights on patient triage in mass casualty incidents, Probability in the Engineering and Informational Sciences, 22, 3, 301-332 (2008) · Zbl 1211.90105
[3]	Bernstein, D. S.; Givan, R.; Immerman, N.; Zilberstein, S., The complexity of decentralized control of Markov decision processes, Mathematics of Operations Research, 27, 4, 819-840 (2002) · Zbl 1082.90593
[4]	Cha, M.-i.; Kim, G. W.; Kim, C. H.; Choa, M.; Choi, D. H.; Kim, I.; Lee, K. H., A study on the disaster medical response during the mauna ocean resort gymnasium collapse, Journal of The Korean Society of Emergency Medicine, 28, 1, 97-108 (2017)
[5]	Chan, C. W.; Farias, V. F.; Bambos, N.; Escobar, G. J., Optimizing intensive care unit discharge decisions with patient readmissions, Operations Research, 60, 6, 1323-1341 (2012) · Zbl 1269.90150
[6]	Chan, T. C.; Killeen, J.; Griswold, W.; Lenert, L., Information technology and emergency medical care during disasters, Academic Emergency Medicine, 11, 11, 1229-1236 (2004)
[7]	Cheng, C.-A.; Yan, X.; Wagener, N.; Boots, B., Fast policy learning through imitation and reinforcement, Conference on uncertainty in artificial intelligence (2018), Monterey, California, USA
[8]	Cohen, I.; Mandelbaum, A.; Zychlinski, N., Minimizing mortality in a mass casualty event: fluid networks in support of modeling and staffing, IIE Transactions, 46, 7, 728-741 (2014)
[9]	Cruz Jr, G. V.; Du, Y.; Taylor, M. E., Pre-training neural networks with human demonstrations for deep reinforcement learning, ALA 2018 workshop (2018)
[10]	Dibangoye, J. S.; Amato, C.; Buffet, O.; Charpillet, F., Optimally solving Dec-POMDPs as continuous-state MDPs, Journal of Artificial Intelligence Research, 55, 443-497 (2016) · Zbl 1352.68220
[11]	Einav, S.; Aharonson-Daniel, L.; Weissman, C.; Freund, H. R.; Peleg, K.; Group, I. T., In-hospital resource utilization during multiple casualty incidents, Annals of Surgery, 243, 4, 533 (2006)
[12]	ResearchPaper168.
[13]	abs/1602.02672.
[14]	Foerster, J. N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S., Counterfactual multi-agent policy gradients, AAAI conference on artificial intelligence, 2974-2982 (2018), AAAI Press: AAAI Press New Orleans, Louisiana, USA
[15]	abs/1802.05313.
[16]	Gerchak, Y.; Gupta, D.; Henig, M., Reservation planning for elective surgery under uncertain demand for emergency surgery, Management Science, 42, 3, 321-334 (1996) · Zbl 0884.90106
[17]	Glorot, X.; Bengio, Y., Understanding the difficulty of training deep feedforward neural networks, Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249-256 (2010)
[18]	Green, L. V.; Savin, S.; Wang, B., Managing patient service in a diagnostic medical facility, Operations Research, 54, 1, 11-25 (2006) · Zbl 1167.90562
[19]	Gupta, J. K.; Egorov, M.; Kochenderfer, M., Cooperative multi-agent control using deep reinforcement learning, International conference on autonomous agents and multiagent systems, 66-83 (2017), Springer: Springer Sao Paulo, Brazil
[20]	Hansen, E. A.; Bernstein, D. S.; Zilberstein, S., Dynamic programming for partially observable stochastic games, Proceedings of the 19th national conference on artificial intelligence, 709-715 (2004), AAAI Press: AAAI Press San Jose, California, USA
[21]	der Heide, E. A., The importance of evidence-based disaster planning, Annals of Emergency Medicine, 47, 1, 34-49 (2006)
[22]	Helm, J. E.; AhmadBeygi, S.; Van Oyen, M. P., Design and analysis of hospital admission control for operational effectiveness, Production and Operations Management, 20, 3, 359-374 (2011)
[23]	Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Osband, I., Deep q-learning from demonstrations, Thirty-second AAAI conference on artificial intelligence (2018)
[24]	Hick, J. L.; Hanfling, D.; Cantrill, S. V., Allocating scarce resources in disasters: emergency department principles, Annals of Emergency Medicine, 59, 3, 177-187 (2012)
[25]	Hogan, D. E.; Waeckerle, J. F.; Dire, D. J.; Lillibridge, S. R., Emergency department impact of the oklahoma city terrorist bombing, Annals of Emergency Medicine, 34, 2, 160-167 (1999)
[26]	Huh, W. T.; Liu, N.; Truong, V.-A., Multiresource allocation scheduling in dynamic environments, Manufacturing & Service Operations Management, 15, 2, 280-291 (2013)
[27]	Jacobson, E. U.; Argon, N. T.; Ziya, S., Priority assignment in emergency response, Operations Research, 60, 4, 813-832 (2012) · Zbl 1260.90103
[28]	Jenkins, J. L.; McCarthy, M. L.; Sauer, L. M.; Green, G. B.; Stuart, S.; Thomas, T. L.; Hsu, E. B., Mass-casualty triage: Time for an evidence-based approach, Prehospital and Disaster Medicine, 23, 1, 3-8 (2008)
[29]	Kang, B.; Jie, Z.; Feng, J., Policy optimization with demonstrations, International conference on machine learning, 2474-2483 (2018)
[30]	Kang, S.; Yun, S. H.; Jung, H. M.; Kim, J. H.; Han, S. B.; Kim, J. S.; Paik, J. H., An evaluation of the disaster medical system after an accident which occurred after a bus fell off the Incheon bridge, Journal of the Korean Society of Emergency Medicine, 24, 1, 1-6 (2013)
[31]	Kilic, A.; Dincer, M. C.; Gokce, M. A., Determining optimal treatment rate after a disaster, Journal of the Operational Research Society, 65, 7, 1053-1067 (2014)
[32]	Kingma, D. P.; Ba, J., Adam: A method for stochastic optimization, International conference on learning representations (2015), Ithaca: Ithaca San Diego, USA
[33]	Konda, V. R.; Tsitsiklis, J. N., Actor-critic algorithms, Advances in neural information processing systems 12, 1008-1014 (2000), MIT Press: MIT Press denver, Colorado
[34]	Lakshminarayanan, A. S.; Ozair, S.; Bengio, Y., Reinforcement learning with few expert demonstrations, Nips workshop on deep learning for action and interaction (2016), Barcelona, Spain
[35]	Lee, H.-R.; Lee, T., Markov decision process model for patient admission decision at an emergency department under a surge demand, Flexible Services and Manufacturing Journal, 30, 1-2, 98-122 (2018)
[36]	Lee, H.-R.; Lee, T., Improved cooperative multi-agent reinforcement learning algorithm augmented by mixing demonstrations from centralized policy, Proceedings of the 18th international conference on autonomous agents and multiagent systems, 1089-1098 (2019), International Foundation for Autonomous Agents and Multiagent Systems
[37]	Li, D.; Glazebrook, K. D., An approximate dynamic programming approach to the development of heuristics for the scheduling of impatient jobs in a clearing system, Naval Research Logistics (NRL), 57, 3, 225-236 (2010) · Zbl 1188.90118
[38]	Li, D.; Glazebrook, K. D., A bayesian approach to the triage problem with imperfect classification, European Journal of Operational Research, 215, 1, 169-180 (2011) · Zbl 1237.90106
[39]	Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I., Multi-agent actor-critic for mixed cooperative-competitive environments, Advances in neural information processing systems 30, 6379-6390 (2017), Curran Associates, Inc: Curran Associates, Inc Long beach, California, USA
[40]	Manoj, B. S.; Baker, A. H., Communication challenges in emergency response, Communications of the ACM, 50, 3, 51-53 (2007)
[41]	Mills, A. F.; Argon, N. T.; Ziya, S., Resource-based patient prioritization in mass-casualty incidents, Manufacturing & Service Operations Management, 15, 3, 361-377 (2013)
[42]	Mills, A. F.; Argon, N. T.; Ziya, S., Dynamic distribution of patients to medical facilities in the aftermath of a disaster, Operations Research (2018) · Zbl 1443.90223
[43]	Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P., Overcoming exploration in reinforcement learning with demonstrations, 2018 ieee international conference on robotics and automation (ICRA), 6292-6299 (2018), IEEE: IEEE Brisbane, QLD, Australia
[44]	Oliehoek, F. A., Sufficient plan-time statistics for decentralized POMDPs, Proceedings of the 23rd international joint conference on artificial intelligence, 302-308 (2013), AAAI Press: AAAI Press Beijing, China
[45]	Oliehoek, F. A.; Amato, C., A concise introduction to decentralized POMDPs (2016), Springer · Zbl 1355.68005
[46]	Oliehoek, F. A.; Spaan, M. T.; Vlassis, N., Optimal and approximate q-value functions for decentralized POMDPs, Journal of Artificial Intelligence Research, 32, 289-353 (2008) · Zbl 1182.68261
[47]	Omidshafiei, S.; Pazis, J.; Amato, C.; How, J. P.; Vian, J., Deep decentralized multi-task multi-agent reinforcement learning under partial observability, Proceedings of the 34th international conference on machine learning, 70, 2681-2690 (2017), PMLR: PMLR Sydney, Australia
[48]	Park, Y.; Jeong, I.; Seo, J.; Kim, J., A study on the construction of a disaster situation management system in korea based on government 3.0 directive, WIT Transactions on The Built Environment, 150, 59-66 (2015)
[49]	Peleg, K.; Kellermann, A. L., Enhancing hospital surge capacity for mass casualty events, JAMA, 302, 5, 565-567 (2009)
[50]	Rajeswaran, A.; Kumar, V.; Gupta, A.; Vezzani, G.; Schulman, J.; Todorov, E.; Levine, S., Learning complex dexterous manipulation with deep reinforcement learning and demonstrations, Proceedings of robotics: Science and systems (2018), Pittsburgh, Pennsylvania, USA
[51]	Ramirez-Nafarrate, A.; Hafizoglu, A. B.; Gel, E. S.; Fowler, J. W., Optimal control policies for ambulance diversion, European Journal of Operational Research, 236, 1, 298-312 (2014) · Zbl 1338.90498
[52]	Repoussis, P. P.; Paraskevopoulos, D. C.; Vazacopoulos, A.; Hupert, N., Optimizing emergency preparedness and resource utilization in mass-casualty incidents, European Journal of Operational Research, 255, 2, 531-544 (2016) · Zbl 1346.90384
[53]	Ross, S.; Bagnell, D., Efficient reductions for imitation learning, Proceedings of the thirteenth international conference on artificial intelligence and statistics, 661-668 (2010)
[54]	Sacco, W. J.; Navin, D. M.; Fiedler, K. E.; Waddell II, R. K.; Long, W. B.; Buckman Jr, R. F., Precise formulation and evidence-based application of resource-constrained triage, Academic Emergency Medicine, 12, 8, 759-770 (2005)
[55]	Sacco, W. J.; Navin, D. M.; Waddell, R. K.; Fiedler, K. E.; Long, W. B.; Buckman Jr, R. F., A new resource-constrained triage method applied to victims of penetrating injury, Journal of Trauma and Acute Care Surgery, 63, 2, 316-325 (2007)
[56]	Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P., High-dimensional continuous control using generalized advantage estimation, International conference on learning representations (2016), San Juan, Puerto Rico
[57]	Seo, J.; Jeong, I.; Park, Y.; Kim, J.; Lim, J., Development of open platform for enhancing disaster risk management, Information and communication technologies for disaster management (ICT-DM), 2015 2nd international conference on, 287-288 (2015), IEEE
[58]	Sokat, K. Y.; Dolinskaya, I. S.; Smilowitz, K.; Bank, R., Incomplete information imputation in limited data environments with application to disaster response, European Journal of Operational Research, 269, 2, 466-485 (2018) · Zbl 1388.90084
[59]	Subramanian, K.; Isbell Jr, C. L.; Thomaz, A. L., Exploration from demonstration for interactive reinforcement learning, Proceedings of the 2016 international conference on autonomous agents & multiagent systems, 447-456 (2016), International Foundation for Autonomous Agents and Multiagent Systems
[60]	Sun, W.; Bagnell, J. A.; Boots, B., Truncated horizon policy search: Combining reinforcement learning & imitation learning, International conference on learning representations (2018)
[61]	Sun, W.; Venkatraman, A.; Gordon, G. J.; Boots, B.; Bagnell, J. A., Deeply aggrevated: Differentiable imitation learning for sequential prediction, International conference on machine learning, 3309-3318 (2017)
[62]	Sung, I.; Lee, T., Optimal allocation of emergency medical resources in a mass casualty incident: Patient prioritization by column generation, European Journal of Operational Research, 252, 2, 623-634 (2016) · Zbl 1346.90863
[63]	Szer, D.; Charpillet, F.; Zilberstein, S., Maa*: a heuristic search algorithm for solving decentralized POMDPs, Proceedings of the twenty-first conference on uncertainty in artificial intelligence, 576-583 (2005), AUAI Press: AUAI Press Edinburgh, Scotland
[64]	Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Vicente, R., Multiagent cooperation and competition with deep reinforcement learning, PLoS One, 12, 4, e0172395 (2017)
[65]	Thompson, S.; Nunez, M.; Garfinkel, R.; Dean, M. D., Or practice efficient short-term allocation and reallocation of patients to floors of a hospital during demand surges, Operations Research, 57, 2, 261-273 (2009)
[66]	Timbie, J. W.; Ringel, J. S.; Fox, D. S.; Pillemer, F.; Waxman, D. A.; Moore, M.; Kellermann, A. L., Systematic review of strategies to manage and allocate scarce resources during mass casualty events, Annals of Emergency Medicine, 61, 6, 677-689 (2013)
[67]	Vecerík, M.; Hester, T.; Scholz, J.; Wang, F.; Pietquin, O.; Piot, B.; Riedmiller, M. A., Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards, CoRR (2017)
[68]	Waeckerle, J. F., Disaster planning and response, New England Journal of Medicine, 324, 12, 815-821 (1991)
[69]	Wilson, D. T.; Hawe, G. I.; Coates, G.; Crouch, R. S., A multi-objective combinatorial model of casualty processing in major incident response, European Journal of Operational Research, 230, 3, 643-655 (2013) · Zbl 1317.90133

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.