Abstract
With the large scale and user demands on modern cloud systems there is a need for autonomous approaches to self-healing. When there is no operator in the loop for self-healing actions, it is crucial to ensure that the actions taken are safe and effective. In this paper we propose SASH: Safe Autonomous Self-Healing, which uses surrogate models to estimate the safety and effectiveness of self-healing actions. SASH uses system metrics, configuration parameters, domain information and available actions to decide on the best fault remediation action or combination of actions. The performance of the action(s) are then verified through a validation block that updates the knowledge base with how the actions performed for that fault. This data is then used to update the safety and effectiveness estimation algorithm. The results show the framework is able to successfully remediate faults with a low number of actions and with protection against unsafe actions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ali-Tolppa, J., Kocsis, S., Schultz, B., Bodrog, L., Kajo, M.: Self-healing and resilience in future 5G cognitive autonomous networks. In: 2018 ITU Kaleidoscope: Machine Learning for a 5G Future (ITU K), pp. 1–8. IEEE (2018)
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Bressler, S.L., Seth, A.K.: Wiener-granger causality: a well established methodology. Neuroimage 58(2), 323–329 (2011)
Computing, A., et al.: An architectural blueprint for autonomic computing. IBM White Pap. 31(2006), 1–6 (2006)
Dai, Y., Xiang, Y., Zhang, G.: Self-healing and hybrid diagnosis in cloud computing. In: IEEE International Conference on Cloud Computing, pp. 45–56 (2009)
Dang, Y., Lin, Q., Huang, P.: AIOps: real-world challenges and research innovations. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 4–5. IEEE (2019)
Gulenko, A.: Autonomic self-healing in cloud computing platforms. Technische Universitaet Berlin, Germany (2020)
Jin, Y., et al.: Self-aware distributed deep learning framework for heterogeneous IoT edge devices. Futur. Gener. Comput. Syst. 125, 908–920 (2021)
Magalhaes, J.P., Silva, L.M.: A framework for self-healing and self-adaptation of cloud-hosted web-based applications. In: 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, vol. 1, pp. 555–564. IEEE (2013)
Mariani, L., Monni, C., Pezzé, M., Riganelli, O., Xin, R.: Localizing faults in cloud systems. In: 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), pp. 262–273. IEEE (2018)
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Mo, S., Pei, X., Wu, C.: Safe reinforcement learning for autonomous vehicle using Monte Carlo tree search. IEEE Trans. Intell. Transp. 23, 6766–6773 (2021)
Paltrinieri, N., Comfort, L., Reniers, G.: Learning about risk: machine learning for risk assessment. Saf. Sci. 118, 475–486 (2019)
Petrenko, S.: Developing a Cybersecurity Immune System for Industry 40. CRC Press, Boca Raton (2022)
Rajput, P.K., Sikka, G.: Multi-agent architecture for fault recovery in self-healing systems. J. Ambient. Intell. Humaniz. Comput. 12(2), 2849–2866 (2021)
Sadiku, M.N., Musa, S.M., Momoh, O.D.: Cloud computing: opportunities and challenges. IEEE Potentials 33(1), 34–36 (2014)
Schwarting, W., Alonso-Mora, J., Rus, D.: Planning and decision-making for autonomous vehicles. Annu. Rev. Control Robot. Auton. Syst. 1(1), 187–210 (2018)
Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment effect: generalization bounds and algorithms. In: International Conference on Machine Learning, pp. 3076–3085. PMLR (2017)
Shirazi, E., Jadid, S.: Autonomous self-healing in smart distribution grids using agent systems. IEEE Trans. Industr. Inf. 15(12), 6291–6301 (2018)
Tamim, I., Saci, A., Jammal, M., Shami, A.: Downtime-aware O-RAN VNF deployment strategy for optimized self-healing in the O-cloud. In: 2021 IEEE Global Communications Conference (GLOBECOM), pp. 1–6. IEEE (2021)
White, G., Diuwe, J., Fonseca, E., O’Brien, O.: MMRCA: multimodal root cause analysis. In: Hacid, H., et al. (eds.) ICSOC 2021. LNCS, vol. 13236, pp. 177–189. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-14135-5_14
Zhou, G., Tian, W., Buyya, R.: Deep reinforcement learning-based methods for resource scheduling in cloud computing: a review and future directions. arXiv preprint arXiv:2105.04086 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
White, G., Custode, L.L., O’Brien, O. (2023). SASH: Safe Autonomous Self-Healing. In: Troya, J., et al. Service-Oriented Computing – ICSOC 2022 Workshops. ICSOC 2022. Lecture Notes in Computer Science, vol 13821. Springer, Cham. https://doi.org/10.1007/978-3-031-26507-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-26507-5_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26506-8
Online ISBN: 978-3-031-26507-5
eBook Packages: Computer ScienceComputer Science (R0)