SASH: Safe Autonomous Self-Healing

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13821))

Included in the following conference series:

International Conference on Service-Oriented Computing

861 Accesses

Abstract

With the large scale and user demands on modern cloud systems there is a need for autonomous approaches to self-healing. When there is no operator in the loop for self-healing actions, it is crucial to ensure that the actions taken are safe and effective. In this paper we propose SASH: Safe Autonomous Self-Healing, which uses surrogate models to estimate the safety and effectiveness of self-healing actions. SASH uses system metrics, configuration parameters, domain information and available actions to decide on the best fault remediation action or combination of actions. The performance of the action(s) are then verified through a validation block that updates the knowledge base with how the actions performed for that fault. This data is then used to update the safety and effectiveness estimation algorithm. The results show the framework is able to successfully remediate faults with a low number of actions and with protection against unsafe actions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Autonomic Service Operation for Cloud Applications: Safe Actuation and Risk Management

Software Reliability Engineering for Resilient Cloud Operations

A distributed formal-based model for self-healing behaviors in autonomous systems: from failure detection to self-recovery

Article 13 June 2022

References

Ali-Tolppa, J., Kocsis, S., Schultz, B., Bodrog, L., Kajo, M.: Self-healing and resilience in future 5G cognitive autonomous networks. In: 2018 ITU Kaleidoscope: Machine Learning for a 5G Future (ITU K), pp. 1–8. IEEE (2018)
Google Scholar
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Bressler, S.L., Seth, A.K.: Wiener-granger causality: a well established methodology. Neuroimage 58(2), 323–329 (2011)
Article Google Scholar
Computing, A., et al.: An architectural blueprint for autonomic computing. IBM White Pap. 31(2006), 1–6 (2006)
Google Scholar
Dai, Y., Xiang, Y., Zhang, G.: Self-healing and hybrid diagnosis in cloud computing. In: IEEE International Conference on Cloud Computing, pp. 45–56 (2009)
Google Scholar
Dang, Y., Lin, Q., Huang, P.: AIOps: real-world challenges and research innovations. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 4–5. IEEE (2019)
Google Scholar
Gulenko, A.: Autonomic self-healing in cloud computing platforms. Technische Universitaet Berlin, Germany (2020)
Google Scholar
Jin, Y., et al.: Self-aware distributed deep learning framework for heterogeneous IoT edge devices. Futur. Gener. Comput. Syst. 125, 908–920 (2021)
Article Google Scholar
Magalhaes, J.P., Silva, L.M.: A framework for self-healing and self-adaptation of cloud-hosted web-based applications. In: 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, vol. 1, pp. 555–564. IEEE (2013)
Google Scholar
Mariani, L., Monni, C., Pezzé, M., Riganelli, O., Xin, R.: Localizing faults in cloud systems. In: 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), pp. 262–273. IEEE (2018)
Google Scholar
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Mo, S., Pei, X., Wu, C.: Safe reinforcement learning for autonomous vehicle using Monte Carlo tree search. IEEE Trans. Intell. Transp. 23, 6766–6773 (2021)
Article Google Scholar
Paltrinieri, N., Comfort, L., Reniers, G.: Learning about risk: machine learning for risk assessment. Saf. Sci. 118, 475–486 (2019)
Article Google Scholar
Petrenko, S.: Developing a Cybersecurity Immune System for Industry 40. CRC Press, Boca Raton (2022)
Book Google Scholar
Rajput, P.K., Sikka, G.: Multi-agent architecture for fault recovery in self-healing systems. J. Ambient. Intell. Humaniz. Comput. 12(2), 2849–2866 (2021)
Article Google Scholar
Sadiku, M.N., Musa, S.M., Momoh, O.D.: Cloud computing: opportunities and challenges. IEEE Potentials 33(1), 34–36 (2014)
Article Google Scholar
Schwarting, W., Alonso-Mora, J., Rus, D.: Planning and decision-making for autonomous vehicles. Annu. Rev. Control Robot. Auton. Syst. 1(1), 187–210 (2018)
Article Google Scholar
Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment effect: generalization bounds and algorithms. In: International Conference on Machine Learning, pp. 3076–3085. PMLR (2017)
Google Scholar
Shirazi, E., Jadid, S.: Autonomous self-healing in smart distribution grids using agent systems. IEEE Trans. Industr. Inf. 15(12), 6291–6301 (2018)
Article Google Scholar
Tamim, I., Saci, A., Jammal, M., Shami, A.: Downtime-aware O-RAN VNF deployment strategy for optimized self-healing in the O-cloud. In: 2021 IEEE Global Communications Conference (GLOBECOM), pp. 1–6. IEEE (2021)
Google Scholar
White, G., Diuwe, J., Fonseca, E., O’Brien, O.: MMRCA: multimodal root cause analysis. In: Hacid, H., et al. (eds.) ICSOC 2021. LNCS, vol. 13236, pp. 177–189. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-14135-5_14
Chapter Google Scholar
Zhou, G., Tian, W., Buyya, R.: Deep reinforcement learning-based methods for resource scheduling in cloud computing: a review and future directions. arXiv preprint arXiv:2105.04086 (2021)

Download references

Author information

Authors and Affiliations

Huawei Ireland Research Centre, Townsend St, Dublin 2, D02 R156, Ireland
Gary White, Leonardo Lucio Custode & Owen O’Brien

Authors

Gary White
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Lucio Custode
View author publications
You can also search for this author in PubMed Google Scholar
Owen O’Brien
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gary White .

Editor information

Editors and Affiliations

University of Malaga, Málaga, Spain
Javier Troya
Politecnico di Milano, Milano, Italy
Raffaela Mirandola
University of Castilla-La Mancha, Albacete, Spain
Elena Navarro
University of the Republic, Montevideo, Uruguay
Andrea Delgado
University of Seville, Sevilla, Spain
Sergio Segura
University of Cádiz, Cádiz, Spain
Guadalupe Ortiz
Faculty of Informatics, Universita della Svizzera Italiana, Lugano, Switzerland
Cesare Pautasso
Karlsuhe Institute of Technology, Karlsruhe, Germany
Christian Zirpins
University of Seville, Seville, Spain
Pablo Fernández
ISA, Universidad de Sevilla, Sevilla, Spain
Antonio Ruiz-Cortés

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

White, G., Custode, L.L., O’Brien, O. (2023). SASH: Safe Autonomous Self-Healing. In: Troya, J., et al. Service-Oriented Computing – ICSOC 2022 Workshops. ICSOC 2022. Lecture Notes in Computer Science, vol 13821. Springer, Cham. https://doi.org/10.1007/978-3-031-26507-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-26507-5_12
Published: 19 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26506-8
Online ISBN: 978-3-031-26507-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SASH: Safe Autonomous Self-Healing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Autonomic Service Operation for Cloud Applications: Safe Actuation and Risk Management

Software Reliability Engineering for Resilient Cloud Operations

A distributed formal-based model for self-healing behaviors in autonomous systems: from failure detection to self-recovery

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SASH: Safe Autonomous Self-Healing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Autonomic Service Operation for Cloud Applications: Safe Actuation and Risk Management

Software Reliability Engineering for Resilient Cloud Operations

A distributed formal-based model for self-healing behaviors in autonomous systems: from failure detection to self-recovery

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation