DashChef: A Metric Recommendation Service for Online Systems Using Graph Learning

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14784 ))

Included in the following conference series:

International Conference on Engineering of Complex Computer Systems

24 Accesses

Abstract

To ensure the high availability of modern online systems, effective maintenance is of critical importance. Today’s software maintenance techniques for online systems heavily rely on metrics, which are time series data that can describe the real-time state of a system from various perspectives. Typically, software engineers generate dashboards with metrics to aid software maintenance. Though several attempts have been devoted to metric analysis for automatic software maintenance, the primary step, i.e., dashboard generation, remains manual to a large extent. In this paper, we develop a metric recommendation service, which can automate the dashboard generation practice and greatly ease the burden in maintaining an online system. Specifically, we analyze the needs of two essential steps of online system maintenance, i.e., anomaly detection and fault diagnosis, and design metric recommendation mechanisms for them respectively. Graph learning techniques are employed in the automation of metric recommendation. Our experiments demonstrate that the proposed approach can achieve an F1-score of 0.912 in selecting metrics for anomaly detection, and an accuracy of 0.859 in retrieving metrics for faults diagnosis, which significantly outperforms the compared baselines.

Z. He and T. Huang—Co-first authors of this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Grafana dashboards (2023). https://grafana.com/grafana/dashboards/. Accessed 12 Dec 2023
Kafka monitoring (2023). https://kafka.apache.org/documentation
Node exporter (2023). https://github.com/prometheus/node_exporter
Prometheus monitoring for containers (2023). https://github.com/google/cadvisor/blob/master/metrics/prometheus.go
Redis monitoring (2023). https://redis.io/commands/info/
Baradari, I., Shoar, M., Nezafati, N., Motadel, M.: A new approach for KPI ranking and selection in ITIL processes: using simultaneous evaluation of criteria and alternatives (SECA). J. Ind. Eng. Manag. Stud. 8(1), 152–179 (2021)
Google Scholar
Barandas, M., et al.: TSFEL: time series feature extraction library. SoftwareX 11, 100456 (2020)
Article Google Scholar
Beyer, B., Jones, C., Petoff, J., Murphy, N.R.: Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, Inc. (2016)
Google Scholar
Chen, P., Qi, Y., Zheng, P., Hou, D.: CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pp. 1887–1895. IEEE (2014)
Google Scholar
Christ, M., Braun, N., Neuffer, J., Kempa-Liehr, A.W.: Time series feature extraction on basis of scalable hypothesis tests (tsfresh-a python package). Neurocomputing 307, 72–77 (2018)
Article Google Scholar
Farshchi, M., Schneider, J.G., Weber, I., Grundy, J.: Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. J. Syst. Softw. 137, 531–549 (2018)
Article Google Scholar
Fu, S.: Performance metric selection for autonomic anomaly detection on cloud computing systems. In: 2011 IEEE Global Telecommunications Conference-GLOBECOM 2011, pp. 1–5. IEEE (2011)
Google Scholar
Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems, pp. 205–214. IEEE (2013)
Google Scholar
He, Z., et al.: A spatiotemporal deep learning approach for unsupervised anomaly detection in cloud systems. IEEE Trans. Neural Netw. Learn. Syst. 34(4), 1705–1719 (2020)
Article Google Scholar
Huang, T., Chen, P., Li, R.: A semi-supervised VAE based active anomaly detection framework in multivariate time series for online systems. In: Proceedings of the ACM Web Conference 2022, pp. 1797–1806 (2022)
Google Scholar
Jha, D.N., Lenton, G., Asker, J., Blundell, D., Wallom, D.: Holistic runtime performance and security-aware monitoring in public cloud environment. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 1052–1059. IEEE (2022)
Google Scholar
Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6), 066138 (2004)
Article MathSciNet Google Scholar
Levin, J., Benson, T.A.: ViperProbe: rethinking microservice observability with eBPF. In: 2020 IEEE 9th International Conference on Cloud Networking (CloudNet), pp. 1–8. IEEE (2020)
Google Scholar
Li, Z., et al.: Actionable and interpretable fault localization for recurring failures in online service systems. In: Proceedings of the 2022 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE 2022 (2022)
Google Scholar
Lin, J., Chen, P., Zheng, Z.: Microscope: pinpoint performance issues with causal graphs in micro-service environments. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 3–20. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03596-9_1
Chapter Google Scholar
Meng, Y., et al.: Localizing failure root causes in a microservice through causality inference. In: 28th IEEE/ACM International Symposium on Quality of Service, IWQoS 2020, Hangzhou, China, 15–17 June 2020, pp. 1–10. IEEE (2020)
Google Scholar
Müller, M.: Dynamic time warping. In: Müller, M. (ed.) Information Retrieval for Music and Motion, pp. 69–84. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74048-3_4
Chapter Google Scholar
Paul, A., Mukherjee, D.P., Das, P., Gangopadhyay, A., Chintha, A.R., Kundu, S.: Improved random forest for classification. IEEE Trans. Image Process. 27(8), 4012–4024 (2018)
Article MathSciNet Google Scholar
Ramadona, S., Haryadi, S., Aryanti, D.R.: Over the top call service key performance indicator. In: 2015 1st International Conference on Wireless and Telematics (ICWT), pp. 1–4. IEEE (2015)
Google Scholar
Siffer, A., Fouque, P., Termier, A., Largouët, C.: Anomaly detection in streams with extreme value theory. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017, pp. 1067–1075. ACM (2017)
Google Scholar
Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., Pei, D.: Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2828–2837. ACM (2019)
Google Scholar
Tong, H., Faloutsos, C., Pan, J.Y.: Fast random walk with restart and its applications. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 613–622. IEEE (2006)
Google Scholar
Weng, T., Yang, W., Yu, G., Chen, P., Cui, J., Zhang, C.: Kmon: an in-kernel transparent monitoring system for microservice systems with eBPF. In: 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence), pp. 25–30. IEEE (2021)
Google Scholar
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering. Springer, Heidelberg (2012)
Book Google Scholar
Wu, C., et al.: Identifying root-cause metrics for incident diagnosis in online service systems. In: 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, 25–28 October 2021, pp. 91–102. IEEE (2021)
Google Scholar
Xu, H., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, 23–27 April 2018, pp. 187–196. ACM (2018)
Google Scholar

Download references

Acknowledgments

The research is supported by the National Natural Science Foundation of China (No. 62272495) and the Guangdong Basic and Applied Basic Research Foundation (No. 2023B1515020054), and sponsored by Tencent.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Zilong He & Pengfei Chen
Tencent, Shenzhen, China
Tao Huang, Ruipeng Li & Rui Wang
School of Software Engineering, Sun Yat-sen University, Guangzhou, China
Zibin Zheng

Authors

Zilong He
View author publications
You can also search for this author in PubMed Google Scholar
Tao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ruipeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zibin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengfei Chen .

Editor information

Editors and Affiliations

School of Electrical Engineering and Computer Science, Faculty of Engineering, Architecture and Information Technology, University of Queensland, Brisbane, QLD, Australia
Guangdong Bai
Information Systems Architecture Science Research Division, National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
Fuyuki Ishikawa
IRIT, University of Toulouse, Toulouse, France
Yamine Ait-Ameur
Department of Computer Science, University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Z., Huang, T., Chen, P., Li, R., Wang, R., Zheng, Z. (2025). DashChef: A Metric Recommendation Service for Online Systems Using Graph Learning. In: Bai, G., Ishikawa, F., Ait-Ameur, Y., Papadopoulos, G.A. (eds) Engineering of Complex Computer Systems. ICECCS 2024. Lecture Notes in Computer Science, vol 14784 . Springer, Cham. https://doi.org/10.1007/978-3-031-66456-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-66456-4_1
Published: 29 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-66455-7
Online ISBN: 978-3-031-66456-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics