Skip to main content

DashChef: A Metric Recommendation Service for Online Systems Using Graph Learning

  • Conference paper
  • First Online:
Engineering of Complex Computer Systems (ICECCS 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14784 ))

Included in the following conference series:

  • 24 Accesses

Abstract

To ensure the high availability of modern online systems, effective maintenance is of critical importance. Today’s software maintenance techniques for online systems heavily rely on metrics, which are time series data that can describe the real-time state of a system from various perspectives. Typically, software engineers generate dashboards with metrics to aid software maintenance. Though several attempts have been devoted to metric analysis for automatic software maintenance, the primary step, i.e., dashboard generation, remains manual to a large extent. In this paper, we develop a metric recommendation service, which can automate the dashboard generation practice and greatly ease the burden in maintaining an online system. Specifically, we analyze the needs of two essential steps of online system maintenance, i.e., anomaly detection and fault diagnosis, and design metric recommendation mechanisms for them respectively. Graph learning techniques are employed in the automation of metric recommendation. Our experiments demonstrate that the proposed approach can achieve an F1-score of 0.912 in selecting metrics for anomaly detection, and an accuracy of 0.859 in retrieving metrics for faults diagnosis, which significantly outperforms the compared baselines.

Z. He and T. Huang—Co-first authors of this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 89.00
Price excludes VAT (USA)
Softcover Book
USD 74.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Grafana dashboards (2023). https://grafana.com/grafana/dashboards/. Accessed 12 Dec 2023

  2. Kafka monitoring (2023). https://kafka.apache.org/documentation

  3. Node exporter (2023). https://github.com/prometheus/node_exporter

  4. Prometheus monitoring for containers (2023). https://github.com/google/cadvisor/blob/master/metrics/prometheus.go

  5. Redis monitoring (2023). https://redis.io/commands/info/

  6. Baradari, I., Shoar, M., Nezafati, N., Motadel, M.: A new approach for KPI ranking and selection in ITIL processes: using simultaneous evaluation of criteria and alternatives (SECA). J. Ind. Eng. Manag. Stud. 8(1), 152–179 (2021)

    Google Scholar 

  7. Barandas, M., et al.: TSFEL: time series feature extraction library. SoftwareX 11, 100456 (2020)

    Article  Google Scholar 

  8. Beyer, B., Jones, C., Petoff, J., Murphy, N.R.: Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, Inc. (2016)

    Google Scholar 

  9. Chen, P., Qi, Y., Zheng, P., Hou, D.: CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pp. 1887–1895. IEEE (2014)

    Google Scholar 

  10. Christ, M., Braun, N., Neuffer, J., Kempa-Liehr, A.W.: Time series feature extraction on basis of scalable hypothesis tests (tsfresh-a python package). Neurocomputing 307, 72–77 (2018)

    Article  Google Scholar 

  11. Farshchi, M., Schneider, J.G., Weber, I., Grundy, J.: Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. J. Syst. Softw. 137, 531–549 (2018)

    Article  Google Scholar 

  12. Fu, S.: Performance metric selection for autonomic anomaly detection on cloud computing systems. In: 2011 IEEE Global Telecommunications Conference-GLOBECOM 2011, pp. 1–5. IEEE (2011)

    Google Scholar 

  13. Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems, pp. 205–214. IEEE (2013)

    Google Scholar 

  14. He, Z., et al.: A spatiotemporal deep learning approach for unsupervised anomaly detection in cloud systems. IEEE Trans. Neural Netw. Learn. Syst. 34(4), 1705–1719 (2020)

    Article  Google Scholar 

  15. Huang, T., Chen, P., Li, R.: A semi-supervised VAE based active anomaly detection framework in multivariate time series for online systems. In: Proceedings of the ACM Web Conference 2022, pp. 1797–1806 (2022)

    Google Scholar 

  16. Jha, D.N., Lenton, G., Asker, J., Blundell, D., Wallom, D.: Holistic runtime performance and security-aware monitoring in public cloud environment. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 1052–1059. IEEE (2022)

    Google Scholar 

  17. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6), 066138 (2004)

    Article  MathSciNet  Google Scholar 

  18. Levin, J., Benson, T.A.: ViperProbe: rethinking microservice observability with eBPF. In: 2020 IEEE 9th International Conference on Cloud Networking (CloudNet), pp. 1–8. IEEE (2020)

    Google Scholar 

  19. Li, Z., et al.: Actionable and interpretable fault localization for recurring failures in online service systems. In: Proceedings of the 2022 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE 2022 (2022)

    Google Scholar 

  20. Lin, J., Chen, P., Zheng, Z.: Microscope: pinpoint performance issues with causal graphs in micro-service environments. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 3–20. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03596-9_1

    Chapter  Google Scholar 

  21. Meng, Y., et al.: Localizing failure root causes in a microservice through causality inference. In: 28th IEEE/ACM International Symposium on Quality of Service, IWQoS 2020, Hangzhou, China, 15–17 June 2020, pp. 1–10. IEEE (2020)

    Google Scholar 

  22. Müller, M.: Dynamic time warping. In: Müller, M. (ed.) Information Retrieval for Music and Motion, pp. 69–84. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74048-3_4

    Chapter  Google Scholar 

  23. Paul, A., Mukherjee, D.P., Das, P., Gangopadhyay, A., Chintha, A.R., Kundu, S.: Improved random forest for classification. IEEE Trans. Image Process. 27(8), 4012–4024 (2018)

    Article  MathSciNet  Google Scholar 

  24. Ramadona, S., Haryadi, S., Aryanti, D.R.: Over the top call service key performance indicator. In: 2015 1st International Conference on Wireless and Telematics (ICWT), pp. 1–4. IEEE (2015)

    Google Scholar 

  25. Siffer, A., Fouque, P., Termier, A., Largouët, C.: Anomaly detection in streams with extreme value theory. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017, pp. 1067–1075. ACM (2017)

    Google Scholar 

  26. Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., Pei, D.: Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2828–2837. ACM (2019)

    Google Scholar 

  27. Tong, H., Faloutsos, C., Pan, J.Y.: Fast random walk with restart and its applications. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 613–622. IEEE (2006)

    Google Scholar 

  28. Weng, T., Yang, W., Yu, G., Chen, P., Cui, J., Zhang, C.: Kmon: an in-kernel transparent monitoring system for microservice systems with eBPF. In: 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence), pp. 25–30. IEEE (2021)

    Google Scholar 

  29. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering. Springer, Heidelberg (2012)

    Book  Google Scholar 

  30. Wu, C., et al.: Identifying root-cause metrics for incident diagnosis in online service systems. In: 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, 25–28 October 2021, pp. 91–102. IEEE (2021)

    Google Scholar 

  31. Xu, H., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, 23–27 April 2018, pp. 187–196. ACM (2018)

    Google Scholar 

Download references

Acknowledgments

The research is supported by the National Natural Science Foundation of China (No. 62272495) and the Guangdong Basic and Applied Basic Research Foundation (No. 2023B1515020054), and sponsored by Tencent.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pengfei Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

He, Z., Huang, T., Chen, P., Li, R., Wang, R., Zheng, Z. (2025). DashChef: A Metric Recommendation Service for Online Systems Using Graph Learning. In: Bai, G., Ishikawa, F., Ait-Ameur, Y., Papadopoulos, G.A. (eds) Engineering of Complex Computer Systems. ICECCS 2024. Lecture Notes in Computer Science, vol 14784 . Springer, Cham. https://doi.org/10.1007/978-3-031-66456-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-66456-4_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-66455-7

  • Online ISBN: 978-3-031-66456-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics