skip to main content
research-article
Open access

Canopy: An End-to-End Performance Tracing And Analysis System

Published: 14 October 2017 Publication History

Abstract

This paper presents Canopy, Facebook's end-to-end performance tracing infrastructure. Canopy records causally related performance data across the end-to-end execution path of requests, including from browsers, mobile applications, and backend services. Canopy processes traces in near real-time, derives user-specified features, and outputs to performance datasets that aggregate across billions of requests. Using Canopy, Facebook engineers can query and analyze performance data in real-time. Canopy addresses three challenges we have encountered in scaling performance analysis: supporting the range of execution and performance models used by different components of the Facebook stack; supporting interactive ad-hoc analysis of performance data; and enabling deep customization by users, from sampling traces to extracting and visualizing features. Canopy currently records and processes over 1 billion traces per day. We discuss how Canopy has evolved to apply to a wide range of scenarios, and present case studies of its use in solving various performance challenges.

Supplementary Material

MP4 File (canopy.mp4)

References

[1]
Abraham, L., Allen, J., Barykin, O., Borkar, V., Chopra, B., Gerea, C., Merl, D., Metzler, J., Reiss, D., Subramanian, S., Wiener, J. L., and Zed, O. Scuba: Diving into Data at Facebook. In 39th International Conference on Very Large Data Bases (VLDB '13). (§3.1, 4.2, and 4.5).
[2]
Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. Performance Debugging for Distributed Systems of Black Boxes. In 19th ACM Symposium on Operating Systems Principles (SOSP '03). (§7).
[3]
Alspaugh, S., Di Chen, B., Lin, J., Ganapathi, A., Hearst, M. A., and Katz, R. H. Analyzing Log Analysis: An Empirical Study of User Log Mining. In 28th USENIX Large Installation System Administration Conference (LISA '14). (§2.2 and 7).
[4]
Apache. ACCUMULO-3741: Reduce incompatibilities with htrace 3.2.0-incubating. Retrieved January 2017 from https://issues.apache.org/jira/browse/ACCUMULO-3741. (§2.2).
[5]
Apache. ACCUMULO-4171: Update to htrace-core4. https://issues.apache.org/jira/browse/ACCUMULO-4171. {Online; accessed January 2017}. (§2.2).
[6]
Apache. CASSANDRA-10392: Allow Cassandra to trace to custom tracing implementations. Retrieved January 2017 from https://issues.apache.org/jira/browse/CASSANDRA-10392. (§3.2).
[7]
Apache. HBASE-12938: Upgrade HTrace to a recent supportable incubating version. Retrieved January 2017 from https://issues.apache.org/jira/browse/HBASE-12938. (§2.2).
[8]
Apache. HBASE-9121: Update HTrace to 2.00 and add new example usage. Retrieved January 2017 from https://issues.apache.org/jira/browse/HBASE-9121. (§2.2).
[9]
Apache. HDFS-11622 TraceId hardcoded to 0 in DataStreamer, correlation between multiple spans is lost. Retrieved April 2017 from https://issues.apache.org/jira/browse/HDFS-11622. (§2.2).
[10]
Apache. HDFS-7054: Make DFSOutputStream tracing more fine-grained. Retrieved January2017 from https://issues.apache.org/jira/browse/HDFS-7054. (§2.2).
[11]
Apache. HDFS-9080: update htrace version to 4.0.1. Retrieved January 2017 from https://issues.apache.org/jira/browse/HDFS-9080. (§2.2).
[12]
Apache. HTrace. Retrieved January 2017 from http://htrace.incubator.apache.org/. (§2.2, 3.2, 5.3, and 7).
[13]
Apache. HTRACE-118: support setting the parents of a span after the span is created. Retrieved January 2017 from https://issues.apache.org/jira/browse/HTRACE-118. (§2.2).
[14]
Apache. HTRACE-209: Make span ID 128 bit to avoid collisions. Retrieved January 2017 from https://issues.apache.org/jira/browse/HTRACE-209. (§2.2).
[15]
Bailis, P., Gan, E., Madden, S., Narayanan, D., Rong, K., and Suri, S. MacroBase: Analytic Monitoring for the Internet of Tings. arXiv preprint arXiv:1603.00567 (2016). (§7).
[16]
Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. Using Magpie for Request Extraction and Workload Modelling. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI '04). (§7).
[17]
Barham, P., Isaacs, R., Mortier, R., and Narayanan, D. Magpie: Online Modelling and Performance-Aware Systems. In 9th USENIX Workshop on Hot Topics in Operating Systems (HotOS '03). (§7).
[18]
Beschastnikh, I., Brun, Y., Ernst, M. D., and Krishnamurthy, A. Inferring Models of Concurrent Systems from Logs of Their Behavior with CSight. In 36th ACM International Conference on Software Engineering (ICSE '14). (§7).
[19]
Chanda, A., Cox, A. L., and Zwaenepoel, W. Whodunit: Transactional Profiling for Multi-Tier Applications. In 2nd ACM European Conference on Computer Systems (EuroSys '07). (§7).
[20]
Chen, M. Y., Accardi, A., Kiciman, E., Patterson, D. A., Fox, A., and Brewer, E. A. Path-Based Failure and Evolution Management. In 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI '04). (§7).
[21]
Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In 32nd IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '02). (§7).
[22]
Chow, M., Meisner, D., Flinn, J., Peek, D., and Wenisch, T. F. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). (§2.2 and 7).
[23]
Fonseca, R., Freedman, M. J., and Porter, G. Experiences with Tracing Causality in Networked Services. In 2010 USENIX Internet Network Management Workshop/Workshop on Research on Enterprise Networking (INM/WREN '10). (§2.2).
[24]
Fonseca, R., Porter, G., Katz, R. H., Shenker, S., and Stoica, I. X-Trace: A Pervasive Network Tracing Framework. In 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI '07). (§1, 2.2, and 7).
[25]
Guo, Z., Zhou, D., Lin, H., Yang, M., Long, F., Deng, C., Liu, C., and Zhou, L. G2: A Graph Processing System for Diagnosing Distributed Systems. In 2011 USENIX Annual Technical Conference (ATC). (§7).
[26]
Jiang, Y., Ravindranath, L., Nath, S., and Govindan, R. WebPerf: Evaluating "What-If" Scenarios for Cloud-hosted Web Applications. In 2016 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM). (§7).
[27]
Johnson, R. Facebook's Scribe technology now open source. (October 2008). Retrieved August 2017 from https://www.facebook.com/note.php?note_id=32008268919. (§4.1 and 4.2).
[28]
Karumuri, S. PinTrace: Distributed Tracing at Pinterest. (August 2016). Retrieved July 2017 from https://www.slideshare.net/mansu/pintrace-advanced-aws-meetup. (§2.2).
[29]
Kavulya, S. P., Daniels, S., Joshi, K., Hiltunen, M., Gandhi, R., and Narasimhan, P. Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems. In 42nd IEEE/IFIP Conference on Dependable Systems and Networks (DSN '12). (§7).
[30]
Ko, S. Y., Yalagandula, P., Gupta, I., Talwar, V., Milojicic, D., and Iyer, S. Moara: Flexible and Scalable Group-Based Querying System. In 9th ACM/IFIP/USENIX International Conference on Middleware (Middleware '08). (§7).
[31]
Leavitt, J. End-to-End Tracing Models: Analysis and Unification. B.Sc. Thesis, Brown University, 2014. (§2.2 and 3.3).
[32]
Li, D., Mickens, J., Nath, S., and Ravindranath, L. Domino: Understanding Wide-Area, Asynchronous Event Causality in Web Applications. In 6th ACM Symposium on Cloud Computing (SoCC '15). (§7).
[33]
Mace, J. End-to-End Tracing: Adoption and Use Cases. Survey, Brown University, 2017. http://cs.brown.edu/people/jcmace/papers/mace2017survey.pdf. (§7).
[34]
Mace, J., Roelke, R., and Fonseca, R. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In 25th ACM Symposium on Operating Systems Principles (SOSP '15). (§7).
[35]
Mann, G., Sandler, M., Krushevskaja, D., Guha, S., and Even-Dar, E. Modeling the Parallel Executionof Black-Box Services. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '11). (§2.2, 3.4, and 7).
[36]
Massie, M. L., Chun, B. N., and Culler, D. E. The Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing 30, 7 (2004), 817--840. (§7).
[37]
Nagaraj, K., Killian, C. E., and Neville, J. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI '12). (§7).
[38]
Oliner, A. J., Kulkarni, A. V., and Aiken, A. Using Correlated Surprise to Infer Shared Influence. In 40th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '10). (§7).
[39]
OpenTracing. OpenTracing. Retrieved January 2017 from http://opentracing.io/. (§7).
[40]
OpenTracing. Specification 5: Non-RPC Spans and Mapping to Multiple Parents. Retrieved February 2017 from https://github.com/opentracing/specification/issues/5. (§2.2).
[41]
OpenZipkin. Zipkin 1189: Representing an asynchronous span in Zipkin. Retrieved January 2017 from https://github.com/openzipkin/zipkin/issues/1189. (§2.2).
[42]
OpenZipkin. Zipkin 1243: Support async spans. Retrieved January 2017 from https://github.com/openzipkin/zipkin/issues/1243. (§2.2).
[43]
OpenZipkin. Zipkin 1244: Multiple parents aka Linked traces. Retrieved January 2017 from https://github.com/openzipkin/zipkin/issues/1244. (§2.2).
[44]
OpenZipkin. Zipkin 925: How to track async spans? Retrieved January 2017 from https://github.com/openzipkin/zipkin/issues/925. (§2.2).
[45]
OpenZipkin. Zipkin 939: Zipkin v2 span model. Retrieved January 2017 from https://github.com/openzipkin/zipkin/issues/939. (§2.2).
[46]
Ostrowski, K., Mann, G., and Sandler, M. Diagnosing Latency in Multi-Tier Black-Box Services. In 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS '11). (§2.2, 3.4, and 7).
[47]
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.-G. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI '15). (§7).
[48]
Ravindranath, L., Padhye, J., Mahajan, R., and Balakrishnan, H. Timecard: Controlling User-Perceived Delays in Server-Based Mobile Applications. In 24th ACM Symposium on Operating Systems Principles (SOSP '13). (§7).
[49]
Reynolds, P., Killian, C. E., Wiener, J. L., Mogul, J. C., Shah, M. A., and Vahdat, A. Pip: Detecting the Unexpected in Distributed Systems. In 3rd USENIX Symposium on Networked Systems Design and Implementation (NSDI '06). (§7).
[50]
Sambasivan, R. R., Shafer, I., Mace, J., Sigelman, B. H., Fonseca, R., and Ganger, G. R. Principled Workflow-Centric Tracing of Distributed Systems. In 7th ACM Symposium on Cloud Computing (SOCC '16). (§2.2 and 7).
[51]
Sambasivan, R. R., Zheng, A. X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M., Wang, W., Xu, L., and Ganger, G. R. Diagnosing Performance Changes by Comparing Request Flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI '11). (§7).
[52]
Shreedhar, M., and Varghese, G. Efficient Fair Queuing Using Deficit Round Robin. In 1995 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM). (§4.2).
[53]
Sigelman, B. H. Towards Turnkey Distributed Tracing. (June 2016). Retrieved January 2017 from https://medium.com/opentracing/towards-turnkey-distributed-tracing-5f4297d1736. (§2.2).
[54]
Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., and Shanbhag, C. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report, Google, 2010. (§1, 2.2, 5.3, and 7).
[55]
Slee, M., Agarwal, A., and Kwiatkowski, M. Thrift: Scalable Cross-Language Services Implementation. Technical Report, Facebook, 2007. (§3).
[56]
Spring. Spring Cloud Sleuth. Retrieved January 2017 from http://cloud.spring.io/spring-cloud-sleuth/. (§7).
[57]
Tang, C., Kooburat, T., Venkatachalam, P., Chander, A., Wen, Z., Narayanan, A., Dowell, P., and Karl, R. Holistic Configuration Management at Facebook. In 25th ACM Symposium on Operating Systems Principles (SOSP '15). (§4.1 and 4.1).
[58]
Thereska, E., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M., Lopez, J., and Ganger, G. R. Stardust: Tracking Activity in a Distributed Storage System. In 2006 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). (§7).
[59]
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In 26th IEEE International Conference on Data Engineering (ICDE '10). (§4.2).
[60]
Twitter. Zipkin. Retrieved July 2017 from http://zipkin.io/. (§2.2 and 7).
[61]
Van Renesse, R., Birman, K. P., and Vogels, W. Astrolabe: A Robust and Scalable Technology For Distributed System Monitoring, Management, and Data Mining. ACM Transactions on Computer Systems 21, 2 (2003), 164--206. (§7).
[62]
Wagner, T., Schkufza, E., and Wieder, U. A Sampling-Based Approach to Accelerating Queries in Log Management Systems. In Poster presented at: 7th ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH '16). (§7).
[63]
Wang, C., Kavulya, S. P., Tan, J., Hu, L., Kutare, M., Kasick, M., Schwan, K., Narasimhan, P., and Gandhi, R. Performance Troubleshooting in Data Centers: An Annotated Bibliography. ACM SIGOPS Operating Systems Review 47, 3 (2013), 50--62. (§7).
[64]
Wang, C., Rayan, I. A., Eisenhauer, G., Schwan, K., Talwar, V., Wolf, M., and Huneycutt, C. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In 13th ACM/IFIP/USENIX International Middleware Conference (Middleware '12). (§7).
[65]
Workgroup, D. T. Tracing Workshop. (February 2017). Retrieved February 2017 from https://goo.gl/2WKjhR. (§2.2).
[66]
Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. I. Detecting Large-Scale System Problems by Mining Console Logs. In 22nd ACM Symposium on Operating Systems Principles (SOSP '09). (§7).
[67]
Zhao, X., Rodrigues, K., Luo, Y., Yuan, D., and Stumm, M. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16). (§7).
[68]
Zhao, X., Zhang, Y., Lion, D., Faizan, M., Luo, Y., Yuan, D., and Stumm, M. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). (§7).

Cited By

View all
  • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
  • (2024)TraceWeaver: Distributed Request Tracing for Microservices Without Application ModificationProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672254(828-842)Online publication date: 4-Aug-2024
  • (2024)TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime StateProceedings of the ACM on Software Engineering10.1145/36437481:FSE(473-493)Online publication date: 12-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
October 2017
677 pages
ISBN:9781450350853
DOI:10.1145/3132747
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOSP '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)538
  • Downloads (Last 6 weeks)43
Reflects downloads up to 25 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
  • (2024)TraceWeaver: Distributed Request Tracing for Microservices Without Application ModificationProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672254(828-842)Online publication date: 4-Aug-2024
  • (2024)TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime StateProceedings of the ACM on Software Engineering10.1145/36437481:FSE(473-493)Online publication date: 12-Jul-2024
  • (2024)Towards Efficient Diagnosis of Performance Bottlenecks in Microservice-Based Applications (Work In Progress paper)Companion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651432(40-46)Online publication date: 7-May-2024
  • (2024)Systemizing and Mitigating Topological Inconsistencies in Alibaba's Microservice Call-graph DatasetsProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645043(276-285)Online publication date: 7-May-2024
  • (2024)VAMP: Visual Analytics for Microservices PerformanceProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636069(1209-1218)Online publication date: 8-Apr-2024
  • (2024)A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and OpportunitiesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.324159630:7(3828-3840)Online publication date: Jul-2024
  • (2024)Informed and Assessable Observability Design Decisions in Cloud-Native Microservice Applications2024 IEEE 21st International Conference on Software Architecture (ICSA)10.1109/ICSA59870.2024.00015(69-78)Online publication date: 4-Jun-2024
  • (2024)Debuglet: Programmable and Verifiable Inter-Domain Network Telemetry2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00032(255-265)Online publication date: 23-Jul-2024
  • (2024)Exploring Use of Symbolic Execution for Service Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00014(12-16)Online publication date: 24-Jun-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media