×

GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers. (English) Zbl 1386.68025

Summary: Today, data-intensive applications rely on geographically distributed systems to leverage data collection, storing and processing. Data locality has been seen as a prominent technique to improve application performance and reduce the impact of network latency by scheduling jobs directly in the nodes hosting the data to be processed. MapReduce and Dryad are examples of frameworks which exploit locality by splitting jobs into multiple tasks that are dispatched to process portions of data locally. However, as the ecosystem of big data analysis has shifted from single clusters to span geo-distributed data centers, it is unavoidable that data may still be transferred through the network in order reduce the schedule length. Nevertheless, there is a lack of mechanism to efficiently blend data locality and inter-data center data transfer requirement in the existing scheduling techniques to address data-intensive processing across dispersed data centers. Therefore, the objective of this work is to propose and solve the makespan optimization problem for data-intensive job scheduling on geo-distributed data centers. To this end, we first formulate the task placement and the data access as a linear programming and use the GLPK solver to solve it. We then present a low complexity heuristic scheduling algorithm called GeoDis which allows data locality to cope with the data transfer requirement to achieve a greater performance on the makespan. The experiments with various realistic traces and synthetic generated workload show that GeoDis can reduce makespan of processing jobs by 44% as compared to the state-of-the-art algorithms and remain within 91% closer to the optimal solution by the LP solver.

MSC:

68M20 Performance evaluation, queueing, and scheduling in the context of computer systems
68M14 Distributed systems
90B35 Deterministic scheduling theory in operations research
90C05 Linear programming
90C27 Combinatorial optimization
90C46 Optimality conditions and duality in mathematical programming
Full Text: DOI

References:

[1] Abad CL, Lu Y, Campbell RH (2011) Dare: adaptive data replication for efficient cluster scheduling. In: 2011 IEEE international conference on cluster computing, pp 159-168. doi:10.1109/CLUSTER.2011.26
[2] Abawajy, JH; Deris, MM, Data replication approach with consistency guarantee for data grid, IEEE Trans Comput, 63, 2975-2987, (2014) · Zbl 1364.68146 · doi:10.1109/TC.2013.183
[3] AWS: Amazon Web Service (2006). http://aws.amazon.com
[4] Ananthanarayanan G, Ghodsi A, Shenker S, Stoica I (2013) Effective straggler mitigation: attack of the clones. In: Presented as part of the 10th USENIX symposium on networked systems design and implementation (NSDI 13). USENIX, Lombard, IL, pp 185-198. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/ananthanarayanan
[5] Ananthanarayanan G, Kandula S, Greenberg A, Stoica I, Lu Y, Saha B, Harris E (2010) Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley, CA, USA, pp 265-278. http://dl.acm.org/citation.cfm?id=1924943.1924962
[6] Anikode LR, Tang B (2011) Integrating scheduling and replication in data grids with performance guarantee. In: Global telecommunications conference (GLOBECOM 2011), 2011 IEEE, pp 1-6. doi:10.1109/GLOCOM.2011.6134492
[7] Breslau L, Cao P, Fan L, Phillips G, Shenker S (1999) Web caching and Zipf-like distributions: evidence and implications. In: INFOCOM ’99. Eighteenth annual joint conference of the IEEE computer and communications societies. Proceedings. IEEE, vol 1, pp 126-134. doi:10.1109/INFCOM.1999.749260
[8] Cameron DG, Carvajal-Schiaffino R, Millar AP, Nicholson C, Stockinger K, Zini F (2003) Evaluating scheduling and replica optimisation strategies in optorsim. In: Proceedings. First Latin American Web Congress, pp 52-59 (2003). doi:10.1109/GRID.2003.1261698
[9] Cardosa M, Wang C, Nangia A, Chandra A, Weissman J (2011) Exploring MapReduce efficiency with highly-distributed data. In: Proceedings of the second international workshop on MapReduce and its applications, MapReduce ’11, ACM, New York, NY, USA, pp 27-34. doi:10.1145/1996092.1996100
[10] Cavallo M, Modica GD, Polito C, Tomarchio O (2016) Application profiling in hierarchical Hadoop for geo-distributed computing environments. In: 2016 IEEE symposium on computers and communication (ISCC), pp 555-560. doi:10.1109/ISCC.2016.7543796
[11] Chen, W; Paik, I; Li, Z, Cost-aware streaming workflow allocation on geo-distributed data centers, IEEE Trans Comput, 66, 256-271, (2016) · Zbl 1364.68106 · doi:10.1109/TC.2016.2595579
[12] Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating mapreduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390-399. doi:10.1109/MASCOTS.2011.12 · Zbl 1364.68146
[13] Cheng, D; Rao, J; Guo, Y; Jiang, C; Zhou, X, Improving performance of heterogeneous mapreduce clusters with adaptive task tuning, IEEE Trans Parallel Distrib Syst, 28, 774-786, (2017) · doi:10.1109/TPDS.2016.2594765
[14] Dean, J; Ghemawat, S, Mapreduce: simplified data processing on large clusters, Commun ACM, 51, 107-113, (2008) · doi:10.1145/1327452.1327492
[15] Elghirani A, Subrata R, Zomaya AY (2007) Intelligent scheduling and replication in datagrids: a synergistic approach. In: Seventh IEEE international symposium on cluster computing and the grid (CCGrid ’07), pp 179-182. doi:10.1109/CCGRID.2007.65
[16] Garg N, Kumar A, Pandit V (2007) Order scheduling models: hardness and algorithms. In: Proceedings of the 27th international conference on foundations of software technology and theoretical computer science, FSTTCS’07, Springer, Berlin, pp 96-107 · Zbl 1135.90345
[17] Google Compute Engine (2011). https://cloud.google.com/compute/
[18] Greenberg, A; Hamilton, J; Maltz, DA; Patel, P, The cost of a cloud: research problems in data center networks, SIGCOMM Comput Commun Rev, 39, 68-73, (2008) · doi:10.1145/1496091.1496103
[19] Apache Hadoop Project (2013). http://hadoop.apache.org · Zbl 1243.68129
[20] Heintz, B; Chandra, A; Sitaraman, RK; Weissman, J, End-to-end optimization for geo-distributed mapreduce, IEEE Trans Cloud Comput, 4, 293-306, (2016) · doi:10.1109/TCC.2014.2355225
[21] Herodotou H, Dong F, Babu S (2011) No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proceedings of the 2nd ACM symposium on cloud computing, SOCC ’11, ACM, New York, NY, USA, pp 18:1-18:14. doi:10.1145/2038916.2038934
[22] Hu Z, Li B, Luo J (2016) Flutter: scheduling tasks closer to data across geo-distributed datacenters. In: IEEE INFOCOM 2016 - the 35th annual IEEE international conference on computer communications, pp 1-9. doi:10.1109/INFOCOM.2016.7524469
[23] Hung CC, Golubchik L, Yu M (2015) Scheduling jobs across geo-distributed datacenters. In: Proceedings of the sixth acm symposium on cloud computing, SoCC ’15, ACM, New York, NY, USA , pp 111-124. doi:10.1145/2806777.2806780
[24] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2007 Eurosys conference. Association for Computing Machinery, Inc., Lisbon, Portugal. http://research.microsoft.com/apps/pubs/default.aspx?id=63785
[25] Jalaparti V, Ballani H, Costa P, Karagiannis T, Rowstron A (2012) Bridging the tenant-provider gap in cloud services. In: Proceedings of the third ACM symposium on cloud computing, SoCC ’12, ACM, New York, NY, USA, pp 10:1-10:14. doi:10.1145/2391229.2391239
[26] Jalaparti, V; Bodik, P; Menache, I; Rao, S; Makarychev, K; Caesar, M, Network-aware scheduling for data-parallel jobs: plan when you can, SIGCOMM Comput Commun Rev, 45, 407-420, (2015) · doi:10.1145/2829988.2787488
[27] Jin Y, Gao Y, Qian Z, Zhai M, Peng H, Lu S (2016) Workload-aware scheduling across geo-distributed data centers. In: 2016 IEEE Trustcom/BigDataSE/ISPA, pp 1455-1462. doi:10.1109/TrustCom.2016.0228
[28] Jolfaei F, Haghighat AT (2012) The impact of bandwidth and storage space on job scheduling and data replication strategies in data grids. In: Computing technology and information management (ICCM), 2012 8th international conference on, vol 1, pp 283-288
[29] Kloudas, K; Mamede, M; Preguiça, N; Rodrigues, R, Pixida: optimizing data parallel jobs in wide-area data analytics, Proc VLDB Endow, 9, 72-83, (2015) · doi:10.14778/2850578.2850582
[30] Koshiba Y, Chen W, Yamada Y, Tanaka T, Paik I (2015) Investigation of network traffic in geo-distributed data centers. In: 2015 IEEE 7th international conference on awareness science and technology (iCAST), pp 174-179 (2015). doi:10.1109/ICAwST.2015.7314042
[31] Kwok, YK; Ahmad, I, Fastest: a practical low-complexity algorithm for compile-time assignment of parallel programs to multiprocessors, IEEE Trans Parallel Distrib Syst, 10, 147-159, (1999) · doi:10.1109/71.752781
[32] Lee, YC; Zomaya, AY, Practical scheduling of bag-of-tasks applications on grids with dynamic resilience, IEEE Trans Comput, 56, 815-825, (2007) · Zbl 1390.68135 · doi:10.1109/TC.2007.1042
[33] Li, P; Guo, S; Miyazaki, T; Liao, X; Jin, H; Zomaya, A; Wang, K, Traffic-aware geo-distributed big data analytics with predictable job completion time, IEEE Trans Parallel Distrib Syst, 28, 1785-1796, (2016) · doi:10.1109/TPDS.2016.2626285
[34] Li P, Guo S, Yu S, Zhuang W (2015) Cross-cloud mapreduce for big data. IEEE Trans Cloud Comput 26(3):1-14. doi:10.1109/TCC.2015.2474385
[35] Li S, Lu Q, Zhang W, Zhu L (2015) A mapreduce cluster deployment optimization framework with geo-distributed data. In: 2015 IEEE 12th Intl Conf on ubiquitous intelligence and computing and 2015 IEEE 12th intl conf on autonomic and trusted computing and 2015 IEEE 15th intl conf on scalable computing and communications and its associated workshops (UIC-ATC-ScalCom), pp 943-949. doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.179
[36] Li, W., Yang, Y., Yuan, D.: A novel cost-effective dynamic data replication strategy for reliability in cloud data centres. In: Dependable, autonomic and secure computing (DASC), 2011 IEEE ninth international conference on, pp 496-502. doi:10.1109/DASC.2011.95
[37] Liao X, Gao Z, Ji W, Wang Y (2015) An enforcement of real time scheduling in spark streaming. In: Green computing conference and sustainable computing conference (IGSC), 2015 sixth international, pp 1-6. doi:10.1109/IGCC.2015.7393730
[38] Lin W, Qian Z, Xu J, Yang S, Zhou J, Zhou L (2016) Streamscope: continuous reliable distributed processing of big data streams. In: 13th USENIX symposium on networked systems design and implementation (NSDI 16), USENIX Association, Santa Clara, CA, pp 439-453. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/lin
[39] Makhorin A (2012) Gnu linear programming kit, version 4.52. http://www.gnu.org/software/glpk/glpk.html
[40] Mandal A, Xin Y, Baldine I, Ruth P, Heerman C, Chase J, Orlikowski V, Yumerefendi A (2011) Provisioning and evaluating multi-domain networked clouds for Hadoop-based applications. In: 2011 IEEE third international conference on cloud computing technology and science, pp 690-697. doi:10.1109/CloudCom.2011.107 · Zbl 1390.68135
[41] Microsoft Azure (2010). https://azure.microsoft.com/
[42] Nguyen VH, Tuong NH, Tran VH, Thoai N (2013) An MILP-based makespan minimization model for single-machine scheduling problem with splitable jobs and availability constraints. In: Computing, management and telecommunications (ComManTel), 2013 international conference on, pp 397-400. doi:10.1109/ComManTel.2013.6482427
[43] Pu, Q; Ananthanarayanan, G; Bodik, P; Kandula, S; Akella, A; Bahl, P; Stoica, I, Low latency geo-distributed data analytics, SIGCOMM Comput Commun Rev, 45, 421-434, (2015) · doi:10.1145/2829988.2787505
[44] Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of the 2015 ACM conference on special interest group on data communication, SIGCOMM ’15, ACM, New York, NY, USA, pp 421-434. doi:10.1145/2785956.2787505
[45] Rackspace (1998). https://www.rackspace.com/
[46] Schrage L (1968) A proof of the optimality of the shortest remaining processing time discipline. Oper Res 16(3):687-690. https://www.rackspace.com/ · Zbl 0237.60039
[47] Sih, GC; Lee, EA, A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures, IEEE Trans Parallel Distrib Syst, 4, 175-187, (1993) · doi:10.1109/71.207593
[48] Sooezi N, Abrishami S, Lotfian M (2015) Scheduling data-driven workflows in multi-cloud environment. In: 2015 IEEE 7th international conference on cloud computing technology and science (CloudCom), pp 163-167. doi:10.1109/CloudCom.2015.95
[49] Apache Spark? (2013). http://spark.apache.org/
[50] Toosi AN, Buyya R (2015) A fuzzy logic-based controller for cost and energy efficient load balancing in geo-distributed data centers. In: 2015 IEEE/ACM 8th international conference on utility and cloud computing (UCC), pp 186-194. doi:10.1109/UCC.2015.35 · Zbl 1364.68106
[51] Tripathi, R; Vignesh, S; Tamarapalli, V; Medhi, D, Cost efficient design of fault tolerant geo-distributed data centers, IEEE Trans Network Service Manag, 14, 289-301, (2017) · doi:10.1109/TNSM.2017.2691007
[52] Tudoran, R; Costan, A; Antoniu, G, Overflow: multi-site aware big data management for scientific workflows on clouds, IEEE Trans Cloud Comput, 4, 76-89, (2016) · doi:10.1109/TCC.2015.2440254
[53] Venugopal, S; Buyya, R, An SCP-based heuristic approach for scheduling distributed data-intensive applications on global grids, J Parallel Distrib Comput, 68, 471-487, (2008) · Zbl 1243.68129 · doi:10.1016/j.jpdc.2007.07.004
[54] Vulimiri A, Curino C, Godfrey PB, Jungblut T, Karanasos K, Padhye J, Varghese G (2015) Wanalytics: geo-distributed analytics for a data intensive world. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15, ACM, New York, NY, USA, pp 1087-1092. doi:10.1145/2723372.2735365
[55] Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015) Global analytics in the face of bandwidth and regulatory constraints. In: 12th usenix symposium on networked systems design and implementation (NSDI 15), USENIX Association, Oakland, CA, pp 323-336. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/vulimiri
[56] Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-Hadoop: mapreduce across distributed data centers for data-intensive computing. Future Gener Comput Syst 29(3):739-750. doi:10.1016/j.future.2012.09.001. Special section: recent developments in high performance computing and security
[57] Zarina M, Ahmad F, bin Mohd Rose AN, Nordin M, Deris MM (2013) Job scheduling for dynamic data replication strategy in heterogeneous federation data grid systems. In: Informatics and applications (ICIA), 2013 second international conference on, pp 203-206. doi:10.1109/ICoIA.2013.6650256
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.