# American Institute of Mathematical Sciences

doi: 10.3934/jimo.2018014

## Performance optimization of parallel-distributed processing with checkpointing for cloud environment

 1 Graduate School of Informatics, Kyoto University, Yoshida-Hommachi, Sakyo-ku, Kyoto 606-8501, Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan

Received  December 2016 Revised  August 2017 Published  January 2018

Citation: Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance optimization of parallel-distributed processing with checkpointing for cloud environment. Journal of Industrial & Management Optimization, doi: 10.3934/jimo.2018014
##### References:
 [1] L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006. [2] T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185. [3] C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347. doi: 10.1016/j.ins.2014.01.015. [4] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). [5] J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312. doi: 10.1016/j.future.2004.11.016. [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113. doi: 10.1145/1327452.1327492. [7] J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). [8] J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77. doi: 10.1145/1629175.1629198. [9] S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217. [10] L. Fialho, D. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332. doi: 10.1109/ICDCS.2011.12. [11] B. Javadi, D. Kondo, A. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223. doi: 10.1016/j.jpdc.2013.04.002. [12] H. Jin, Y. Chen, H. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534. doi: 10.1109/ICPP.2010.80. [13] A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699. doi: 10.1109/ICDCS.2011.29. [14] P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145. [15] M. Taifi, J. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120. doi: 10.1007/978-3-642-24669-2_11. [16] J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531. doi: 10.1145/361147.361115. [17] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438. doi: 10.1145/2517349.2522737.

show all references

##### References:
 [1] L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006. [2] T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185. [3] C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347. doi: 10.1016/j.ins.2014.01.015. [4] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). [5] J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312. doi: 10.1016/j.future.2004.11.016. [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113. doi: 10.1145/1327452.1327492. [7] J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). [8] J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77. doi: 10.1145/1629175.1629198. [9] S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217. [10] L. Fialho, D. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332. doi: 10.1109/ICDCS.2011.12. [11] B. Javadi, D. Kondo, A. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223. doi: 10.1016/j.jpdc.2013.04.002. [12] H. Jin, Y. Chen, H. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534. doi: 10.1109/ICPP.2010.80. [13] A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699. doi: 10.1109/ICDCS.2011.29. [14] P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145. [15] M. Taifi, J. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120. doi: 10.1007/978-3-642-24669-2_11. [16] J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531. doi: 10.1145/361147.361115. [17] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438. doi: 10.1145/2517349.2522737.
Processing of a subtask with checkpointing method
Mean task-processing time with respect to the number of checkpoints for various $M$ ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $b$ ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $c$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $f$ ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $r$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $c = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to small $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $f = 1$ to $7$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Parameter set.
 Parameter Description Value $M$ Number of workers $10$ to $1,000$ $b$ Subtask-processing time $6$ to $120$ [hour] $c$ Time to make a checkpoint $30$ to $3,000$ [sec] $f$ Mean time between worker failures $7$ to $180$ [day] $r$ Time to resume a failed subtask $30$ to $3,000$ [sec] $K$ Number of checkpoints $0$ to $30$
 Parameter Description Value $M$ Number of workers $10$ to $1,000$ $b$ Subtask-processing time $6$ to $120$ [hour] $c$ Time to make a checkpoint $30$ to $3,000$ [sec] $f$ Mean time between worker failures $7$ to $180$ [day] $r$ Time to resume a failed subtask $30$ to $3,000$ [sec] $K$ Number of checkpoints $0$ to $30$
 [1] Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of large-scale parallel-distributed processing with backup tasks for cloud computing. Journal of Industrial & Management Optimization, 2014, 10 (1) : 113-129. doi: 10.3934/jimo.2014.10.113 [2] Kyosuke Hashimoto, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of backup-task scheduling with deadline time in cloud computing. Journal of Industrial & Management Optimization, 2015, 11 (3) : 867-886. doi: 10.3934/jimo.2015.11.867 [3] Bin Zheng, Min Fan, Mengqi Liu, Shang-Chia Liu, Yunqiang Yin. Parallel-machine scheduling with potential disruption and positional-dependent processing times. Journal of Industrial & Management Optimization, 2017, 13 (2) : 697-711. doi: 10.3934/jimo.2016041 [4] Jiping Tao, Zhijun Chao, Yugeng Xi. A semi-online algorithm and its competitive analysis for a single machine scheduling problem with bounded processing times. Journal of Industrial & Management Optimization, 2010, 6 (2) : 269-282. doi: 10.3934/jimo.2010.6.269 [5] Weidong Bao, Haoran Ji, Xiaomin Zhu, Ji Wang, Wenhua Xiao, Jianhong Wu. ACO-based solution for computation offloading in mobile cloud computing. Big Data & Information Analytics, 2016, 1 (1) : 1-13. doi: 10.3934/bdia.2016.1.1 [6] Leiyang Wang, Zhaohui Liu. Heuristics for parallel machine scheduling with batch delivery consideration. Journal of Industrial & Management Optimization, 2014, 10 (1) : 259-273. doi: 10.3934/jimo.2014.10.259 [7] Hongtruong Pham, Xiwen Lu. The inverse parallel machine scheduling problem with minimum total completion time. Journal of Industrial & Management Optimization, 2014, 10 (2) : 613-620. doi: 10.3934/jimo.2014.10.613 [8] Arminda Moreno-Díaz, Gabriel de Blasio, Moreno-Díaz Jr.. Distributed, layered and reliable computing nets to represent neuronal receptive fields. Mathematical Biosciences & Engineering, 2014, 11 (2) : 343-361. doi: 10.3934/mbe.2014.11.343 [9] Sze-Bi Hsu, Christopher A. Klausmeier, Chiu-Ju Lin. Analysis of a model of two parallel food chains. Discrete & Continuous Dynamical Systems - B, 2009, 12 (2) : 337-359. doi: 10.3934/dcdsb.2009.12.337 [10] Louis Caccetta, Syarifah Z. Nordin. Mixed integer programming model for scheduling in unrelated parallel processor system with priority consideration. Numerical Algebra, Control & Optimization, 2014, 4 (2) : 115-132. doi: 10.3934/naco.2014.4.115 [11] Le Thi Hoai An, Tran Duc Quynh, Kondo Hloindo Adjallah. A difference of convex functions algorithm for optimal scheduling and real-time assignment of preventive maintenance jobs on parallel processors. Journal of Industrial & Management Optimization, 2014, 10 (1) : 243-258. doi: 10.3934/jimo.2014.10.243 [12] Zhichao Geng, Jinjiang Yuan. Scheduling family jobs on an unbounded parallel-batch machine to minimize makespan and maximum flow time. Journal of Industrial & Management Optimization, 2017, 13 (5) : 1-22. doi: 10.3934/jimo.2018017 [13] Ran Ma, Jiping Tao. An improved 2.11-competitive algorithm for online scheduling on parallel machines to minimize total weighted completion time. Journal of Industrial & Management Optimization, 2018, 14 (2) : 497-510. doi: 10.3934/jimo.2017057 [14] Chengxin Luo. Single machine batch scheduling problem to minimize makespan with controllable setup and jobs processing times. Numerical Algebra, Control & Optimization, 2015, 5 (1) : 71-77. doi: 10.3934/naco.2015.5.71 [15] Chuanli Zhao, Yunqiang Yin, T. C. E. Cheng, Chin-Chia Wu. Single-machine scheduling and due date assignment with rejection and position-dependent processing times. Journal of Industrial & Management Optimization, 2014, 10 (3) : 691-700. doi: 10.3934/jimo.2014.10.691 [16] Ji-Bo Wang, Mengqi Liu, Na Yin, Ping Ji. Scheduling jobs with controllable processing time, truncated job-dependent learning and deterioration effects. Journal of Industrial & Management Optimization, 2017, 13 (2) : 1025-1039. doi: 10.3934/jimo.2016060 [17] Xianyu Yu, Dar-Li Yang, Dequn Zhou, Peng Zhou. Multi-machine scheduling with interval constrained position-dependent processing times. Journal of Industrial & Management Optimization, 2018, 14 (2) : 803-815. doi: 10.3934/jimo.2017076 [18] Yazheng Dang, Fanwen Meng, Jie Sun. Convergence analysis of a parallel projection algorithm for solving convex feasibility problems. Numerical Algebra, Control & Optimization, 2016, 6 (4) : 505-519. doi: 10.3934/naco.2016023 [19] Omer Faruk Yilmaz, Mehmet Bulent Durmusoglu. A performance comparison and evaluation of metaheuristics for a batch scheduling problem in a multi-hybrid cell manufacturing system with skilled workforce assignment. Journal of Industrial & Management Optimization, 2017, 13 (5) : 1-31. doi: 10.3934/jimo.2018007 [20] Andrey Yu. Verisokin, Darya V. Verveyko, Eugene B. Postnikov, Anastasia I. Lavrova. Wavelet analysis of phase clusters in a distributed biochemical system. Conference Publications, 2011, 2011 (Special) : 1404-1412. doi: 10.3934/proc.2011.2011.1404

2016 Impact Factor: 0.994