Boundary estimation of the reliability of cluster systems based on the decomposition of the Markov model with limited recovery of nodes with accumulated failures
https://doi.org/10.17586/2226-1494-2025-25-3-574-583
Abstract
The possibilities of a boundary assessment of the reliability of a cluster consisting of many nodes, each of which can be in a significant number of states, differing in the performance of the required functions and the average recovery time to a healthy node, are being investigated. Estimating the reliability of such a cluster system based on Markov processes is difficult at the stage of constructing a diagram of states and transitions due to its large dimension. The difficulty of building a model increases especially with limited node recovery, leading to a queue of nodes requiring recovery. The proposed approach allows us to overcome this difficulty. The differences between the proposed approaches are that it provides for the decomposition of the Markov cluster model and a step-by-step sequential refinement of the upper and lower boundary estimates of cluster reliability, taking into account the impact on slowing down the recovery of each cluster node of its other nodes. The peculiarity of the proposed approach is the decomposition of the model with the allocation of a certain individual cluster node and the construction of its Markov model with the introduction of waiting states for node recoveries due to queue maintenance for the restoration of other previously failed cluster nodes. Having determined the probabilities of all its states on the Markov model of the selected node, taking into account the identity of all cluster nodes, the average delays until the restoration of the serviceable state of the remaining cluster nodes with previous failures are determined based on the hypothesis enumeration formula. The calculated average delays are used in the next stage of calculating the Markov node model, specifying the delay in starting recovery of the allocated node due to the influence of the recovery queue of the remaining nodes in the cluster. Based on the proposed model, the availability coefficient of a cluster is estimated, consisting of a significant number of structurally complex nodes characterized by a variety of states of different performance and recovery time of the node to its initial working condition. As a result of decomposition, the proposed model makes it possible to overcome the problem of an avalanche-like increase in the complexity of the cluster model with an increase in the number of its nodes and the number of their states. The calculations performed have shown the convergence of the proposed boundary estimate of the reliability of a cluster of a significant number of structurally complex nodes. The results obtained can be used to assess the reliability and justify the choice of cluster structure as well as the disciplines of their maintenance and recovery when failures accumulate, taking into account limited recovery resources leading to the formation of queues of failed elements to be restored. The proposed model can be used to analyze the impact of the accumulation of failures in different cluster nodes on the delays in servicing the incoming request stream.
About the Authors
V. A. BogatyrevRussian Federation
Vladimir A. Bogatyrev — D.Sc., Professor
Saint Petersburg, 190000;
Professor
Saint Petersburg, 197101
sc 7006571069
S. V. Bogatyrev
Russian Federation
Stanislav V. Bogatyrev — Consulting Engineer
Saint Petersburg, 195027;
PhD Student
Saint Petersburg, 197101
sc 57183002200
A. V. Bogatyrev
Russian Federation
Anatoly V. Bogatyrev — PhD, Consulting Engineer
Saint Petersburg, 195027
sc 56549712700
References
1. Polovko A.M., Gurov S.V. Fundamentals of Reliability Theory. St. Petersburg, BHV-Petersburg Publ., 2006, 702 p. (in Russian)
2. Shubinsky I.B., Rozenberg I.N., Papic L. Adaptive fault tolerance in real-time information systems. Reliability Theory and Applications, 2017, vol. 12, no. 1, pp. 18–25.
3. Cherkesov G.N. Reliability of Hardware and Software Complexes. St. Petersburg, Piter Publ., 2005, 479 p. (in Russian)
4. Aysan H. Fault-tolerance strategies and probabilistic guarantees for real-time systems. Doctoral dissertation. Mälardalen University. 2012. 109 p.
5. Koren I., Krishna C.M. Fault-Tolerant Systems. Morgan Kaufmann, 2007, 378 p.
6. Krasnobaev V., Kuznetsov A., Kiian A., Kuznetsova K. Fault tolerance computer system structures functioning in residue classes. Proc. of the 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), 2021, pp. 471–474. https://doi.org/10.1109/idaacs53288.2021.9660919
7. Kucheryavy A.E. Ultra Low Latency communication networks. Trudy NIIR, 2020, no. 1, pp. 69 (in Russian)
8. Tatarnikova T.M., Sikarev I.A., Bogdanov P.Y., Timochkina T.V. Botnet attack detection approach in IoT networks. Automatic Control and Computer Sciences, 2022, vol. 56, no. 8, pp. 838–846. https://doi.org/10.3103/S0146411622080259
9. Bogatyrev V.A., Bogatyrev A.V., Bogatyrev S.V. The probability of timeliness of a fully connected exchange in a redundant real-time communication system. Proc. of the Wave Electronics and its Application in Information and Telecommunication Systems (WECONF), 2020, pp. 9131517. https://doi.org/10.1109/WECONF48837.2020.9131517
10. Burkov A., Rachugin R., Turlikov A. Stabilizing ALOHA using Preamble-based exploration by estimation of the number of active users. Proc. of the 18th International Symposium Problems of Redundancy in Information and Control Systems, (REDUNDANCY), 2023, pp. 106–109. https://doi.org/10.1109/redundancy59964.2023.10330186
11. Bogatyrev V.A., Bogatyrev A.V., Bogatyrev S.V. Multipath transmission of heterogeneous traffic in acceptable delays with packet replication and destruction of expired replicas in the nodes that make up the path. Communications in Computer and Information Science, 2023, vol. 1748, pp. 104–121. https://doi.org/10.1007/978-3-031-30648-8_9
12. Bogatyrev V.A. Protocols for dynamic distribution of requests through a bus with variable logic ring for reception authority transfer. Automatic Control and Computer Sciences, 1999, vol. 33, no. 1, pp. 57–63.
13. Bogatyrev V.A., Bogatyrev S.V., Bogatyrev A.V. Control of multipath transmissions in the nodes of switching segments of reserved paths. Proc. of the International Conference on Information, Control, and Communication Technologies (ICCT), 2022, pp. 1–5. https://doi.org/10.1109/icct56057.2022.9976839
14. Tatarnikova T.M., Arkhiptsev E.M. Designing fault-tolerant systems with micro-service architecture. Proc. of the 27th International Conference on Soft Computing and Measurements (SCM), 2024, pp. 348–351. https://doi.org/10.1109/scm62608.2024.10554143
15. Haider S., Nazir B. Fault tolerance in computational grids: perspectives, challenges, and issues. SpringerPlus, 2016, vol. 5, no. 1, pp. 1991. https://doi.org/10.1186/s40064-016-3669-0
16. Chinnaiah M.R., Niranjan N. Fault tolerant software systems using software configurations for cloud computing. Journal of Cloud Computing, 2018, vol. 7, no. 1, pp. 3 https://doi.org/10.1186/s13677-018-0104-9
17. Markoval E., Moltchanov D., Pirmagomedov R., Ivanova D., Koucheryavy Y., Samouylov K. Priority-based coexistence of eMBB and URLLC traffic in industrial 5G NR deployments. Proc. of the 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), 2020, pp. 1–6. https://doi.org/10.1109/ICUMT51630.2020.9222433
18. Ji H., Park S., Yeo J., Kim Y., Lee J., Shim B. Ultra-Reliable and Low-Latency Communications in 5G Downlink: physical layer aspects. IEEE Wireless Communications, 2018, vol. 25, no. 3, pp. 124–130. https://doi.org/10.1109/mwc.2018.1700294
19. Gurjanov A.V., Korobeynikov A.G., Zharinov I.O., Zharinov O.O. Edge, fog and cloud computing in the cyber-physical systems networks. Ceur Workshop Proceedings, 2021, pp. 103–108.
20. Srivastava A., Kumar N. Queueing model based dynamic scalability for containerized cloud. International Journal of Advanced Computer Science and Applications, 2023, vol. 14, no. 1, pp. 465–472. https://doi.org/10.14569/IJACSA.2023.0140150
21. Astakhova T.N., Verzun N.A., Kasatkin V.V., Kolbanev M.O., Shamin A.A. Sensor network connectivity models. Informatsionno Upravliaiushchie Sistemy, 2019. N 5 (102). P. 38–50. (in Russian). https://doi.org/10.31799/1684-8853-2019-5-38-50
22. Gurov S.V., Utkin L.V. Reliability of repairable reserved systems with failure aftereffect. Automation and Remote Control, 2017, vol. 78, no. 1, pp. 113–124. https://doi.org/10.1134/S000511791701009X
23. Bogatyrev V., Vinokurova M. Control and safety of operation of duplicated computer systems. Communications in Computer and Information Science, 2017, vol. 700, pp. 331–342. https://doi.org/10.1007/978-3-319-66836-9_28
24. Leontev A.S., Timoshkin M.S. Mathematical models for evaluating reliability indicators to study the probabilistic and temporal characteristics of multi-machine complexes with regard to failures. International Research Journal, 2023, no. 1 (127). pp. 18. (in Russian). https://doi.org/10.23670/IRJ.2023.127.27
25. Kleinrock L. Queueing Systems: Theory. Wiley, 1974, 417 p.
26. Bogatyrev V.A., Bogatyrev S.V., Bogatyrev A.V. Assessment of the readiness of a computer system for timely servicing of requests when combined with information recovery of memory after failures. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 3, pp. 608–617. (in Russian). https://doi.org/10.17586/2226-1494-2023-23-3-608-617
27. Qi Y., Meng H., Hou D., Chen Y. A study on software rejuvenation model of application server cluster in two-dimension state space using Markov process. Information Technology Journal, 2008, vol. 7, no. 1, pp. 98–104. https://doi.org/10.3923/itj.2008.98.104
28. Rahman P.A. Advanced reliability model of the fault-tolerant disk arrays with data striping and single disk redundancy. Proc. of the International Scientific and Practical Conference, 2017, pp. 20–25.
29. Uspenskaya N.N. Estimation of availability factor for the data storage systems based on redundant disk arrays with the backup. Proc. of the International Scientific and Practical Conference, 2016, pp. 20–23.
30. Rakhman P.A., Sharipov M.I. Reliability model of a two-node cluster of high-availability applications in enterprise management systems. Economics and Management of Management Systems. 2015. no. 3 (17). pp. 85–102. (in Russian)
31. Khomonenko A.D., Blagoveshchenskaya E.A., Prourzin. O.V., Andruk A.A. Forecasting the reliability of a cluster computing system using a semi-Markov model of alternating processes and monitoring. High Technologies in Earth Space Research. H&ES Research, 2018, vol. 10, no. 4, pp. 72–82. (in Russian). https://doi.org/10.24411/2409-5419-2018-10099
32. Terskov V. Sakash I. The reliability evaluation of local computer networks using markov model of multiple heterogeneous groups of switches. E3s Web of Conferences, 2024. vol. 592, pp. 03036.
Review
For citations:
Bogatyrev V.A., Bogatyrev S.V., Bogatyrev A.V. Boundary estimation of the reliability of cluster systems based on the decomposition of the Markov model with limited recovery of nodes with accumulated failures. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2025;25(3):574-583. (In Russ.) https://doi.org/10.17586/2226-1494-2025-25-3-574-583