This paper will discuss four different types of system failure that may occur in a distributed system. It will discuss which, if any, of these four types of system failure are also applicable to a centralized system, and provide information regarding how to fix two of the four different types of system failure.
Keywords: system failure, distributed system, centralized system, resolution, information technology
System failure occurs, quite literally, when the system fails to meet the requirements that have been set forth for it (Berk, 2012), regardless of the type of system in which the failure occurs. There are two primary forms that a system will take; either a distributed system or a centralized system. A distributed system consists of a group of computers, all autonomous, that are connected through a network and utilize middleware in order to allow the computers to coordinate activity, sharing resources in such a way that they are perceived to all be a part of a single computing facility (Emmerich, 1997). There are many different forms that system failure may take in a distributed system, including but not limited to a split brain, inconsistent failure detection, site failures, or delayed messages (Ventura Networks, 2003).
A split brain occurs when computers that are present within a distributed system lose the ability to communicate with each other but continue operating independently, resulting in conflicting changes during data management and the potential corruption of data (Ventura Networks, 2003).
Inconsistent failure detection occurs when the different failures that occur within the system itself are not detected properly, when different sets of failures are detected, or when the failures are detected in an incorrect order (Ventura Networks, 2003).
Site failures happen when all computers at a particular site cease to function, such as if all machines happen to lose power, occur if there is a component failure, or occur as a result of a power surge (Ventura Networks, 2003).
Delayed messages exist when message delivery is “not synchronized with the detection of failures that occur in the system,” resulting in problems occurring if messages are not received in their appropriate order as a result of a particular aspect of the system failing, thereby causing processing issues and delays in responding to requests made of the system itself (Ventura Networks, 2003). Of the four problems that have been described, all but a split brain may occur in a centralized system as well, causing potential issues with data corruption, response times, and potential loss of data. In order to better understand these issues, a more in depth look will be provided at two of the different possible issues of system failure, and information on how to resolve those issues will be provided.
Split brain scenarios may be resolved through conflict management of the database itself; one such program that works to detect these conflicting issues is CouchDB, which detects two different revisions for the same document, then creates a conflict report. Once the conflict report is generated, it is sent to the designated administrator, whose job it is to be able to determine which revision to the database itself that the system should use. Once this information has been determined, it is conveyed through CouchDB, which works to update both sides of the system to use the same information; alternately the administrator may decide that both of the revisions in the two separate locations need to be kept, and will then choose to merge the documents, still resulting in the fact that the issue will be resolved, in that both sides will now contain the same information. If an admin does not wish to use a program, such as CouchDB in order to work to resolve this issue, they may create a curl script that will allow them to perform the same steps, however, this may take slightly longer, given that programs such as CouchDB already have the necessary scripts in place to make the same changes (CouchDB, 2013).
Site failures, as previously mentioned, occur when all the computers at a specific location cease to function, something that may occur for many different reasons. If the site failure is a result of a power outage only, in order to resolve the matter, the company must contact the power company in order to determine the amount of time that it will take to resolve the matter. If the issue is the fault of the power company itself, the site may work to prevent the issue from occurring in the future through the purchase and installation of a generator that will allow the data center to maintain its functionality for a given amount of time while the power company is working to resolve the issue. This is a common means of working to prevent this issue from occurring in the future at most data centers in this day and age. If the reason for the loss of power is a tripped circuit breaker, the breakers will simply need to be flipped and the systems restarted once more; however these should be turned on one at a time in order to determine the reason for the first overload that occurred. If the reason was a power surge, as a result of lightening striking a transformer, for example, the company may wish to utilize a whole building surge protector in order to prevent this issue from reoccurring. Finally, if the site failure is a result of a component issue, each individual machine would have to be turned on one at a time, and each must go through a diagnostic process in order to determine the specific reason for the component issue, and where the component issue lies, working to replace faulty components as they are found in order to prevent total site failure from occurring once more.
Through the process of understanding the different ways that a distributed system may fail, and realizing what those failures may be, it is possible to work to create contingency plans in order to be able to deal with the matter, or prevent the matter from occurring in the future. These are not the only ways that a system failure may occur, though they are some of the most common, and by working to ensure that all admins and technicians have a firm grasp of the different possibilities, they are able to take a potential disaster and work to resolve it quickly, efficiently, and with minimal interruption in day to day activities, the end goal of any administrator.
- Berk, J. (2012). Systems failure analysis. Retrieved from http://www.jhberkandassociates.com/systems_failure_analysis.htm
- CouchDB. (2012). Conflict Management. Retrieved from http://guide.couchdb.org/draft/conflicts.html
- Emmerich, W. (1997). Distributed system principles. Retrieved from http://www0.cs.ucl.ac.uk/staff/ucacwxe/lectures/ds98-99/dsee3.pdf
- Ventura Networks. (2003, October 23). Failure scenarios and mistakes commonly found in distributed systems. Retrieved from http://www.venturanetworksinc.com/failure_scenarios_whitepaper.pdf