Handling Byzantine Faults

Dealing with Byzantine Faults CS 686 Final Project brought to you by Chris Sosa

Overview Motivation in Dependable Systems Common Types of Byzantine Faults Solutions in Real Systems

The Myths Hardware cannot be “traitorous”! Anthropomorphic model Any system with consensus is susceptible It’s never happened before Often misclassified Legionnaire's Disease

The Awful Truth Time-Triggered Architecture Radioactive Fault injection to one node Messed up timing protocol (SOS) Formed Cliques until system failed Quad Redundant Control System No message exchange Lots of redundancy One fault propagated to look like many Professor Knight’s Computer

Trends in Dependable Systems Device Physics Smaller and faster not always better Cosmic Rays, etc. Movement to Distributed Topologies Usage of Commercial off-the-shelf (COTS) Technology

Common Types of Observed Faults Value Issues related to digital values being the extreme of analog Propagation Temporal Different observations at same time Synchronization doesn’t help very much Value + Temporal

Solutions (1) Full Exchange Uses classical Byzantine agreement SPIDER – bus (ROBUS) design

Solutions (2) Hierarchical Uses hierarchy of different fault tolerant techniques including Byzantine Agreement Seen with Fail-Stop processors SAFEbus Communication backplane for Boeing 777 Uses two buses which are themselves dual redundant –different forms of parity detect errors Uses self-checking pairs on top of buses

Solutions (3) Filtering Targets propagation of Byzantine faults Tries to either Mask faults by forcing output to some straight value (removes value-type faults) Segments system into Fault Containment Regions (FCR’s) where we put protections to stop propagation

Ignorance is not Bliss Can invalidate failure model Propagation of one fault can be disastrous No amount of redundancy can help Large Economic Factor Possible costs of recall and redeployment

Conclusions Byzantine faults are real! Problems with Ignoring them No amount of Redundancy can tolerate them w/out message exchange Three categories of solutions to deal with them

BGP Quick Review Algorithm is expensive: Each processor has to broadcast its values for m any rounds Chooses majority value Requires n > 3f where f is # of failures and n is the # of processors With signed messages Can tolerate more failures Still expensive

Handling Byzantine Faults

More Related Content

Handling Byzantine Faults