Tolerating Byzantine Behavior in Distributed Systems, Cylab

Miguel Correia

To le ra ting Byza ntine Be ha vio r in Distrib ute d Syste ms Mig ue l Co rre ia Unive rsity o f Lisb o a LASIGE / Na vig a to rs g ro up CyLa b / CMU, De c e mb e r 2007 Motivation • Every year thousands of new vulnerabilities appear, zillions of attacks and intrusions happen – Doing the best we know/can, using security best practices etc. is essential but not enough • Systems with high societal importance are becoming “online” – Critical infrastructures: gas, water, power,… – Controlled by computers indirectly connected to the Internet 2 Intrusion Tolerance • (also called Byzantine Fault Tolerance) • To apply the Fault Tolerance paradigm in the domain of Security • Do the best we know to protect systems • …but vulnerabilities still remain… • Tolerate intrusions that still occur 3 I-T: an example I-T Distributed Service Servers (N) Redundancy T REC R O C Diversity Request Reply NFS, DNS, on-line CA, Web server, etc. 0-Day vulnerability Clients 4 Outline • Hybrid system models and Wormholes • I-T State machine replication • Randomized I-T protocols • Primary-backup vs decentralized protocols • Conclusions 5 Hyb rid syste m mo d e ls a nd Wo rmho le s Homogeneous system models • Most work on I-T assumes an homogeneous system model; typically: – Asynchronous (no bounds on delays) – Byzantine/arbitrary faults, including attacks/intrusions Host 1 Host 2 Host n Processes Processes Processes OS OS OS Payload Network 7 Hybrid system models • We proposed and are interested on hybrid system models. For instance: – Asynchronous/Byzantine as before (red) + – Wormhole that is secure/tamperproof (green) Host 1 Host 2 Host n Processes Processes Processes OS OS OS Local Wh. Local Wh. Local Wh. Wormhole Control Channel (optional) Payload Network 8 Question 1: practical? • Yes, it models several current systems: • PCs with Trusted Platform Modules (TPM) – https://www.trustedcomputinggroup.org/ • PCs with SmartCards • DIY: PCs with virtual machines (Xen, VMWare) • DIY: PCs with hardware appliances 9 Question 2: why model? • Why not do research about PCs + SmartCards or TPMs or…? • In our research we want: – Expressive models of real systems – Sound theoretical basis for proofs of correctness – Enablers for building new algorithms • For practical minds: – We don’t want to be restricted to what can be done with SmartCards or TPMs… 10 Question 3: model what? • In this talk: – “insecure system + secure subsystem” • But there are other possibilities, e.g., – “untimely system + timely subsystem” – A. Casimiro, P. Veríssimo, Timely Computing Base 11 I-T Sta te ma c hine re p lic a tio n State machine replication basics I-T Distributed Service Servers (N) SMR is a mechanism to make any deterministic service fault-tolerant Request Reply Clients 13 SMR definition • Servers are state machines: – state variables, commands Atomic multicast protocol • Basic idea: to make all servers follow the same sequence of states, i.e., enforce: – Initial state: all servers start in the same state – Agreement: all servers execute the same commands – Total order: all servers execute the commands in the same order – Determinism: the same command executed in the same initial state generates the same final state 14 Main Contribution • There is a maximum number f of servers that can be faulty for the system to remain correct • With an homogeneous system model (asynchronous Byzantine): – Minimum: N=3f+1 servers – 4 servers to tolerate 1 faulty, 7 to tolerate 2 faulty,… • With a hybrid system model (secure wormhole in servers; not in clients): – Minimum: N=2f+1 servers – 3 to tolerate 1 faulty, 5 to tolerate 2 faulty,… – This reduction has a huge impact on the system cost: hw, sw, admin (diversity) 15 Trusted Ordering Wormhole • The TOW is a wormhole that serves specifically to implement a 2f+1 I-T atomic multicast • Provides a single service with two purposes: – Says when a message can be delivered (which is when f+1 servers have it) – Says the order in which it must be delivered • API: – TOW_sent – “I sent a message” – TOW_received – “I received a message” • Output: – TOW_decide – “You can deliver the message, order is n” 16 2f+1 Atomic multicast w/ TOW H(M) – a collision-resistant hash function N=3 f=1 S0 decide H(M1),1 X decide H(M1),1 received H(M1) sent H(M1) received H(M1) X S2 TOW works the same way with more messages X M1 S1 f+1 servers have M1 order = 1 message delivery 17 Performance of I-T SMR • Nice runs • Bad runs 18 I-T SMR Research trends • BFT – Castro and Liskov (OSDI 99) – First efficient I-T SMR system • Increasing speed: – FaB Paxos (DSN’05), Q/U (SOSP’05), HQ (OSDI’06), Zyzzyva (SOSP’07) • Reducing window of vulnerability: – BFT-PR (TOCS’02), Sousa et al. (SAC’06) • Reducing number of replicas: – this work (SRDS’04), BFT2F (NSDI’07), A2M-PBFTEA (SOSP’07) 19 Ra nd o mize d I-T p ro to c o ls Motivation • Randomized Byzantine FT agreement protocols: – Introduced in 1983: Ben-Or (PODC), Rabin (FOCS) – Since then many others appeared… • But from a practical point of view: – Ben-Or style protocols (“local coins”) à run in an exponential expected number of communication steps – Rabin style protocols (“shared coin”) à rely on publickey crypto • DS folklore: work in the area is theoretical; protocols too slow for most applications… • …but are they really slow? 21 RITAS • First, we designed an arguably efficient stack of randomized I-T protocols, RITAS (no wormhole) – No signatures, asynchronous, decentralized, n=3f+1 • Then implemented and evaluated their performance… – LANs, PlanetLab, wireless (PCs and PDAs) 22 Local coins vs Shared coin • Binary consensus protocols evaluated: – Bracha’s (84), expected n. rounds O(2n-f), no crypto – ABBA (01), expected n. rounds constant, public-key crypto • Testbed – 10/100/1000 Mbps local-area network (LAN) – 11 Dell PowerEdge 850 computers (2.8 GHz, 2 GB RAM) – Linux 2.6.11 23 Latency Shared Coin has always much higher latency Latency (µs) [1000 Mbps, no faults] Proposal Distribution Uniform Corrosive Random Machines (n) 4 7 10 Local 824 2187 4132 Shared 21590 31315 43633 Local 2453 6172 12075 Shared 33834 38529 55169 Local 2056 5812 11501 Shared 24320 36325 49206 24 Throughput Local Shared Local Coin Coin Coin is always isisnot affected affected betterbythan by Byzantine Shared Coin faults Maximum Throughput (decisions/s) Faultload Local FailureShared free Crash Byzantine Machines (n) 4 7 10 450 170 80 13 9 8 Local 600 225 110 Shared 31 25 20 Local 330 87 30 Shared 16 9 8 25 Number of The The Shared performance average protocols Coinnumber is always more isRounds similar of terminate robust rounds forwith the is in failure-free one the round Byzantine in very thecrash low crash faultload faultload and faultloads Number of Rounds until Decision Faultload Local FailureShared free Crash Byzantine Machines (n) 4 7 10 1.004 1.005 1.009 1.013 1.018 1.010 Local 1.000 1.000 1.000 expected Shared 1.000 1.000 1.000result is Local 1.462 1.569 2.289 Shared 1.016 1.017 1.012 Theoretical 128 rounds 26 Randomized Atomic Broadcast Bracha’84 • Is it fast/practical? • Testbed – 100 Mbps LAN – 4 nodes (Pentium III PCs, 500 MHz, 128 MB RAM) – Linux (kernel version 2.6.15) 27 Throughput • No Byzantine faults, n=4 faults – throughput almost not affected ~721 msgs/s ~711 msgs/s ~650 msgs/s ~634 msgs/s ~460 msgs/s ~465 msgs/s 28 Prima ry-b a se d vs d e c e ntra lize d p ro to c o ls Faster RITAS? • We wanted RITAS to be faster; best candidate for improvement: Binary Consensus (bottom) – Fastest RITAS’s BC (Bracha 84): decentralized, n=3f+1, O(n3) message complexity, no signatures • Decentralized algorithms that solve asynchronous Byzantine BC can be build with and only with: 1. More Processes: n = 5f+1, O(n2) message complexity and no signatures 2. More Messages: n = 3f+1, O(o) message complexity (n2 < o = n2f) and no signatures 3. Signatures: n = 3f+1, O(n2) message complexity and using signatures • To improve RITAS, option 2, message complex. O(n2f) 30 State machine replication revisited • For decentralized consensus algorithms, best: – n = 3f+1, O(o) message complexity (n2 < o = n2f), no signatures • But for a primary-based SMR like BFT: – n = 3f+1, O(n2) message complexity, no signatures • SMR with n=2f+1: – Requires distributed “heavy” wormhole – Decentralized (but not randomized) • What about a primary-based SMR? – n=2f+1 ? “Lighter” wormhole? 31 Co nc lusio ns Conclusions (1) • Intrusion tolerance: a new paradigm for more secure distributed systems • Hybrid system models and Wormholes – Model reality as sound basis for proofs of correctness – Enablers for building new algorithms… – … without getting tied to current devices • First solution for I-T state-machine replication with only 2f+1 replicas 33 Conclusions (2) • Randomized I-T protocols – Experimentation contradicted DS folklore – Protocols are practical – Local coin protocols are fast/practical but scale worse than shared-coin protocols • Primary-based vs decentralized protocols – Primary-based have to recover from faulty leader – But decentralized protocols have constraints that do not apply to primary-based 34 Thank you. Questions? http://www.di.fc.ul.pt/~mpc/ http://www.navigators.di.fc.ul.pt/ • Some related publications: – M Correia, NF Neves, P Veríssimo. How to Tolerate Half Less One Byzantine Nodes in Practical Distributed Systems. IEEE SRDS 2004 – N F Neves, M Correia, P Veríssimo. Solving Vector Consensus with a Wormhole. IEEE TPDS 16-12, Dec. 2005 – M Correia, N F Neves, L C Lung, P Veríssimo. Low Complexity Byzantine-Resilient Consensus. Distributed Computing, 17-3 Mar. 2005 – P Veríssimo, Travelling through Wormholes: a new look at Distributed Systems Models. SIGACT News 37-1, 2006 – M Correia, N F Neves, P Veríssimo. From Consensus to Atomic Broadcast: TimeFree Byzantine-Resistant Protocols without Signatures. Computer Journal 41-1, Jan. 2006 – H Moniz and N F Neves and M Correia and P Veríssimo. Randomized IntrusionTolerant Asynchronous Services. DSN 2006 – A Bessani, M. Correia, H Moniz, N F Neves, P Verissimo. When 3 f +1 is not Enough: Tradeoffs for Decentralized Asynchronous Byzantine Consensus. DISC 2007 35

Log In

Tolerating Byzantine Behavior in Distributed Systems, Cylab

Tolerating Byzantine Behavior in Distributed Systems, Cylab

Related Papers

RELATED PAPERS