CSE 124 Lecture Notes - Lecture 17: Fail-Safe, Distributed Computing, Memory Address

46 views2 pages
Fault Tolerance? - building reliable systems from unreliable components
3 Steps:
1. Detect errors
2. Contain errors
3. Masking errors - ensure system operates correctly despite error
Why is Fault Tolerance Hard?
- Say one-bit in DRAM fils…
- it flips a bit in a memory address the kernel is writing to
- causes BIG memory error elsewhere
- a client can’t read from FS
cascading failures!
So What To Do?
1. Do nothing: silently return the failure
2. Fail FAST: detect the failure and report at interface
3. Fail safe: transform incorrect behavior or values into “acceptable” ones
4. Mask the failure: operate despite failure
- when errors are common, mask errors by using error-correcting code for bit flips, replicate data
in multiple places
- useful for “high-density” hard drives that have higher chances to err
Masking Failures
- we mask failures on one server via
- atomic ops
- logging and recovery
- in a distributed system w/ multiple servers, we might replicate “some or all”
- but if you have replicated servers
- keep them consistent in a “fault-tolerant” way
Safety - “bad things” don’t happen
- no stopped or “deadlocked” states
- no “error” states
Liveness - “good things” happen
- eventually (no starvation)
- not “inherently’ a good thing - maybe you always want to give priority to a certain port
- but you risk starvation!
Often a Tradeoff
- safety is VERY important in banking transactions
Motivation: Sending Money
Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Fault tolerance? - building reliable systems from unreliable components. 3 steps: detect errors, contain errors, masking errors - ensure system operates correctly despite error. It flips a bit in a memory address the kernel is writing to. So what to do: do nothing : silently return the failure, fail fast : detect the failure and report at interface, fail safe : transform incorrect behavior or values into acceptable ones, mask the failure : operate despite failure. When errors are common, mask errors by using error-correcting code for bit flips, replicate data in multiple places. Useful for high-density hard drives that have higher chances to err. We mask failures on one server via. In a distributed system w/ multiple servers, we might replicate some or all . Keep them consistent in a fault-tolerant way. Not inherently" a good thing - maybe you always want to give priority to a certain port. Safety is very important in banking transactions.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers