CSE 124 Lecture Notes - Lecture 17: Fail-Safe, Distributed Computing, Memory Address

46 views2 pages

turquoiseplatypus250

14 May 2018

School

UCSD

Department

Computer Science & Engineering

Course

CSE 124

Professor

George Porter

For unlimited access to Class Notes, a Class+ subscription is required.

Fault Tolerance? - building reliable systems from unreliable components

3 Steps:

1. Detect errors

2. Contain errors

3. Masking errors - ensure system operates correctly despite error

Why is Fault Tolerance Hard?

- Say one-bit in DRAM fils…

- it flips a bit in a memory address the kernel is writing to

- causes BIG memory error elsewhere

- a client can’t read from FS

→ cascading failures!

So What To Do?

1. Do nothing: silently return the failure

2. Fail FAST: detect the failure and report at interface

3. Fail safe: transform incorrect behavior or values into “acceptable” ones

4. Mask the failure: operate despite failure

- when errors are common, mask errors by using error-correcting code for bit flips, replicate data

in multiple places

- useful for “high-density” hard drives that have higher chances to err

Masking Failures

- we mask failures on one server via

- atomic ops

- logging and recovery

- in a distributed system w/ multiple servers, we might replicate “some or all”

- but if you have replicated servers

- keep them consistent in a “fault-tolerant” way

Safety - “bad things” don’t happen

- no stopped or “deadlocked” states

- no “error” states

Liveness - “good things” happen

- eventually (no starvation)

- not “inherently’ a good thing - maybe you always want to give priority to a certain port

- but you risk starvation!

Often a Tradeoff

- safety is VERY important in banking transactions

Motivation: Sending Money

Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Fault tolerance? - building reliable systems from unreliable components. 3 steps: detect errors, contain errors, masking errors - ensure system operates correctly despite error. It flips a bit in a memory address the kernel is writing to. So what to do: do nothing : silently return the failure, fail fast : detect the failure and report at interface, fail safe : transform incorrect behavior or values into acceptable ones, mask the failure : operate despite failure. When errors are common, mask errors by using error-correcting code for bit flips, replicate data in multiple places. Useful for high-density hard drives that have higher chances to err. We mask failures on one server via. In a distributed system w/ multiple servers, we might replicate some or all . Keep them consistent in a fault-tolerant way. Not inherently" a good thing - maybe you always want to give priority to a certain port. Safety is very important in banking transactions.

CSE 124 Lecture Notes - Lecture 17: Fail-Safe, Distributed Computing, Memory Address

Document Summary

Get access