A register file consists of a set of registers that can be read and written by supplying reg num of reg to be accessed.
A pipeline latch is similar. Since both register files and pipeline latches contain relatively few addressable state
elements, they are built using standard digital logic, including multiplexors. Because size of multiplexors can get out of
hand, larger memories are built using either _SRAMs_ (static random access memories) or _DRAMs_ (dynamic
random access memories).
Large SRAMs cannot be built as register file since, unlike a register file where a 32-to-1 multiplexor practical;
64K-to-1 multiplexor that would be required for a 64K x 1 SRAM is totally impractical. (This is 64K entries, each 1
bit wide). Instead, large memories are implemented with a shared output line, called a _bit line_,which multiple
memory cells in the memory can assert. We do not explain how multiple sources can drive a single line.
SRAM : large array of storage cells that are accessed like registers. SRAM memory cell requires 4 - 6 transistors
per bit, and holds stored data as long as it is powered on.
DRAM:uses only one transistor per bit, but which must be periodically refreshed to prevent the loss of stored data.
Both SRAMs and DRAMs would lose data if disconnected from power; volatile_ memories.
Synchronous SRAMs and DRAMs provide cleaner interfaces, and are becoming increasingly popular.
It is impossible to build a bistable element with only one transistor. To allow single-transistor memory cells, which
provide highest storage density and lowest cost per bit, a DRAM stores data on tiny capacitors.
Reading destroys the cell count, so such a _destructive readout_ of data must be immediately followed by a write
operator to restore the original values.
Leakage of charge from tiny capacitors causes data to be lost after a fraction of a second. Hence, special
circuitry is required that periodically refreshes the memory contents. DRAMs are cheaper,slower, than SRAMs.
In addition, DRAM writeback operations interfere with the memory bandwidth offered by a DRAM bank.
We spoke of synchronous DRAM (SDRAM) as offering several benefits. One enhanced implementation is double-
data-rate DRAM (DDR-DRAM). This doubles the transfer rate from the DRAM by using both edges of the clock
signal to trigger actions. Another enhancement, Rambus DRAM (RDRAM), is slightly too complicated to explain in an
Both SRAMs and DRAMs require power to keep stored data intact. This type of volatile_ memory must be
supplemented by some form of _nonvolatile_ memory so that data and programs are not lost when power is
interrupted. This typically accomplished by providing a small read-only memory_ (ROM) in addition to the disk. Powerful Idea
Since computer systems are globally clocked, we can think of processor-memory interconnect, and memory
itself, as collectively forming a giant memory pipeline. (Note: memory pipelines are distinct from pipelined
memories). This allows memory-access latency to be spread across multiple pipeline stages, thus increasing
the memory throughput (bandwidth). When we were dealing with instruction pipelines, we took for granted that we
could supply a new instruction to the pipeline in every clock cycle. For a memory pipeline to work well, we need to
supply a new memory reference to the memory pipeline in every cycle. If we don't, memory pipeline fill with bubbles,
and the delivered bandwidth will fail.
(Note: if there are no bubbles, we tolerate _all_ the memory latency; if there are few bubbles, we tolerate, maybe,
90% of latency; it's not a binary situation). Unfortunately, killer micros lack the capacity (the "pumping engine") to
supply a new memory reference in every cycle. As a result, killer micros are famously _latency intolerant_
processors. In fact, overcoming the Memory Wall requires both interconnect/memory redesign and processor
We will discuss this later when we introduce _Little's law_. But just to give you an idea now, if the
interconnect/memory latency is 400 cycles, and you desire that a new memory word be delivered to the processor
every cycle, then you must sustain 400 outstanding memory references in every cycle. This is totally impossible if
the processor spends much of its day stalled, as killer micros are wont to do.
Most books present a memory hierarchy that was developed to remedy the weak spots of killer micros. In the 1990s,
it is not unfair to speak of codesign of increasingly elaborate caches to complement increasingly elaborate
instruction pipelines. The standard model is, registers, L1$, L2$, L3$, DRAM, etc., etc. This drawn as pyramid to
show greater capacity in bits we move down the hierarchy. As you drop down the hierarchy, the bandwidth
decreases and the latency increases. Much emphasis on relationship between cost and capacity.
large main memories are implemented using slower and cheaper DRAM, while smaller caches are implemented using
faster and more expensive SRAM. It seems only logical to present the user with as much memory as is available in
the cheapest technology, while providing the illusion that one can access memory with the latency and bandwidth of the
most expensive technology. The presented memory hierarchy appears natural, and not at all ad hoc.
As we will see, this illusion has a probabilistic component. If one can move data up and down through the memory
hierarchy so that it is _very likely_that data is close to the processor when the processor needs it (or them),
then the illusion can be maintained.
The real reason to have any memory hierarchy at all is because it is _one_ tried and tested technique to push
back the Memory Wall. But there are fateful choices to be made in designing any memory hierarchy. We will see some of this. Exploiting a memory hierarchy is classified as a _latency avoidance_ mechanism because if,
statistically speaking, data is always close, then there is no latency to worry about.
But memory hierarchies are not divinely ordained, and caches are not the only way to build memory hierarchies.
Vector architectures were the first modern processors to do away with caches altogether. Vector processors
have vector loads and vector stores in which large numbers of memory references are sent to memory at
same time. Keeping large numbers of memory references outstanding at (practically) all times is classified as a
_latency-tolerance_ mechanism, in the sense that latency is still there but, for reasons we will explore, processor
performance isn't harmed.
Slightly older GPUs did not have caches at all. Given that GPUs run programs with high levels of data parallelism, there
was no need to waste precious chip resources on caches since the high levels of data parallelism (think vector
processors) were able to tolerate the long latency to DRAM.
The truth is slightly more complicated. Some programs dont have enough data parallelism to hide all DRAM latency; if
this is only latency-handling technique available to processor. Moreover (this may seem curious at first), other
programs have unpredictable data-reuse patterns that defy software-controlled scratchpad memories.
software-controlled scratchpad memory : an on-chip memory that tries to beat caches at their own game by replacing
robotic hardware control by more flexible software control, also attempts to move data up and down through the
memory hierarchy, just like caches do.
In truth, both scratchpad memories and caches try to exploit data-reuse locality that is smetimes present in programs.
In its Fermi architecture (2009), NVIDIA decided to add caches to its GPUs. Fermi now has configurable 64-KB
private first-level caches for each of its on-chip quasi-vector processors, and a 768-KB shared second-level cache
for the GPU chip as a whole. But there's more. Each private cache (64 KBs) can be split either as a 48-KB cache
and a 16-KB local scratchpad memory, or as a 16-KB cache and a 48-KB local scratchpad memory. NVIDIA simply
doesn't trust a hardware-controlled cache to do the best job in all cases. Moreover, NVIDIA messes with the capacity
dogma of conventional memory hierarchies.
The total size of the processor registers (16 * 128 KBs = 2,048 KBs) is greater than the total size of all the level-1
caches put together (16 * 48 KBs = 768 KBs), which is also the size of the level-2 cache (768 KBs). Both the
large register sets and the ample scratchpad memories are, in part, an attempt to rely on a hardware-controlled cache
as little as possible. The cache is only there to handle situations where the data-reuse patterns become too
complicated for software control. Hierarchy Logic
In my opinion, it is helpful to understand the logic of memory hierarchies independently of whether caches even exist.
In the short run, this is necessary if we hope to imagine less toxic ways of organizing and using caches. In the long
run, it is necessary if we hope to imagine completely novel, deep memory hierarchies that transcend the notion of
simplest model: memory hierarchy has only two levels that can store data,near level and far level. The conceptual
difference is that it is expensive for the processor to acquire data from the far level.
Note that no one has specified whether the near level consists of scalar register files, vector register files, scratchpad
memories, caches, some combination of the above, or something else entirely. For now, we abstract from these details.
A program, because of its algorithmic structure, may (or may not) have a high degree of _data reuse_. This simply
means it is statistically likely that, whenever the program instructs the processor to acquire a datum from the far
level, the program will subsequently, within a reasonable time window, reuse the datum multiple times. Of course, if this
reuse is obvious, we should just stick the datum in a register. If the reuse is not obvious, the program we write may
blithely attempt to reacquire the datum from the far level. Common names for the program attribute 'high degree of
data reuse' are _data-reuse locality_ and _temporal locality_.
Suppose some mechanism suspects that a datum about to be acquired from the far level will be used multiple times.
Then it is simple common sense to make and keep a copy of the datum in the near level, and to preserve this copy as
long as possible. If