Class Notes (808,126)
Canada (493,084)
COMP 228 (26)
Lecture 7

COMP 228 Lecture 7: lec 7

8 Pages
Unlock Document

Concordia University
Computer Sci.
COMP 228
David K Probst

State Aggregates A register file consists of a set of registers that can be read and written by supplying reg num of reg to be accessed. A pipeline latch is similar. Since both register files and pipeline latches contain relatively few addressable state elements, they are built using standard digital logic, including multiplexors. Because size of multiplexors can get out of hand, larger memories are built using either _SRAMs_ (static random access memories) or _DRAMs_ (dynamic random access memories). Large SRAMs cannot be built as register file since, unlike a register file where a 32-to-1 multiplexor practical; 64K-to-1 multiplexor that would be required for a 64K x 1 SRAM is totally impractical. (This is 64K entries, each 1 bit wide). Instead, large memories are implemented with a shared output line, called a _bit line_,which multiple memory cells in the memory can assert. We do not explain how multiple sources can drive a single line. SRAM : large array of storage cells that are accessed like registers. SRAM memory cell requires 4 - 6 transistors per bit, and holds stored data as long as it is powered on. DRAM:uses only one transistor per bit, but which must be periodically refreshed to prevent the loss of stored data. Both SRAMs and DRAMs would lose data if disconnected from power; volatile_ memories. Synchronous SRAMs and DRAMs provide cleaner interfaces, and are becoming increasingly popular. It is impossible to build a bistable element with only one transistor. To allow single-transistor memory cells, which provide highest storage density and lowest cost per bit, a DRAM stores data on tiny capacitors. Reading destroys the cell count, so such a _destructive readout_ of data must be immediately followed by a write operator to restore the original values. Leakage of charge from tiny capacitors causes data to be lost after a fraction of a second. Hence, special circuitry is required that periodically refreshes the memory contents. DRAMs are cheaper,slower, than SRAMs. In addition, DRAM writeback operations interfere with the memory bandwidth offered by a DRAM bank. We spoke of synchronous DRAM (SDRAM) as offering several benefits. One enhanced implementation is double- data-rate DRAM (DDR-DRAM). This doubles the transfer rate from the DRAM by using both edges of the clock signal to trigger actions. Another enhancement, Rambus DRAM (RDRAM), is slightly too complicated to explain in an introductory course. Both SRAMs and DRAMs require power to keep stored data intact. This type of volatile_ memory must be supplemented by some form of _nonvolatile_ memory so that data and programs are not lost when power is interrupted. This typically accomplished by providing a small read-only memory_ (ROM) in addition to the disk. Powerful Idea Since computer systems are globally clocked, we can think of processor-memory interconnect, and memory itself, as collectively forming a giant memory pipeline. (Note: memory pipelines are distinct from pipelined memories). This allows memory-access latency to be spread across multiple pipeline stages, thus increasing the memory throughput (bandwidth). When we were dealing with instruction pipelines, we took for granted that we could supply a new instruction to the pipeline in every clock cycle. For a memory pipeline to work well, we need to supply a new memory reference to the memory pipeline in every cycle. If we don't, memory pipeline fill with bubbles, and the delivered bandwidth will fail. (Note: if there are no bubbles, we tolerate _all_ the memory latency; if there are few bubbles, we tolerate, maybe, 90% of latency; it's not a binary situation). Unfortunately, killer micros lack the capacity (the "pumping engine") to supply a new memory reference in every cycle. As a result, killer micros are famously _latency intolerant_ processors. In fact, overcoming the Memory Wall requires both interconnect/memory redesign and processor redesign. We will discuss this later when we introduce _Little's law_. But just to give you an idea now, if the interconnect/memory latency is 400 cycles, and you desire that a new memory word be delivered to the processor every cycle, then you must sustain 400 outstanding memory references in every cycle. This is totally impossible if the processor spends much of its day stalled, as killer micros are wont to do. Memory-Hierarchy Diversity Most books present a memory hierarchy that was developed to remedy the weak spots of killer micros. In the 1990s, it is not unfair to speak of codesign of increasingly elaborate caches to complement increasingly elaborate instruction pipelines. The standard model is, registers, L1$, L2$, L3$, DRAM, etc., etc. This drawn as pyramid to show greater capacity in bits we move down the hierarchy. As you drop down the hierarchy, the bandwidth decreases and the latency increases. Much emphasis on relationship between cost and capacity. large main memories are implemented using slower and cheaper DRAM, while smaller caches are implemented using faster and more expensive SRAM. It seems only logical to present the user with as much memory as is available in the cheapest technology, while providing the illusion that one can access memory with the latency and bandwidth of the most expensive technology. The presented memory hierarchy appears natural, and not at all ad hoc. As we will see, this illusion has a probabilistic component. If one can move data up and down through the memory hierarchy so that it is _very likely_that data is close to the processor when the processor needs it (or them), then the illusion can be maintained. The real reason to have any memory hierarchy at all is because it is _one_ tried and tested technique to push back the Memory Wall. But there are fateful choices to be made in designing any memory hierarchy. We will see some of this. Exploiting a memory hierarchy is classified as a _latency avoidance_ mechanism because if, statistically speaking, data is always close, then there is no latency to worry about. But memory hierarchies are not divinely ordained, and caches are not the only way to build memory hierarchies. Vector architectures were the first modern processors to do away with caches altogether. Vector processors have vector loads and vector stores in which large numbers of memory references are sent to memory at same time. Keeping large numbers of memory references outstanding at (practically) all times is classified as a _latency-tolerance_ mechanism, in the sense that latency is still there but, for reasons we will explore, processor performance isn't harmed. Slightly older GPUs did not have caches at all. Given that GPUs run programs with high levels of data parallelism, there was no need to waste precious chip resources on caches since the high levels of data parallelism (think vector processors) were able to tolerate the long latency to DRAM. The truth is slightly more complicated. Some programs dont have enough data parallelism to hide all DRAM latency; if this is only latency-handling technique available to processor. Moreover (this may seem curious at first), other programs have unpredictable data-reuse patterns that defy software-controlled scratchpad memories. software-controlled scratchpad memory : an on-chip memory that tries to beat caches at their own game by replacing robotic hardware control by more flexible software control, also attempts to move data up and down through the memory hierarchy, just like caches do. In truth, both scratchpad memories and caches try to exploit data-reuse locality that is smetimes present in programs. In its Fermi architecture (2009), NVIDIA decided to add caches to its GPUs. Fermi now has configurable 64-KB private first-level caches for each of its on-chip quasi-vector processors, and a 768-KB shared second-level cache for the GPU chip as a whole. But there's more. Each private cache (64 KBs) can be split either as a 48-KB cache and a 16-KB local scratchpad memory, or as a 16-KB cache and a 48-KB local scratchpad memory. NVIDIA simply doesn't trust a hardware-controlled cache to do the best job in all cases. Moreover, NVIDIA messes with the capacity dogma of conventional memory hierarchies. The total size of the processor registers (16 * 128 KBs = 2,048 KBs) is greater than the total size of all the level-1 caches put together (16 * 48 KBs = 768 KBs), which is also the size of the level-2 cache (768 KBs). Both the large register sets and the ample scratchpad memories are, in part, an attempt to rely on a hardware-controlled cache as little as possible. The cache is only there to handle situations where the data-reuse patterns become too complicated for software control. Hierarchy Logic In my opinion, it is helpful to understand the logic of memory hierarchies independently of whether caches even exist. In the short run, this is necessary if we hope to imagine less toxic ways of organizing and using caches. In the long run, it is necessary if we hope to imagine completely novel, deep memory hierarchies that transcend the notion of cache. simplest model: memory hierarchy has only two levels that can store data,near level and far level. The conceptual difference is that it is expensive for the processor to acquire data from the far level. Note that no one has specified whether the near level consists of scalar register files, vector register files, scratchpad memories, caches, some combination of the above, or something else entirely. For now, we abstract from these details. A program, because of its algorithmic structure, may (or may not) have a high degree of _data reuse_. This simply means it is statistically likely that, whenever the program instructs the processor to acquire a datum from the far level, the program will subsequently, within a reasonable time window, reuse the datum multiple times. Of course, if this reuse is obvious, we should just stick the datum in a register. If the reuse is not obvious, the program we write may blithely attempt to reacquire the datum from the far level. Common names for the program attribute 'high degree of data reuse' are _data-reuse locality_ and _temporal locality_. Suppose some mechanism suspects that a datum about to be acquired from the far level will be used multiple times. Then it is simple common sense to make and keep a copy of the datum in the near level, and to preserve this copy as long as possible. If
More Less

Related notes for COMP 228

Log In


Don't have an account?

Join OneClass

Access over 10 million pages of study
documents for 1.3 million courses.

Sign up

Join to view


By registering, I agree to the Terms and Privacy Policies
Already have an account?
Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.

Add your courses

Get notes from the top students in your class.