TAILIEUCHUNG - Parallel Programming: for Multicore and Cluster Systems- P4

Parallel Programming: for Multicore and Cluster Systems- P4: Innovations in hardware architecture, like hyper-threading or multicore processors, mean that parallel computing resources are available for inexpensive desktop computers. In only a few years, many standard software products will be based on concepts of parallel programming implemented on such hardware, and the range of applications will be much broader than that of scientific computing, up to now the main application area for parallel computing | 20 2 Parallel Computer Architecture in the cache. If so the data is loaded from the cache and no memory access is necessary. Therefore memory accesses that go into the cache are significantly faster than memory accesses that require a load from the main memory. Since fast memory is expensive several levels of caches are typically used starting from a small fast and expensive level 1 L1 cache over several stages L2 L3 to the large but slow main memory. For a typical processor architecture access to the L1 cache only takes 2-4 cycles whereas access to main memory can take up to several hundred cycles. The primary goal of cache organization is to reduce the average memory access time as far as possible and to achieve an access time as close as possible to that of the L1 cache. Whether this can be achieved depends on the memory access behavior of the program considered see Sect. . Caches are used for single-processor computers but they also play an important role in SMPs and parallel computers with different memory organization. SMPs provide a shared address space. If shared data is used by multiple processors it may be replicated in multiple caches to reduce access latencies. Each processor should have a coherent view of the memory system . any read access should return the most recently written value no matter which processor has issued the corresponding write operation. A coherent view would be destroyed if a processor p changes the value of a memory address in its local cache without writing this value back to main memory. If another processor q would later read this memory address it would not get the most recently written value. But even if p writes the value back to main memory this may not be sufficient if q has a copy of the same memory location in its local cache. In this case it is also necessary to update the copy in the local cache of q . The problem of providing a coherent view of the memory system is often referred to as cache coherence problem. To