Memory Model Of A Computer
RAM comes in thousands of different varieties, defined by a whole digital cornucopia of features. The form factor of the memory module itself, the type of memory chip on the module, the RAM speed and other factors can all determine exactly what sort of RAM your PC is currently packing. If you're in the market for an upgrade, knowing the exact specs of your RAM will help ensure compatibility. DDR4One of the key differences in RAM types you may encounter is between DDR3 and DDR4 RAM, two similar types of double-data-rate SDRAM, the acronym for synchronous dynamic random-access memory.
DDR4 operates at a lower voltage than DDR3, but the voltage difference won't be too significant in home use. It can, however, make an impact for large-scale computing, such as running servers. While DDR3 speed maxes out at 2133 millions of transfers per second, DDR4 starts at 2133 MT/s. Task ManagerPerhaps the most straightforward RAM test you can perform on your PC is via the good old Task Manager. On Windows 10 and earlier versions of the OS, just press CTRL, ALT and Delete simultaneously to open the Task Manager, then click the Performance tab.Here, you'll see a breakdown of your system memory.
This will tell you how many gigabytes of RAM your computer current has. It will also display the RAM speed, such as 1600 or 1233 MT/s, and its form factor. Most modern PCs use DIMM (Dual-Inline Memory Module) RAM, while laptops use SODIMM (Small-Outline Dual-Inline Memory Module) RAM. CPU-Z RAM Checker SoftwareCPUID's CPU-Z has served as a longtime standard RAM checker tool for PCs, and it's still free to download and use for Windows 10. After downloading the software from CPUID's site, installing it and launching it, click the Memory tab to get your PC's detailed RAM specs.CPU-Z not only lists your PC's RAM type (such as DDR3 or DDR4), it lists its speed, size, number of operating channels, NB frequency, DRAM frequency and even more down-to-the-bone stats, ranging from the RAM's command rate to its row refresh cycle time. Additionally, CPU-Z provides information on your PC's processor name and number, its codename and cache levels and the real-time measurement of each memory core's internal frequency and memory frequency.
Memory Model Java
Thomas Sterling. Maciej Brodowicz, in, 2018 7.2.2 Thread VariablesOpenMP is a shared-memory model allowing direct access to global variables by all threads of a user process. To support the SPMD modality of control where all concurrent threads run the same code block simultaneously, OpenMP also provides private variables.
These have the same syntactical names, but their scoping is limited to the thread in which they are used. Private variables of the same name have different values in each thread in which they occur. A frequently occurring example is the use of index variables accessing elements of a vector or array. While all threads will use the same index variable name, typically “i”, when accessing an element of the shared vector, perhaps “xi”, the range of values of the index variable will differ for the separate concurrent threads. For this to be possible, the index variable has to be private rather than shared. In fact this particular idiom is so common that the default for such variable usage is private, although for most common variables the default is shared. Directive clauses are available for explicit setting of these properties of variables by users.
Another variant on how variables may be used relates to reduction operators such as sum or product. In this specialized case the reduction variable is a mix of private and global, as discussed in detail in Section 7.5. Applications in which units operate relatively autonomously are natural candidates for message passing communication. For example, a home control system has one microcontroller per household device—lamp, thermostat, faucet, appliance, and so on. The devices must communicate relatively infrequently; furthermore, their physical separation is large enough that we would not naturally think of them as sharing a central pool of memory.
Passing communication packets among the devices is a natural way to describe coordination between these devices. Message passing is the natural implementation of communication in many 8-bit microcontrollers that do not normally operate with external memory. QueuesA queue is a common form of message passing. The queue uses a FIFO discipline and holds records that represent messages. The FreeRTOS.org system provides a set of queue functions. It allows queues to be created and deleted so that the system may have as many queues as necessary. Figure 12.5b.
Multibank shared memory model.The analysis of any one of these architectures would follow methodology similar to that of the single CPU case described previously. The system we choose to model here is the multiprocessor case. This is more indicative of realistic systems, where multiple servers are interconnected to serve several users.If we decide to model the system shown in Figure 12.5b, we have essentially the problem of memory allocation to processors.
A processor can have all of the memory or none of the memory or anything in between. Allocations are done using the entire memory module. That is, a CPU cannot share a memory module with another CPU during a cycle.On each CPU cycle, each processor makes a memory request. If there is a free memory meeting the CPU's request, it gets filled; otherwise, the CPU must wait until the next cycle.
Each memory module for which there is a memory access request can fill only one request. When several processors make memory module requests to the same memory module, only one is served (chosen at random from those requesting). The other processors will make the same memory module request on the next cycle. New memory requests for each processor are chosen randomly from the M memory modules using a uniform distribution. Dong Ping Zhang, in, 2015 1.5 Threads and Shared MemoryA running program (called a process) may consist of multiple subprograms that each maintains its own independent control flow, and which as a group are allowed to run concurrently.
These subprograms are called threads. All of the threads executing within a process share some resources (e.g., memory, open-files, global variables), but also have independent local storage (e.g., stack, automatic variables). Threads communicate with each other using variables allocated in the globally shared address space. Communication requires synchronization constructs to ensure that multiple threads are not updating the same memory location at the same time.In shared memory systems, all processors have the ability to view the same address space (i.e., they view the same global memory). A key feature of the shared memory model is the fact that the programmer is not responsible for managing data movement. For these types of systems, it is especially important that the programmer and hardware have a well-defined agreement concerning updates to global data shared between threads. This agreement is called a memory consistency model.
Memory consistency models are often supported in programming languages using higher-level constructs such as mutexes and semaphores, or acquire/release semantics as in the case of Java, C/C11, and OpenCL. By having the programmer explicitly inform the hardware when certain types of synchronization must be performed, the hardware is able to execute concurrent programs with much greater efficiency.As the number of processor cores increase, there is a significant cost to supporting shared memory in hardware. The length of wires (latency, power), number of interfaces between hardware structures, and number of shared bus clients quickly become limiting factors.
The extra hardware required typically grows exponentially in terms of its complexity as we attempt to add additional processors. This has slowed the introduction of multicore and multiprocessor systems at the low end, and it has limited the number of cores working together in a consistent shared memory system to relatively low numbers because shared buses and coherence protocol overheads become bottlenecks. More relaxed shared memory systems scale better, although in all cases, scaling the number of cores in a shared memory system comes at the cost of complicated and expensive interconnects. Most multicore CPU platforms support shared memory in one form or another. OpenCL supports execution on shared memory devices. Jang, in, 2017 1 IntroductionThread communication and synchronization play an important role in parallel computing. Thread communication refers to information exchange among concurrent threads either via shared memory or a message passing mechanism.
In shared memory systems, a thread can update variables and make them visible to other threads. In contrast, a message passing mechanism allows threads in separate memory spaces to communicate with each other.
Graphics processing unit (GPUs) adopt a shared memory model; thus threads communicate through memories such as local and global memory. As with other shared memory parallel systems, GPUs encounter the following main issues associated with shared memory access. Mutual exclusion: Concurrently running threads may attempt to update the same memory location simultaneously.
Without a mechanism to guarantee mutual exclusion, shared variables may become corrupted, causing an indeterministic result. Such a scenario is called a race condition or data race. One way to avoid it is to protect shared memory using atomic operations.
An atomic operation grants serialized exclusive access to shared data. Another way is to divide a program into phases so that threads in a phase can update shared data safely without any interference from the execution of other phases. A barrier divides a program into phases, ensuring that all threads must reach the barrier before any of them can proceed past the barrier. Therefore phases before and after the barrier cannot be concurrent with each other 1.Ordering: Shared memory may appear inconsistent among threads without an ordering policy. This type of problem usually arises when the compiler or hardware reorders load and store operations within a thread or among threads. A memory consistency model is used to resolve this issue. GPUs adopt a relaxed consistency model where ordering is guaranteed only when the programmer explicitly uses a mechanism called a fence.Thread synchronization is a mechanism to resolve the problems arising from concurrent access to shared memory.
It prevents data corruption caused by unconstrained shared memory accesses. In this chapter, we present thread communication and synchronization mechanisms found in modern GPUs.
We also discuss the performance implications of some techniques and include examples. Understanding thread communication and synchronization is essential not only to writing correct code but also to fully exploiting GPU’s parallel computation power.Throughout the chapter, we use OpenCL terminology, except when Compute Unified Device Architecture (CUDA)-specific techniques are presented. The rest of the chapter is structured as follows: Section 2 discusses coarse-grained synchronization operations using explicit barriers and implicit hardware thread batch. Section 3 presents native atomic functions available for shared variables with regular data types.
Section 4 presents fine-grained synchronization operations supported in OpenCL 2.0. Maurice Herlihy. Sergio Rajsbaum, in, 2014 14.1 ModelAs usual, a model of computation is given by a set of process names, a communications medium such as shared memory or message passing, a timing model such as synchronous or asynchronous, and a failure model given by an adversary. In this chapter, we restrict our attention to asynchronous shared-memory models with the wait-free adversary (any subset of processes can fail).Recall that a protocol ( I, P, Ξ ) solves a (colored) task ( I, O, Δ ) if there is a color-preserving simplicial map δ: P → O carried by Δ: for every simplex σ in I, δ ( Ξ ( σ ) ) ⊆ Δ ( σ ). Operationally, each process finishes the protocol in a local state that is a vertex of P of matching color and then applies δ to choose an output value at a vertex of matching color in O.As before, a reduction is defined in terms of two models of computation, a model R (called the real model) and a model V (called the virtual model).
They have the same set of process names and the same colored input complex I, but their protocol complexes may differ. For example, the real model might be snapshot memory, whereas the simulated model might be layered immediate snapshot memory. Scott, in, 2009 12.2.1 Communication and SynchronizationIn any concurrent programming model, two of the most crucial issues to be addressed are communication and synchronization. Communication refers to any mechanism that allows one thread to obtain information produced by another. Communication mechanisms for imperative programs are generally based on either shared memory or message passing. In a shared-memory programming model, some or all of a program's variables are accessible to multiple threads.
For a pair of threads to communicate, one of them writes a value to a variable and the other simply reads it. In a message-passing programming model, threads have no common state. For a pair of threads to communicate, one of them must perform an explicit send operation to transmit data to another.Synchronization refers to any mechanism that allows the programmer to control the relative order in which operations occur in different threads. Synchronization is generally implicit in message-passing models: a message must be sent before it can be received.
If a thread attempts to receive a message that has not yet been sent, it will wait for the sender to catch up. Synchronization is generally not implicit in shared-memory models: unless we do something special, a “receiving” thread could read the “old” value of a variable, before it has been written by the “sender.”. Design & Implementation Hardware and Software CommunicationAs described in Section 12.1.2, the distinction between shared memory and message passing applies not only to languages and libraries but also to computer hardware. It is important to note that the model of communication and synchronization provided by the language or library need not necessarily agree with that of the underlying hardware. It is easy to implement message passing on top of shared-memory hardware.
With a little more effort, one can also implement shared memory on top of message-passing hardware. Systems in this latter camp are sometimes referred to as software distributed shared memory (S-DSM).In both shared-memory and message-based programs, synchronization can be implemented either by spinning (also called busy-waiting) or by blocking. In busy-wait synchronization, a thread runs a loop in which it keeps reevaluating some condition until that condition becomes true (e.g., until a message queue becomes nonempty or a shared variable attains a particular value)—presumably as a result of action in some other thread, running on some other processor.
Note that busy-waiting makes no sense on a uniprocessor: we cannot expect a condition to become true while we are monopolizing a resource (the processor) required to make it true. (A thread on a uniprocessor may sometimes busy-wait for the completion of I/O, but that's a different situation: the I/O device runs in parallel with the processor.)In blocking synchronization (also called scheduler-based synchronization), the waiting thread voluntarily relinquishes its processor to some other thread. Before doing so, it leaves a note in some data structure associated with the synchronization condition. A thread that makes the condition true at some point in the future will find the note and take action to make the blocked thread run again.
We will consider synchronization again briefly in Section 12.2.4, and then more thoroughly in Section 12.3. Jack Dongarra. Mustafa Tikir, in, 2008 4.1.0.4 The PGAS Languages.In contrast to the message-passing model, the Partitioned Global Address Space (PGAS) languages provide each process direct access to a single globally addressable space. Each process has local memory and access to the shared memory. 5 This model is distinguishable from a symmetric shared-memory model in that shared memory is logically partitioned, so there is a notion of near and far memory explicit in each of the languages. This allows programmers to control the layout of shared arrays and of more complex pointer-based structures.The PGAS model is realized in three existing languages, each presented as an extension to a familiar base language: UPC (Unified Parallel C) 58 for C; Co-Array Fortran (CAF) 65 for Fortran, and Titanium 68 for Java. The three PGAS languages make references to shared memory explicit in the type system, which means that a pointer or reference to shared memory has a type that is distinct from references to local memory.
These mechanisms differ across the languages in subtle ways, but in all three cases the ability to statically separate local and global references has proven important in performance tuning. On machines lacking hardware support for global memory, a global pointer encodes a node identifier along with a memory address, and when the pointer is dereferenced, the runtime must deconstruct this pointer representation and test whether the node is the local one. This overhead is significant for local references and is avoided in all three languages due to expressions that are statically known to be local. This allows the compiler to generate code that uses a simpler (address-only) representation and avoids the test on dereference.These three PGAS languages share with the strict message-passing model a number of processes fixed at job start time, with identifiers for each process. This results in a one-to-one mapping between processes and memory partitions and allows for very simple runtime support, since the runtime has only a fixed number of processes to manage and these typically correspond to the underlying hardware processors.
The languages run on shared memory hardware, distributed memory clusters and hybrid architectures. Each of these languages is the focus of current compiler research and implementation activities, and a number of applications rely on them. All three languages continue to evolve based on application demand and implementation experience, a history that is useful in understanding requirements for the HPCS languages.
UPC and Titanium have a set of collective communication operations that gang the processes together to perform reductions, broadcasts and other global operations, and there is a proposal to add such support to CAF. UPC and Titanium do not allow collectives or barrier synchronization to be done on subsets of processes, but this feature is often requested by users. UPC has parallel I/O support modelled after MPIs, and Titanium has bulk I/O facilities as well as support to checkpoint data structures, based on Java's serialization interface.
All three languages also have support for critical regions and there are experimental efforts to provide atomic operations.The distributed array support in all three languages is fairly rigid, a reaction to the implementation challenges that plagued the High Performance Fortran (HPF) effort. In UPC, distributed arrays may be blocked, but there is only a single blocking factor that must be a compile-time constant; in CAF, the blocking factors appear in separate ‘co-dimensions’; and Titanium does not have built-in support for distributed arrays, but they are programmed in libraries and applications using global pointers and a built-in all-to-all operation for exchanging pointers. There is an ongoing tension in this area of language design, most visible in the active UPC community, between the generality of distributed array support and the desire to avoid significant runtime overhead.