This textbook, Microeconomics: Theory Through Applications , centers around student needs and expectations through two premises: … Students are motivated to study economics if they see that it relates to their own lives. Many books claim to present economics in a way that is digestible for students; Russell and Andrew have truly created one from scratch. This textbook will assist you in increasing students' economic literacy both by developing their aptitude for economic thinking and by presenting key insights about economics that every educated individual should know.
Applications Ahead of Theory: They present all the theory that is standard in Principles books. But by beginning with applications, students get to learn why this theory is needed. Each chapter is built around a particular business or policy application, such as minimum wages, the stock exchange, and auctions.
Why take this approach? Traditional courses focus too much on abstract theory relative to the interests and capabilities of the average undergraduate. Students are rarely engaged and the formal theory is never integrated into the way students think about economic issues. And traditional books are organized around theoretical constructs that mean nothing to students. They introduce tools and ideas as and when they are needed.
Each chapter is designed with two goals. Second, the application is a suitable vehicle a vehicle for teaching the principles of economics. Learning through Repetition: Important tools appear over and over again, allowing students to learn from repetition and to see how one framework can be useful in many different contexts. Each piece of economic theory in this text is first introduced and explained in the context of a specific application. Most are re-used in other chapters, so students see them in action on multiple occasions. As students progress through the book, they accumulate a set of techniques and ideas.
A Student's Table of Contents vs. An Instructor's Table of Contents: There is no further proof that Russell and Andrew have created a book aimed specifically at educating students about economics than their two tables of contents. The Student's Table of Contents speaks to students, piquing their interest to involve them in the economics, and a Instructor's Table of Contents with the economics to better help you organize your teaching—and frankly, you don't need to get excited by economics, you already are.
He has taught principles of economics at many of these universities as well as numerous courses to PhD students. He received his PhD from the University of Pennsylvania in He was elected Fellow of the Econometric Society in He received his undergraduate degree in economics from the University of Dublin, Trinity College, in and his PhD in economics from Yale University in He joined Melbourne Business School in January Andrew has consulting experience in the areas of marketing, economics, and strategy.
He has worked with clients in Australia, Europe, and throughout the Asia-Pacific region. He has extensive experience in the pharmaceutical industry and has also worked with firms in the consumer goods and consulting sectors. His research interests include state-dependent pricing models, environmental economics, coordination games, and consumer boycotts. His work is widely cited in economics journals. Read this book PDF Online. Reviews Learn more about reviews.
Following this line of thought, we present an EMD-based symbolization method to extract the detailed features of the original PIR signal and generate five symbol sequences for further analysis. Because the details of the PIR signals usually have small time scales and the EMD always arranges the IMF components in an ascending order of their time scales [ 14 ], thus the details of the original PIR signal would be primarily extracted into the first few IMF components.
For this reason, we can just use the first five IMF components for further feature extraction. To extract the feature information contained in the generated IMF components, as shown in Figure 6 , the second step of the proposed symbolization method is partitioning the data space of the IMF components for generating five symbol sequences.
In the partitioning procedure, first, the data space of the i th IMF component C i is partitioned into a number of mutually exclusive and exhaustive regions using the maximum entropy partition method [ 18 ], which ensures that the information-rich regions of the IMF component can be partitioned finer and those with sparse information are partitioned coarser.
At last, each data point in C i is assigned to a particular symbol according to the region where the data point falls, in this way, we can generate a symbol sequence S i from the i th IMF component. Through repeating this partitioning procedure on each generated IMF component, we can obtain five symbol sequences as shown in Figure 6 c. It has been indicated in [ 20 ] that the essential robust features of the original signal can be extracted into a symbol sequence through an appropriate partition, but these features still cannot be directly used for the classification.
Therefore, further analysis is still needed to generate appropriate feature descriptors for the original PIR signal. An example of constructing a PFSA from a symbol sequence, the left-hand of the figure indicates the generation of state sequence and the right hand indicates the constructed PFSA, in which the circle indicates the state of the PFSA and the arrow indicates the transition between the states.
Then we use the generated state sequence to estimate the probability of the transition between two states according to Equation 5 :. After obtained the transition probability matrix, we calculate its left eigenvectors and generate one feature descriptor of the original PIR signals according to Equation 7 :. Through repeating this feature descriptor generation procedure on the five symbol sequences respectively, we can generate five feature descriptors for each PIR signal.
Because the IMF components are numerically orthogonal to each other [ 14 ] and the symbolization procedures of these IMF component are mutually independent Section 3. After generated five feature descriptors for the original PIR signal, next we need to fuse the feature information contained in the feature descriptors for classifying the original PIR signal. Therefore, inspired by the boosting algorithm [ 22 ], we propose a weighted voting classification strategy with five classifiers.
The proposed classification strategy is shown in Figure 8. First, we separately classify the generated feature descriptors with five classifiers and obtain five reference classes of the original PIR signal. Then, according to the generated reference classes, the five classifiers vote for the final class of the original PIR signal through a weighted voting mechanism. Construction of the weighted voting classification strategy, where C i indicates the i th classifier. Specifically, we assign higher voting weights to the classifiers with low classification errors, whereas the classifiers with high classification errors can only obtain a lower voting weights.
In this way, a more accurate classifier can play a greater role in determining the final class of the original PIR signal. Algorithm 1 gives the voting weights determination procedure of the proposed classification strategy. Train the j th classifier and calculate its classification error e j defined by Equation 8 with the feature set F j using the five-fold cross validation procedure recommended by [ 23 ], where C j indicates the discriminant function of the j th classifier and I is the indicator function:.
To verify the effectiveness of the proposed method, we establish two databases of human and nonhuman PIR signals.
The contents of these two databases are shown in Table 1. Note that the PIR signals in Database 1 are collected from adults, dogs, and the warm wind generated by an air condition in different indoor environments, whereas those in Dataset 2 are collected from adults, dogs, and geese in different outdoor environments. All PIR signals are collected with a commercial PIR detector and a data acquisition card using a sampling frequency of Hz and a duration of 2 s.
With these two databases, we perform multiple experiments to compare the performance of the proposed method, WPE [ 6 ], AR [ 7 ] and SDF-based [ 24 ] without the wavelet preprocessing feature extraction methods. The experiments implemented on Database 1 are used to verify the effectiveness of the proposed method for indoor intrusion detection.
The reason why we choose the dog and warm wind as the nonhuman targets is that the false alarms of the PIR detector in the indoor environment are mainly caused by pets and heating apparatus. Generally, it may be more reasonable to use a one-class classifier to implement the human and nonhuman recognition, but there is only a small range of disturbance sources of the PIR detector in the indoor environment, therefore, besides the one-class classifiers, we can also adopt the support vector machine SVM as the classifier of the proposed classification strategy for obtaining a higher classification accuracy.
Because SVM is a binary classifier, as shown in Figure 9 , we adopt a two-layer recognition procedure to deal with the three-class i. Each layer contains a feature extraction and classification procedure elaborated in Section 3. As described in Section 3. Therefore, to determine the voting weights of the classifiers, we first equally divide Database 1 into two datasets and term them as Subset 1 and Subset 2 respectively, then we use Subset 1 to execute the voting weights determination procedure and use Subset 2 to perform the comparative experiments.
The classification errors of the classifiers in each recognition layer are summarized in Table 2 , where SVM j indicates j th SVM of the weighted voting classification strategy and RL m indicates the m th recognition layer. According to the classification errors, we calculate the voting weights of the classifiers according to Equation 9 , the voting weights are summarized in Table 3 and their sum has been normalized to one. Classification errors of the classifiers in each recognition layer for the recognition of adults, dogs and warm wind in the indoor environment. Voting weights of the classifiers in each recognition layer for the recognition of adults, dogs and warm wind in the indoor environment.
After determined the voting weights, we use the leave-one-out cross validation procedure recommended by [ 23 ] to optimize another parameters i. The parameter tuning and performance evaluation of these methods are also using the above-mentioned leave-one-out procedure. The experimental results are listed in Table 4 using the confusion matrix, where the rows are the actual classes and the columns are the predicted classes. Recall, precision and the overall recognition accuracy of each method in the comparative experiments on Database 1.
We can see from Table 5 that, for the recognition of the warm wind, the proposed method only shows a slightly better performance than another three methods, but for the recognition of the adults and dogs, the proposed method shows a remarkably better performance that only two human targets are missed and no false alarm emerges. EMD is a self-adaptive signal decomposition method, so unlike the wavelet decomposition, it has no need of any base functions [ 14 ], so the proposed method can avoid the performance degradation caused by a bad choice of wavelet bases.
EMD is also a suitable tool for non-stationary signal processing, therefore, the proposed method can provide a more reliable recognition result than AR-based feature extraction because sometimes the PIR signals are non-stationary. As for SDF, it has the advantage of low computational complexity and memory requirement [ 9 ], but because of the coarse graining of the original PIR signal, it may loss the feature information contained in the signal details, which are closely related to the body shape features of the human and nonhuman targets. The proposed method can be viewed as an improvement of SDF, because EMD makes it possible to specially analysis the detailed components of the original PIR signal, the proposed method can effectively extract the discriminable features of human and nonhuman PIR signals.
However, we cannot claim that our method is better than another three methods for any applications, because the introduction of EMD also cause an increment of computational complexity and memory requirement. There should be a comprehensive consideration between the requirement of recognition accuracy and the computing speed. The experiments implemented on Database 2 are to verify the effectiveness of the proposed method for outdoor pedestrian detection.
Similar to the experiments on Database 1, to determinate the voting weights of the classifiers in the weighted voting classification strategy, we equally divide the Database 2 into two subsets and term them as Subset 3 and Subset 4, respectively. Subset 3 will be used to perform the voting weights determination procedure elaborated in Section 3. The classification errors of the classifiers are summarized in Table 6 , where SVM j indicates j th SVM of the weighted voting classification strategy and RL m indicates the m th recognition layer.
According to these classification errors, we calculate the voting weights of the classifiers using Equation 9 , the normalized voting weights are listed in Table 7. Classification errors of the classifiers for the recognition of adults, dogs and geese in the outdoor environment. Voting weights of the classifiers for the recognition of adults, dogs and geese in the outdoor environment.
Turn ideas into MATLAB
After determined the voting weights, to verify the effectiveness of the proposed method for outdoor pedestrian detection, we implement the proposed method on Subset 4 and compare its performance with the WPE, AR-based and SDF-based feature extraction method, the parameter selection and performance evaluation of these methods are identical with that in Section 4.
The classification results are summarized in Table 8. The recall, precision and the overall recognition accuracy of each method are listed in Table 9. Recall, precision and the overall recognition accuracy of each method in the comparative experiments on Database 2. We can see from Table 8 and Table 9 that the proposed method obtains the highest recognition accuracy that only three human targets are missed and no false alarm emerges. Although the most common nonhuman subjects in our applications are dogs and geese, or another wild animals with similar body shapes, there may be more disturbance sources in other applications.
Therefore, we recommend that readers use a flexible strategy to deal with the multiclass classification problem, for example, combining all nonhuman targets together and classifying them from the human ones, or just adding more layers into the recognition procedure elaborated in Section 4. Besides, the one-class classifiers, such as support vector data description SVDD [ 25 ], are also recommended.
For reducing PIR detector false alarms, after analyzing the inherent differences between human and nonhuman PIR signals using a mathematical model of a PIR detector, we propose an EMD-based symbolization method for generating five symbol sequences with the detailed feature information of the original PIR signals. Then, we construct the multiple PFSA based on the generated symbol sequences for extracting five feature descriptors of the original PIR signal. Third, we used a weighted voting classification strategy to fuse the features represented by the feature descriptors and then classify the original PIR signals.
To verify the effectiveness of the proposed method, we executed comparative experiments on two databases with the proposed method, WPE, SDF and AR-based feature extraction methods. The experimental results show that the proposed method can effectively reduce the false alarms of a PIR detector. The authors express deep appreciation to A.
Ray and S. Bahrampour for providing the SDF code, and thank P. Flandrin for providing the EMD toolbox. The authors are also grateful to L. Yu and G. Zhao for their sincere assistance in the experimental data collection. Finally, we want to thank the editors and reviewers for their constructive and valuable suggestions. The work presented in this paper is a collaborative development by four authors. Jiaduo Zhao designed the methods and experiments, developed the data collection modules and perform the comparative experiments.
Weiguo Gong defined the research theme and guided the data analysis. Yuzhen Tang and Weihong Li perform the data collection and data analysis. Weak scaling is a measure of how the time to solution changes as more processors are added to a system with a fixed problem size per processor ; i. Weak scaling is often equated with Gustafson's Law, which states that in practice, the problem size scales with the number of processors. Because of this, the maximum speedup S of a program is:. Another way of looking at Gustafson's Law is that it is not the problem size that remains constant as we scale up the system but rather the execution time.
Note that Gustafson's Law assumes that the ratio of serial to parallel execution remains constant, reflecting additional cost in setting up and handling the larger problem. Understanding which type of scaling is most applicable to an application is an important part of estimating speedup. For some applications the problem size will remain constant and hence only strong scaling is applicable. An example would be modeling how two molecules interact with each other, where the molecule sizes are fixed.
For other applications, the problem size will grow to fill the available processors. Examples include modeling fluids or structures as meshes or grids and some Monte Carlo simulations, where increasing the problem size provides increased accuracy. Having understood the application profile, the developer should understand how the problem size would change if the computational performance changes and then apply either Amdahl's or Gustafson's Law to determine an upper bound for the speedup.
There are several key strategies for parallelizing sequential code. While the details of how to apply these strategies to a particular application is a complex and problem-specific topic, the general themes listed here apply regardless of whether we are parallelizing code to run on for multicore CPUs or for use on CUDA GPUs. The most straightforward approach to parallelizing an application is to leverage existing libraries that take advantage of parallel architectures on our behalf. The key here is that libraries are most useful when they match well with the needs of the application.
Thrust provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be composed together to implement complex algorithms with concise, readable source code. By describing your computation in terms of these high-level abstractions you provide Thrust with the freedom to select the most efficient implementation automatically.
As a result, Thrust can be utilized in rapid prototyping of CUDA applications, where programmer productivity matters most, as well as in production, where robustness and absolute performance are crucial. Another common approach to parallelization of sequential codes is to make use of parallelizing compilers.
Often this means the use of directives-based approaches, where the programmer uses a pragma or other similar notation to provide hints to the compiler about where parallelism can be found without needing to modify or adapt the underlying code itself. By exposing parallelism to the compiler, directives allow the compiler to do the detailed work of mapping the computation onto the parallel architecture. The details of managing the accelerator device are handled implicitly by an OpenACC-enabled compiler and runtime. We can then launch this kernel onto the GPU and retrieve the results without requiring major rewrites to the rest of our application.
This approach is most straightforward when the majority of the total running time of our application is spent in a few relatively isolated portions of the code. More difficult to parallelize are applications with a very flat profile - i. For the latter variety of application, some degree of code refactoring to expose the inherent parallelism in the application might be necessary, but keep in mind that this refactoring work will tend to benefit all future architectures, CPU and GPU alike, so it is well worth the effort should it become necessary.
Obtaining the right answer is clearly the principal goal of all computation. On parallel systems, it is possible to run into difficulties not typically found in traditional serial-oriented programming. These include threading issues, unexpected values due to the way floating-point values are computed, and challenges arising from differences in the way CPU and GPU processors operate. This chapter examines issues that can affect the correctness of returned data and points to appropriate solutions. A key aspect of correctness verification for modifications to any existing program is to establish some mechanism whereby previous known-good reference outputs from representative inputs can be compared to new results.
After each change is made, ensure that the results match using whatever criteria apply to the particular algorithm. Some will expect bitwise identical results, which is not always possible, especially where floating-point arithmetic is concerned; see Numerical Accuracy and Precision regarding numerical accuracy. For other algorithms, implementations may be considered correct if they match the reference within some small epsilon.
Note that the process used for validating numerical results can easily be extended to validate performance results as well. We want to ensure that each change we make is correct and that it improves performance and by how much. Checking these things frequently as an integral part of our cyclical APOD process will help ensure that we achieve the desired results as rapidly as possible. A useful counterpart to the reference comparisons described above is to structure the code itself in such a way that is readily verifiable at the unit level.
For example, many kernels have complex addressing logic for accessing memory in addition to their actual computation. If we validate our addressing logic separately prior to introducing the bulk of the computation, then this will simplify any later debugging efforts. Note that the CUDA compiler considers any device code that does not contribute to a write to global memory as dead code subject to elimination, so we must at least write something out to global memory as a result of our addressing logic in order to successfully apply this strategy.
If there are differences, then those differences will be seen early and can be understood in the context of a simple function. Incorrect or unexpected results arise principally from issues of floating-point accuracy due to the way floating-point values are computed and stored. The following sections explain the principal items of interest. Devices of compute capability 1.
Results obtained using double-precision arithmetic will frequently differ from the same operation performed via single-precision arithmetic due to the greater precision of the former and due to rounding issues. Therefore, it is important to be sure to compare values of like precision and to express the results within a certain tolerance rather than expecting them to be exact.
Each floating-point arithmetic operation involves a certain amount of rounding. Consequently, the order in which arithmetic operations are performed is important. When you parallelize computations, you potentially change the order of operations and therefore the parallel results might not match sequential results. This limitation is not specific to CUDA, but an inherent part of parallel computation on floating-point values. When comparing the results of computations of float variables between the host and device, make sure that promotions to double precision on the host do not account for different numerical results.
For example, if the code segment. However, if the code were performed on the host, the literal 1.
- Equations and systems solver - MATLAB solve - MathWorks France.
- Differential equations introduction.
- Differential equations introduction (video) | Khan Academy.
If, however, the literal 1. To ensure that computations use single-precision arithmetic, always use float literals. In addition to accuracy, the conversion between doubles and floats and vice versa has a detrimental effect on performance, as discussed in Instruction Optimization. One of the key differences is the fused multiply-add FMA instruction, which combines multiply-add operations into a single instruction execution.
Its result will often differ slightly from results obtained by doing the two operations separately. The results of these calculations can frequently differ from pure bit operations performed on the CUDA device. To get a closer match between values, set the x86 host processor to use regular double or single precision 64 bits and 32 bits, respectively. When attempting to optimize CUDA code, it pays to know how to measure performance accurately and to understand the role that bandwidth plays in performance measurement.
It then explores how bandwidth affects performance metrics and how to mitigate some of the challenges it poses. This section examines the functionality, advantages, and pitfalls of both approaches. The details of various CPU timing approaches are outside the scope of this document, but developers should always be aware of the resolution their timing calls provide. All kernel launches are asynchronous, as are memory-copy functions with the Async suffix on their names. Although it is also possible to synchronize the CPU thread with a particular stream or event on the GPU, these synchronization functions are not suitable for timing code in streams other than the default stream.
Because the driver may interleave execution of CUDA calls from other non-default streams, calls in other streams may be included in the timing. Because the default stream, stream 0, exhibits serializing behavior for work on the device an operation in the default stream can begin only after all preceding calls in any stream have completed; and no subsequent operation in any stream can begin until it finishes , these functions can be used reliably for timing in the default stream.
Be aware that CPU-to-GPU synchronization points such as those mentioned in this section imply a stall in the GPU's processing pipeline and should thus be used sparingly to minimize their performance impact. The CUDA event API provides calls that create and destroy events, record events via timestamp , and convert timestamp differences into a floating-point value in milliseconds. How to time code using CUDA events illustrates their use. Here cudaEventRecord is used to place the start and stop events into the default stream, stream 0.
The device will record a timestamp for the event when it reaches that event in the stream. The cudaEventElapsedTime function returns the time elapsed between the recording of the start and stop events. This value is expressed in milliseconds and has a resolution of approximately half a microsecond. Like the other calls in this listing, their specific operation, parameters, and return values are described in the CUDA Toolkit Reference Manual. Note that the timings are measured on the GPU clock, so the timing resolution is operating-system-independent.
Bandwidth - the rate at which data can be transferred - is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth. As described in Memory Optimizations of this guide, bandwidth can be dramatically affected by the choice of memory in which data is stored, how the data is laid out and the order in which it is accessed, as well as other factors.
To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it. Theoretical bandwidth can be calculated using hardware specifications available in the product literature. In this calculation, the memory clock rate is converted in to Hz, multiplied by the interface width divided by 8, to convert bits to bytes and multiplied by 2 due to the double data rate.
Effective bandwidth is calculated by timing specific program activities and by knowing how data is accessed by the program. To do so, use this equation:. For example, to compute the effective bandwidth of a x matrix copy, the following formula could be used:. The number of elements is multiplied by the size of each element 4 bytes for a float , multiplied by 2 because of the read and write , divided by 10 9 or 1, 3 to obtain GB of memory transferred. For devices with compute capability of 2. The following throughput metrics can be displayed in the Details or Detail Graphs view:.
The Requested Global Load Throughput and Requested Global Store Throughput values indicate the global memory throughput requested by the kernel and therefore correspond to the effective bandwidth obtained by the calculation shown under Effective Bandwidth Calculation. Because the minimum memory transaction size is larger than most word sizes, the actual memory throughput required for a kernel can include the transfer of data not used by the kernel. For global memory accesses, this actual throughput is reported by the Global Load Throughput and Global Store Throughput values.
- Symbolic algebra and Mathematics with Xcas?
- Die linguistische Realisierung der informationellen Komponente im Englischen (German Edition)!
- Best Practices Guide :: CUDA Toolkit Documentation?
- Navigating the Out-of-Body Experience: Radical New Techniques.
- Differential equations introduction (video) | Khan Academy.
- Microeconomics: Theory Through Applications.
It's important to note that both numbers are useful. The actual memory throughput shows how close the code is to the hardware limit, and a comparison of the effective or requested bandwidth to the actual bandwidth presents a good estimate of how much bandwidth is wasted by suboptimal coalescing of memory accesses see Coalesced Access to Global Memory.
For global memory accesses, this comparison of requested memory bandwidth to actual memory bandwidth is reported by the Global Memory Load Efficiency and Global Memory Store Efficiency metrics. Memory optimizations are the most important area for performance. The goal is to maximize the use of the hardware by maximizing bandwidth. Bandwidth is best served by using as much fast memory and as little slow-access memory as possible.
This chapter discusses the various kinds of memory on the host and device and how best to set up data items to use the memory effectively. The peak theoretical bandwidth between the device memory and the GPU is much higher Hence, for best overall application performance, it is important to minimize data transfer between the host and the device, even if that means running kernels on the GPU that do not demonstrate any speedup compared with running them on the host CPU.
Intermediate data structures should be created in device memory, operated on by the device, and destroyed without ever being mapped by the host or copied to host memory. Also, because of the overhead associated with each transfer, batching many small transfers into one larger transfer performs significantly better than making each transfer separately, even if doing so requires packing non-contiguous regions of memory into a contiguous buffer and then unpacking after the transfer.
Finally, higher bandwidth between the host and the device is achieved when using page-locked or pinned memory, as discussed in the CUDA C Programming Guide and the Pinned Memory section of this document. Page-locked or pinned memory transfers attain the highest bandwidth between the host and the device. For regions of system memory that have already been pre-allocated, cudaHostRegister can be used to pin the memory on-the-fly without the need to allocate a separate buffer and copy the data into it.
Pinned memory should not be overused.
Symbols, Trees, and Types
Excessive use can reduce overall system performance because pinned memory is a scarce resource, but how much is too much is difficult to know in advance. Furthermore, the pinning of system memory is a heavyweight operation compared to most normal system memory allocations, so as with all optimizations, test the application and the systems it runs on for optimal performance parameters. Data transfers between the host and the device using cudaMemcpy are blocking transfers; that is, control is returned to the host thread only after the data transfer is complete. The cudaMemcpyAsync function is a non-blocking variant of cudaMemcpy in which control is returned immediately to the host thread.
In contrast with cudaMemcpy , the asynchronous transfer version requires pinned host memory see Pinned Memory , and it contains an additional argument, a stream ID. A stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped - a property that can be used to hide data transfers between the host and the device. Asynchronous transfers enable overlap of data transfers with computation in two different ways.
On all CUDA-enabled devices, it is possible to overlap host computation with asynchronous data transfers and with device computations. For example, Overlapping computation and data transfers demonstrates how host computation in the routine cpuFunction is performed while data is transferred to the device and a kernel using the device is executed. The last argument to the cudaMemcpyAsync function is the stream ID, which in this case uses the default stream, stream 0.
The kernel also uses the default stream, and it will not begin execution until the memory copy completes; therefore, no explicit synchronization is needed. Because the memory copy and the kernel both return control to the host immediately, the host function cpuFunction overlaps their execution.
Microeconomics: Theory Through Applications - Open Textbook Library
In Overlapping computation and data transfers , the memory copy and kernel execution occur sequentially. On devices that are capable of concurrent copy and compute, it is possible to overlap kernel execution on the device with data transfers between the host and the device. On devices that have this capability, the overlap once again requires pinned host memory, and, in addition, the data transfer and kernel must use different, non-default streams streams with non-zero stream IDs.
Non-default streams are required for this overlap because memory copy, memory set functions, and kernel calls that use the default stream begin only after all preceding calls on the device in any stream have completed, and no operation on the device in any stream commences until they are finished. Concurrent copy and execute illustrates the basic technique.
In this code, two streams are created and used in the data transfer and kernel executions as specified in the last arguments of the cudaMemcpyAsync call and the kernel's execution configuration. Concurrent copy and execute demonstrates how to overlap kernel execution with asynchronous data transfer. This technique could be used when the data dependency is such that the data can be broken into chunks and transferred in multiple stages, launching multiple kernels to operate on each chunk as it arrives.
Sequential copy and execute and Staged concurrent copy and execute demonstrate this. They produce equivalent results. The first segment shows the reference sequential implementation, which transfers and operates on an array of N floats where N is assumed to be evenly divisible by nThreads. Staged concurrent copy and execute shows how the transfer and kernel execution can be broken up into nStreams stages. This approach permits some overlapping of the data transfer and execution. Because execution within a stream occurs sequentially, none of the kernels will launch until the data transfers in their respective streams complete.
Current GPUs can simultaneously process asynchronous data transfers and execute kernels. GPUs with a single copy engine can perform one asynchronous data transfer and execute kernels whereas GPUs with two copy engines can simultaneously perform one asynchronous data transfer from the host to the device, one asynchronous data transfer from the device to the host, and execute kernels. It should be mentioned that it is not possible to overlap a blocking transfer with an asynchronous transfer, because the blocking transfer occurs in the default stream, so it will not begin until all previous CUDA calls complete.
It will not allow any other CUDA call to begin until it has completed. A diagram depicting the timeline of execution for the two code segments is shown in Figure 1 , and nStreams is equal to 4 for Staged concurrent copy and execute in the bottom half of the figure. For this example, it is assumed that the data transfer and kernel execution times are comparable. Zero copy is a feature that was added in version 2. It enables GPU threads to directly access host memory. For this purpose, it requires mapped pinned non-pageable memory.
On integrated GPUs i. On discrete GPUs, mapped pinned memory is advantageous only in certain cases. Because the data is not cached on the GPU, mapped pinned memory should be read or written only once, and the global loads and stores that read and write the memory should be coalesced. Zero copy can be used in place of streams because kernel-originated data transfers automatically overlap kernel execution without the overhead of setting up and determining the optimal number of streams. The host code in Zero-copy host code shows how zero copy is typically set up. In this code, the canMapHostMemory field of the structure returned by cudaGetDeviceProperties is used to check that the device supports mapping host memory to the device's address space.
Note that cudaSetDeviceFlags must be called prior to setting a device or making a CUDA call that requires state that is, essentially, before a context is created. Page-locked mapped host memory is allocated using cudaHostAlloc , and the pointer to the mapped device address space is obtained via the function cudaHostGetDevicePointer. Devices of compute capability 2. With UVA, the host memory and the device memories of all installed supported devices share a single virtual address space.
Prior to UVA, an application had to keep track of which pointers referred to device memory and for which device and which referred to host memory as a separate bit of metadata or as hard-coded information in the program for each pointer. Using UVA, on the other hand, the physical memory space to which a pointer points can be determined simply by inspecting the value of the pointer using cudaPointerGetAttributes. Under UVA, pinned host memory allocated with cudaHostAlloc will have identical host and device pointers, so it is not necessary to call cudaHostGetDevicePointer for such allocations.
Host memory allocations pinned after-the-fact via cudaHostRegister , however, will continue to have different device pointers than their host pointers, so cudaHostGetDevicePointer remains necessary in that case. These memory spaces include global, local, shared, texture, and registers, as shown in Figure 2. Of these different memory spaces, global memory is the most plentiful; see Features and Technical Specifications of the CUDA C Programming Guide for the amounts of memory available in each memory space at each compute capability level.
Global, local, and texture memory have the greatest access latency, followed by constant memory, shared memory, and the register file. The various principal traits of the memory types are shown in Table 1. In the case of texture access, if a texture reference is bound to a linear array in global memory, then the device code can write to the underlying array.
Texture references that are bound to CUDA arrays can be written to via surface-write operations by binding a surface to the same underlying CUDA array storage. Reading from a texture while writing to its underlying global memory array in the same kernel launch should be avoided because the texture caches are read-only and are not invalidated when the associated global memory is modified.
Perhaps the single most important performance consideration in programming for CUDA-capable GPU architectures is the coalescing of global memory accesses. Global memory loads and stores by threads of a warp are coalesced by the device into as few as one transaction when certain access requirements are met. The access requirements for coalescing depend on the compute capability of the device and are documented in the CUDA C Programming Guide. For devices of compute capability 2.
By default, all accesses are cached through L1, which as byte lines. For scattered access patterns, to reduce overfetch, it can sometimes be useful to cache only in L2, which caches shorter byte segments see the CUDA C Programming Guide. For devices of compute capability 3. Some devices of compute capability 3. Accessing memory in a coalesced way is even more important when ECC is turned on. Scattered accesses increase ECC memory transfer overhead, especially when writing data to the global memory. Coalescing concepts are illustrated in the following simple examples.
These examples assume compute capability 2. These examples assume that accesses are cached through L1, which is the default behavior on those devices, and that accesses are for 4-byte words, unless otherwise noted. The first and simplest case of coalescing can be achieved by any CUDA-enabled device: the k -th thread accesses the k -th word in a cache line. Not all threads need to participate. For example, if the threads of a warp access adjacent 4-byte words e. Such a pattern is shown in Figure 3. This access pattern results in a single byte L1 transaction, indicated by the red rectangle.
If some words of the line had not been requested by any thread such as if several threads had accessed the same word or if some threads did not participate in the access , all data in the cache line is fetched anyway. Furthermore, if accesses by the threads of the warp had been permuted within this segment, still only one byte L1 transaction would have been performed by a device with compute capability 2. If sequential threads in a warp access memory that is sequential but not aligned with the cache lines, two byte L1 cache will be requested, as shown in Figure 4.
For non-caching transactions i. In Figure 5 , we see an example of this: the same access pattern from Figure 4 is used, but now L1 caching is disabled, so now five byte L2 segments are needed to satisfy the request. Therefore, choosing sensible thread block sizes, such as multiples of the warp size i. Consider what would happen to the memory addresses accessed by the second, third, and subsequent thread blocks if the thread block size was not a multiple of warp size, for example. It is easy and informative to explore the ramifications of misaligned accesses using a simple copy kernel, such as the one in A copy kernel that illustrates misaligned accesses.
In A copy kernel that illustrates misaligned accesses , data is copied from the input array idata to the output array, both of which exist in global memory. The kernel is executed within a loop in host code that varies the parameter offset from 0 to Figure 4 and Figure 4 correspond to misalignments in the cases of caching and non-caching memory accesses, respectively. For the NVIDIA Tesla M, global memory accesses with no offset or with offsets that are multiples of 32 words result in a single L1 cache line transaction or 4 L2 cache segment loads for non-L1-caching loads.
In this particular example, that effect is not apparent, however, because adjacent warps reuse the cache lines their neighbors fetched. So while the impact is still evident in the case of caching loads, it is not as great as we might have expected. It would have been more so if adjacent warps had not exhibited such a high degree of reuse of the over-fetched cache lines. As seen above, in the case of misaligned sequential accesses, the caches of compute capability 2. It may be different with non-unit-strided accesses, however, and this is a pattern that occurs frequently when dealing with multidimensional data or matrices.
For this reason, ensuring that as much as possible of the data in each cache line fetched is actually used is an important part of performance optimization of memory accesses on these devices. To illustrate the effect of strided access on effective bandwidth, see the kernel strideCopy in A kernel to illustrate non-unit stride data copy , which copies data with a stride of stride elements between threads from idata to odata.
Figure 7 illustrates such a situation; in this case, threads within a warp access words in memory with a stride of 2. This action leads to a load of two L1 cache lines or eight L2 cache segments in non-caching mode per warp on the Tesla M compute capability 2. As the stride increases, the effective bandwidth decreases until the point where 32 lines of cache are loaded for the 32 threads in a warp, as indicated in Figure 8.
As illustrated in Figure 8 , non-unit-stride global memory accesses should be avoided whenever possible. One method for doing so utilizes shared memory, which is discussed in the next section. Because it is on-chip, shared memory has much higher bandwidth and lower latency than local and global memory - provided there are no bank conflicts between the threads, as detailed in the following section.
To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules banks that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans n distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is n times as high as the bandwidth of a single bank.
However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized. The hardware splits a memory request that has bank conflicts into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of separate memory requests. The one exception here is when multiple threads in a warp address the same shared memory location, resulting in a broadcast.
To minimize bank conflicts, it is important to understand how memory addresses map to memory banks and how to optimally schedule memory requests. On devices of compute capability 2. The warp size is 32 threads and the number of banks is also 32, so bank conflicts can occur between any threads in the warp. See Compute Capability 2. On devices of compute capability 3. There are two different banking modes: either successive bit words in bit mode or successive bit words bit mode are assigned to successive banks. See Compute Capability 3. Shared memory enables cooperation between threads in a block.
When multiple threads in a block use the same data from global memory, shared memory can be used to access the data from global memory only once. Shared memory can also be used to avoid uncoalesced memory accesses by loading and storing data in a coalesced pattern from global memory and then reordering it in shared memory. Aside from memory bank conflicts, there is no penalty for non-sequential or unaligned accesses by a warp in shared memory.
To keep the kernels simple, M and N are multiples of 32, and w is 32 for devices of compute capability 2. A natural decomposition of the problem is to use a block and tile size of wxw threads. Therefore, in terms of wxw tiles, A is a column matrix, B is a row matrix, and C is their outer product; see Figure 9. To do this, the simpleMultiply kernel Unoptimized matrix multiplication calculates the output elements of a tile of matrix C.
In Unoptimized matrix multiplication , a , b , and c are pointers to global memory for the matrices A, B, and C, respectively; blockDim. Each thread in the wxw-thread block calculates one element in a tile of C. The for loop over i multiplies a row of A by a column of B, which is then written to C. The effective bandwidth of this kernel is only 6. To analyze performance, it is necessary to consider how warps access global memory in the for loop. Each warp of threads calculates one row of a tile of C, which depends on a single row of A and an entire tile of B as illustrated in Figure For each iteration i of the for loop, the threads in a warp read a row of the B tile, which is a sequential and coalesced access for all compute capabilities.
Even though such an access requires only 1 transaction on devices of compute capability 2. The performance on a device of any compute capability can be improved by reading a tile of A into shared memory as shown in Using shared memory to improve the global memory load efficiency in matrix multiplication. In Using shared memory to improve the global memory load efficiency in matrix multiplication , each element in a tile of A is read from global memory only once, in a fully coalesced fashion with no wasted bandwidth , to shared memory.
Within each iteration of the for loop, a value in shared memory is broadcast to all threads in a warp. This kernel has an effective bandwidth of 7. This illustrates the use of the shared memory as a user-managed cache when the hardware L1 cache eviction policy does not match up well with the needs of the application or when L1 cache is not used for reads from global memory. A further improvement can be made to how Using shared memory to improve the global memory load efficiency in matrix multiplication deals with matrix B.
In calculating each of the rows of a tile of matrix C, the entire tile of B is read. The repeated reading of the B tile can be eliminated by reading it into shared memory once Improvement by reading additional data into shared memory. The effective bandwidth of this routine is Note that the performance improvement is not due to improved coalescing in either case, but to avoiding redundant transfers from global memory. The results of the various optimizations are summarized in Table 2.
A variant of the previous matrix multiplication can be used to illustrate how strided accesses to global memory, as well as shared memory bank conflicts, are handled. In Unoptimized handling of strided accesses to global memory , the row -th, col -th element of C is obtained by taking the dot product of the row -th and col -th rows of A. The effective bandwidth for this kernel is 3. The way to avoid strided access is to use shared memory as before, except in this case a warp reads a row of A into a column of a shared memory tile, as shown in An optimized handling of strided accesses using coalesced reads from global memory.
An optimized handling of strided accesses using coalesced reads from global memory uses the shared transposedTile to avoid uncoalesced accesses in the second term in the dot product and the shared aTile technique from the previous example to avoid uncoalesced accesses in the first term. The effective bandwidth of this kernel is The cause of the difference is shared memory bank conflicts. The reads of elements in transposedTile within the for loop are free of conflicts, because threads of each half warp read across rows of the tile, resulting in unit stride across the banks.
However, bank conflicts occur when copying the tile from global memory into shared memory. To enable the loads from global memory to be coalesced, data are read from global memory sequentially. However, this requires writing to shared memory in columns, and because of the use of wxw tiles in shared memory, this results in a stride between threads of w banks - every thread of the warp hits the same bank.
Recall that w is selected as 32 for devices of compute capability 2. These many-way bank conflicts are very expensive. The simple remedy is to pad the shared memory array so that it has an extra column, as in the following line of code. After this change, the effective bandwidth is The results of these optimizations are summarized in Table 3. These results should be compared with those in Table 2. As can be seen from these tables, judicious use of shared memory can dramatically improve performance.
The examples in this section have illustrated three reasons to use shared memory:. Local memory is so named because its scope is local to the thread, not because of its physical location.