Multi-core Architectures: Heterogeneous processors

PictureINTRODUCTION: A heterogeneous processor integrates a mix of “big” and “small” cores, and thus can potentially achieve the benefits of both. Several usages motivate this design:

Parallel processing: with a few big and many small cores, the processor can deliver higher performance at possibly the same or lower power than an iso-area homogeneous design.

Power savings: the processor uses small cores to save power. For example, it can operate in two modes: a high-power mode in which all cores are available and a low power mode in which applications only run on the small cores to save power at the cost of performance.

Accelerator: unlike the previous models, where the big cores have higher performance and even more features, in this model, the small cores implement special instructions, such as vector processing, which are unavailable on the big cores. Thus, applications can use the small cores as accelerators for these operations.

Heterogeneous Architectures:
(1)Design Space: We classify heterogeneous architectures into two types: performance asymmetry and functional asymmetry. The former refers to architectures where cores differ in performance (and power) due to different clock speeds, cache sizes, microarchitectures, and so forth. Applications run correctly on any core, but can have different performance.

(2)OS Challenges: there are two sets of challenges:

Correctness: OSes typically query processor features on the bootstrap processor (BSP) and assume the same for every core. This assumption becomes invalid for heterogeneous
processors. With instruction-based asymmetry, software can fail on one core but succeed on another. This needs to be handled properly to ensure correct execution.

Performance: Even when software runs correctly, obtaining high performance can be challenging. With performance asymmetry, an immediate challenge is how applications can share the high-performance cores fairly, especially when they belong to different users. OS scheduling should also enable consistent application performance across different

runs. Otherwise, a thread may execute on a fast core in one run but a slow one in another, causing performance variations. Scheduling is further complicated as threads can perform differently on different cores. In general, one would expect higher performance on a faster core; however,for I/O-bound applications, this may not be true. Choosing the right thread-to-core mappings can be challenging.

Supporting Performance Asymmetry

Quantifying CPU Performance: An essential component of our algorithms is to assign a
performance rating per CPU such that we can estimate performance differences if a thread is to run on different CPUs.There are various ways to obtain CPU ratings. Our design allows the OS to run a simple benchmark of its choice at boot time and set a default rating for each CPU. When the system is up, the OS or user can run complex benchmarks such as SPEC CPU* to override the default ratings if desired. The processor manufacturer can also provide CPU ratings, which the OS can use as the default. All of these approaches produce the same result, i.e., a static rating per CPU. If the rating of a CPU is X times higher than the rating of another CPU, we say this CPU is X times faster.

Faster-First Scheduling: If two CPUs are idle and a thread can run on both of them, we always run it on the faster CPU. The algorithm consists of two components:

Initial placement: When scheduling a thread for the first time after its creation, if two CPUs are idle, we always choose the faster one to run it. If none is idle, our algorithm has no effect and the OS performs its normal action,typically selecting the most lightly loaded CPU.

Dynamic migration: During execution, a faster CPU can become idle. If any thread is running on a slow CPU, we preempt it and move it to the faster CPU. Thus, if the total
number of threads is less than or equal to the number of faster CPUs, every thread can run on a faster CPU and achieve maximum performance.

Instruction-based Asymmetry :To emulate the accelerator usage model in Section 1, we
configure the small cores with a 2 GHz frequency, resulting in a 32% lower SPEC CPU2006* rating than the big cores.

Fault-and-migrate performance: We perform three experiments for the three instruction-asymmetry benchmarks.First, we run the non-SSE4.1 version by pinning it on a big core, which gives the performance of running on a homogeneous system of big cores without SSE4.1. Second, we run the SSE4.1 version without pinning. With faster-first scheduling, it starts on a big core; on an SSE4.1 instruction,it faults and migrates to a small core and later back to a big core. Thus, the benchmark migrates back and forth between the big and small cores, allowing us to evaluate overheads of fault-and-migrate. To evaluate the impact of T, we repeat this experiment with T equal to 1, 2, 4, and 8, where one tick in our system is 4 ms. Finally, to emulate a costly design of homogeneous big cores with SSE4 canada viagra.1, we  re-configure each small core to have equivalent performance to the big core. By pinning the SSE4.1 version of each benchmark to this core, we get an upper bound for any heterogeneous configuration with fault-and-migrate.

Conclusion :Heterogeneous architectures provide a cost-effective solution for improving both single-thread performance and multi-thread throughput. However, they also face significant challenges in the OS design, which traditionally assumes only homogeneous hardware. This paper presents a set of algorithms that allow the OS to effectively manage heterogeneous CPUs.

Our fault-and-migrate algorithm enables the OS to transparently support instruction-based asymmetry. Faster-first scheduling improves application performance by allowing them to utilize faster cores whenever possible. Finally, DWRR allows applications to fairly share CPU resources, enabling good individual application performance and system throughput. We have implemented these algorithms in Linux 2.6.24 and evaluated them on an actual heterogeneous platform. Our results demonstrated
that, with incremental changes, we can modify an existing OS to effectively manage heterogeneous hardware and achieve high performance for a wide range of applications.