research-article

Public Access

Software Hint-Driven Data Management for Hybrid Memory in Mobile Systems

Authors:

Fei Wen,

Mian Qin,

Paul Gratz,

Narasimha ReddyAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 21, Issue 1

Article No.: 8, Pages 1 - 18

https://doi.org/10.1145/3494536

Published: 14 January 2022 Publication History

All formats PDF

Abstract

Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of current mobile applications. Recently emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of distinct memory classes render a new challenge for memory system design.

Ideally, pages should be placed or migrated between the two types of memories according to the data objects’ access properties. Prior system software approaches exploit the program information from OS but at the cost of high software latency incurred by related kernel processes. Hardware approaches can avoid these latencies, however, hardware’s vision is constrained to a short time window of recent memory requests, due to the limited on-chip resources.

In this work, we propose OpenMem: a hardware-software cooperative approach that combines the execution time advantages of pure hardware approaches with the data object properties in a global scope. First, we built a hardware-based memory manager unit (HMMU) that can learn the short-term access patterns by online profiling, and execute data migration efficiently. Then, we built a heap memory manager for the heterogeneous memory systems that allows the programmer to directly customize each data object’s allocation to a favorable memory device within the presumed object life cycle. With the programmer’s hints guiding the data placement at allocation time, data objects with similar properties will be congregated to reduce unnecessary page migrations.

We implemented the whole system on the FPGA board with embedded ARM processors. In testing under a set of benchmark applications from SPEC 2017 and PARSEC, experimental results show that OpenMem reduces 44.6% energy consumption with only a 16% performance degradation compared to the all-DRAM memory system. The amount of writes to the NVM is reduced by 14% versus the HMMU-only, extending the NVM device lifetime.

1 Introduction

Recently, the memory footprint of many mobile applications has expanded dramatically. Catering to this demand, mobile devices are being equipped with ever-larger DRAM. For example, consider the Samsung Galaxy S series flagship phones. In these devices, the DRAM capacity has expanded by 16X over the past 10 years [18].

Such continued scaling of DRAM capacity is becoming untenable due to both cost/economics and energy consumption. DRAM requires constant refreshing of its stored data, drawing substantial energy. Compared to desktop/server-class computers, mobile devices are highly sensitive to cost, size, and energy budgets.

The non-volatile memory (NVM) technologies, including Intel 3D Xpoint [20], memristor [15], and Phase-change-memory (PCM) [34] have emerged as alternative memory technologies to DRAM. These new memory technologies promise an order of magnitude higher density [9] and minimal background power consumption. Their access delay is typically within one order of magnitude larger than that of DRAM, while the write access has considerably higher overheads. Such different characteristics require an essential redesign of the memory system architecture, in both data management policies and mechanisms.

Ideally, one would prefer to place data objects with high locality and access frequency in the DRAM to ensure good performance. Meanwhile, objects of larger size shall be placed on the NVM, which has a higher capacity and could help reduce or avoid swapping between memory and storage devices. Furthermore, heavily written objects should likely be placed on the DRAM, to avoid the high write overheads of NVM devices in terms of both latency and power consumption (depending on a particular technology, which also helps to prolong the NVM device lifetime).

One must also reconsider the mechanism to address the new memory device. In the era of magnetic disks and flash drives, moving data between memories and storage was managed by the operating system. Here, the OS page fault handler consists of a chain of kernel processes and incurs thousands of cycles of delays. These delays were hidden behind the lengthy access time of these old technologies. Emerging NVM memory access latency is within an order of magnitude as opposed to DRAM; thus the incurred OS management overhead becomes untenable [16].

Some existing works advocate using DRAM as a hardware-managed cache for NVM [10, 33]. This approach implies a high hardware cost for metadata management and imposes noticeable capacity and bandwidth constraints. Other groups propose a pure software, OS managed approaches [16, 17, 24, 39]. As we have discussed, these approaches imply tremendous slowdowns due to software overheads incurred by procedures such as context switches and page fault handling.

We first built a hardware memory management unit (HMMU) [40] to execute the data migration to avoid prolonged software overheads. This work also implemented a set of data placement/migration policies, thus it can track the live memory requests at runtime and adapt to the change of memory access behaviors immediately. Being implemented in hardware with a limited state budget, however, the HMMU’s vision is constrained to a short time window. For instance, we designed a bloom filter to record the most recently accessed pages, which performs the best (lowest false positive rate) with a maximum of 128 entries but degrade rapidly after the inserted entries go beyond this number. While the point of saturation varies for different monitored activities, generally hardware counters have finite bit widths and will saturate after recording certain amounts of transactions. This hardware resource limitation prevents HMMU from capturing complex long-term access patterns. Prior work [22] examining several major benchmark suites showed that a short sampling window is incapable of catching the global memory access behaviors even in relatively steady application phases. Moreover, due to the nature of online-profiling, such an HMMU cannot discern data object characteristics before observing several associated memory requests, thus it is incapable of deciding the favorable memory device at the time of allocation. In the worst case, data objects with opposite properties could end up allocated to the same page. If these data objects are referenced in an interleaved sequence, the online profiler might be misled to move the shared page back and forth between memories. Undesired page swaps waste substantial energy and exacerbate the write endurance issues of NVM devices.

Generally, a given program’s authors have a better understanding of the program’s data structure usage and behavior. For example, will this data be revisited frequently after allocation? Will it be intensively read or written? Alternately, in cases where the programmer is not available or not as well trained, profiling can be used to produce data placement hints to a similar effect [2]. This article presents OpenMem, a hardware/software cooperative system combining the benefits of HMMUs with the deeper insights into memory usage that the programmer and profiling can provide. In experiments using profiling, we show OpenMem significantly saves 44.6% energy compared to the all DRAM system, with merely 16% performance degradation. Compared to the HMMU, we successfully reduce the writes to NVM by 14%.

2 Background and Motivation

Table 1 shows the characteristics of recent DRAM [27] and NVM [23] memory technologies respectively. As can be seen, there are significant differences in latency and energy consumption between these classes of memory. Memory access behaviors are highly diverse between data objects; hence a prudent data placement/migration decision becomes critical to exploit heterogeneous memory systems given their divergent access properties. Data objects should be placed on the proper memory device to achieve the best overall performance and least energy. For example, frequently read data shall be stored on DRAM for its shorter access delay, along with more intensively written objects due to high NVM write overheads and limited endurance. As for the large data objects with rare accesses, they should be placed on NVM to leverage its high capacity and low static power consumption. Programmer’s domain knowledge [13, 25, 28] or application profiling [13, 32, 41, 42] is critical in identifying the memory access pattern of data objects and can help to reach better decisions on data placement and migration. In the following sections, we will discuss some related works on the three key components of the hybrid memory system: dynamic data profiling, programmer’s hints and data migration mechanism, and show how they motivate us to find comprehensive software/hardware approaches to better exploit heterogeneous memory.

Table 1.

Technology	DDR4	3D Xpoint
Read Latency	50ns	100ns
Write Latency	50ns	300ns
Read Energy	4.2nJ	1.28nJ
Write Energy	3.5nJ	8.7nJ
Background Power	30mW/GB	\(\sim \!0\)

Table 1. DDR4 and 3D-XPoint Technologies Comparison [23, 27]

Table 2.

Argument	Type	Value Range
kind	memkind	DRAM: Prefer DRAM allocation
		Default: Do not care memory type
		NVM: Prefer NVM allocation
retention	unsigned	\((0,7]\) : Retention period

Table 2. Arguments Description

2.1 Dynamic Data Profiling

Runtime profiling has been widely used to improve performance by analyzing memory accesses during application execution. Prior works [42] employed the built-in hot/cold page lists of the Linux kernel for its own profiling algorithms. Furthermore, a variety of algorithms derived from the CLOCK algorithm [11] have also been proposed [35]. Others used performance counters to identify the memory access patterns of different program phases [41]. While these techniques can improve performance to a certain extent, they only provide a system-level general evaluation, which lacks the fine grained information for each page.

In some proposals, the activity of each page is tracked by inserting metadata bits as profiling counters into page table entries [12] or TLB [24]. However, these approaches still require the OS intervention to handle page table updates or TLB shootdowns. The incurred high latencies in these processes will be further discussed in the later section.

In our prior work, we propose a Hardware Memory Manager Unit (HMMU) [40], which has its own page table that redirects each physical page address to the corresponding memory device frame address. Each table entry also has several metadata bits to count the number of recent read/ write accesses to allow for dynamic data profiling. Owing to the efficiency of its pure hardware design, the HMMU is able to instantly update these profiling counters upon each access, and make decisions on page migration coordinately. Unlike other works mentioned above, the HMMU keeps these internal activities completely hidden from the OS.

These online profiling approaches, however, are limited in the amount of profiling storage available in either the TLB’s metadata or internal state of the HMMU, and thus are constrained to profiling recent memory accesses within a small time window. As the consequence, online profiling is unable to detect complex memory access patterns that span across a long range of time.

The footprint size and time scope needed for effective profiling are illustrated in Figure 1. The figure shows the access volume heatmap generated by profiling applications 531.deepsjeng and 519.lbm from SPEC 2017, respectively. The darker dots/lines represent higher counts of accesses to the corresponding memory region. The straight black lines in the figures indicate that the applications intensively access one memory area within a short time (the dark dots comprising the lines), but quickly move forward and rarely look back. Given an HMMU with a short window of memory access tracking, it will frequently promote the pages that are currently heavily accessed but will not enjoy the benefits of fast access for future requests, as these pages will not be touched again before eventually being replaced by other pages. In other words, the unawareness of the poor data spatial locality leads to frequent page swaps as the application sweeps through its large streaming footprint.

Fig. 1.

Moreover, since the profiling only starts after pages were allocated and referenced, the data objects are initially, obliviously placed on memory devices during allocation. Thus, items with opposite access patterns, say heavily read vs. write-intensive, could end up sharing the same page, which exhibits mixed behaviors and becomes difficult to profile. In the best case, it still takes the profiler several rounds before it gained enough information to decide on a page migration. All these page migrations could have been avoided if they were placed in the correct memory type in the first place. Therefore, we propose to integrate software hints in the allocation policy.

2.2 Software-Hints/Static Data Classification

Prior work [22] has shown that the sampling window size should be greater than a billion instructions for data objects greater than the page size (4KB). Shorter sampling epochs limit the profiler’s vision to the “local” variability, rather than the long-term memory access patterns along with program phase changes. With a comprehensive understanding of the program, programmer knowledge or offline memory access profiling (broadly we label these together as “software hints”) has been shown to be crucial for better exploiting heterogeneous memory.

Based on system-level heuristic information, Mogul [28] suggests migrating operating system cold pages based on page types and file types to the slow memory. Meswani et al. proposed TLM Malloc [25] to allow programmers explicitly allocate memory buffers on different classes of memories. Dulloor et al. [13] developed a runtime profiling framework to place program data structures in heterogeneous memory. Mukkara et al. guided the data placement in cache with static data classification based on the application memory accesses, to reduce the potential data movement [29].

To date, however, we are unaware of any HW/SW full-stack solution in the hybrid memory domain to coordinate the software-hints information with the highly efficient hardware memory controller to achieve the optimal data placement and migration.

2.3 Data Migration

Efficient data migration is critical for heterogeneous memory after we identify features for different data structures through user knowledge or data profiling. Prior works proposed both software and hardware approaches for efficient data migration. Generally, software data migration approaches require extensions of the OS kernel functions or data structures [6, 12, 24, 28, 35, 38, 42] or runtime/helper threads [41]. These OS-driven page migrations incur significant overheads: the process is usually triggered by a system interrupt, followed by context switch and kernel interrupt handling. A TLB shootdown might also be necessary after the page mapping was updated. The associated delays, which were once negligible compared to traditional storage media such as spinning disks, now become the dominant factors as the NVM device has much shorter access latencies. Researchers have tried different ways to mitigate the overheads of page migration: Wu et al. constrained the page migration to only happen at the end of each phase of the application [41]. Yan et al. chose to move a bundle of pages at one time to amortize the cost of OS intervention.

The hardware-based solutions typically use DRAM as an inclusive cache for the NVM device and need enormous space for tag storage, as NVM capacity is much larger than DRAM. Efforts were made to address this problem by either shrinking the tag size [19], or reorganizing the tag/data-entry structure [21, 26]. Using DRAM as a cache for NVM also adds additional access latency, and sacrifices the ability of parallel accesses to the two memory devices, since every request must go through DRAM first. Moreover, the total available memory becomes less as the DRAM space is invisible to user applications.

We successfully solved both of these issues via a HMMU, which addresses the two memory devices in one flat address space, executing the page migration transparently behind a translation layer visible only to the HMMU. In the whole process, OS does not need to step in and no changes were made to the TLB or the OS page table. We propose to combine software hints (either through domain knowledge or static data profiling) with the HMMU data migration scheme to better guide object-level data placement and migration on heterogeneous memory.

3 Design

OpenMem is a full-stack solution that enables the adoption of static profiling and/or user’s knowledge, of data object reference patterns to inform placement and migration within the HMMU. The hardware/software cooperative architecture encompasses both online/offline data profiling to execute data migration efficiently and effectively.

3.1 OpenMem Architecture Overview

Figure 2 shows the complete system architecture. Software hints (gathered either via profiling or by explicit programmer insertion) are encoded during data object allocation. By default, our memory allocator groups objects of a given type (i.e., high writes, latency sensitive, low writes, and so on) and assigns them into a DRAM-preferred page (page S1 in Figure 2) or an NVM-preferred page (page S2) based upon the flag variables set as arguments to the memory allocation system calls. These arguments convey the preferred memory type and the presumed life cycle of the given data object (See Section 3.2 for further details on the API). These flags are communicated to the HMMU during page mapping, via a custom HMMU device driver. The device driver converts the arguments into special mark bits, encoded into the memory request messages sent to the HMMU. The corresponding decoding module inside HMMU extracts the hint information and passes this metadata to the data management policy component of the HMMU. This component executes data placement based on a set of comprehensive policies under different scenarios. When no free page is available, the HMMU scans the metadata of the internal redirection page table for the cold pages pending to be swapped out from DRAM. After mapping is finished, the memory requests are then forwarded to the allocated frames on the specific type of memory device. Besides the injection of software hints, the HMMU still dynamically profiles the memory access patterns and has its own policy of page migration. A brief introduction of the HMMU and its interaction with software-hints information is presented in the later sections.

Fig. 2.

3.2 Memory Allocator API

We created a set of memory allocation APIs that allow the programmer/profiler to specify the desired memory type for the pertinent data objects. As discussed in Section 2.1, the baseline HMMU is poor at detecting long-term, complex access patterns, as its limited hardware resources only support monitoring of short-term behavior. With the help of software hints, we can identify the data objects that have reoccurring accesses between intervals. Such data shall be preferentially kept in the DRAM. HMMU profiles data objects by observing the received memory requests, thus it carries no relevant information at the time of allocation. Such oblivious allocation often places mixed data types together, leaving negative impacts on the data management throughout the object’s lifetime. Now that software hints are provided during allocation, we can group similar data into the same page and avoid unnecessary potential data migration.

Our memory allocator is an extension of Memkind by Intel [1]. We changed the underlying memory mapping functions to ensure that the application memory was allocated to the HMMU’s memory. The APIs are compatible with the default glibc malloc function:

The programmer/profiler can either label particular data objects as DRAM-preferred, or leave the flag blank, allowing the allocator to handle it with default settings. Furthermore, the “retention” argument is provided as a software hint to indicate how long this data object should be held in DRAM.

3.3 Generating Software Hints Through Code Profiling

3.3.1 Memory Accesses After Cache Filtering.

Although OpenMem can enable software hints via direct programmer insertion as well as static code profiling, we note that we are not the programmers for the various workloads examined. Thus, we rely entirely upon static profiling to generate software hints. In particular, we run each benchmark with Valgrind [30] to determine how their data structures use memory. We customized the tools Cachegrind and DHAT to collect and analyze the memory accesses after cache filtering. Although, as we show, this approach is effective, it’s likely that an experienced programmer could produce even better results.

3.3.2 Target Objects Selection Criteria.

\begin{equation} (W_{wr}V_{wr} + W_{rd}V_{rd})/\text{object size} \gt \theta . \end{equation}

(1)

OpenMem is mainly designed for page-level granularity, so we discarded the objects smaller than the page size. We evaluated the remained target objects using Equation (1), based on two major metrics: the read/write ratio and the accumulated amount of written bytes. Hence, we assigned weight \(W_{wr}\) for write volume \(V_{wr}\) , and \(W_{rd}\) for read volume \(V_{rd}\) . The values of these weights shall be selected based on platform-specific configurations, such as the proportion of DRAM/NVM capacities, target performance/energy budget. We chose \(W_{wr} : W_{rd} = 2.8\) for our experiments. We examined the score distribution among all data objects, and selected those with high positive deviation from the mean value, specifically above the threshold percentile \(\theta\) . The data objects with high positive deviation from the mean value were marked as “DRAM-preferred” at malloc time as described in Section 3.2. As for the streaming data in applications such as 531.deepsjeng (1(a)), we’d mark them as “NVM-preferred”, since they would likely be visited only once during its time in the DRAM, making migration useless. The selection threshold value \(\theta\) is highly dependent on the number of created objects and their sizes, which exhibit wide variability among benchmark applications. For instance, the 531.deepsjeng only has a total of 4 objects, while streamcluster created more than 9,700 objects. Therefore, we cannot apply a universal \(\theta\) value to classify data objects for all applications.

3.3.3 Retention Period.

We added a retention argument to memory allocation, to allow the profiler/programmer to limit the amount of time pages are held in DRAM. Thus, permitting some adaptivity should they become less active due to program phase changes.

We modified the Cachegrind simulator to count read/written bytes respectively for each object within every epoch. Each object has a linked list where a new entry is inserted after every epoch ends. We used the global instruction counter as the approximate time measurement. We summed the epochs with greater read/written bytes than predefined threshold, and divide the total counts against the application runtime. As the result shown in the next section, this somewhat naïve form of hinting likely leaves considerable gains on the table versus having the actual programmer generate the memory hints.

We implemented the new malloc functions as an extension to Intel’s Memkind project [1], which is a jemalloc-like memory allocator library. We also changed the underlying memory mapping functions to ensure that the application memory was allocated to the HMMU’s memory. All these modifications are hidden behind the API, so minimal changes are needed in the benchmark source code.

3.4 Hardware Memory Management Unit

The software hints collected from the software stack are forwarded to a HMMU [40]. HMMU is a memory controller unit that receives the memory requests from the CPU and addresses NVM and DRAM devices in one flat memory space. It has an internal redirection page table that translates the incoming physical address to the mapped memory frame on the memory device.¹ The recent activities of each page are captured into metadata bits on the page table entry. Based on these profiling information, we designed and implemented a comprehensive policy of both data placement and migration. The built-in DMA engine can directly move data between two memory devices, without blocking the incoming requests from the CPU. These operations are all hidden from the OS and are executed highly efficiently in hardware.

3.4.1 Page Swap.

To prevent profiling counter saturation, we reset the counters periodically. The HMMU tracks the number of references to each page in the last epoch. Once the number grows above the threshold value, a full-page migration is triggered. The threshold value is also learned online to cope with phase changes of the running application. In the background, the HMMU scans the entire DRAM space with a pseudo-random pointer for choosing the target page to be migrated. It also records the most recently accessed pages in a bloom filter to preserve locality, protecting them from being swapped out.

3.4.2 Adaptive Threshold.

Data localities may vary starkly across different phases of the application. Therefore, we set aside several metrics to evaluate the efficiency of page swap. We calculate the average number of references to the most recently visited pages. When these numbers increase, the threshold is lowered to allow more data moves. The threshold is raised to suppress migration between memories to save time/energy when these numbers decrease. This is commonly seen in streaming applications which traverse a vast range of data only once.

3.5 Data Management Policy

When the HMMU receives the first memory request to a “DRAM-preferred” page, it assigns a free page in DRAM if available. Otherwise, the requested address is translated to the currently mapped memory frame in the NVM device. It also sets a special bit for this page in the HMMU’s page table, which will trigger the page migration to DRAM next time it gets referenced.

Since DRAM pages are a limited resource, mechanisms are required to ensure that pages marked as “DRAM-preferred” are not locked into DRAM, if in fact they are not being written often, or if DRAM pressure is otherwise high. We built a counter in the page table entry to reflect the presumed life cycle of the data objects allocated to the pertinent page. When a page is mapped or migrated to the frame in DRAM, the HMMU sets this counter with the value passed via the argument “retention” in the API (see Section 3.2). Since we focus primarily on the bulk memory allocations that absorb a large amount of written/read bytes, such flagged pages are usually fully allocated in one single call to homogeneous data objects. In the rare case when a page is partitioned among different data objects, we override the counter value with the largest retention period of all pertinent allocation. During every refresh epoch, the counter decrements by 1 if no write requests are received. When the counter decreases to zero, the page loses its special status and becomes a normal page susceptible to being swapped back to NVM. The detailed algorithm is listed in Algorithm 1.

3.6 Hardware/Software Coordination

Since programmers/users are not necessarily always aware of the actual memory access patterns of all their data objects, especially after cache filtering, their preferences of memory type are only passed to HMMU as hints, rather than the determinant of the data management. Thus, the system is free to ignore some or all of these hints in the event that the requested DRAM space is larger than the actual DRAM size. Throughout the different phases of applications, if the number of references or writes declines, the data object will “expire” and be swapped back to NVM, even though it was marked as “DRAM-preferred” by the user when allocated. This coordination between HMMU and the software stack enables the design to have both long-term and short-term scope in data management.

3.7 Adaptive Throttling of Data Migration

When pages are marked with software hints, they have a higher priority in the competition for the limited DRAM pages. These demands add stress to the memory system and may lead to thrashing among the other pages after the memory footprint grows to a certain extent. Frequent page swaps incur a waste of memory bandwidth and energy, obviating any benefit from leveraging the hints. Therefore, we need to find a new way to throttle the data migration.

A typical scenario is in streaming applications, which traverse a vast range of addresses without recurrence. To deal with such applications lacking locality, we designed the adaptive throttling mechanism on our data migration policy. For the full page migration, the HMMU counts the references to the page after it is migrated to DRAM, in each refresh epoch. If the average number of the last 128 accessed DRAM pages is larger than that of the last refresh epoch, we lower the bar of full page migration. Such applications have strong spatial locality and it’s worthwhile to perform full page migration, as the following access all hit in DRAM with shorter latency. Alternately, if the page was referenced less frequently than before, we raise the threshold to suppress page migration.

4 Evaluation

In this section, we present the evaluation of OpenMem hybrid memory systems. First, we present the experimental methodology. Then, we discuss the performance results. Finally, we analyze some of the more interesting data points.

4.1 Methodology

4.1.1 Emulation Platform.

Evaluating the proposed system presents several unique challenges because we aim at testing the whole system stack, comprising not only the CPU, but also the memory controller, memory devices and interconnections. Furthermore, since this project involves hybrid memory, accurate modeling of DRAM is required. Much of the prior work in the processor memory domain relies upon software simulation as the primary evaluation framework with tools such as Champsim [7] and gem5 [5]. However, detailed software simulators that meet our goals impose huge simulation time slow-downs versus real hardware. Moreover, there are often questions of the degree of fidelity of the outcome of arbitrary additions to software simulators [31].

Thus, we elected to emulate the HMMU architecture on an FPGA platform. The system is illustrated in Figure 3. FPGAs provide flexibility to develop and test sophisticated memory management policies while their hardware-like nature provides near-native simulation speed. The FPGA communicates with the ARM CortexA57 processor via a high-speed PCI Express link and manages the two memory modules (DRAM and NVM) directly. The DRAM and NVM memories are mapped to the physical memory space via the PCI BAR (Base Address Register) window. From the perspective of the CPU, they are rendered as available memory resources same as other regions of this unified space. Note that the CPU caching is still enabled on the mapped memory space, due to the PCIe BAR configurations.

Fig. 3.

Our platform emulates the NVM access delays by adding stall cycles to the operations executed in FPGA to access external DRAM. We measured the round trip time in FPGA cycles to access external DRAM DIMM first, and then scaled the number of stall cycles according to the speed ratio between DRAM and 3D Xpoint, as described in Table 1. Thus, we have one DRAM DIMM running at full speed and the other DRAM DIMM emulating the approximate speed of NVM. Hence the platform is not constrained to any specific type of NVM, but rather allows us to study and compare the behaviors across any arbitrary combinations of hybrid memories. Since 3D Xpoint is one of the few commercialized and massively manufactured NVM technology (Intel Optane series), we chose it for the simulation experiments presented in the following sections.

The detailed system specification is listed in Table 3.

Table 3.

Component	Description
CPU	ARM Cortex-A57 @ 2.0GHz, 8 cores, ARM v8 architecture
L1 I-Cache	48 KB instruction cache, 3-way set-associative
L1 D-Cache	32 KB data cache, 2-way set-associative
L2 Cache	1MB, 16-way associative, 64kB cache line size
Interconnection	PCI Express Gen3 (8.0 Gbps)
Memory	128MB DDR4 + 1GB NVM
OS	Linux version 4.1.8

Table 3. Emulation System Specification

4.1.2 Workloads.

We initially considered several mobile-specific benchmark suites, including the CoreMark [36] and AndEBench [14] from EEMBC. We found however that these suites are largely out of date and do not accurately represent the large application footprints found on modern mobile systems. Also, in some cases, they are only available as closed sources [14] and thus are unusable in our infrastructure. Android platform has its own memory management environment that has garbage collection deployed by Android Runtime (ART); thus it’s incompatible with our design. Therefore, we use applications from the recently released SPEC CPU 2017 benchmark suite [37]. To emulate memory-intensive workloads for future mobile space, we selected only those applications that require a larger working set than the fast memory size in our system. To diversify the workloads, we also added a few benchmark applications from the parsec benchmark suites. The details of tested benchmark applications are listed in Table 4.

Table 4.

Benchmark	Description	Memory footprint
SPEC 2017
508.namd	Molecular dynamics	172MB
510.parest	Biomedical imaging: optical tomography with finite elements	413MB
538.imagick	Image Manipulation	287MB
557.xz	General data compression	727MB
PARSEC
blackscholes	Option pricing with Black-Scholes Partial Differential Equation	610MB
facesim	Simulates the motions of a human face	298MB
freqmine	Frequent itemset mining	624MB
ocean	Large-scale ocean movements simulation (HPC)	222MB
streamcluster	Online clustering of an input stream	219MB

Table 4. Tested Workloads [4, 37]

4.1.3 Designs Under Test.

The tested data management policies are listed as below. We controlled the available memory size by limiting the address range accessible to the memory controller.

–

Static: A baseline policy in which host requested pages are randomly assigned to fast and slow memory. This serves as a nominal, worst-case, memory performance.

–

Static-SWHints: static/offline data profiling only. The data objects are allocated to the memory type specified by software hints and stay at the allocated memory frame until they are freed.

–

HMMU-only: Dynamic/online profiling only, provided by the HMMU.

–

OpenMem: Both dynamic and static profiling enabled. Our proposed hardware/software cooperative data management policy, which incorporates the global data objects’ properties conveyed by software hints, and the short-term access patterns profiled by the HMMU. Based on this auxiliary info, the HMMU conducts a set of comprehensive data management policies. In particular, the data marked as “DRAM preferred” is given a higher priority in the competition for DRAM memory resources but will not retain the DRAM pages after the designated period expires.

–

AllDRAM: A baseline policy with sufficient provision of DRAM memory to serve the entire working set, and hence no page movement is needed. This indicates a nominal, best-case but impractical upper performance bound. In the emulation system, we achieved it by rendering DRAM address space larger than the peak memory footprint of the application.

4.1.4 Evaluation Objectives and Metrics.

We evaluated the tested designs in the following aspects and metrics:

(1)

Energy consumption ( \(\,\mathrm{n}\mathrm{J}\) ): The total energy consumed during the target application execution, which comprises both dynamic and static power consumption.

(2)

Memory reads/writes (bytes): We count all the accesses to DRAM and NVM devices, including those incurred by data migrations. This factor contributes to the dynamic power consumption of the memory system. The writes to NVM is also critical to NVM device wear out.

(3)

Runtime performance ( \(\,\mathrm{s}\) ): Indicates application execution efficiency, also determines the DRAM background power consumption.

The absolute numbers in these metrics exhibit wide variance among different applications, whereas our focus is the tested designs rather than the applications’ characteristics. Therefore, we normalized the raw data against that of the selected baseline design under each application, respectively. The normalized results present a comprehensive performance comparison between tested designs across all applications in a single chart.

4.2 Results

4.2.1 Energy Saving.

The energy budget is highly restricted for mobile computing or embedded systems, so it’s a primary factor to consider in system design. Emerging NVM has negligible background power compared to the traditional DRAM technology, rendering an enormous advantage in energy saving versus DRAM. Here, we examine the energy consumption of all tested policies under the benchmark applications. We integrated a few counters inside the HMMU to accurately record the total number of read/write unit transactions (cache line size) that occurred to DRAM and NVM respectively during the application execution. These counts record all read/write operations incurred by both memory demand requests and the data migration. Besides the dynamic power, we also calculated the static energy consumed during application runtime, using the spec data in Table 1.

We’ve also estimated the overheads of the HMMU, which is dominated by the on-chip memory constituting the page table: The emulation system has 2GB memory, with a page size of 4KB, thus we need \({\log _2 {({\rm 2GB}/{\rm 4KB})}} + 5 = 24\) metadata bits (3 bytes) for each page table entry, and the total hardware cost of the HMMU is 1.5MB. This energy consumption is negligible compared to the memory DIMMs.

We normalize the energy consumption of our policies to that of the AllDRAM configuration and present them in Figure 4. The figure shows that OpenMem has the lowest energy consumption, \(\sim\) 44.6% less than the AllDRAM configuration. The HMMU without software hints achieves an \(\sim\) 39.7% energy saving. Even without the underlying dynamic HMMU policies, the static allocation still significantly improves after adopting the software hints, as the Static-SWHints beats the baseline static allocation by 10% percent in terms of energy saving. This evidence validates the value of software hints in saving energy and the lifetime of NVM devices.

Fig. 4.

4.2.2 Writes Reduction and NVM Lifetime Saving.

Writes in NVM technologies have 3x the latency and dissipate 8x the energy versus reads [8]. Furthermore, NVM devices are susceptible to write-induced wear out. Figure 5 shows the number of writes to the NVM for the four tested configurations, with all data normalized against the baseline, random static allocation. Here, we count not only the writes access generated by the CPU but also the writes induced by the data migration triggered inside the HMMU. The figure shows that OpenMem (7.9%) outperforms the HMMU-only (9.0%) design by a margin of \(\sim\) 14%. The gap expands to 92% when compared to the static allocation. The vast majority of writes were absorbed in DRAM after the pages migrated from NVM where those writes accesses should have landed. Writes to NVM decreased in 7 out of the 9 applications, with the largest improvement found with blackscholes. This application sees a 25% drop in overall NVM writes. Furthermore, the part induced by data migration was reduced remarkably by 28% compared to the HMMU-only design without software hints. The total number of writes (including DRAM) also diminished with these benchmarks, owing to less data migration.

Fig. 5.

4.2.3 Runtime Performance.

Figure 7 shows the speedup attained by the different designs for the given benchmarks. We normalized all the data to the runtime of the AllDRAM configuration, which is the upper bound for all policies. The geometric mean performance across all applications is (in ascending order): Static (35.1%), Static-SWHints (39.9%), HMMU-only (74.8%), OpenMem (84.1%). This result meets our expectation that OpenMem has the best overall performance since it employs both the HMMU and knowledge from the programmers. Moreover, it achieves 97% of AllDRAM, if the two outliers, ocean and streamcluster are excluded. We explore further details of memory accesses in Figure 6 to show the source of performance gain. Although DRAM only comprises 1/8 of the total memory capacity, we see it captures the vast majority of read requests. This explains why OpenMem generally approximates AllDRAM performance. Compared to the HMMU-only, OpenMem was able to absorb more writes in the DRAM. The blackscholes application, for instance, saw a 14% higher hit rate in the DRAM for those write requests sent from the CPU.

Fig. 6.

Fig. 7.

4.3 Specific Benchmark Analysis and Discussion

Ocean and streamcluster are the two benchmarks that do not benefit much from OpenMem in terms of energy saving and runtime performance. Ocean’s data structures are designed in a way that prevents contiguous memory allocation, leading to poor locality. Its miss rate in cache is the second highest across all parsec benchmark applications. Moreover, this workload is highly sensitive to cache capacity, and the miss rate is two orders of magnitude higher for small caches. Streamcluster is a data-mining application. It is reported to have the most core-to-bus data transfers [3]. Consequently, streamcluster is very sensitive to memory bandwidth and access latency. Therefore, the performance is expected to degrade as we replace 7/8 of the DRAM resource in a system with NVM which has substantially higher latency. Both applications have poor data locality and their performance decline drastically with small CPU cache capacity. Thus, we consider the performance loss is mainly attributed to other bottleneck parts of our testing system, other than the memory management itself. OpenMem obtained the largest edge over HMMU-only policy in regard to NVM writes reduction with two benchmarks, 508.namd (18%) and ocean (22%). We inspected the number of marked pages that remained in the DRAM after each application was completed and found that they have the highest percentage: 508.namd (46%) and ocean (69%). This metric to some extent shows how closely the software hints match the actual accessed pattern of the marked data objects throughout the application. As we mentioned in Section 3.6, the HMMU still makes independent decisions on data management even after accepting the software hints. Thus, when the marked data objects went cold after application switching to the next phase, they will not retain the precious DRAM resource. We infer that the benefits of having software hints were maximized with those data objects that keep receiving frequent write references for the whole duration.

5 Conclusions

Emerging NVM technologies provide higher capacity and less static power consumption than traditional DRAM. These features are promising to address the dilemma of the memory system on mobile computing/embedded systems: new applications have ever-increasing memory footprint, while the limited battery life prohibits DRAM from scaling. NVM, however, tends to have longer access latency and write endurance issues. We created a purely HMMU for the hybrid memory but later observed that hardware resource limits impact the ability to detect long-term memory access patterns. Therefore, we proposed OpenMem, a hardware/software cooperative solution that incorporates software hints of data properties. We customized the memkind allocator [1] to allow users to define the target memory device. We also co-designed the HMMU so that the information collected from the software stack can effectively collaborate with the existing hardware policies. We tested our schemes with benchmarks selected from SPEC 2017 and PARSEC suites. Experimental results show that OpenMem consumes remarkably 44.6% less energy than an untenable all DRAM configuration. We also managed to further reduce 14% of writes to the NVM versus the HMMU-only. OpenMem achieved 84.1% performance of the all DRAM configuration with only 1/8 DRAM capacity. This performance is 12.4% better than the HMMU-only.

Footnote

The HMMU’s page table is orthogonal from the OS page table and invisible to it. In this system the OS only sees a large, flat physical memory space.

References

[1]

Lukasz Anaczkowski. 2019. User extensible heap manager for heterogeneous memory platforms and mixed memory policies. Retrieved November 2020 from http://memkind.github.io.

Abstract

1 Introduction

2 Background and Motivation

2.1 Dynamic Data Profiling

2.2 Software-Hints/Static Data Classification

2.3 Data Migration

3 Design

3.1 OpenMem Architecture Overview

3.2 Memory Allocator API

3.3 Generating Software Hints Through Code Profiling

3.3.1 Memory Accesses After Cache Filtering.

3.3.2 Target Objects Selection Criteria.

3.3.3 Retention Period.

3.4 Hardware Memory Management Unit

3.4.1 Page Swap.

3.4.2 Adaptive Threshold.

3.5 Data Management Policy

3.6 Hardware/Software Coordination

3.7 Adaptive Throttling of Data Migration

4 Evaluation

4.1 Methodology

4.1.1 Emulation Platform.

4.1.2 Workloads.

4.1.3 Designs Under Test.

4.1.4 Evaluation Objectives and Metrics.

4.2 Results

4.2.1 Energy Saving.

4.2.2 Writes Reduction and NVM Lifetime Saving.

4.2.3 Runtime Performance.

4.3 Specific Benchmark Analysis and Discussion

5 Conclusions

Footnote

References

Cited By

Index Terms

Recommendations

Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory

Efficient memory management of a hierarchical and a hybrid main memory for MN-MATE platform

Page placement in hybrid memory systems

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations