1. 에너지 30% 이상 절감 가능한
범용 운영체제 핵심 원천 기술 개발 과제
1월 정기회의
국민대학교
김영만, 한재일
2. Contents
Research Activities
Part 1 : 발표 배정논문 (3편)
1. A new model for the system and devices latency
2. Cross-Layer Frameworks for Constrained Power and Resources
Management of Embedded Systems
3. Automatic Run-Time Selection of Power Polices for Operating Systems
Part 2 : 국민대 연구내용
1. Performance Evaluation of Parallel Applications on Next Generation
Memory Architecture with Power-Aware Paging Method
2. PPFS : A Scalable Flash Memory File System for the Hybrid
Architecture of Phase-change RAM and NAND Flash
3. Address Translation Technique for Large NAND Flash Memory using
Page Level Mapping
4. Performance Optimization Techniques for Legacy File Systems on
Flash Memory
3. 1. Research Activities
Research Contents
에너지 절감과 관련된 논문 읽기
시뮬레이터 분석
Papers
A new model for the system and devices latency
Automatic Run-Time Selection of Power Policies for Operating Systems
Cross-Layer Frameworks for Constrained Power and Resources Management of Embedded Systems
Address Translation Technique for Large NAND Flash Memory
Performance Optimization Techniques for Legacy File Systems on Flash Memory
A Scalable Flash Memory File System for the Hybrid Architecture of Phase-change RAM and NAND Flash
An Energy Efficient Cache Design Using Spin Torque Transfer (STT) RAM
Performance Evaluation of Parallel Applications on Next Generation Memory Architecture with Power-
Aware Paging Method
6. WHAT IS LATENCY ?
• “ In a computer system, latency is often used to mean any del
ay or waiting that increases real or perceived response time b
eyond the response time desired. “
• “ Specific contributors to computer latency include mismatche
s in data speed between the microprocessor and input/output
devices and inadequate data buffers.”
• “ Within a computer, latency can be removed or "hidden" by s
uch techniques as prefetching (anticipating the need for data i
nput requests) and multithreading, or using parallelism across
multiple execution threads.“
• Source: http://searchciomidmarket.techtarget.com/definition/
latency
7. TERMINOLOGY.
(Texas Instrument)
• Latency: time to react to an external event, e.g. time spent to execut
e the handler code after an IRQ, time spent to execute driver code fr
om an external wake-up event.
• HW latency: latency introduced by the HW to transition between po
wer states.
• SW latency: time for the SW to execute low power transition code, e
.g. IP block save & restore, caches flush/invalidate etc.
• System: ‘everything needed to execute the kernel code', e.g. on OM
AP3, system = CPU0 + CORE (main memory, caches, IRQ controller...).
• Per-device latency: latency of a device (or peripheral). The per-devic
e PM QoS framework allows to control the devices states from the al
lowed devices latency.
• Cpuidle: framework that controls the CPUs low power states (=C-stat
es), from the allowed system latency. Note : Is being abused to contr
ol the system state.
• PM runtime: framework that allows the dynamic switching of resour
ces.
8. HOW TO SPECIFY THE ALLOWED LATE
NCY.
• The PM QoS framework allows the kernel and user to specify
the allowed latency.
• The framework calculates the aggregated constraint value and
calls the registered platform-specific handlers in order to apply
the constraints at lower level.
9. PM QoS FRAMEWORK.
• PM QoS is a framework developed by Intel.
• It allows kernel code and applications to set their requirement
s in terms of:
• CPU DMA latency.
• Network latency.
• According to these requirements, PM QoS allows kernel driver
s to adjust their power management.
• See Documentation/power/pm_qos_interface.txt.
• http://free-electrons.com/kerneldoc/latest/power/pm_qos_interface.txt
• Still in very early deployment (only 4 drivers in 2.6.36).
10. What is the key point of control
ling the latency ?
• The point is to dynamically optimize the power consumption o
f all system components.
• Knowing the allowed latency (from the constraints) and the ex
pected worst-case latency allows to choose the optimum pow
er state.
11. OMAP.
• OMAP (Open Multimedia Applications Platform) developed b
y Texas Instruments is a category of proprietary system on chi
ps (SoCs) for portable and mobile multimedia applications.
• OMAP devices generally include a general-purpose ARM archi
tecture processor core plus one or more specialized co-proces
sors.
• Earlier OMAP variants commonly featured a variant of the Tex
as Instruments TMS320 series digital signal processor.
• The OMAP family consists of three product groups classified b
y performance and intended application:
• High-performance applications processors
• Basic multimedia applications processors
• Integrated modem and applications processors
18. PROBLEM.
• There is no concept of ‘overall latency’.
• No interdependency between PM frameworks
Ex. on OMAP3 : cpuidle manages only a subset of the power domains
(MPU, CORE).
Ex. on OMAP3 per-device PM QoS manages the other power domain
s.
No relation between the frameworks, each framework has its own lat
ency numbers.
• Some system settings are not included in the model
Mainly because of the (lack of) SW support at the time of the measur
ement session.
Ex. On OMAP3 : voltage scaling in low power modes, sys_clkreq, sys_
offmode and the interaction with the PowerIC.
• Dynamic nature of the system settings
The measured numbers are for a fixed setup, with predefined system
settings.
The measured numbers are constant.
19. SOLUTION PROPOSAL.
• Overall latency calculation.
• We need a model which breaks down the overall latency into t
he latencies from every contributor:
Latency = latencySW + latencyHW
Latency = latencySW + latencySoC + latencyExternal HW
• LatencySW : time for the SW to save/restore the context of an I
P block.
• LatencySoC : time for the SoC HW to change an IP block state.
• LatencyExternal HW : time to stop/restart external HW. (Ex: extern
al crystal oscillator, external power supply …)
• Note: every latency factor maybe be divided into smaller facto
rs. E.g: On OMAP a DPLL can feed multiple power domains.
22. Problems
• Existing studies one power management make an implicit
assumption
• Only one policy can be used to save power
• Hence, those studies focus on finding the best polices for
unique request patterns
23. HAPPI(Homogeneous Architecture
for Power Policy Integration)
• HAPPI is currently capable of supporting power policies for
disk, DVD-ROM, and network devices
• But it can easily be extended to support other I/O devices
• Must provide
• A function that predicts idleness and controls a device’s power
state.
• A function that accepts a trace of device accesses, determines
the actions the control function would take, and returns the
energy consumption and access delay from the actions.
24. HAPPI(Homogeneous Architecture
for Power Policy Integration)
• If policy is selected to manage the power state of a specific
device by HAPPI, it is considered activity
• Each device is assigned only one active policy at anytime
• Whenever the device is accessed, HAPPI captures the size and
time of the access
• Also records the energy and delay for each device
26. Implementation
• Linux 2.6.5
• Policies and evaluators are implemented as kernel module
• Experimental hardware is not fully ACPI compliant
• So they implement a function that returns the power,
transition energy and transition delay for each state of each
device
• Policies need these values to compute the power consumed in
each state
27. Experiments
• Fujitsu laptop hard disk(HDD)
• Samsung DVD drive(DVD)
• NetXtreme integrated wired network card(NIC)
Power states for devices
28. Experiments
• Workload
1. Web browsing + buffered media playback from DVD
2. Download video and buffered media playback from disk
3. CVS checkout from remote repository
4. E-mail synchronization + unbuffered media playback from DVD
5. Kernel compile
30. Exponential Prediction
• Formulation
• In : the last predicted value
• in : the latest idle period
• a : a constant attenuation factor in the range between 0 to 1
• If a = 0, then In+1 = In
• If a = 1, then In = in
• So, typically a = 1/2
32. Experiments
• Result
Estimated energy consumption for each policy on
devices for experimental workload
Selected policies for devices at each evaluation
Workload 1 2 3 4 5 Workload 1 2 3 4 5
33. Conclusion
• experiments indicate that policy selection is highly adaptive to
workload and hardware types, supporting our claim that
automatic policy selection is necessary to achieve better
energy savings
36. Problems
• This paper propose solution (architecture and low power
paging algorithm) to reduce energy consumption in HPC (High
Performance Computing) systems
• To demonstrate low power paging algorithm can improve HPC
performance and reduce energy consumption
37. SOLUTION
• Replace a part of DRAM with MRAM.
• Conduct simulation to evaluate the performance and energy
consumption of several application benchmark.
• Make a trace file of memory access in each application
benchmark by using the Valgrind profiling tool.
• For each memory access that incurs miss, we collect memory
address and profiling results, which are access count on all the
memory pages.
• With the trace files, they replay behavior of application with
our event-driven simulator.
38. HOW CAN THEY SOLVE ?
• They propose hybrid memory architecture and power aware s
wapping.
• Use MRAM as main memory beside DRAM due to its higher ac
cess speed and low power consumption.
• Use FLASH as fast random-access swap device due to its faster
random access read speed.
• Use MRAM hit rate and threshold in Low Power Paging Algorit
hm to mange the swapping interaction between DRAM/MRA
M and FLASH. Therefore can improve performance and reduce
energy.
40. Low-Power Paging Algorithm
Allocate Hot Pages on
MRAM
Profiling Result
Application Running
Page Fault
Memory
Access on
MRAM
MRAM Hit++
Memory Access++
MRAM Hit Rate <-- MRAM Hit / Memory Access
MRAM Hit Rate >
Threshold
Swap Out the Last
Recently Used Page on
DRAM
Swap Out the Last
Recently Used Page on
DRAM or MRAM
L2 Cache Miss
No
Yes
No
No
Yes
Yes
Figure 2 – Algorithmic Flow of Proposed Paging Algorithm
- A trace file also includes profiling results, which are access
counts on all the memory pages.
- Profiling: the per page memory access frequency of a given application
throughout its execution.
- Pre execution trial or sampling with HW assist.
With the trace file, we replay behavior of application with our event-driven simulator.
- Memory access
L2 Cache Miss
Collect
Profiling
41. Why they need that algorith
m?
• First simple algorithm works as follows:
• We pin downs the hottest pages so that they are never swap out
and allocated on MRAM.
• The remaining pages are allocated onto DRAM and use LRU based
swapping with flash memory.
• This simple algorithm in some case increased application exec
ution time with LRU swapping algorithm.
• Excessive swaps, slowing down the application considerably.
42. Why they need that algorith
m?
• To resolve this situation, we extend our algorithm by introduce
a metric called MRAM hit rate and its threshold so that applica
tions exhibiting lower locality may use both MRAM and DRAM
as swappable main memory.
• Thr = α x MRAM_SIZE / TOTAL_SIZE.
• α (≈1) is a configurable parameter to be used to determine the
threshold.
• Several preliminary experiments have shown that a threshold
value of 0.9 seems to work for the NAS and other HPC applicat
ions.
43. CORE IDEAS OF LOW POWER
PAGING ALGORITHM.
• MRAM hit rate is a dynamic value that indicates the ratio of th
e access counts onto MRAM versus memory access to all the
memory at each point in execution time.
• If the ratio is large, we can decide that accesses to MRAM has
sufficient locality such that pages should be pinned down.
• On the other hand, if the ratio is small, the application lacks of
locality and thus the entire main memory should be seen as s
wappable.
44. CONCLUSION.
• Reduce DRAM capacity aggressively can reduce energy consu
mption, even with swapping.
• The energy consumption can be reduced to 25% by reducing D
RAM capacity.
46. NAND Flash Memory
• NAND flash memory structure
• Page (2KB) : Read and Write Unit
• Block (64 pages = 128KB) : Erase Unit
• NAND flash memory is beauty
• Non-volatility
• Fast access time (No seek latency)
• Low power consumption
• Relatively large capacity
• Shock-resistance
• NAND flash memory is beast
• Erase before write : The page should be erased first in
order to update data on that page
• Slow write : Support only page-level write and 10x slower
than read
• Limited life time : Ensure 100K ~ 1M erase cycles
Ref) K9F1G08X0A Datasheet
47. Feature of PRAM
Source: Motoyuki Ooishi, Nikkei Electronics Asia, Oct. 2007
PRAM memory
Random access memory
Non-volatile memory
Low leakage energy
High density: 4x denser than DRAM
Limited endurance
48. NAND flash memory VS. PRAM
1. KPS5615EZM Data Sheet, 2. K9G8G08U0M Data Sheet
PRAM1 NOR SLC NAND MLC NAND2
Volatility Non-volatile Non-volatile Non-volatile Non-volatile
Random access Yes Yes No No
Unit of write Word (2byte) Word (2byte) Page (2Kbyte) Page (2Kbyte)
Read speed 50ns/word 100ns/word 25us/page 60us/page
Write speed 5us/word 11.5us/word 200us/page 800us/page
Erase speed N/A 0.7s/64KB 2ms/128KB 1.5ms/128KB
Program
endurance
108 105 106 105
Size 32MByte 32MByte ~1GB 4GB+
Others • Serial program
• Serial program
• Paired page
damage
49. JFFS2(Journaling Flash File System)
• Developed by Redhat eCos in 2001
• Designed for NOR flash memory at the first time
• Supporting data compression
– Good for reducing total page write
– Additional computational overhead
• Log-structured File system
– Any file system modification is appended to the log
• Scalability problem
– Need full scan at a mount time
– Manage all metadata in main memory
• Directory structure, File indexing structure
Scan area
JFFS
Ref. D. Woodhouse, “JFFS: The journaling flash file system,” presented at the Ottawa Linux
Symposium, 2001.
50. YAFFS2 (Yet Another Flash File System)
• Developed by Aleph One in 2003
• Designed specifically for NAND flash memory
– Use spare region to store the file metadata
• Log-structured File system
– Any file system modification is appended to the log
• Scalability problem
– Need to scan entire spare region
• Reduced mounting time comparing with JFFS2
– Manage all metadata in main memory
• Directory structure, File indexing structure
Scan area
YAFFS
Ref. http://www.yaffs.net/
51. CFFS (Core Flash File System)
• Developed By CORE Lab in 2006
• Log-structured File system
– Any file system modification is appended to the log
• Metadata separation
– Metadata and data is written to different blocks in NAND flash
– Scanning only the metadata blocks Reduced mounting time
• Store file indexing structure in NAND flash memory
– Reduce the main memory usage
– Manage directory structure in main memory
• CFFS2 limitation
– Need extra metadata write operation
• Updating file index in NAND flash memory
– Wear-leveling problem
• Metadata block is updated more frequently Scan area
CFFS2
Ref. S. H. Lim and K. H. Park, “An efficient nand flash file system for flash memory storage,”
IEEE Transactions on Computers, vol. 55, no. 7, pp. 906–912, 2006.
52. Previous flash file systems
Feature Pros. Cons.
JFFS2
[2001]
• LFS approach
• Data compression
• Node management
• Reliable
• Metadata update overhead
• Scalability problem
• Node management overhead
YAFFS2
[2003]
• LFS approach
• Using spare region
• Reduced mounting time
• Metadata update overhead
• Scalability problem
CFFS
[2006]
• LFS approach
• Metadata separation
• File indexing in NAND
• Reduced mounting time
• Reduced GC overhead
• Metadata update overhead
• Scalability problem remaining
• Extra write overhead
• Wear-leveling problem
54. Scalability problems
2. Use of main memory
Scan area Non-scan area
1. Scan area comparison
>>
Open(“/dir/a.txt”)
i-number
Location of inode
Location of data
Accessing a file ‘/dir/a.txt’
Type of Index JFFS, YAFFS CFFS
1. Find i-number using
path name
In memory
directory
In memory
directory
2. Find inode using i-
number
In memory inode
map
In memory inode
map
3. Find file data
In memory file
index
In NAND file
index
YAFFS CFFSJFFS
56. PFFS Scalability: Mounting
time
• PFFS has minimized and fixed mounting time
– All metadata are connected from root directory in PRAM
– PFFS does not need to scan the NAND flash memory
YAFFS CFFSJFFS PFFS
Scan area Non-scan area
>> >
Scan area comparison
57. PFFS Scalability: Memory use
• PFFS use no DRAM main memory for metadata structure
– Most of metadata structures of PFFS are contained in PRAM
Type of Index
JFFS,
YAFFS
CFFS PFFS
1. Find i-number
using path name
In memory
directory
In memory
directory
In-PRAM
directory
2. Find inode using
i-number
In memory
inode map
In memory
inode map
Simple
calculation
3. Find file data
In memory
file index
In NAND file
index
In-PRAM data
pointers
Open(“/dir/a.txt”)
i-number
Location of inode
Location of data
Accessing a file ‘/dir/a.txt’
Main memory use
58. Evaluation
• CPU: Samsung S3C2413 (ARM 926EJ)
• Mem : 64MB DRAM
• 1GB MLC NAND, 32MB PRAM
• NAND flash memory characteristics
• Benchmark: PostMark
• Benchmark for short-lived, small file
read/write performance
• Comparison with YAFFS2
61. Conclusion
• PFFS solves the scalability problems of previous flash file
systems by using the hybrid architecture of PRAM and NAND
flash memory
• Mounting time and memory usage of PFFS are O(1)
• The performance of PFFS is 25% better than YAFFS2 for small
file writes
63. Problems
• In page level mapping scheme, relocation of data is possible as
page size
• But, disadvantage is the large size of mapping table
• Ex) In 64GB SSD
• If using block level mapping, size of mapping table is 512KB
• If using page level mapping, size of mapping table is 64MB
• So most of the actual commercial SSD uses a hybrid scheme
based on block level mapping scheme
64. Page Level Mapping scheme
address translation techniques
• The entire mapping table is maintained in the NAND
• caching frequently used mapping table to DRAM
• Use FTL-TLB and FTL mapping directory structure
65. Page table management in Demand
Paging Memory System
• Using page level mapping in the NAND Flash memory and
using Demand Paging scheme on the memory system are
similar
66. FTL-TLB
• manage mapping table as a section
• A section stored mapping table about NAND Flash memory
one block
• Ex) If a block has 128 pages, size of a section is 128 * 4B
• The number of sections is the same that entire block numbers
in NAND
67. FTL Mapping Directory
• FTL Mapping Directory is allocated in DRAM
• FTL Mapping Directory has as much as the number of sections
Whether cashed or not in FTL-TLB
Whether updated or not in NAND
69. Evaluation
• Workload
• Goal is 64GB SSD
• Use Virtual Box and 64GB HDD and Windows XP
• Collect access trace
• Daily_usage and multi_program are typical environment
• Install_update is Windows update and program install
• Large_file is copy large file
Trace Requests Data size [MB]
Daily_usage 545031 10270.78
Multi_program 309262 3070.669
Install_update 1022856 14072.22
Large_file 45593 2810.333
71. CFLRU(Clean First LRU)
• If all page frames are clean pages or dirty pages then CFLRU is
the same LRU algorithm.
• But Example: all page frames have dirty pages and clean pages.
• CFLRU divides the LRU list into two regions.
• The working region consists of recently used pages and most of
cache hits are generated in this region.
• The clean-first region consists of pages which are candidates for
eviction.
• CFLRU selects a clean page to evict in the clean-first region
first.
• If there is no clean page in this region, a dirty page at the end
of the LRU list is evicted.
73. LRU-WSR(LRU-WriteSequenceReordering)
• We have some concepts: Cold dirty page, And Cold flag.
• If the page is dirty and cold-flag is set, this page regarded as a
cold dirty page.
• We have example: LRU-WSR uses a page list L and an
additional flag - Cold flag.
78. Evaluation
• Memory usage
• 64GB SSD has 131072 blocks of 512 KB size
• A entry use 6B in FTL mapping directory
• So size is 768KB
Page mapping
table
512KB
FTL-TLB
1024KB
FTL-TLB
64 GB SSD 64MB 1280KB 1.9% 1792KB 2.7%
79. Conclusion
• Although FTL-TLB uses only 512KB, cache hit ratio is over 90%
• Cache over head is under 2%
• Memory usage is only 1.9% rather than full mapping table
81. Problems
• No research about File system optimization on Flash
• Legacy cluster allocation scheme for hard disk is not suitable
• Hard disk can in place update
• But Flash can not do
82. Solutions
• AFCA(Anti-Fragmentation Cluster Allocation)
• New Fragmentation for Flash
• Data invalidation scheme
• If data is not used any more, file system announce to FTL for
reduce unnecessary overhead
83. AFCA(Anti-Fragmentation Cluster
Allocation)
• File fragmentation
• The number of logical blocks to save the file : N
• The number of logical blocks to actually used : n
• If n>N, file is fragmented
• Free space fragmentation
• Minimum number of logical blocks with free space : M
• The number of logical blocks in free space is located : m
• If m>M, Free space is fragmented
86. AFCA(Anti-Fragmentation Cluster
Allocation)
• Considerations
• If file is larger than logical block, allocate as a logical block. It’s
good to reduce file fragmentation
• After allocate all clusters in a block, allocate next logical block. it
is good to reduce free space fragmentation
• File is considered as small file and if file exceeds the threshold,
file is considered as large file
87. AFCA(Anti-Fragmentation Cluster
Allocation)
• Free logical blocks(F-logical block)
• All clusters are unused state in logical block
• Logical blocks for small file(S-logical block)
• Logical blocks for large file(L-logical block)
89. Data invalidation scheme
• If sector is not used any more, file system announce to FTL
• FTL checks sector that is invalid data on page mapping table
94. Conclusion
• When we use AFCA
• Fragmentation is reduced up to 53%
• Performance is improved up to 46%
• When we use data invalidation
• Write performance is improved up to 22%