SlideShare ist ein Scribd-Unternehmen logo
1 von 95
에너지 30% 이상 절감 가능한
범용 운영체제 핵심 원천 기술 개발 과제
1월 정기회의
국민대학교
김영만, 한재일
Contents
 Research Activities
Part 1 : 발표 배정논문 (3편)
1. A new model for the system and devices latency
2. Cross-Layer Frameworks for Constrained Power and Resources
Management of Embedded Systems
3. Automatic Run-Time Selection of Power Polices for Operating Systems
Part 2 : 국민대 연구내용
1. Performance Evaluation of Parallel Applications on Next Generation
Memory Architecture with Power-Aware Paging Method
2. PPFS : A Scalable Flash Memory File System for the Hybrid
Architecture of Phase-change RAM and NAND Flash
3. Address Translation Technique for Large NAND Flash Memory using
Page Level Mapping
4. Performance Optimization Techniques for Legacy File Systems on
Flash Memory
1. Research Activities
Research Contents
에너지 절감과 관련된 논문 읽기
시뮬레이터 분석
Papers
A new model for the system and devices latency
Automatic Run-Time Selection of Power Policies for Operating Systems
Cross-Layer Frameworks for Constrained Power and Resources Management of Embedded Systems
Address Translation Technique for Large NAND Flash Memory
Performance Optimization Techniques for Legacy File Systems on Flash Memory
A Scalable Flash Memory File System for the Hybrid Architecture of Phase-change RAM and NAND Flash
An Energy Efficient Cache Design Using Spin Torque Transfer (STT) RAM
Performance Evaluation of Parallel Applications on Next Generation Memory Architecture with Power-
Aware Paging Method
Part 1
발표 배정논문 (3편)
1.Anewmodelforthesystemanddeviceslatency
WHAT IS LATENCY ?
• “ In a computer system, latency is often used to mean any del
ay or waiting that increases real or perceived response time b
eyond the response time desired. “
• “ Specific contributors to computer latency include mismatche
s in data speed between the microprocessor and input/output
devices and inadequate data buffers.”
• “ Within a computer, latency can be removed or "hidden" by s
uch techniques as prefetching (anticipating the need for data i
nput requests) and multithreading, or using parallelism across
multiple execution threads.“
• Source: http://searchciomidmarket.techtarget.com/definition/
latency
TERMINOLOGY.
(Texas Instrument)
• Latency: time to react to an external event, e.g. time spent to execut
e the handler code after an IRQ, time spent to execute driver code fr
om an external wake-up event.
• HW latency: latency introduced by the HW to transition between po
wer states.
• SW latency: time for the SW to execute low power transition code, e
.g. IP block save & restore, caches flush/invalidate etc.
• System: ‘everything needed to execute the kernel code', e.g. on OM
AP3, system = CPU0 + CORE (main memory, caches, IRQ controller...).
• Per-device latency: latency of a device (or peripheral). The per-devic
e PM QoS framework allows to control the devices states from the al
lowed devices latency.
• Cpuidle: framework that controls the CPUs low power states (=C-stat
es), from the allowed system latency. Note : Is being abused to contr
ol the system state.
• PM runtime: framework that allows the dynamic switching of resour
ces.
HOW TO SPECIFY THE ALLOWED LATE
NCY.
• The PM QoS framework allows the kernel and user to specify
the allowed latency.
• The framework calculates the aggregated constraint value and
calls the registered platform-specific handlers in order to apply
the constraints at lower level.
PM QoS FRAMEWORK.
• PM QoS is a framework developed by Intel.
• It allows kernel code and applications to set their requirement
s in terms of:
• CPU DMA latency.
• Network latency.
• According to these requirements, PM QoS allows kernel driver
s to adjust their power management.
• See Documentation/power/pm_qos_interface.txt.
• http://free-electrons.com/kerneldoc/latest/power/pm_qos_interface.txt
• Still in very early deployment (only 4 drivers in 2.6.36).
What is the key point of control
ling the latency ?
• The point is to dynamically optimize the power consumption o
f all system components.
• Knowing the allowed latency (from the constraints) and the ex
pected worst-case latency allows to choose the optimum pow
er state.
OMAP.
• OMAP (Open Multimedia Applications Platform) developed b
y Texas Instruments is a category of proprietary system on chi
ps (SoCs) for portable and mobile multimedia applications.
• OMAP devices generally include a general-purpose ARM archi
tecture processor core plus one or more specialized co-proces
sors.
• Earlier OMAP variants commonly featured a variant of the Tex
as Instruments TMS320 series digital signal processor.
• The OMAP family consists of three product groups classified b
y performance and intended application:
• High-performance applications processors
• Basic multimedia applications processors
• Integrated modem and applications processors
OMAP.
TI OMAP3530 on BeagleBoard described
OMAP.
TI OMAP4430 on PandaBoard described
CURRENT MODEL.
CURRENT MODEL.
LATENCY FIGURE.
LATENCY FIGURE.
PROBLEM.
• There is no concept of ‘overall latency’.
• No interdependency between PM frameworks
 Ex. on OMAP3 : cpuidle manages only a subset of the power domains
(MPU, CORE).
 Ex. on OMAP3 per-device PM QoS manages the other power domain
s.
 No relation between the frameworks, each framework has its own lat
ency numbers.
• Some system settings are not included in the model
 Mainly because of the (lack of) SW support at the time of the measur
ement session.
 Ex. On OMAP3 : voltage scaling in low power modes, sys_clkreq, sys_
offmode and the interaction with the PowerIC.
• Dynamic nature of the system settings
 The measured numbers are for a fixed setup, with predefined system
settings.
 The measured numbers are constant.
SOLUTION PROPOSAL.
• Overall latency calculation.
• We need a model which breaks down the overall latency into t
he latencies from every contributor:
Latency = latencySW + latencyHW
Latency = latencySW + latencySoC + latencyExternal HW
• LatencySW : time for the SW to save/restore the context of an I
P block.
• LatencySoC : time for the SoC HW to change an IP block state.
• LatencyExternal HW : time to stop/restart external HW. (Ex: extern
al crystal oscillator, external power supply …)
• Note: every latency factor maybe be divided into smaller facto
rs. E.g: On OMAP a DPLL can feed multiple power domains.
NEW MODEL.
3.AutomaticRun-TimeSelectionofPowerPolicesfor
OperatingSystems
20130108
이재열
Problems
• Existing studies one power management make an implicit
assumption
• Only one policy can be used to save power
• Hence, those studies focus on finding the best polices for
unique request patterns
HAPPI(Homogeneous Architecture
for Power Policy Integration)
• HAPPI is currently capable of supporting power policies for
disk, DVD-ROM, and network devices
• But it can easily be extended to support other I/O devices
• Must provide
• A function that predicts idleness and controls a device’s power
state.
• A function that accepts a trace of device accesses, determines
the actions the control function would take, and returns the
energy consumption and access delay from the actions.
HAPPI(Homogeneous Architecture
for Power Policy Integration)
• If policy is selected to manage the power state of a specific
device by HAPPI, it is considered activity
• Each device is assigned only one active policy at anytime
• Whenever the device is accessed, HAPPI captures the size and
time of the access
• Also records the energy and delay for each device
HAPPI(Homogeneous Architecture
for Power Policy Integration)
• Policy Selection
Implementation
• Linux 2.6.5
• Policies and evaluators are implemented as kernel module
• Experimental hardware is not fully ACPI compliant
• So they implement a function that returns the power,
transition energy and transition delay for each state of each
device
• Policies need these values to compute the power consumed in
each state
Experiments
• Fujitsu laptop hard disk(HDD)
• Samsung DVD drive(DVD)
• NetXtreme integrated wired network card(NIC)
Power states for devices
Experiments
• Workload
1. Web browsing + buffered media playback from DVD
2. Download video and buffered media playback from disk
3. CVS checkout from remote repository
4. E-mail synchronization + unbuffered media playback from DVD
5. Kernel compile
Experiments
• Policies
• Null
• 2-competitive timeout
• Exponential prediction
• Adaptive timeout
Exponential Prediction
• Formulation
• In : the last predicted value
• in : the latest idle period
• a : a constant attenuation factor in the range between 0 to 1
• If a = 0, then In+1 = In
• If a = 1, then In = in
• So, typically a = 1/2
Exponential Prediction
In
in
Actual Idle(in) 6 4 6 4 13 13 13 …
Prediction(In) 10 8 6 6 5 9 11 12 …
Experiments
• Result
Estimated energy consumption for each policy on
devices for experimental workload
Selected policies for devices at each evaluation
Workload 1 2 3 4 5 Workload 1 2 3 4 5
Conclusion
• experiments indicate that policy selection is highly adaptive to
workload and hardware types, supporting our claim that
automatic policy selection is necessary to achieve better
energy savings
Part 2
국민대 연구내용
1.PerformanceEvaluationofParallelApplicationsonNext
GenerationMemoryArchitecturewithPower-AwarePaging
Method
Problems
• This paper propose solution (architecture and low power
paging algorithm) to reduce energy consumption in HPC (High
Performance Computing) systems
• To demonstrate low power paging algorithm can improve HPC
performance and reduce energy consumption
SOLUTION
• Replace a part of DRAM with MRAM.
• Conduct simulation to evaluate the performance and energy
consumption of several application benchmark.
• Make a trace file of memory access in each application
benchmark by using the Valgrind profiling tool.
• For each memory access that incurs miss, we collect memory
address and profiling results, which are access count on all the
memory pages.
• With the trace files, they replay behavior of application with
our event-driven simulator.
HOW CAN THEY SOLVE ?
• They propose hybrid memory architecture and power aware s
wapping.
• Use MRAM as main memory beside DRAM due to its higher ac
cess speed and low power consumption.
• Use FLASH as fast random-access swap device due to its faster
random access read speed.
• Use MRAM hit rate and threshold in Low Power Paging Algorit
hm to mange the swapping interaction between DRAM/MRA
M and FLASH. Therefore can improve performance and reduce
energy.
Proposition– HybridMemoryArchitecture
andPower AwareSwapping
CPUs
L1 CACHE
L2 CACHE
MRAM DRAM
HOTTER PAGE
COLDER PAGE
FLASH SWAP
Overview of Proposed Low Power Memory Architecture
MainMemory
Larger number of access.
SWAP SWAP SWAP
CACHE
Low-Power Paging Algorithm
Allocate Hot Pages on
MRAM
Profiling Result
Application Running
Page Fault
Memory
Access on
MRAM
MRAM Hit++
Memory Access++
MRAM Hit Rate <-- MRAM Hit / Memory Access
MRAM Hit Rate >
Threshold
Swap Out the Last
Recently Used Page on
DRAM
Swap Out the Last
Recently Used Page on
DRAM or MRAM
L2 Cache Miss
No
Yes
No
No
Yes
Yes
Figure 2 – Algorithmic Flow of Proposed Paging Algorithm
- A trace file also includes profiling results, which are access
counts on all the memory pages.
- Profiling: the per page memory access frequency of a given application
throughout its execution.
- Pre execution trial or sampling with HW assist.
With the trace file, we replay behavior of application with our event-driven simulator.
- Memory access
 L2 Cache Miss
 Collect
 Profiling
Why they need that algorith
m?
• First simple algorithm works as follows:
• We pin downs the hottest pages so that they are never swap out
and allocated on MRAM.
• The remaining pages are allocated onto DRAM and use LRU based
swapping with flash memory.
• This simple algorithm in some case increased application exec
ution time with LRU swapping algorithm.
• Excessive swaps, slowing down the application considerably.
Why they need that algorith
m?
• To resolve this situation, we extend our algorithm by introduce
a metric called MRAM hit rate and its threshold so that applica
tions exhibiting lower locality may use both MRAM and DRAM
as swappable main memory.
• Thr = α x MRAM_SIZE / TOTAL_SIZE.
• α (≈1) is a configurable parameter to be used to determine the
threshold.
• Several preliminary experiments have shown that a threshold
value of 0.9 seems to work for the NAS and other HPC applicat
ions.
CORE IDEAS OF LOW POWER
PAGING ALGORITHM.
• MRAM hit rate is a dynamic value that indicates the ratio of th
e access counts onto MRAM versus memory access to all the
memory at each point in execution time.
• If the ratio is large, we can decide that accesses to MRAM has
sufficient locality such that pages should be pinned down.
• On the other hand, if the ratio is small, the application lacks of
locality and thus the entire main memory should be seen as s
wappable.
CONCLUSION.
• Reduce DRAM capacity aggressively can reduce energy consu
mption, even with swapping.
• The energy consumption can be reduced to 25% by reducing D
RAM capacity.
2.PPFS:AScalableFlashMemoryFileSystemfortheHybrid
ArchitectureofPhase-changeRAMandNANDFlash
NAND Flash Memory
• NAND flash memory structure
• Page (2KB) : Read and Write Unit
• Block (64 pages = 128KB) : Erase Unit
• NAND flash memory is beauty
• Non-volatility
• Fast access time (No seek latency)
• Low power consumption
• Relatively large capacity
• Shock-resistance
• NAND flash memory is beast
• Erase before write : The page should be erased first in
order to update data on that page
• Slow write : Support only page-level write and 10x slower
than read
• Limited life time : Ensure 100K ~ 1M erase cycles
Ref) K9F1G08X0A Datasheet
Feature of PRAM
Source: Motoyuki Ooishi, Nikkei Electronics Asia, Oct. 2007
 PRAM memory
 Random access memory
 Non-volatile memory
 Low leakage energy
 High density: 4x denser than DRAM
 Limited endurance
NAND flash memory VS. PRAM
1. KPS5615EZM Data Sheet, 2. K9G8G08U0M Data Sheet
PRAM1 NOR SLC NAND MLC NAND2
Volatility Non-volatile Non-volatile Non-volatile Non-volatile
Random access Yes Yes No No
Unit of write Word (2byte) Word (2byte) Page (2Kbyte) Page (2Kbyte)
Read speed 50ns/word 100ns/word 25us/page 60us/page
Write speed 5us/word 11.5us/word 200us/page 800us/page
Erase speed N/A 0.7s/64KB 2ms/128KB 1.5ms/128KB
Program
endurance
108 105 106 105
Size 32MByte 32MByte ~1GB 4GB+
Others • Serial program
• Serial program
• Paired page
damage
JFFS2(Journaling Flash File System)
• Developed by Redhat eCos in 2001
• Designed for NOR flash memory at the first time
• Supporting data compression
– Good for reducing total page write
– Additional computational overhead
• Log-structured File system
– Any file system modification is appended to the log
• Scalability problem
– Need full scan at a mount time
– Manage all metadata in main memory
• Directory structure, File indexing structure
Scan area
JFFS
Ref. D. Woodhouse, “JFFS: The journaling flash file system,” presented at the Ottawa Linux
Symposium, 2001.
YAFFS2 (Yet Another Flash File System)
• Developed by Aleph One in 2003
• Designed specifically for NAND flash memory
– Use spare region to store the file metadata
• Log-structured File system
– Any file system modification is appended to the log
• Scalability problem
– Need to scan entire spare region
• Reduced mounting time comparing with JFFS2
– Manage all metadata in main memory
• Directory structure, File indexing structure
Scan area
YAFFS
Ref. http://www.yaffs.net/
CFFS (Core Flash File System)
• Developed By CORE Lab in 2006
• Log-structured File system
– Any file system modification is appended to the log
• Metadata separation
– Metadata and data is written to different blocks in NAND flash
– Scanning only the metadata blocks  Reduced mounting time
• Store file indexing structure in NAND flash memory
– Reduce the main memory usage
– Manage directory structure in main memory
• CFFS2 limitation
– Need extra metadata write operation
• Updating file index in NAND flash memory
– Wear-leveling problem
• Metadata block is updated more frequently Scan area
CFFS2
Ref. S. H. Lim and K. H. Park, “An efficient nand flash file system for flash memory storage,”
IEEE Transactions on Computers, vol. 55, no. 7, pp. 906–912, 2006.
Previous flash file systems
Feature Pros. Cons.
JFFS2
[2001]
• LFS approach
• Data compression
• Node management
• Reliable
• Metadata update overhead
• Scalability problem
• Node management overhead
YAFFS2
[2003]
• LFS approach
• Using spare region
• Reduced mounting time
• Metadata update overhead
• Scalability problem
CFFS
[2006]
• LFS approach
• Metadata separation
• File indexing in NAND
• Reduced mounting time
• Reduced GC overhead
• Metadata update overhead
• Scalability problem remaining
• Extra write overhead
• Wear-leveling problem
Metadata update problems
Write 512 or 2KiB
Scalability problems
2. Use of main memory
Scan area Non-scan area
1. Scan area comparison
>>
Open(“/dir/a.txt”)
i-number
Location of inode
Location of data
Accessing a file ‘/dir/a.txt’
Type of Index JFFS, YAFFS CFFS
1. Find i-number using
path name
In memory
directory
In memory
directory
2. Find inode using i-
number
In memory inode
map
In memory inode
map
3. Find file data
In memory file
index
In NAND file
index
YAFFS CFFSJFFS
Solution of Metadata update
Write 2Btyte
PFFS Scalability: Mounting
time
• PFFS has minimized and fixed mounting time
– All metadata are connected from root directory in PRAM
– PFFS does not need to scan the NAND flash memory
YAFFS CFFSJFFS PFFS
Scan area Non-scan area
>> >
Scan area comparison
PFFS Scalability: Memory use
• PFFS use no DRAM main memory for metadata structure
– Most of metadata structures of PFFS are contained in PRAM
Type of Index
JFFS,
YAFFS
CFFS PFFS
1. Find i-number
using path name
In memory
directory
In memory
directory
In-PRAM
directory
2. Find inode using
i-number
In memory
inode map
In memory
inode map
Simple
calculation
3. Find file data
In memory
file index
In NAND file
index
In-PRAM data
pointers
Open(“/dir/a.txt”)
i-number
Location of inode
Location of data
Accessing a file ‘/dir/a.txt’
Main memory use
Evaluation
• CPU: Samsung S3C2413 (ARM 926EJ)
• Mem : 64MB DRAM
• 1GB MLC NAND, 32MB PRAM
• NAND flash memory characteristics
• Benchmark: PostMark
• Benchmark for short-lived, small file
read/write performance
• Comparison with YAFFS2
Evaluation
Evaluation
Conclusion
• PFFS solves the scalability problems of previous flash file
systems by using the hybrid architecture of PRAM and NAND
flash memory
• Mounting time and memory usage of PFFS are O(1)
• The performance of PFFS is 25% better than YAFFS2 for small
file writes
3.AddressTranslationTechniqueforLargeNANDFlash
MemoryusingPageLevelMapping
Problems
• In page level mapping scheme, relocation of data is possible as
page size
• But, disadvantage is the large size of mapping table
• Ex) In 64GB SSD
• If using block level mapping, size of mapping table is 512KB
• If using page level mapping, size of mapping table is 64MB
• So most of the actual commercial SSD uses a hybrid scheme
based on block level mapping scheme
Page Level Mapping scheme
address translation techniques
• The entire mapping table is maintained in the NAND
• caching frequently used mapping table to DRAM
• Use FTL-TLB and FTL mapping directory structure
Page table management in Demand
Paging Memory System
• Using page level mapping in the NAND Flash memory and
using Demand Paging scheme on the memory system are
similar
FTL-TLB
• manage mapping table as a section
• A section stored mapping table about NAND Flash memory
one block
• Ex) If a block has 128 pages, size of a section is 128 * 4B
• The number of sections is the same that entire block numbers
in NAND
FTL Mapping Directory
• FTL Mapping Directory is allocated in DRAM
• FTL Mapping Directory has as much as the number of sections
Whether cashed or not in FTL-TLB
Whether updated or not in NAND
Architecture
Sector0
1
3
6
Evaluation
• Workload
• Goal is 64GB SSD
• Use Virtual Box and 64GB HDD and Windows XP
• Collect access trace
• Daily_usage and multi_program are typical environment
• Install_update is Windows update and program install
• Large_file is copy large file
Trace Requests Data size [MB]
Daily_usage 545031 10270.78
Multi_program 309262 3070.669
Install_update 1022856 14072.22
Large_file 45593 2810.333
Evaluation
• Replacement Algorithm
• OPT(The Optimal Algorithm)
• LRU
• LRFU(Last Recently/Frequently Used Replacement)
• LRU2(LRU-K)
• LIRS(Low Inter-Reference recency Set)
• CFLRU(Clean First LRU)
• LRU-WSR(LRU-Write Sequence Reordering)
CFLRU(Clean First LRU)
• If all page frames are clean pages or dirty pages then CFLRU is
the same LRU algorithm.
• But Example: all page frames have dirty pages and clean pages.
• CFLRU divides the LRU list into two regions.
• The working region consists of recently used pages and most of
cache hits are generated in this region.
• The clean-first region consists of pages which are candidates for
eviction.
• CFLRU selects a clean page to evict in the clean-first region
first.
• If there is no clean page in this region, a dirty page at the end
of the LRU list is evicted.
CFLRU(Clean First LRU)
P1 P2 P3 P4 P5 P6 P7 P8
5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 6(C) 8(D)
Working Region
C : Clean Page
D : Dirty Page Clean First Region
P7 P1 P2 P3 P4 P5 P6 P8
9(C) 5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 8(D)
P5 P7 P1 P2 P3 P4 P6 P8
10(C) 9(C) 5(D) 2(C) 3(D) 7(C) 4(D) 8(D)
P4 P5 P7 P1 P2 P3 P6 P8
11(C) 10(C) 9(C) 5(D) 2(C) 3(D) 4(D) 8(D)
P2 P4 P5 P7 P1 P3 P6 P8
12(C) 11(C) 10(C) 9(C) 5(D) 3(D) 4(D) 8(D)
Access 9
Access 10
Access 11
Access 12
Evict P7
Evict P5
Evict P4
Evict P2
LRU-WSR(LRU-WriteSequenceReordering)
• We have some concepts: Cold dirty page, And Cold flag.
• If the page is dirty and cold-flag is set, this page regarded as a
cold dirty page.
• We have example: LRU-WSR uses a page list L and an
additional flag - Cold flag.
LRUWSR(LRU-WriteSequenceReordering)
P1 P2 P3 P4 P5 P6 P7 P8
5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 6(C) 8(D)
Cf=0 Cf=0 Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0
Cf : Cold flag
P7 P8 P1 P2 P3 P4 P5 P6
9(C) 8(D) 5(D) 2(C) 3(D) 7(C) 1(C) 4(D)
Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0 Cf=0 Cf=1
Access 9 Evict P7
P6 P7 P8 P1 P2 P3 P4 P5
10(C) 9(C) 8(D) 5(D) 2(C) 3(D) 7(C) 1(C)
Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0 Cf=0
Access 10 Evict P6
P5 P6 P7 P8 P1 P2 P3 P4
11(C) 10(C) 9(C) 8(D) 5(D) 2(C) 3(D) 7(C)
Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0
Access 11 Evict P5
P4 P5 P6 P7 P8 P1 P2 P3
12(C) 11(C) 10(C) 9(C) 8(D) 5(D) 2(C) 3(D)
Cf=0 Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0
Access 12 Evict P4
Evaluation
• Cache hit ratio
Daily_usage Multi_program
Large_fileInstall_update
Show cache hit ratio of
over 95% in all cases
except Large_file.
Evaluation
• Overhead
Evaluation
• Overhead
Daily_usage Multi_program
Large_fileInstall_update
In most of workload
when cache size is
more than 512KB
overhead is less than
2%.
Evaluation
• Memory usage
• 64GB SSD has 131072 blocks of 512 KB size
• A entry use 6B in FTL mapping directory
• So size is 768KB
Page mapping
table
512KB
FTL-TLB
1024KB
FTL-TLB
64 GB SSD 64MB 1280KB 1.9% 1792KB 2.7%
Conclusion
• Although FTL-TLB uses only 512KB, cache hit ratio is over 90%
• Cache over head is under 2%
• Memory usage is only 1.9% rather than full mapping table
4.PerformanceOptimizationTechniquesforLegacyFile
SystemsonFlashMemory
Problems
• No research about File system optimization on Flash
• Legacy cluster allocation scheme for hard disk is not suitable
• Hard disk can in place update
• But Flash can not do
Solutions
• AFCA(Anti-Fragmentation Cluster Allocation)
• New Fragmentation for Flash
• Data invalidation scheme
• If data is not used any more, file system announce to FTL for
reduce unnecessary overhead
AFCA(Anti-Fragmentation Cluster
Allocation)
• File fragmentation
• The number of logical blocks to save the file : N
• The number of logical blocks to actually used : n
• If n>N, file is fragmented
• Free space fragmentation
• Minimum number of logical blocks with free space : M
• The number of logical blocks in free space is located : m
• If m>M, Free space is fragmented
AFCA(Anti-Fragmentation Cluster
Allocation)
M : 4
m : 5
N : 2
n : 2
M : 2
m : 3
N : 2
n : 3
N : 2
n : 2
M : 2
m : 2
M : 2
m : 2
N : 2
n : 3
AFCA(Anti-Fragmentation Cluster
Allocation)
basic cluster allocation(BCA) AFCA
AFCA(Anti-Fragmentation Cluster
Allocation)
• Considerations
• If file is larger than logical block, allocate as a logical block. It’s
good to reduce file fragmentation
• After allocate all clusters in a block, allocate next logical block. it
is good to reduce free space fragmentation
• File is considered as small file and if file exceeds the threshold,
file is considered as large file
AFCA(Anti-Fragmentation Cluster
Allocation)
• Free logical blocks(F-logical block)
• All clusters are unused state in logical block
• Logical blocks for small file(S-logical block)
• Logical blocks for large file(L-logical block)
AFCA(Anti-Fragmentation Cluster
Allocation)
F-logical block
S-logical block L-logical block
cluster allocation
at small file
Return some cluster
Data invalidation scheme
• If sector is not used any more, file system announce to FTL
• FTL checks sector that is invalid data on page mapping table
Data invalidation scheme
Data invalidation scheme
Evaluation
• Use Ext2, Kernel 2.4 and NAND Flash Emulator
• Page size is 2KB
• Block size is 128KB
• FTL is Z-FTL based on block mapping
Result
BCA AFCA
Conclusion
• When we use AFCA
• Fragmentation is reduced up to 53%
• Performance is improved up to 46%
• When we use data invalidation
• Write performance is improved up to 22%
Thanks
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating SystemTech_MX
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLinaro
 
2009.08 grid peer-slides
2009.08 grid peer-slides2009.08 grid peer-slides
2009.08 grid peer-slidesYehia El-khatib
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementationRajan Kumar
 
Rtos concepts
Rtos conceptsRtos concepts
Rtos conceptsanishgoel
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesMugdha2289
 
Rtos princples adn case study
Rtos princples adn case studyRtos princples adn case study
Rtos princples adn case studyvanamali_vanu
 
RTOS for Embedded System Design
RTOS for Embedded System DesignRTOS for Embedded System Design
RTOS for Embedded System Designanand hd
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance mentoresd
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachjemin lee
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05Rajesh Gupta
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time KernelsArnav Soni
 
Os rtos.ppt
Os rtos.pptOs rtos.ppt
Os rtos.pptrahul km
 
Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingAjil Jose
 
Report on hyperthreading
Report on hyperthreadingReport on hyperthreading
Report on hyperthreadingdeepakmarndi
 

Was ist angesagt? (20)

Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating System
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoC
 
2009.08 grid peer-slides
2009.08 grid peer-slides2009.08 grid peer-slides
2009.08 grid peer-slides
 
How to choose an RTOS?
How to choose an RTOS?How to choose an RTOS?
How to choose an RTOS?
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementation
 
Rtos concepts
Rtos conceptsRtos concepts
Rtos concepts
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
 
Rtos princples adn case study
Rtos princples adn case studyRtos princples adn case study
Rtos princples adn case study
 
RTOS for Embedded System Design
RTOS for Embedded System DesignRTOS for Embedded System Design
RTOS for Embedded System Design
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Rtos
RtosRtos
Rtos
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
Real Time Kernels
Real Time KernelsReal Time Kernels
Real Time Kernels
 
Os rtos.ppt
Os rtos.pptOs rtos.ppt
Os rtos.ppt
 
Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel Processing
 
RTOS Basic Concepts
RTOS Basic ConceptsRTOS Basic Concepts
RTOS Basic Concepts
 
Report on hyperthreading
Report on hyperthreadingReport on hyperthreading
Report on hyperthreading
 
Rtos part2
Rtos part2Rtos part2
Rtos part2
 

Andere mochten auch

Img 0002
Img 0002Img 0002
Img 0002zamchar
 
105申請入學說明會(上網)
105申請入學說明會(上網)105申請入學說明會(上網)
105申請入學說明會(上網)君 陳
 
Managerial Accounting Garrison Noreen Brewer Chapter 10
Managerial Accounting Garrison Noreen Brewer Chapter 10Managerial Accounting Garrison Noreen Brewer Chapter 10
Managerial Accounting Garrison Noreen Brewer Chapter 10Asif Hasan
 
You raise me up violin
You raise me up violinYou raise me up violin
You raise me up violinSuni Aguado
 
Labour management ppt
Labour management  pptLabour management  ppt
Labour management pptAnit Datta
 

Andere mochten auch (9)

Syed Khaleel Ahmed
Syed Khaleel AhmedSyed Khaleel Ahmed
Syed Khaleel Ahmed
 
Img 0002
Img 0002Img 0002
Img 0002
 
Mostafa guda
Mostafa gudaMostafa guda
Mostafa guda
 
CFF-VRPANEL
CFF-VRPANELCFF-VRPANEL
CFF-VRPANEL
 
105申請入學說明會(上網)
105申請入學說明會(上網)105申請入學說明會(上網)
105申請入學說明會(上網)
 
Ejercicios base datos
Ejercicios base datosEjercicios base datos
Ejercicios base datos
 
Managerial Accounting Garrison Noreen Brewer Chapter 10
Managerial Accounting Garrison Noreen Brewer Chapter 10Managerial Accounting Garrison Noreen Brewer Chapter 10
Managerial Accounting Garrison Noreen Brewer Chapter 10
 
You raise me up violin
You raise me up violinYou raise me up violin
You raise me up violin
 
Labour management ppt
Labour management  pptLabour management  ppt
Labour management ppt
 

Ähnlich wie 참여기관_발표자료-국민대학교 201301 정기회의

Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptxPratik Gohel
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSMaurvi04
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? Deepak Shankar
 
load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940Samsung Electronics
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsNECST Lab @ Politecnico di Milano
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Michael Christofferson
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdfwisard1
 
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-planeMemory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane AJAY KHARAT
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Sarwan ali
 
Real time operating system
Real time operating systemReal time operating system
Real time operating systemKhuram Shahzad
 
Daniel dauwe ece 561 Trial 3
Daniel dauwe   ece 561 Trial 3Daniel dauwe   ece 561 Trial 3
Daniel dauwe ece 561 Trial 3cinedan
 

Ähnlich wie 참여기관_발표자료-국민대학교 201301 정기회의 (20)

Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
 
load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940load-balancing-method-for-embedded-rt-system-20120711-0940
load-balancing-method-for-embedded-rt-system-20120711-0940
 
Real time operating systems
Real time operating systemsReal time operating systems
Real time operating systems
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdf
 
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-planeMemory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
 
Real time operating system
Real time operating systemReal time operating system
Real time operating system
 
Daniel dauwe ece 561 Trial 3
Daniel dauwe   ece 561 Trial 3Daniel dauwe   ece 561 Trial 3
Daniel dauwe ece 561 Trial 3
 

참여기관_발표자료-국민대학교 201301 정기회의

  • 1. 에너지 30% 이상 절감 가능한 범용 운영체제 핵심 원천 기술 개발 과제 1월 정기회의 국민대학교 김영만, 한재일
  • 2. Contents  Research Activities Part 1 : 발표 배정논문 (3편) 1. A new model for the system and devices latency 2. Cross-Layer Frameworks for Constrained Power and Resources Management of Embedded Systems 3. Automatic Run-Time Selection of Power Polices for Operating Systems Part 2 : 국민대 연구내용 1. Performance Evaluation of Parallel Applications on Next Generation Memory Architecture with Power-Aware Paging Method 2. PPFS : A Scalable Flash Memory File System for the Hybrid Architecture of Phase-change RAM and NAND Flash 3. Address Translation Technique for Large NAND Flash Memory using Page Level Mapping 4. Performance Optimization Techniques for Legacy File Systems on Flash Memory
  • 3. 1. Research Activities Research Contents 에너지 절감과 관련된 논문 읽기 시뮬레이터 분석 Papers A new model for the system and devices latency Automatic Run-Time Selection of Power Policies for Operating Systems Cross-Layer Frameworks for Constrained Power and Resources Management of Embedded Systems Address Translation Technique for Large NAND Flash Memory Performance Optimization Techniques for Legacy File Systems on Flash Memory A Scalable Flash Memory File System for the Hybrid Architecture of Phase-change RAM and NAND Flash An Energy Efficient Cache Design Using Spin Torque Transfer (STT) RAM Performance Evaluation of Parallel Applications on Next Generation Memory Architecture with Power- Aware Paging Method
  • 6. WHAT IS LATENCY ? • “ In a computer system, latency is often used to mean any del ay or waiting that increases real or perceived response time b eyond the response time desired. “ • “ Specific contributors to computer latency include mismatche s in data speed between the microprocessor and input/output devices and inadequate data buffers.” • “ Within a computer, latency can be removed or "hidden" by s uch techniques as prefetching (anticipating the need for data i nput requests) and multithreading, or using parallelism across multiple execution threads.“ • Source: http://searchciomidmarket.techtarget.com/definition/ latency
  • 7. TERMINOLOGY. (Texas Instrument) • Latency: time to react to an external event, e.g. time spent to execut e the handler code after an IRQ, time spent to execute driver code fr om an external wake-up event. • HW latency: latency introduced by the HW to transition between po wer states. • SW latency: time for the SW to execute low power transition code, e .g. IP block save & restore, caches flush/invalidate etc. • System: ‘everything needed to execute the kernel code', e.g. on OM AP3, system = CPU0 + CORE (main memory, caches, IRQ controller...). • Per-device latency: latency of a device (or peripheral). The per-devic e PM QoS framework allows to control the devices states from the al lowed devices latency. • Cpuidle: framework that controls the CPUs low power states (=C-stat es), from the allowed system latency. Note : Is being abused to contr ol the system state. • PM runtime: framework that allows the dynamic switching of resour ces.
  • 8. HOW TO SPECIFY THE ALLOWED LATE NCY. • The PM QoS framework allows the kernel and user to specify the allowed latency. • The framework calculates the aggregated constraint value and calls the registered platform-specific handlers in order to apply the constraints at lower level.
  • 9. PM QoS FRAMEWORK. • PM QoS is a framework developed by Intel. • It allows kernel code and applications to set their requirement s in terms of: • CPU DMA latency. • Network latency. • According to these requirements, PM QoS allows kernel driver s to adjust their power management. • See Documentation/power/pm_qos_interface.txt. • http://free-electrons.com/kerneldoc/latest/power/pm_qos_interface.txt • Still in very early deployment (only 4 drivers in 2.6.36).
  • 10. What is the key point of control ling the latency ? • The point is to dynamically optimize the power consumption o f all system components. • Knowing the allowed latency (from the constraints) and the ex pected worst-case latency allows to choose the optimum pow er state.
  • 11. OMAP. • OMAP (Open Multimedia Applications Platform) developed b y Texas Instruments is a category of proprietary system on chi ps (SoCs) for portable and mobile multimedia applications. • OMAP devices generally include a general-purpose ARM archi tecture processor core plus one or more specialized co-proces sors. • Earlier OMAP variants commonly featured a variant of the Tex as Instruments TMS320 series digital signal processor. • The OMAP family consists of three product groups classified b y performance and intended application: • High-performance applications processors • Basic multimedia applications processors • Integrated modem and applications processors
  • 12. OMAP. TI OMAP3530 on BeagleBoard described
  • 13. OMAP. TI OMAP4430 on PandaBoard described
  • 18. PROBLEM. • There is no concept of ‘overall latency’. • No interdependency between PM frameworks  Ex. on OMAP3 : cpuidle manages only a subset of the power domains (MPU, CORE).  Ex. on OMAP3 per-device PM QoS manages the other power domain s.  No relation between the frameworks, each framework has its own lat ency numbers. • Some system settings are not included in the model  Mainly because of the (lack of) SW support at the time of the measur ement session.  Ex. On OMAP3 : voltage scaling in low power modes, sys_clkreq, sys_ offmode and the interaction with the PowerIC. • Dynamic nature of the system settings  The measured numbers are for a fixed setup, with predefined system settings.  The measured numbers are constant.
  • 19. SOLUTION PROPOSAL. • Overall latency calculation. • We need a model which breaks down the overall latency into t he latencies from every contributor: Latency = latencySW + latencyHW Latency = latencySW + latencySoC + latencyExternal HW • LatencySW : time for the SW to save/restore the context of an I P block. • LatencySoC : time for the SoC HW to change an IP block state. • LatencyExternal HW : time to stop/restart external HW. (Ex: extern al crystal oscillator, external power supply …) • Note: every latency factor maybe be divided into smaller facto rs. E.g: On OMAP a DPLL can feed multiple power domains.
  • 22. Problems • Existing studies one power management make an implicit assumption • Only one policy can be used to save power • Hence, those studies focus on finding the best polices for unique request patterns
  • 23. HAPPI(Homogeneous Architecture for Power Policy Integration) • HAPPI is currently capable of supporting power policies for disk, DVD-ROM, and network devices • But it can easily be extended to support other I/O devices • Must provide • A function that predicts idleness and controls a device’s power state. • A function that accepts a trace of device accesses, determines the actions the control function would take, and returns the energy consumption and access delay from the actions.
  • 24. HAPPI(Homogeneous Architecture for Power Policy Integration) • If policy is selected to manage the power state of a specific device by HAPPI, it is considered activity • Each device is assigned only one active policy at anytime • Whenever the device is accessed, HAPPI captures the size and time of the access • Also records the energy and delay for each device
  • 25. HAPPI(Homogeneous Architecture for Power Policy Integration) • Policy Selection
  • 26. Implementation • Linux 2.6.5 • Policies and evaluators are implemented as kernel module • Experimental hardware is not fully ACPI compliant • So they implement a function that returns the power, transition energy and transition delay for each state of each device • Policies need these values to compute the power consumed in each state
  • 27. Experiments • Fujitsu laptop hard disk(HDD) • Samsung DVD drive(DVD) • NetXtreme integrated wired network card(NIC) Power states for devices
  • 28. Experiments • Workload 1. Web browsing + buffered media playback from DVD 2. Download video and buffered media playback from disk 3. CVS checkout from remote repository 4. E-mail synchronization + unbuffered media playback from DVD 5. Kernel compile
  • 29. Experiments • Policies • Null • 2-competitive timeout • Exponential prediction • Adaptive timeout
  • 30. Exponential Prediction • Formulation • In : the last predicted value • in : the latest idle period • a : a constant attenuation factor in the range between 0 to 1 • If a = 0, then In+1 = In • If a = 1, then In = in • So, typically a = 1/2
  • 31. Exponential Prediction In in Actual Idle(in) 6 4 6 4 13 13 13 … Prediction(In) 10 8 6 6 5 9 11 12 …
  • 32. Experiments • Result Estimated energy consumption for each policy on devices for experimental workload Selected policies for devices at each evaluation Workload 1 2 3 4 5 Workload 1 2 3 4 5
  • 33. Conclusion • experiments indicate that policy selection is highly adaptive to workload and hardware types, supporting our claim that automatic policy selection is necessary to achieve better energy savings
  • 36. Problems • This paper propose solution (architecture and low power paging algorithm) to reduce energy consumption in HPC (High Performance Computing) systems • To demonstrate low power paging algorithm can improve HPC performance and reduce energy consumption
  • 37. SOLUTION • Replace a part of DRAM with MRAM. • Conduct simulation to evaluate the performance and energy consumption of several application benchmark. • Make a trace file of memory access in each application benchmark by using the Valgrind profiling tool. • For each memory access that incurs miss, we collect memory address and profiling results, which are access count on all the memory pages. • With the trace files, they replay behavior of application with our event-driven simulator.
  • 38. HOW CAN THEY SOLVE ? • They propose hybrid memory architecture and power aware s wapping. • Use MRAM as main memory beside DRAM due to its higher ac cess speed and low power consumption. • Use FLASH as fast random-access swap device due to its faster random access read speed. • Use MRAM hit rate and threshold in Low Power Paging Algorit hm to mange the swapping interaction between DRAM/MRA M and FLASH. Therefore can improve performance and reduce energy.
  • 39. Proposition– HybridMemoryArchitecture andPower AwareSwapping CPUs L1 CACHE L2 CACHE MRAM DRAM HOTTER PAGE COLDER PAGE FLASH SWAP Overview of Proposed Low Power Memory Architecture MainMemory Larger number of access. SWAP SWAP SWAP CACHE
  • 40. Low-Power Paging Algorithm Allocate Hot Pages on MRAM Profiling Result Application Running Page Fault Memory Access on MRAM MRAM Hit++ Memory Access++ MRAM Hit Rate <-- MRAM Hit / Memory Access MRAM Hit Rate > Threshold Swap Out the Last Recently Used Page on DRAM Swap Out the Last Recently Used Page on DRAM or MRAM L2 Cache Miss No Yes No No Yes Yes Figure 2 – Algorithmic Flow of Proposed Paging Algorithm - A trace file also includes profiling results, which are access counts on all the memory pages. - Profiling: the per page memory access frequency of a given application throughout its execution. - Pre execution trial or sampling with HW assist. With the trace file, we replay behavior of application with our event-driven simulator. - Memory access  L2 Cache Miss  Collect  Profiling
  • 41. Why they need that algorith m? • First simple algorithm works as follows: • We pin downs the hottest pages so that they are never swap out and allocated on MRAM. • The remaining pages are allocated onto DRAM and use LRU based swapping with flash memory. • This simple algorithm in some case increased application exec ution time with LRU swapping algorithm. • Excessive swaps, slowing down the application considerably.
  • 42. Why they need that algorith m? • To resolve this situation, we extend our algorithm by introduce a metric called MRAM hit rate and its threshold so that applica tions exhibiting lower locality may use both MRAM and DRAM as swappable main memory. • Thr = α x MRAM_SIZE / TOTAL_SIZE. • α (≈1) is a configurable parameter to be used to determine the threshold. • Several preliminary experiments have shown that a threshold value of 0.9 seems to work for the NAS and other HPC applicat ions.
  • 43. CORE IDEAS OF LOW POWER PAGING ALGORITHM. • MRAM hit rate is a dynamic value that indicates the ratio of th e access counts onto MRAM versus memory access to all the memory at each point in execution time. • If the ratio is large, we can decide that accesses to MRAM has sufficient locality such that pages should be pinned down. • On the other hand, if the ratio is small, the application lacks of locality and thus the entire main memory should be seen as s wappable.
  • 44. CONCLUSION. • Reduce DRAM capacity aggressively can reduce energy consu mption, even with swapping. • The energy consumption can be reduced to 25% by reducing D RAM capacity.
  • 46. NAND Flash Memory • NAND flash memory structure • Page (2KB) : Read and Write Unit • Block (64 pages = 128KB) : Erase Unit • NAND flash memory is beauty • Non-volatility • Fast access time (No seek latency) • Low power consumption • Relatively large capacity • Shock-resistance • NAND flash memory is beast • Erase before write : The page should be erased first in order to update data on that page • Slow write : Support only page-level write and 10x slower than read • Limited life time : Ensure 100K ~ 1M erase cycles Ref) K9F1G08X0A Datasheet
  • 47. Feature of PRAM Source: Motoyuki Ooishi, Nikkei Electronics Asia, Oct. 2007  PRAM memory  Random access memory  Non-volatile memory  Low leakage energy  High density: 4x denser than DRAM  Limited endurance
  • 48. NAND flash memory VS. PRAM 1. KPS5615EZM Data Sheet, 2. K9G8G08U0M Data Sheet PRAM1 NOR SLC NAND MLC NAND2 Volatility Non-volatile Non-volatile Non-volatile Non-volatile Random access Yes Yes No No Unit of write Word (2byte) Word (2byte) Page (2Kbyte) Page (2Kbyte) Read speed 50ns/word 100ns/word 25us/page 60us/page Write speed 5us/word 11.5us/word 200us/page 800us/page Erase speed N/A 0.7s/64KB 2ms/128KB 1.5ms/128KB Program endurance 108 105 106 105 Size 32MByte 32MByte ~1GB 4GB+ Others • Serial program • Serial program • Paired page damage
  • 49. JFFS2(Journaling Flash File System) • Developed by Redhat eCos in 2001 • Designed for NOR flash memory at the first time • Supporting data compression – Good for reducing total page write – Additional computational overhead • Log-structured File system – Any file system modification is appended to the log • Scalability problem – Need full scan at a mount time – Manage all metadata in main memory • Directory structure, File indexing structure Scan area JFFS Ref. D. Woodhouse, “JFFS: The journaling flash file system,” presented at the Ottawa Linux Symposium, 2001.
  • 50. YAFFS2 (Yet Another Flash File System) • Developed by Aleph One in 2003 • Designed specifically for NAND flash memory – Use spare region to store the file metadata • Log-structured File system – Any file system modification is appended to the log • Scalability problem – Need to scan entire spare region • Reduced mounting time comparing with JFFS2 – Manage all metadata in main memory • Directory structure, File indexing structure Scan area YAFFS Ref. http://www.yaffs.net/
  • 51. CFFS (Core Flash File System) • Developed By CORE Lab in 2006 • Log-structured File system – Any file system modification is appended to the log • Metadata separation – Metadata and data is written to different blocks in NAND flash – Scanning only the metadata blocks  Reduced mounting time • Store file indexing structure in NAND flash memory – Reduce the main memory usage – Manage directory structure in main memory • CFFS2 limitation – Need extra metadata write operation • Updating file index in NAND flash memory – Wear-leveling problem • Metadata block is updated more frequently Scan area CFFS2 Ref. S. H. Lim and K. H. Park, “An efficient nand flash file system for flash memory storage,” IEEE Transactions on Computers, vol. 55, no. 7, pp. 906–912, 2006.
  • 52. Previous flash file systems Feature Pros. Cons. JFFS2 [2001] • LFS approach • Data compression • Node management • Reliable • Metadata update overhead • Scalability problem • Node management overhead YAFFS2 [2003] • LFS approach • Using spare region • Reduced mounting time • Metadata update overhead • Scalability problem CFFS [2006] • LFS approach • Metadata separation • File indexing in NAND • Reduced mounting time • Reduced GC overhead • Metadata update overhead • Scalability problem remaining • Extra write overhead • Wear-leveling problem
  • 54. Scalability problems 2. Use of main memory Scan area Non-scan area 1. Scan area comparison >> Open(“/dir/a.txt”) i-number Location of inode Location of data Accessing a file ‘/dir/a.txt’ Type of Index JFFS, YAFFS CFFS 1. Find i-number using path name In memory directory In memory directory 2. Find inode using i- number In memory inode map In memory inode map 3. Find file data In memory file index In NAND file index YAFFS CFFSJFFS
  • 55. Solution of Metadata update Write 2Btyte
  • 56. PFFS Scalability: Mounting time • PFFS has minimized and fixed mounting time – All metadata are connected from root directory in PRAM – PFFS does not need to scan the NAND flash memory YAFFS CFFSJFFS PFFS Scan area Non-scan area >> > Scan area comparison
  • 57. PFFS Scalability: Memory use • PFFS use no DRAM main memory for metadata structure – Most of metadata structures of PFFS are contained in PRAM Type of Index JFFS, YAFFS CFFS PFFS 1. Find i-number using path name In memory directory In memory directory In-PRAM directory 2. Find inode using i-number In memory inode map In memory inode map Simple calculation 3. Find file data In memory file index In NAND file index In-PRAM data pointers Open(“/dir/a.txt”) i-number Location of inode Location of data Accessing a file ‘/dir/a.txt’ Main memory use
  • 58. Evaluation • CPU: Samsung S3C2413 (ARM 926EJ) • Mem : 64MB DRAM • 1GB MLC NAND, 32MB PRAM • NAND flash memory characteristics • Benchmark: PostMark • Benchmark for short-lived, small file read/write performance • Comparison with YAFFS2
  • 61. Conclusion • PFFS solves the scalability problems of previous flash file systems by using the hybrid architecture of PRAM and NAND flash memory • Mounting time and memory usage of PFFS are O(1) • The performance of PFFS is 25% better than YAFFS2 for small file writes
  • 63. Problems • In page level mapping scheme, relocation of data is possible as page size • But, disadvantage is the large size of mapping table • Ex) In 64GB SSD • If using block level mapping, size of mapping table is 512KB • If using page level mapping, size of mapping table is 64MB • So most of the actual commercial SSD uses a hybrid scheme based on block level mapping scheme
  • 64. Page Level Mapping scheme address translation techniques • The entire mapping table is maintained in the NAND • caching frequently used mapping table to DRAM • Use FTL-TLB and FTL mapping directory structure
  • 65. Page table management in Demand Paging Memory System • Using page level mapping in the NAND Flash memory and using Demand Paging scheme on the memory system are similar
  • 66. FTL-TLB • manage mapping table as a section • A section stored mapping table about NAND Flash memory one block • Ex) If a block has 128 pages, size of a section is 128 * 4B • The number of sections is the same that entire block numbers in NAND
  • 67. FTL Mapping Directory • FTL Mapping Directory is allocated in DRAM • FTL Mapping Directory has as much as the number of sections Whether cashed or not in FTL-TLB Whether updated or not in NAND
  • 69. Evaluation • Workload • Goal is 64GB SSD • Use Virtual Box and 64GB HDD and Windows XP • Collect access trace • Daily_usage and multi_program are typical environment • Install_update is Windows update and program install • Large_file is copy large file Trace Requests Data size [MB] Daily_usage 545031 10270.78 Multi_program 309262 3070.669 Install_update 1022856 14072.22 Large_file 45593 2810.333
  • 70. Evaluation • Replacement Algorithm • OPT(The Optimal Algorithm) • LRU • LRFU(Last Recently/Frequently Used Replacement) • LRU2(LRU-K) • LIRS(Low Inter-Reference recency Set) • CFLRU(Clean First LRU) • LRU-WSR(LRU-Write Sequence Reordering)
  • 71. CFLRU(Clean First LRU) • If all page frames are clean pages or dirty pages then CFLRU is the same LRU algorithm. • But Example: all page frames have dirty pages and clean pages. • CFLRU divides the LRU list into two regions. • The working region consists of recently used pages and most of cache hits are generated in this region. • The clean-first region consists of pages which are candidates for eviction. • CFLRU selects a clean page to evict in the clean-first region first. • If there is no clean page in this region, a dirty page at the end of the LRU list is evicted.
  • 72. CFLRU(Clean First LRU) P1 P2 P3 P4 P5 P6 P7 P8 5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 6(C) 8(D) Working Region C : Clean Page D : Dirty Page Clean First Region P7 P1 P2 P3 P4 P5 P6 P8 9(C) 5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 8(D) P5 P7 P1 P2 P3 P4 P6 P8 10(C) 9(C) 5(D) 2(C) 3(D) 7(C) 4(D) 8(D) P4 P5 P7 P1 P2 P3 P6 P8 11(C) 10(C) 9(C) 5(D) 2(C) 3(D) 4(D) 8(D) P2 P4 P5 P7 P1 P3 P6 P8 12(C) 11(C) 10(C) 9(C) 5(D) 3(D) 4(D) 8(D) Access 9 Access 10 Access 11 Access 12 Evict P7 Evict P5 Evict P4 Evict P2
  • 73. LRU-WSR(LRU-WriteSequenceReordering) • We have some concepts: Cold dirty page, And Cold flag. • If the page is dirty and cold-flag is set, this page regarded as a cold dirty page. • We have example: LRU-WSR uses a page list L and an additional flag - Cold flag.
  • 74. LRUWSR(LRU-WriteSequenceReordering) P1 P2 P3 P4 P5 P6 P7 P8 5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 6(C) 8(D) Cf=0 Cf=0 Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf : Cold flag P7 P8 P1 P2 P3 P4 P5 P6 9(C) 8(D) 5(D) 2(C) 3(D) 7(C) 1(C) 4(D) Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0 Cf=0 Cf=1 Access 9 Evict P7 P6 P7 P8 P1 P2 P3 P4 P5 10(C) 9(C) 8(D) 5(D) 2(C) 3(D) 7(C) 1(C) Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0 Cf=0 Access 10 Evict P6 P5 P6 P7 P8 P1 P2 P3 P4 11(C) 10(C) 9(C) 8(D) 5(D) 2(C) 3(D) 7(C) Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0 Access 11 Evict P5 P4 P5 P6 P7 P8 P1 P2 P3 12(C) 11(C) 10(C) 9(C) 8(D) 5(D) 2(C) 3(D) Cf=0 Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Access 12 Evict P4
  • 75. Evaluation • Cache hit ratio Daily_usage Multi_program Large_fileInstall_update Show cache hit ratio of over 95% in all cases except Large_file.
  • 77. Evaluation • Overhead Daily_usage Multi_program Large_fileInstall_update In most of workload when cache size is more than 512KB overhead is less than 2%.
  • 78. Evaluation • Memory usage • 64GB SSD has 131072 blocks of 512 KB size • A entry use 6B in FTL mapping directory • So size is 768KB Page mapping table 512KB FTL-TLB 1024KB FTL-TLB 64 GB SSD 64MB 1280KB 1.9% 1792KB 2.7%
  • 79. Conclusion • Although FTL-TLB uses only 512KB, cache hit ratio is over 90% • Cache over head is under 2% • Memory usage is only 1.9% rather than full mapping table
  • 81. Problems • No research about File system optimization on Flash • Legacy cluster allocation scheme for hard disk is not suitable • Hard disk can in place update • But Flash can not do
  • 82. Solutions • AFCA(Anti-Fragmentation Cluster Allocation) • New Fragmentation for Flash • Data invalidation scheme • If data is not used any more, file system announce to FTL for reduce unnecessary overhead
  • 83. AFCA(Anti-Fragmentation Cluster Allocation) • File fragmentation • The number of logical blocks to save the file : N • The number of logical blocks to actually used : n • If n>N, file is fragmented • Free space fragmentation • Minimum number of logical blocks with free space : M • The number of logical blocks in free space is located : m • If m>M, Free space is fragmented
  • 84. AFCA(Anti-Fragmentation Cluster Allocation) M : 4 m : 5 N : 2 n : 2 M : 2 m : 3 N : 2 n : 3 N : 2 n : 2 M : 2 m : 2 M : 2 m : 2 N : 2 n : 3
  • 86. AFCA(Anti-Fragmentation Cluster Allocation) • Considerations • If file is larger than logical block, allocate as a logical block. It’s good to reduce file fragmentation • After allocate all clusters in a block, allocate next logical block. it is good to reduce free space fragmentation • File is considered as small file and if file exceeds the threshold, file is considered as large file
  • 87. AFCA(Anti-Fragmentation Cluster Allocation) • Free logical blocks(F-logical block) • All clusters are unused state in logical block • Logical blocks for small file(S-logical block) • Logical blocks for large file(L-logical block)
  • 88. AFCA(Anti-Fragmentation Cluster Allocation) F-logical block S-logical block L-logical block cluster allocation at small file Return some cluster
  • 89. Data invalidation scheme • If sector is not used any more, file system announce to FTL • FTL checks sector that is invalid data on page mapping table
  • 92. Evaluation • Use Ext2, Kernel 2.4 and NAND Flash Emulator • Page size is 2KB • Block size is 128KB • FTL is Z-FTL based on block mapping
  • 94. Conclusion • When we use AFCA • Fragmentation is reduced up to 53% • Performance is improved up to 46% • When we use data invalidation • Write performance is improved up to 22%