Right Money Management App For Your Financial Goals
CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing Applications
1. University of Iowa | Mobile Sensing Laboratory
CSense: A Stream-Processing Toolkit
for Robust and High-Rate Mobile
Sensing Applications
IPSN 2014
Farley Lai, Syed Shabih Hasan, Austin Laugesen, Octav Chipara
Department of Computer Science
2. University of Iowa | Mobile Sensing Laboratory |
Mobile Sensing Applications (MSAs)
CSense Toolkit 2
Speaker
Models
Speech
Recording
VAD
Feature
Extraction
HTTP
Upload
Sitting
Standing
Walking
Running
Climbing Stairs
…
Bluetooth
Data
Collection
Feature
Extraction
Activity
Classification
Speaker Identification
Activity Recognition
3. University of Iowa | Mobile Sensing Laboratory |
• Mobile sensing applications are difficult to implement on
Android devices
– concurrency
– high frame rates
– robustness
• Resource limitations and Java VM worsen these problems
– additional cost of virtualization
– significant overhead of garbage collection
Challenges
CSense Toolkit 3
4. University of Iowa | Mobile Sensing Laboratory |
• Support for MSAs
– SeeMon, Coordinator: constrained queries
– JigSaw: customized pipelines
CSense provides a high-level stream programming
abstraction general and suitable for a broad range of MSAs
• CSense builds on prior data flow models
– Synchronous data flows: static scheduling and optimizations
• e.g., StreamIt, Lustre
– Async. data flows: more flexible but have lower performance
• e.g., Click, XStream/Wavescript
Related Work
CSense Toolkit 4
5. University of Iowa | Mobile Sensing Laboratory |
• Programming model
• Compiler
• Run-time environment
• Evaluation
CSense Toolkit
CSense Toolkit 5
6. University of Iowa | Mobile Sensing Laboratory |
• Applications modeled as Stream Flow Graphs (SFG)
– builds on prior work on asynchronous data flow graphs
– incorporates novel features to support MSA
Programming Model
CSense Toolkit 6
addComponent("audio", new AudioComponentC(rateInHz, 16));
addComponent("rmsClassifier", new RMSClassifierC(rms));
addComponent("mfcc", new MFCCFeaturesG(speechT, featureT))
...
link("audio", "rmsClassifier");
toTap("rmsClassifier::below");
link("rmsClassifier::above", "mfcc::sin");
fromMemory("mfcc::fin");
...
create
components
wire
components
7. University of Iowa | Mobile Sensing Laboratory |
• Goal: Reduce memory overhead introduced by garbage
collection and copy operations
• Pass-by-reference semantics
– allows for sharing data between components
• Explicit inclusion of memory management in SFGs
– focuses programmer’s attention on memory operations
– enables static analysis by tracking data exchanges globally
– allows for efficient implementation
Memory Management
CSense Toolkit 7
8. University of Iowa | Mobile Sensing Laboratory |
• Data flows from sources, through links, to taps
• Implementation:
– sources implement memory pools that hold several frames
– references counters used to track sharing of frames
– taps decrement reference counters
Memory Management
CSense Toolkit 8
Audio data MFCCs Filenames
9. University of Iowa | Mobile Sensing Laboratory |
• Goal: Expressive concurrency model that may be analyzed
statically
• Components are partitioned into execution domains
– components in the same domain are executed on a thread
– frame exchanges between domains are mediated using shared
queues
• Other data sharing between components are using a tuple space
• Concurrency is specified as constraints
– NEW_DOMAIN / SAME_DOMAIN
– heuristic assignment of components to domains to minimize data
exchanges between domains
• Static analysis may identify some data races
Concurrency Model
CSense Toolkit 9
10. University of Iowa | Mobile Sensing Laboratory | CSense Toolkit 10
Concurrency Model
getComponent("audio").setThreading(Threading.NEW_DOMAIN);
getComponent("httpPost").setThreading(Threading.NEW_DOMAIN);
getComponent("mfcc").setThreading(Threading.SAME_DOMAIN);
Compiler transformation
11. University of Iowa | Mobile Sensing Laboratory |
• Goal: Promote component reuse across MSAs
• A rich type system that extends Java’s type system
– most components use generic type systems
– insight: frame sizes are essential in configuring components
• detect configuration errors / optimization opportunities
Type System
CSense Toolkit 11
VectorC energyT = TypeC.newFloatVector();
energyT.addConstraint(Constraint.GT(8000));
energyT.addConstraint(Constraint.LT(24000));
VectorC speechT = TypeC.newFloatVector(128);
VectorC featureT = TypeC.newFloatVector(11);
12. University of Iowa | Mobile Sensing Laboratory |
• Not all configurations may be implemented efficiently
Flow Analysis
CSense Toolkit 12
Constraints:
energyT > 8000
energyT < 24000
speechT = 128
featuresT = 11
energyT speechT
Inefficient 10,000 128
Efficient 10,240 (128 * 80) 128
13. University of Iowa | Mobile Sensing Laboratory |
• Not all configurations may be implemented efficiently
Flow Analysis
CSense Toolkit 13
Constraints:
energyT > 8000
energyT < 24000
speechT = 128
featuresT = 11
energyT speechT
Inefficient 10,000 128
Efficient 10,240 (128 * 80) 128
Mrms=1 Mmfcc=80
An efficient implementation exists when
Mrms * energyT = Mmfcc * speechT
14. University of Iowa | Mobile Sensing Laboratory |
• Goal: determine configurations have efficient frame
conversions
• Problem may be formulated as an integer linear program
– constraints: generated from type constraints
– optimization: minimize total memory usage
– solution: specifies frame sizes and multipliers for application
• An efficient frame conversion may not exist
– the compiler relaxes conversion rules
Flow Analysis
CSense Toolkit 14
16. University of Iowa | Mobile Sensing Laboratory |
• Components exchange data using push/pull semantics
• Runtime includes a scheduler for each domain
– task queue + event queue
– wake lock – for power management
CSense Runtime
CSense Toolkit 16
Scheduler1Task Queue
Event Queue
Scheduler2 Task Queue
Event Queue
Memory Pool
17. University of Iowa | Mobile Sensing Laboratory |
• Micro benchmarks evaluate the runtime performance
– synchronization primitives + memory management
• Implemented the MSA using CSense
– Speaker identification
– Activity recognition
– Audiology application
• Setup
– Galaxy Nexus, TI OMAP 4460 ARM A9@1.2 GHz, 1 GB
– Android 4.2
– MATLAB 2012b and MATLAB Coder 2.3
Evaluation
17CSense Toolkit
18. University of Iowa | Mobile Sensing Laboratory |
• Scheduler: memory management + synchronization primitives
• Memory management options
– GC: garbage collection
– MP: memory pool
• Concurrent access to queues and memory pools
– L: Java reentrant lock
– C: CSense atomic variable based synchronization primitives
Producer-Consumer Benchmark
18CSense Toolkit
19. University of Iowa | Mobile Sensing Laboratory | 19
Producer-Consumer Throughput
• Garbage collection overhead limits scalability
• Concurrency primitives have a significant impact on performance
30%
13.8x
CSense Toolkit
19x
20. University of Iowa | Mobile Sensing Laboratory |
• Reentrant locks incurs GC due to implicit allocations
• CSense runtime has low garbage collection overhead
Producer-Consumer GC Overhead
20
no garbage
collection
(in this benchmark)
CSense Toolkit
21. University of Iowa | Mobile Sensing Laboratory |
• Benefits of flow analysis
• Runtime overhead
MFCC Benchmark
21CSense Toolkit
22. University of Iowa | Mobile Sensing Laboratory |
• Flow analysis eliminates unnecessary memory copy
• Benefits of larger but efficient frame allocations
– reduced number of component invocations and disk I/O
overhead
– Increased cache locality
MFCC Benchmark CPU Usage
22
45% decrease
CSense Toolkit
23. University of Iowa | Mobile Sensing Laboratory |
• Runtime overhead is low for a wide range of data rates
MFCC Runtime Overhead
23
1.83%
2.39%
CSense Toolkit
24. University of Iowa | Mobile Sensing Laboratory |
• Programming model
– efficient memory management
– flexible concurrency model
– rich type system
• Compiler
– whole-application configuration & optimization
– static and flow analyses
• Efficient runtime environment
• Evaluation
– implemented three typical MSAs
– benchmarks indicate significant performance improvements
• 19X throughput boost compared with naïve Java baseline
• 45% CPU time reduced with flow analysis
• Low garbage collection overhead
Conclusions
24CSense Toolkit
25. University of Iowa | Mobile Sensing Laboratory |
• National Science Foundation (NeTs grant #1144664 )
• Carver Foundation (grant #14-43555 )
Acknowledgements
25CSense Toolkit
26. University of Iowa | Mobile Sensing Laboratory |
• Runtime scheduler overhead of a complex 6-domain
application that accesses both phone sensors and remote
shimmer motes over bluetooth
ActiSense Benchmark
26CSense Toolkit
27. University of Iowa | Mobile Sensing Laboratory |
• Runtime scheduler overhead of a complex 6-domain
application that accesses both phone sensors and remote
shimmer motes over bluetooth
ActiSense Benchmark
27CSense Toolkit
28. University of Iowa | Mobile Sensing Laboratory |
• Overall domain scheduler overhead is small despite a longer
pipeline
ActiSense CPU Usage
28
50 Hz
60 Hz
CSense Toolkit
29. University of Iowa | Mobile Sensing Laboratory |
AudioSense
29CSense Toolkit
30. University of Iowa | Mobile Sensing Laboratory |
AudioSense
30CSense Toolkit
Hinweis der Redaktion
With the popularity of smart devices, there is increasing demand of developing mobile sensing application to captureandanalyze physical activities, social interactions and ambient information from rich sensors.Here are two typical mobile sensing applications.The top one is SpeakerIdentification and the bottom one is Activity Recognition.Both applications work in a similar way.First, they collect sensor data that can be local or remote.Next, features are extracted from the sensor data.Finally, the features may be used to perform real time classification or uploaded to a remote server for offline recognition.
Though these applications are conceptually straightforward, it is not trivial for programmers to implement efficiently due to the following challenges.This first challenge is Concurrency.Apparently, MSAs are multi-threaded because sensor reading, network communication, interacting with users and the environment may happen concurrently.However,multi-threading is usuallyerror-prone due to data races and even deadlocks.The next challenge is high frames rate.For example, the audio and video sources tend to produce a large mount of data constantly.That stresses memory management.The third challenge is robustness.Well, mobile sensing applications are usually expected to run long-term data collection in the background.It would be unacceptable to bother users due to crash or restarts.So far, our main target is the Android platform. However, the underlying Java virtual machine even worsens the problems because of higher computational overhead and non-deterministic garbage collection.Therefore, we propose the CSense Toolkitto address the challenges without sacrificing the performance.
Before introducing the design of CSense, I would like to go through the related work.First, in term of support for MSAs, prior work like SeeMon, Coordinator and JigSaw require programmer to use their special constructs to develop specific types sensing applications.CSense, on the other hand, provides a high-level stream programming abstraction general and suitable for a broad range of MSAs Second, CSensebuildsonthedataflowmodels. There are two categories.One is the synchronous data flow, like the StreamIt and Lustre, which enforces static scheduling and optimizations.However, if you need to process asynchronous events, you’re on your own to adapt it.The other is asynchronous data flow, like Click/XStream which provides asynchronous constructs but sacrifices some performance.Our CSense toolkit adopts the asynchronous data flow model but improve the performance with compile time analysis.
For the remainder of the talk, I will introduce the CSense programming model, compiler and runtime environment.Evaluation results will be presented later.
Here is the programming model.A MSA is represented as a SFG which is a directed acyclic graph with nodes implemented as components connected through input and output ports.The application of Speaker Identification is shown as an example.The following Javacode segmentshows how we create and wire the components.
What’s different between CSense and previous work is the focus of memory management.The goal here is to reduce memory overhead introduced by garbage collection and copy operations.The CSense programming model not only adopts the pass-by-ref semantics to facilitate data sharing between components but also makes explicit inclusion of memory management in a SFG which focuses programmers’ attention on memory operations, tracks data exchange globally and allows for efficient implementation.
Let’s take a look at the memory management in the SpeakerIdentification example.In a SFG, only two special components called source and tap are allowed to perform memory management.The sources implement memory pools and pre-allocate frames.When in execution, the frames are taken from the pools and flow form sources, through links to taps.The tap puts the frame back to the memory pool to ensure no leaks.In this example, there are three sources, the audio component, S1 and S2.The data flows follow the colored links and reach the corresponding taps.On the other hand, if a frame is shared between components, its associated reference counter is incremented.When reaching taps, the reference counter is decremented.If the counter is zero, the frame is put back to its memory pool.
As for the concurrency challenge, the goal is to expose the concurrency model that may be analyzed statically.The idea is partition the components in a SFG into execution domains.A domain is a connected subgraph of components executed on a single thread.Any frame exchange between domains should be mediated by a shared queue.Other data sharing between components are using a tuple space.Currently, the CSense programming model provides several concurrency constraints such the NEW_DOMAIN and SAME_DOMAIN. Based on the domain partition information, it is possible for compiler analysis to identify data races.
Here is the same example.The audio and httpPost components declare new domains.The domain partitioning starts with the two components and expands by including other components in the downstream direction.The other components are added to the domains of adjacent components.After partitioning the SFG, the first four components including S1, S2 and T1 to T3 are in one domain while the httpPost and T4 are in the other domain. With this information, the compiler is able to transform the graph by inserting a shared queue between the two domains for data exchange.Another special concurrency option is SAME_DOMAIN.This annotation is used for a group component that is composed of several related subcomponents.It make sense that those components should be partitioned into the same domain to avoid cross-domain data exchange overhead.
Next, we introduce the type system which extends the Java generic types.It is designed ensure the correctness of component composition and facilitates efficient component reuse in different applications.In a SFG, all the input/output ports are typed and allow programmers to specify frame size constraints.The frame size is the amount of data produced or consumed once by a component through a port.Here is the code segment showing how to specify the type constraints.
Apparently, there are many frame size configurations satisfying the constraints.However, not all the configurations can be implemented efficiently.In this example, let’s focus on the output port of the energyT and the input port of the speechTThe energyT constrains the output frame size to be greater than 8000 and less than 24,000.The speechT constrains the input frame size to be exactly 128.Now, consider the following two configurations.The first configuration sets the energyT frame size to 10,000 which is not a multiple of 128.That is, there won’t be efficient frame size conversion without frame remainder that causes additional memory copy.In contrast, the second configuration set the energyT frame size to 10,240 which can be divided into 80 frames of the SpeechT. This allows for an efficient frame size conversion.
Now, to make it general, we introduce the concept of multiplier which is the number of executions for a component to produce or consume the entire input or output.Whattheflowanalysisdoesistofindtheconstrainedframesizesandmultipliersthatresultinacommonmultiple.Thecommonmultipleisthe resulting frame size to allocate and representedastheequalityconstraintwhichisimplicitlyaddedbythecompiler.
Next, to apply the flow analysis to the entire SFG, the compiler formulates an integer program by adding all theconstraints for each pair of input/output ports. Thenthecompilercalls an external solver to derive a solutiontotheframesizesandmultipliers.Theobjectiveis tominimize the totalmemoryusage.Nonetheless, if there is no such solution, the compiler may return inefficient configurationsandshows awarning.Theprogrammersmayrelaxtheconstraintsforanefficientsolution.
So, in summary, given all the information about the SFG of a MSA, the CSense compiler first performs the staticanalysisto prevent composition errors, memory usage errors, race conditions.Second, the compilerapplies flowanalysis to derivewhole-applicationframe sizeconfigurationsofcomponents.Third,thecompilermay transformstheSFGbyinsertingsharedqueuesbetweendomains,andtypeconvertersbetween pairs of input/outputports whichare not compatible.Inaddition,connected MATLABcomponentsmaybecoalesced.A MATLAB component is created by wrapping the C code of the MATLB function generated by the MATLAB coder.The coalescing simply combine the MATLAB functions first and then generate a single component toreducedataexchangeoverheadbetweentheJavaspaceandthenativespace.Finally,thecompilergeneratesthetargetAndroidapplication code that links to the native MATLAB functions.
Aftertheapplicationisinstalledonthetargetdevice,itisexecutedbytheCSenseruntimetodrivethedataflowfromsourcesthroughcomponentstotaps.Theruntimeincludes aschedulerforeachdomain.Aschedulemaintainsatask queue,an event queueandan Androidwakelock.The task queue allows a component to schedule for execution as soon as possible.The timer queue allows a component to schedule a delayed event to process at a specified time.The wake lock is associated with the Android power management.Whenever there is no wake lock acquiredby any application, the Android device is put to deep sleep soon.Forourschedulers,if the task queue is empty, scheduler determines whether to release wake lock and goes to sleep.
Next, I am going to present the CSense runtime performance evaluations based on several benchmarks.Besides,wehaveimplementedthreeMSAsto validate CSense.The speakeridentification.Theactivityrecognition.Thehearingaidsurvey application for audiology that combines subjective questionnaire and objective data collection to capture the listening context.Here is the experimental setup.WeuseGalaxyNexus,Android4.2, MATLABandMATLABcoder.
Ourfirst producer-consumer benchmark is conducted to evaluate the performance of data exchange between two domains via a shared queue. We are especially interested in the impacts ofdifferentmemorymanagementoptionsandsynchronizationprimitives.Formemorymanagement, there are two configurationstoallocateframes.GC stands for garbage collection.Framesarecreatedwhenneeded.MP stands for the memory pool.Framesarepre-allocated and reused.As for the concurrent access to the sharedqueue and memory pool,Configuration L stands for the Java reentrant lock.Configuration C stands for the CSense atomic variable based synchronization primitives which utilize the hardware compare-and-swap instructions. It is designed for a thread to retry acquiring the access to a shared resource without being suspended on failure.
Here we show the throughputinthisfigure.The x-axis represents the production rate while the y-axis represents the consumption rate.Ideally,bothratesshouldbeequal.Now, as you can see, the GC and L lead to the lowest throughput.Replace GC with memory pools, the throughput is improved by 13.8 times.Replace the Java reentrant lock with the CSense synchronization primitives, the throughput is further improved by 30%.So, the total throughput improvement is about 19x times.The is mainly becauseGC and the Java reentrant lockcausefrequentthreadsuspensions and switching.In summary, Garbage collection overhead limits scalability and concurrency primitives have a significant impact on performance.
Next, we want to further understand the garbage collection overhead.In this figure, the x-axis stands for the production rate and the y-axis stands for the time spent in garbage collection.As you can see, with memory pools and the CSense synchronization primitives, it is possible to achieve zero garbage collection.If only memory pools are used, the Java reentrant lock still incurs garbage collection because of implicit object creations.Insummary, the CSense runtime incurslittle garbage collection overhead.
Next, we evaluate the benefits of flow analysis and the runtime overhead in the MFCC benchmark which is the simplified Speaker Identification application by removing the httpPost component.
Here,weshowthebenefitsofflowanalysis as the reduction of CPUusage.In the left figure, the x-axis stands for the audio sampling rate.The y-axis stands for the total CPU usage of the benmark.As you can see, with flow analysis, the total CPU usage can be reduced up to 45% at the highest sampling rate.To further understand reduction of CPU usage, we break down the total CPU usage into per-component CPU usage in the right figure.Thex-axisstandsforthecomponents.The y-axis stands for the component CPU usage.For theMFCC component, flow analysis eliminates unnecessary memory copy and increases cache locality for execution.For the other components, flow analysis leads to larger but efficient frame allocations that reduce the number of component invocations and disk I/O overhead especially for those components writing to the storage.
Finally, we want to understand the CSense runtime overhead.The overhead is computed by subtracting thesumofcomponent CPU time from the total application CPU time.In the figure, the x-axis represents the sampling rate.The y-axis of the bottom figure shows the percentage of the overhead over the total CPU time.As you can see, the percentage of the overhead is low and does not grow with the workload.In the top figure, we further decompose the runtime overhead into the scheduler overhead and sleep overhead.The sleep overhead is incurred when the scheduler calls to sleep() which should be small.The schedule overhead is spent to pass frames between components and access memory pools.Clearly,the scheduler overhead is even smaller than the sleep overhead.Therefore,weconcludetheruntime overhead is low for a wide range of data rates.
Alright,I have introduced the main design of the CSense toolkit.Inconclusion,theCSenseprogrammingmodelprovidesefficient memorymanagement, a flexible concurrencymodel anda richtypesystem.TheCSensecompilerperformswhole-application optimization based the static and flow analyses.TheCSenseruntimeisefficientwithlowoverheadandintegrated withAndroidwakelocks.Wehave implementedthreetypicalMSAsto validateCSense.Thebenchmarks indicate significant performance improvementswithmemorypools,CSensesynchronizationprimitives andflow analysis.
We especially thank and acknowledge our funding sources.Now, I think it’s time to take your questions.
Accelerometer pipelines involve intensive operationsDomain CPU usage grows with sampling rates and length of pipelinesShimmer pipelines involve more components and thus more overheadMaking predictions per sec induce smaller superframe sizeDomain CPU timePhone 60 HzShimmer 50 Hz
Electronic surveysAmbient sound samples and GPSDeployed for six months as part of a clinical studyReliability = uploaded / collected0 server offline due to power outages< 100% move out of wireless signal cover in the study area Reliability during weeklong deployments