The document discusses lessons learned from developing a web browser on the Raspberry Pi. Some key optimizations included progressive tiled rendering to improve scrolling performance, avoiding unnecessary image format conversions, using disk caching for images, reducing memory copies for video playback, and supporting hardware decoding and scaling of images and video. Other optimizations involved improving responsiveness, managing tabs and memory usage, speeding up startup, and fixing issues for the Raspberry Pi's ARMv6 processor. The main lessons highlighted were the importance of profiling, watching for relevant upstream improvements, eliminating unnecessary processes, optimizing just-in-time, utilizing all available platform resources, and carefully managing memory and resource allocation.
5. Do developers ever have enough memory and performance?
DEVIEW 2014 5
We are always hungry.
http://images.google.com
6. Optimization
DEVIEW 2014 6
• Dictionary definition
‣ Make the best or most effective use of a situation or resource
‣ In short, Improve performance & Use resources efficiently
• Usually difficult and tedious works
• Depends on developer’s experience & know-how
7. Possible approaches
1. Using a better hardware including a faster CPU/GPU & more memory
2. Parallel programming to take advantages from multi-core CPU
3. Utilizing a GPU through OpenGL/ES to improve rendering performance.
4. Just turning off the screen and going outside to play…?
DEVIEW 2014
7
9. Raspberry Pi is a good example for such a poor environment
• Old single core CPU
‣ ARMv6, 700MHz
• Very limited system memory
‣ 512MB shared with GPU
• Not redundant storage
• Bad OpenGL ES integration with windowing system.
DEVIEW 2014
9
10. DEVIEW 2014 10
All problems come from here
http://en.wikipedia.org/wiki/Raspberry_Pi#mediaviewer/File:Raspberrypi_pcb_overview_v04.svg
13. Requirements
DEVIEW 2014 13
A modern & fast HTML5 browser
• Multi-Tab browsing
• HTML5 & CSS3
• HTML5 Video/Audio support (YouTube should run well)
• Responsive user interface
• Low memory footprint
15. Achievements
DEVIEW 2014 15
We’ve improved WebKit1 + Epiphany
• Progressive tiled rendering for smoother scrolling
• Avoid useless image format conversions
• Disk image cache
• Reduction of the number of memory copies to play videos
• Memory pressure handler support by using cgroup
• Better YouTube support including on-demand load of embedded
YouTube videos for a much faster page load
16. Achievements
DEVIEW 2014 16
We’ve improved WebKit1 + Epiphany
• Faster fullscreen playback using dispmanx directly
• Hardware decoding of image & video through OMX
• Hardware scaling of video through gst-omx
• More responsive UI & scrolling even under heavy load
• Memory & CPU friendly tab management
• Startup is 3x faster
• Javascript JIT fixes for ARMv6
18. Progressive tiled rendering for smoother scrolling
• Scrolling doesn’t block even if the content is not available, instead we fill the area with a checkered
pattern.
DEVIEW 2014
18 http://ariya.ofilabs.com/2011/06/progressive-rendering-via-tiled-backing-store.html
19. Avoid useless image format conversions
• Try to use internal buffers which use the same depth, 16 or 32 bits to prevent format conversions
‣ Raspberry Pi uses 16bit depth(RGB16_565) buffer as default.
‣ Basically images (JPEG, PNG, GIF) and video were decoded into 32 bits depth (ARGB32) buffers.
‣ By using same depth, we could use cairo image surface which can be painted quickly to the target.
32bit) GIF(32bit)
16bit)
16bit)
PNG(16bit)
JPEG(1326bit) Videos(32bit)
DEVIEW 2014
19
TBS(16bit)
GtkWidget (16bit)
20. Disk image cache
• We enhanced the disk image cache module of WebKit for the POSIX system.
• Decoded images are kept int memory mapped files as caches
• Saved CPU by avoiding multiple decodings
• Saved memory by using local disk space
• Not a magic wand : Big image over 20KB, Animated GIF
DEVIEW 2014
20
Decoded image
Local disk space
Physical memory
21. Reduction of the number of memory copies to play video
• The video needs to be blotted on screen and that involves memory copies for no reason.
• If cairo surface of backingstore is a system memory then cairo creates an additional surface which
wraps a shm pixmap and copies into this pixmap before copying into the final drawable.
‣ cairo_surface_create_similar
• When GdkWindow has already a cairo surface which wraps a X drawable, it is friendly to cairo image
surfaces.
‣ Ensured that by calling gdk_cairo_create
‣ cairo_surface_create_similar_image
• When used correctly we can prevent cairo from calling XShmCreatePixmap at every copying the
backingstore to the window.
• Available from gtk+3.10
DEVIEW 2014
21
22. Video
gst buffer
Cairo surface for video
Cairo surfaces for TBS
DEVIEW 2014 22
Video
gst buffer
Cairo image surface for video
SHM pixmap GtkWidget
GtkWidget
Cairo image surfaces for TBS
23. Memory pressure handler support through cgroups
• Control groups(cgroups) is a Linux kernel feature to limit, account, and isolate resource usage (CPU,
memory, disk I/O etc) of process groups.
‣ Merged into kernel version 2.6.24
‣ Resource limiting : groups can be set to not exceed a set memory limit
‣ Prioritization : some groups may get a larger share of CPU or disk I/O throughput
‣ Accounting : to measure how much resources certain systems use
‣ Control : freezing groups or checkpointing and restarting.
• We implemented memory pressure handler for POSIX systems in webkit by using cgroups.
• When the RPi system goes under pressure of memory, we free all unnecessary cache and memory
and also run garbage collector to avoid OOM according to a pressure level.
• Not a magic wand : If the OOM is caused by other applications, not browser?
DEVIEW 2014
23
24. Better YouTube support
• HTML5 video is required.
• YouTube has its own heavy UI
• Inject some simple javascript code which gets the URL for video stream and create a <video> for it.
• Get thumbnails through YouTube Data API, and get video with a similar way with the youtube-dl
• This allow us to block some extra JS on YouTube that was using a lot of CPU
• Block the comment section on YouTube since it took 30 seconds to fully load.
• Embedded YouTube video took too much time to load as well.
• We just load a fake placeholder showing the thumbnail and a fake play button.
• When a user clicks on it, the real video is actually loaded. This made loading pages with a lot of videos
much much faster.
DEVIEW 2014
24
30. DEVIEW 2014 30
<video width=“xxx” height=“yyy”
src=“A video URL extracted from
youtube” controls />
31. DEVIEW 2014 31
var posterData = download_webpage(
'http://gdata.youtube.com/feeds/
api/videos/' + this.videoId + '?
v=2&alt=json');
1.Show a thumbnail and a fake play button
2.On click, inject the video wrapper
3.and then actual video is loaded.
!
Pretty useful for heavy pages embedding
many YouTube videos.
url ='http://www.youtube.com/
watch?v=' + video_id +
'%s&gl=US&hl=en&has_verified=1';
video_webpage =
download_webpage(url);
32. Faster fullscreen playback using dispmanx directly
• Fullscreen mode is a very independent feature.
‣ It just shows video and controls.
‣ Need to do nothing except copying decoded video frame and drawing controls if necessary.
‣ Do not need to update backingstore at all under fullscreen mode.
• Dispmanx
‣ A subset of VideoCore library
‣ A windowing system in the process of being deprecated in favor of OpenWF
‣ Provide useful APIs like creating comprehensible layers to GPU, scaling/moving the layers etc.
• We directly wrote a video raw data into a dispmanx plane and scaled it to fit in with a screen through
GPU.
• Not updating backingstore and scaling video through GPU allow us to save CPU very much.
• A fake cursor required since the bad integration of a GPU plane into the windowing system.
DEVIEW 2014
32
34. DEVIEW 2014 34
Cursor Dispmanx plane 4
Controls
Video
TBS
Dispmanx plane 2, 3
Filled with a controls images
Dispmanx plane 1
Filled with a video draw data. Scaling is
performed by GPU
Cairo surface in GtkWidget.
Absolutely hidden by Video plane.
So we don’t need to update at all.
Controls
A fake cursor image
35. Hardware decoding of image & video through OpenMAX
• Raspberry Pi supports OpenMAX (shortened as “OMX”)
• OpenMAX
‣ A set of C-language programming interfaces that provides abstractions for routines especially
useful for audio, video, and still images processing.
‣ Provide 3 layers of interfaces: AL(application layer), IL(integration layer) and DL(development layer)
• Especially OpenMAX DL is useful to decode image and video.
‣ AC : Audio Codecs (MP3 decoder & AAC decoder components) - Can’t because of licensing issue!
‣ IC : Image codecs (JPEG components)
‣ IP : Image processing (Generic image processing functions)
‣ SP : Signal Processing (Generic audio processing functions)
‣ VC : Video Codecs (H.264 & MP4 components)
• JPEG is decoded with OMX in WebKit
• Gst-omx is used to decode video with OMX in gstreamer.
‣ http://cgit.freedesktop.org/gstreamer/gst-omx
DEVIEW 2014
35
36. Hardware scaling of video through gst-omx
• Often the video in web is not displayed at its natural size. It needs to be scaled.
• We enhanced gst-omx to scale the video through OMX as well.
DEVIEW 2014
<video width=“760” height=“340” controls>
36
37. More responsive UI and scrolling even under heavy load
• Progressive tiled backing store.
‣ Progressive tile base rendering on scroll as like mobile browsers do
‣ We can reduce an absolute amount of drawing with TBS so UI event could have more chances to
be handled.
• Suspend javascript and animation while scrolling
‣ WebKit1 is single threaded for JS and rendering single process so that we could not get the scroll
events while JS is running.
‣ But this is not perfect yet since we could not stop running javascript functions
• Tune priorities among events
‣ Make sure the handling of the UI event is higher priority than other things.
‣ Tweaking event priority should be conducted very carefully. It’s quite conditional.
‣ ex) Wiggling a mouse may make drawing events fall into a starvation.
DEVIEW 2014
37
38. Memory & CPU friendly tab management
• Unload tabs if too many(more than 3) are in use.
• Slow down javascript on background tabs.
DEVIEW 2014
38
39. Start up is 3x faster
• Optimized Adblock
‣ Adblock is built in Epiphany. It’s loaded automatically when startup.
‣ Use regular expressions only when needed.
‣ Reuse parsed regular expressions instead of recreating the same one every time.
‣ Asynchronously load filters for Adblock.
‣ Avoid running the converter tool used to convert epiphany config files from one version to another if
not needed.
DEVIEW 2014
39
40. Javascript JIT fixes for ARMv6
• Backported latest JIT related changes into our working WebKit.
• Bug fix for ARMv6
DEVIEW 2014
40
42. Lesson 1. Profiling, Profiling & Profiling
• Measuring cpu, memory and time will show you a way to go.
• Profiling quite depends on developer’s experience.
• Do not hesitate to share your know-how with your colleagues.
• Do not be afraid of learning new tools.
• Ex) perf tool is very useful on linux.
‣ Install relevant debug packages
‣ sudo apt-get install linux-tools
‣ sudo perf record -a -g -o perf.data
‣ sudo perf report -g -i perf.data
DEVIEW 2014
42
44. Lesson 2. Keep watching upstream
ARMv6 is not a popular AP nowadays. Nobody cares. BUT…
• You’re not only guy concerning the problem!
• JIT compiler enabled on ARMv6
• Optimized pixman and libav for ARMv6
DEVIEW 2014
44
45. Lesson 3. Suspect useless, stupid and repeated things
• Direct painting, not to use a timer based drawing mechanism.
• Disk image cache
• Reduction of the number of memory copies to play video
• Unique feature, fullscreen mode
• Avoid useless image format conversions
DEVIEW 2014
45
46. Lesson 4. Just In Time
• Progressive tiled backing store.
• Suspend javascript and animations if necessary.
• Optimized Adblock
DEVIEW 2014
46
47. Lesson 5. Hackish but feasible then O.K
• Used mobile version pages for some sites.
• Better YouTube support by injecting custom video tag wrapper.
• Faster fullscreen video.
DEVIEW 2014
47
48. Lesson 6. Utilize all available resources in the platform
• Disk image cache
• Trade-off between memory and local disk space.
• OMX(OpenMAX) for decoding video and images
• Decode video through GPU, not CPU
• OMX for scaling video and images
• Scales videos through GPU, not CPU.
DEVIEW 2014
48
49. Lesson 7. Careful resource reallocation
• Throttle video fps up to 30fps.
• Tune priorities among events
• Memory pressure handler by using cgroup
• Unload tabs if too many are in use.
• Slow down javascript on background tabs.
DEVIEW 2014
49
50. Conclusion
DEVIEW 2014 50
• Optimization is literally finding the best solutions to fit your
purpose or platform.
• It depends on your situation so it could be various ways
• SW engineer should not expect a better hardware to do anything
instead of you.
• No magic, No universal solution for optimization
• Imagine your own way, don’t be afraid of trying your idea.