Виктор Ерухимов Open VX mixar moscow sept'15

© Copyright Khronos Group 2014 - Page 1
Vision Acceleration
mixAR, September 2015
Victor Erukhimov
Itseez, Itseez3D

Itseez
• Real time computer vision solutions on
embedded platforms:
– Mobile products: ItSeez3D, Facense
– Automotive: driver assistance systems
– Ecosystem: OpenCV, OpenVX

S C A N N E R
VICTOR ERUKHIMOV
victor.erukhimov@itseez3d.com
Capture the world in 3D!

Embedded
vision
challenges
•Intense and power hungry computations
•Need to run in real-time on
embedded/mobile/wearable devices
•Very few specialized hardware products
•Software ecosystem not ready for embedded real-
time scenarios

Vision Acceleration
mixAR, September 2015
Victor Erukhimov
Itseez, Itseez3D

Khronos Connects Software to Silicon
Open Consortium creating
ROYALTY-FREE, OPEN STANDARD
APIs for hardware acceleration
Defining the roadmap for
low-level silicon interfaces
needed on every platform
Graphics, compute, rich media,
vision, sensor and camera
processing
Rigorous specifications AND
conformance tests for cross-
vendor portability
Acceleration APIs
BY the Industry
FOR the Industry
Well over a BILLION people use Khronos APIs
Every Day…http://accelerateyourworld.org/

Khronos Standards
Visual Computing
- 3D Graphics
- Heterogeneous Parallel Computing
3D Asset Handling
- 3D authoring asset interchange
- 3D asset transmission format
with compression
Acceleration in HTML5
- 3D in browser – no Plug-in
- Heterogeneous computing for JavaScript
Over 100 companies defining royalty-free
APIs to connect software to silicon
Sensor Processing
- Vision Acceleration
- Camera Control
- Sensor Fusion

Mobile Vision Acceleration = New Experiences
Augmented
Reality
Face, Body and
Gesture Tracking
Computational
Photography and
Videography
3D Scene/Object
Reconstruction
Need for advanced sensors
and the acceleration to
process them

Visual Computing = Graphics PLUS Vision
Real-time GPU Compute
Research project on GPU-accelerated laptop
High-Quality Reflections, Refractions, and Caustics in Augmented
Reality and their Contribution to Visual Coherence
P. Kán, H. Kaufmann, Institute of Software Technology and Interactive
Systems, Vienna University of Technology, Vienna, Austria
https://www.youtube.com/watch?v=i2MEwVZzDaA
Imagery
Data
Vision
Processing
Graphics
Processing
Enhanced sensor
and vision
capability deepens
the interaction
between real and
virtual worlds

Vision Pipeline Challenges and Opportunities
• Light / Proximity
• 2 cameras
• 3 microphones
• Touch
• Position
- GPS
- WiFi (fingerprint)
- Cellular trilateration
- NFC/Bluetooth Beacons
• Accelerometer
• Magnetometer
• Gyroscope
• Pressure / Temp / Humidity
1
9
Sensor Proliferation
Diverse sensor awareness of
the user and surroundings
• Camera sensors >20MPix
• Novel sensor configurations
• Stereo pairs
• Plenoptic Arrays
• Active Structured Light
• Active TOF
Growing Camera Diversity
Capturing color, range
and lightfields
Diverse Vision Processors
Driving for high performance
and low power
• Multi-core CPUs
• Programmable GPUs
• DSPs and DSP arrays
• Camera ISPs
• Dedicated vision IP blocks
Flexible sensor and camera
control to generate
required image stream
Use best processing available
for image stream processing –
with code portability
Control/fuse vision data
by/with all other sensor data
on device

Vision Processing Power Efficiency
• Depth sensors = significant processing
- Generate/use environmental information
• Wearables will need ‘always-on’ vision
- With smaller thermal limit / battery than phones!
• GPUs has x10 CPU imaging power efficiency
- GPUs architected for efficient pixel handling
• Traditional cameras have dedicated hardware
- ISP = Image Signal Processor – on all SOCs today
• SOCs have space for more transistors
- But can’t turn on at same time = Dark Silicon
• Potential for dedicated sensor/vision silicon
- Can trigger full CPU/GPU complex
PowerEfficiency
Computation Flexibility
Dedicated
Hardware
GPU
Compute
Multi-core
CPUX1
X10
X100
Advanced
Sensors
Wearables
But how to program specialized processors?
Performance and Functional Portability

OpenVX – Power Efficient Vision Acceleration
• Out-of-the-Box vision acceleration framework
- Enables low-power, real-time applications
- Targeted at mobile and embedded platforms
• Functional Portability
- Tightly defined specification
- Full conformance tests
• Performance portability across diverse HW
- Higher-level abstraction hides hardware details
- ISPs, Dedicated hardware, DSPs and DSP arrays,
GPUs, Multi-core CPUs …
• Enables low-power, always-on acceleration
- Can run solely on dedicated vision hardware
- Does not require full SOC CPU/GPU complex to
be powered on
Vision
Accelerator
Application
Application
Application
Application
Vision
AcceleratorVision
AcceleratorVision
Accelerator

OpenVX Graphs – The Key to Efficiency
• Vision processing directed graphs for power and performance efficiency
- Each Node can be implemented in software or accelerated hardware
- Nodes may be fused by the implementation to eliminate memory transfers
- Processing can be tiled to keep data entirely in local memory/cache
• VXU Utility Library for access to single nodes
- Easy way to start using OpenVX by calling each node independently
• EGLStreams can provide data and event interop with other Khronos APIs
- BUT use of other Khronos APIs are not mandated
OpenVX
Node
OpenVX
Node
OpenVX
Node
OpenVX
Node
Downstream
Application
Processing
Native
Camera
Control
Example OpenVX Graph

OpenVX 1.0 Function Overview
• Core data structures
- Images and Image Pyramids
- Processing Graphs, Kernels, Parameters
• Image Processing
- Arithmetic, Logical, and statistical operations
- Multichannel Color and BitDepth Extraction and Conversion
- 2D Filtering and Morphological operations
- Image Resizing and Warping
• Core Computer Vision
- Pyramid computation
- Integral Image computation
• Feature Extraction and Tracking
- Histogram Computation and Equalization
- Canny Edge Detection
- Harris and FAST Corner detection
- Sparse Optical Flow
OpenVX 1.0 defines
framework for
creating, managing and
executing graphs
Focused set of widely
used functions that are
readily accelerated
Implementers can add
functions as extensions
Widely used extensions
adopted into future
versions of the core
OpenVX Specification
Is Extensible
Khronos maintains extension registry

Example Graph - Stereo Machine Vision
Camera 1
Compute Depth
Map
(User Node)
Detect and
track objects
(User Node)
Camera 2
Image
Pyramid
Stereo
Rectify with
Remap
Stereo
Rectify with
Remap
Compute
Optical
Flow
Object
coordinates
OpenVX Graph
Delay
Tiling extension enables user nodes (extensions) to also optimally run in local memory

OpenVX and OpenCV are Complementary
Governance
Community driven open source
with no formal specification
Formal specification defined and
implemented by hardware vendors
Conformance
No conformance tests for consistency and
every vendor implements different subset
Full conformance test suite / process
creates a reliable acceleration platform
Portability APIs can vary depending on processor Hardware abstracted for portability
Scope
Very wide
1000s of imaging and vision functions
Multiple camera APIs/interfaces
Tight focus on hardware accelerated
functions for mobile vision
Use external camera API
Efficiency
Memory-based architecture
Each operation reads and writes memory
Graph-based execution
Optimizable computation, data transfer
Use Case Rapid experimentation Production development & deployment

OpenVX Announcement
• Finalized OpenVX 1.0.1 specification released June 2015
- www.khronos.org/openvx
• Full conformance test suite and Adopters Program immediately available
- $20K Adopters fee ($15K for members) – working group reviews submitted results
- Test suite exercises graph framework and functionality of each OpenVX 1.0 node
- Approved Conformant implementations can use the OpenVX trademark
• Open source sample implementation of OpenVX 1.0.1 released

Khronos APIs for Vision Processing
GPU Compute Shaders (OpenGL 4.X and OpenGL ES 3.1)
Pervasively available on almost any mobile device or OS
Easy integration into graphics apps – no vision/compute API interop needed
Program in GLSL not C
Limited to acceleration on a single GPU
General Purpose Heterogeneous Programming Framework
Flexible, low-level access to any devices with OpenCL compiler
Single programming and run-time framework for CPUs, GPUs, DSPs, hardware
Open standard for any device or OS – being used as backed by many languages and frameworks
Needs full compiler stack and IEEE precision
Out of the Box Vision Framework - Operators and graph framework library
Can run some or all modes on dedicated hardware – no compiler needed
Higher-level abstraction means easier performance portability to diverse hardware
Graph optimization opens up possibility of low-power, always-on vision acceleration
Fixed set of operators – but can be extended
It is possible to use OpenCL or GLSL to build OpenVX Nodes on programmable devices!

Example : Feature tracking Graph
color

convert
channel

extract
pyramid
optical
flow
pyrLK
pyr 0
RGB frame
frameYUV
frameGray
Array of keypoints
Image capture
API
Display API
Pyramids
pyr -‐1 pts 0pts -‐1
pyr_delay ptr_delay

First processing : Keypoint detection
color

convert
channel

extract
pyramid
pyr 0
RGB image
frameYUV
frameGray
Harris
corner
pyr_delay ptr_delay

Context & Data Objects Creation
// Create the ‘OpenVX world’
vx_context context = vxCreateContext();
// Create image objects necessary for the RGB -> Y transformation
vx_image frameYUV = vxCreateImage(context, width_, height_, VX_DF_IMAGE_IYUV);
vx_image frameGray = vxCreateImage(context, width_, height_, VX_DF_IMAGE_U8);
// Image pyramids for two successive frames are necessary for the computation.
// A delay object with 2 slots is created for this purpose
vx_pyramid pyr_exemplar = vxCreatePyramid(context, 4, VX_SCALE_PYRAMID_HALF,
width_, height_, VX_DF_IMAGE_U8);
vx_delay pyr_delay = vxCreateDelay(context, (vx_reference)pyr_exemplar, 2);
vxReleasePyramid(&pyr_exemplar);
// Tracked points need to be stored for two successive frames.
// A delay object with 2 slots is created for this purpose
vx_array pts_exemplar = vxCreateArray(context, VX_TYPE_KEYPOINT, 2000);
vx_delay pts_delay = vxCreateDelay(context, (vx_reference)pts_exemplar, 2);
vxReleaseArray(&pts_exemplar);
pyr 0
frameYUV
pyr_delay
frameGray
ptr_delay

Initial step: Keypoint Detection
// RGB to Y conversion
vxuColorConvert(context, frameRGB, frameYUV);
vxuChannelExtract (context, frameYUV, VX_CHANNEL_Y, frameGray);
// Keypoint detection : Harris corner
vx_float32 strength_thresh = 0.0f;
vx_scalar s_strength_thresh = vxCreateScalar(context, VX_TYPE_FLOAT32, &strength_thresh);
vx_float32 min_distance = 3.0f;
vx_scalar s_min_distance = vxCreateScalar(context, VX_TYPE_FLOAT32, &min_distance);
vx_float32 k_sensitivity = 0.04f;
vx_scalar s_k_sensitivity = vxCreateScalar(context_ VX_TYPE_FLOAT32, &k_sensitivity);
vx_int32 gradientSize = 3;
vx_int32 blockSize = 3;
vxuHarrisCorners(context, frameGray, s_strength_thresh, s_min_distance,
s_k_sensitivity, gradientSize, blockSize,
(vx_array)vxGetReferenceFromDelay(pts_delay, -1), 0 );
// Create the first pyramid needed for optical flow
vxuGaussianPyramid(context, frameGray, (vx_pyramid)vxGetReferenceFromDelay(pyr_delay, -1))
;
color

convert
channel

extract
pyramid
pyr 0
RGB frame
frameYUV
frameGray
Harris
corner
pyr_delay ptr_delay

Feature tracking: Graph Creation
color

convert
channel

extract
pyramid
optical
flow
pyrLK
pyr 0
RGB frame
frameYUV
frameGray
Array of keypoints
pyr_delay ptr_delay
vx_graph graph = vxCreateGraph(context);
// RGB to Y conversion nodes
vx_node cvt_color_node = vxColorConvertNode(graph, frame, frameYUV);
vx_node ch_extract_node = vxChannelExtractNode(graph, frameYUV, VX_CHANNEL_Y,
frameGray);
// Pyramid image node
vx_node pyr_node = vxGaussianPyramidNode(graph, frameGray,
(vx_pyramid) vxGetReferenceFromDelay(pyr_delay, 0));
// Lucas-Kanade optical flow node
// Note: keypoints of the previous frame are also given as 'new points estimates'
vx_float32 lk_epsilon = 0.01f;
vx_scalar s_lk_epsilon = vxCreateScalar(context, VX_TYPE_FLOAT32, &lk_epsilon);
vx_uint32 lk_num_iters = 5;
vx_scalar s_lk_num_iters = vxCreateScalar(context, VX_TYPE_UINT32, &lk_num_iters);
vx_bool lk_use_init_est = vx_false_e;
vx_scalar s_lk_use_init_est = vxCreateScalar(context, VX_TYPE_BOOL, &lk_use_init_est);
vx_node opt_flow_node = vxOpticalFlowPyrLKNode(graph,
(vx_pyramid) vxGetReferenceFromDelay(pyr_delay, -1),
(vx_pyramid) vxGetReferenceFromDelay(pyr_delay, 0),
(vx_array) vxGetReferenceFromDelay(pts_delay, -1),
(vx_array) vxGetReferenceFromDelay(pts_delay, -1),
(vx_array) vxGetReferenceFromDelay(pts_delay, 0),
VX_TERM_CRITERIA_BOTH, s_lk_epsilon, s_lk_num_iters,
s_lk_use_init_est, 10);
vxReleaseScalar(&s_lk_epsilon);
vxReleaseScalar(&s_lk_num_iters);
vxReleaseScalar(&s_lk_use_init_est);

Feature tracking: Execution
color

convert
channel

extract
pyramid
optical
flow
pyrLK
pyr 0
new_frame
frameYUV
frameGray
Array of keypoints
pyr_delay ptr_delay
// Context & data creation
// <…>
// Graph the first image
// <…>
// Keypoints detection
// <…>
// Graph creation
// <…>
// Graph verification (mandatory before executing the graph)
vxVerifyGraph(graph);
// MAIN PROCESSING LOOP
for (;;) {
// Grab next frame
// <…>
// Set the new graph input
vxSetParameterByIndex(cvt_color_node, 0, (vx_reference)new_frame);
// Process graph
vxProcessGraph(graph);
// ‘Age’ pyramid and keypoint delay objects for the next frame processing
vxAgeDelay(pyr_delay);
vxAgeDelay(pts_delay);
}

OpenVX 1.0.1 extensions
• Tiling extension: more efficient processing of graphs with user nodes
- Provisional spec released
• XML Schema extension: cross-platform graph saving and loading
- Provisional spec released

OpenVX™ User Kernel Tiling
Extension Specification
Motivation and Overview

Large
external
memory
HD
Output
ImageHD
Output
Image
Tile-based processing
9/25/15 30
Accelerators
Small
local

memory
F2F1
HD
Input
Image
200
cycles
1
cycle
• Optimal
performance
&
memory
utilization
• Full
intermediate
(purple)
image
never
exists
• Tedious
programming
F1 F2
DMA
Engine
Ping
Pong

OpenVX Code
vx_context context = vxCreateContext();
vx_image input = vxCreateImage(context, 640, 480, VX_DF_IMAGE_U8);
vx_image output = vxCreateImage(context, 640, 480, VX_DF_IMAGE_U8);
vx_image intermediate = vxCreateVirtualImage(context, 640, 480,
VX_DF_IMAGE_U8);
vx_graph graph = vxCreateGraph(context);
vx_node F1 = vxF1Node(input, intermediate);
vx_node F2 = vxF2Node(intermediate, output);
vxVerifyGraph(graph);
vxProcessGraph(graph);
outputinput F1 F2
context
graph
OpenVX
handles
the
tiling!
inter-‐
mediate

OpenVX 1.0 tiling and user kernels
• An implementation of OpenVX 1.0 can already do tiled processing with the standard
kernels
– The user/programmer just needs to be sure to declare intermediate images as “virtual”
– “Virtual” indicates the user will not try to access the intermediate results, so they to not need to be fully allocated/constructed
• User can already create their own kernels per the existing OpenVX 1.0 specification
– There is a User Kernel section in the OpenVX 1.0Advanced Framework API section
– But the image data for these user-defined kernels cannot be “tiled”
– Note: a “kernel” is analogous to a C++ “class” and a “node” is analogous to an “instance”
The use of kernels versus nodes enables object-oriented programming within the C programming language
• The new User Kernel Tiling Extension is only needed for tiled processing of user-
defined kernels
– The user/programmer needs to provide additional information about their kernel to enable the OpenVX implementation to properly
decompose the image into tiles and run the user node on these tiles
– The User Kernel Tiling Extension defines an API that can be used to provide this additional information

O
The User Kernel Tiling Extension
1.The user writes the kernel function to be executed on each tile
– The OpenVX runtime will call this function on a specific tile during vxProcessGraph()
– The extension defines macros this function can use to determine information about the given tile and its parent image
– E.g., the tile’s height and width, the tile’s (x, y) location in the parent image, and the parent image’s height and width
2.The user adds this new kernel to the OpenVX system via vxAddTilingKernel()
– vxAddTilingKernel() takes a name, a pointer to the user’s function, and the number of kernel parameters
3.The user describes each of the kernel’s parameters via vxAddParameterToKernel()
– This is the same function used to describe non-tiled user kernel parameters
4.The user tells OpenVX about its pixel-access behavior via vxSetKernelAttribute()
– Must set the output block size, input neighborhood size, and border mode
5.The user calls vxFinalizeKernel() to indicate that the kernel description is complete
f

Required user tiling kernel attributes
• VX_KERNEL_ATTRIBUTE_OUTPUT_TILE_BLOCK_SIZE
– The size of the region the user’s kernel prefers to write on each loop iteration
– The OpenVX implementation will ensure that the tile sizes are a multiple of this block size
– Except possibly at the edges of the image
• VX_KERNEL_ATTRIBUTE_INPUT_NEIGHBORHOOD
– The “extra” input pixels needed to compute an output block
– E.g., a pixelwise function has an input neighborhood of 0 on all sides
– A 3x3 filter has a neighborhood of 1, and a 5x5 filter has a neighborhood of 2 (on all sides)
• VX_KERNEL_ATTRIBUTE_BORDER
– Indicates whether the kernel function can correctly handle the odd-sized tiles near the edges of the image (VX_BORDER_MODE_SELF) or
not (VX_BORDER_MODE_UNDEFINED)
• Examples:
tileBlocksize
=
(1,
1)
Neighborhood
=
(0,
0,
0,
0)
e.g.,
pixelwise
add
tileBlocksize
=
(1,
1)
Neighborhood
=
(1,
1,
1,
1)
e.g.,
3x3
box
filter
tileBlocksize
=
(1,
1)
Neighborhood
=
(2,
2,
2,
2)
e.g.,
5x5
box
filter
tileBlocksize
=
(4,
4)
Neighborhood
=
(0,
0,
0,
0)
e.g.,
4x4
pixelate
tileBlocksize
=
(4,
1)
Neighborhood
=
(2,
2,
2,
2)
e.g.,
SIMD-‐optimized
5x5
box
that
writes
4
pixels/cycle

Additional optimization
• The user may provide two versions of the function for the user kernel
• The fast version and the flexible version
• The OpenVX implementation will only call the fast function when it’s “safe”
– The tile size is a whole-number multiple of the output tile block size
– The inputneighborhood doesn’textend beyond the boundariesof the input image
• The fast version of the function doesn’t have to check any edge conditions
– Computesefficientlywithout conditional checksand branches
• The flexible version needs to make the appropriate checks to handle the edge conditions
• There is a relationship between the fast function, flexible function, and border mode
– Read the spec
Fast
Flexible

Applications
and
Middleware
Tegra
K1
CUDA
Libraries
VisionWorks
Primitives
Classifier
Corner

Detection
3rd
Party

Vision
Pipeline
Samples
Object
Detection
3rd Party
Pipelines

…
…
SLAM
VisionWorks

Framework
NVIDIA VisionWorks is Integrating OpenVX
• VisionWorks library contains diverse vision and imaging primitives
• Will leverage OpenVX for optimized primitive execution
• Can extend VisionWorks nodes through GPU-accelerated primitives
• Provided with sample library of fully accelerated pipelines
GPU Libraries

Khronos APIs for Augmented Reality
Advanced Camera
Control and stream
generation
3D Rendering and Video
Composition
On GPU
Audio
Rendering
Application
on CPUs, GPUs
and DSPs
Sensor
Fusion
Vision
Processing
MEMS
Sensors
EGLStream -
stream data
between APIs
Precision timestamps
on all sensor samples
AR needs not just advanced sensor processing, vision
acceleration, computation and rendering - but also for
all these subsystems to work efficiently together

Summary
• Khronos is building a trio of interoperating APIs for portable / power-efficient
vision and sensor processing
• OpenVX 1.0 specification is now finalized and released
- Full conformance tests and Adopters program immediately available
- Khronos open source sample implementation by end of 2014
- First commercial implementations already close to shipping
• Any company is welcome to join Khronos
to influence the direction of mobile and embedded vision processing!
- $15K annual membership fee for access to all Khronos API working groups
- Well-defined IP framework protects your IP and conformant implementations
• More Information
- www.khronos.org
- ntrevett@nvidia.com
- @neilt3d

Background Material

Need for Camera Control API - OpenKCAM
• Advanced control of ISP and camera subsystem – with cross-platform portability
- Generate sophisticated image stream for advanced imaging & vision apps
• No platform API currently fulfills all developer requirements
- Portable access to growing sensor diversity: e.g. depth sensors and sensor arrays
- Cross sensor synch: e.g. synch of camera and MEMS sensors
- Advanced, high-frequency per-frame burst control of camera/sensor: e.g. ROI
- Multiple input, output re-circulating streams with RAW, Bayer or YUV Processing
Image Signal
Processor (ISP)
Image/Vision
Applications
Defines control of Sensor, Color Filter Array
Lens, Flash, Focus, Aperture
Auto Exposure (AE)
Auto White Balance (AWB)
Auto Focus (AF)
EGLStreams

OpenKCAM is FCAM-based
• FCAM (2010) Stanford/Nokia, open source
• Capture stream of camera images with precision control
- A pipeline that converts requests into image stream
- All parameters packed into the requests - no visible state
- Programmer has full control over sensor settings for each frame in stream
• Control over focus and flash
- No hidden daemon running
• Control ISP
- Can access supplemental
statistics from ISP if available
• No global state
- State travels with image requests
- Every pipeline stage may have different state
- Enables fast, deterministic state changes
Khronos coordinating with
MIPI on camera control and
data formats

Sensor Industry Fragmentation …

Sensor Data Types
• Raw sensor data
- Acceleration, Magnetic Field, Angular Rates
- Pressure, Ambient Light, Proximity, Temperature, Humidity, RGB light, UV light
- Heart rate, Blood Oxygen Level, Skin Hydration, Breathalyzer
• Fused sensor data
- Orientation (Quaternion or Euler Angles)
- Gravity, Linear Acceleration, Position
• Contextual awareness
- Device Motion: general movement of the device: still, free-fall, …
- Carry: how the device is being held by a user: in pocket, in hand, …
- Posture: how the body holding the device is positioned: standing, sitting, step, …
- Transport: about the environment around the device: in elevator, in car, …

Low-level Sensor Abstraction API
Apps Need Sophisticated
Access to Sensor Data
Without coding to specific
sensor hardware
Apps request semantic sensor information
StreamInput defines possible requests, e.g.
Read Physical or Virtual Sensors e.g. “Game Quaternion”
Context detection e.g. “Am I in an elevator?”
StreamInput processing graph provides
optimized sensor data stream
High-value, smart sensor fusion middleware can connect
to apps in a portable way
Apps can gain ‘magical’ situational awareness
Advanced Sensors Everywhere
Multi-axis motion/position, quaternions,
context-awareness, gestures, activity
monitoring, health and environmental sensors
Sensor Discoverability
Sensor Code Portability

Виктор Ерухимов Open VX mixar moscow sept'15

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Виктор Ерухимов Open VX mixar moscow sept'15

Ähnlich wie Виктор Ерухимов Open VX mixar moscow sept'15 (20)

Mehr von mixARConference

Mehr von mixARConference (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Виктор Ерухимов Open VX mixar moscow sept'15