This document provides an overview of the Kinect SDK, including hardware requirements, the development environment, image and depth APIs, skeletal tracking, audio processing, speech recognition, and sample applications. It discusses the Kinect sensor's depth sensors, RGB camera, and microphone array, and how to access depth data, skeletal data, and audio streams from the Kinect in code. It also introduces APIs for images, depth data, skeletal tracking, audio, and speech recognition.
2. Agenda Installing and Using the Kinect Sensor Setting up your Development Environment Camera Fundamentals Working with Depth Data Skeletal Tracking Fundamentals Audio Fundamentals
3. Hardware Computer with a dual-core 2.66-GHz or faster processor 2GB RAM Windows 7-compatible graphics card that supports DirectX 9.0c
4. Kinect Sensor 3D DEPTH SENSORS RGB CAMERA MOTORIZED TILT MULTI-ARRAY MIC
5. Development Environment Microsoft Visual Studio 2010 Express or other Visual Studio 2010 edition .NET Framework 4.0 SDK, http://research.microsoft.com/kinectsdk DirectX Samples Microsoft DirectX® SDK - June 2010 or later version Current runtime for Microsoft DirectX® 9 Speech Samples Microsoft Speech Platform Runtime, version 10.2 (x86 edition) Microsoft Kinect Speech Platform (US-English version) Microsoft Speech Platform - Software Development Kit, version 10.2 (x86 edition)
8. Depth Image Array of bytes (ImageFrame.Image.Bits) Left to right, top to bottom Represents distance for pixel in mm (850 to 4,000mm) 0 means unknown Shadows, low reflectivity, and high reflectivity among the few reasons Player Index 0, No player 1, Skeleton 0 2, Skeleton 1
9. Depth Data 2 bytes per pixel (16 bits) Depth (Distance per pixel) Bitshiftsecond byte by 8 Distance (0,0) = (int)(Bits[0] | Bits[1] << 8); DepthAndPlayer Index (Includes Player index) Bitshift by 3 first byte (player index), 5 second byte Distance (0,0) =(int)(Bits[0] >> 3 | Bits[1] << 5);
13. Joint Data Maximum two players tracked at once Six player proposals Each player with set of joints <x, y, z> in meters Tracking state Tracked Inferred Occluded, clipped, or low confidence joints Not tracked Rare, but your code must check for this state
15. Audio Processing Four microphone arraywith hardware-basedaudio processing Multichannel echo cancellation (MEC) Sound position tracking Other digital signal processing (noise suppression and reduction)
19. Samples NUI Skeletal viewer, C++, C# Shape Game Demo, C# Audio Raw capture, C++ Audio filtering, C++ Echo cancellation, C++ Recording, C# Speech, C#
Speech Platform Runtime & SDK must use x86 editionMicrosoft Kinect Speech Platform is the same speech recognition for XBOX.NET 4.0 Windows.Speech namespace can be used but not as up-to-date
Colour and depth stream4 to 11.5 feet (1.2 to 3.5 meters) Skeletal tracking4 to 11.5 feet (1.2 to 3.5 meters) Viewing angle43° vertical by 57° horizontal field of viewMechanized tilt range (vertical)±28° Frame rate (depth and colour stream)30 frames per second (FPS)Resolution, depth streamQVGA (320 × 240) Resolution, colour streamVGA (640 × 480) Audio format16-kHz, 16-bit mono pulse code modulation (PCM)Audio input characteristicsA four-microphone array with 24-bit analogue-to-digital converter (ADC) and Kinect-resident signal processing such as echo cancellation and noise suppression
Speech Platform Runtime & SDK must use x86 editionMicrosoft Kinect Speech Platform is the same speech recognition (acoustic model) for XBOX.NET 4.0 Windows.Speech namespace can be used but not as up-to-date
WPF event-driven RGB & Depth framesCamera tilt
WPF event-driven RGB & Depth framesCamera tilt
WPF event-driven RGB & Depth framesCamera tilt
WPF event-driven RGB & Depth framesCamera tilt
Skeletal Viewer (C++ and C#) The Kinect sensor includes two cameras: one delivers depth information and the other delivers color data. The NUI API enables applications to access and manipulate this data. The SkeletalViewer sample uses the NUI API to render data from the Kinect sensor’s cameras as images on the screen. The managed sample uses WPF to render captured images, and the native application uses DirectX.ShapeGame—Creating a Game with Audio and Skeletal Tracking Displays the tracked skeletons of two players together with shapes falling from the sky. Players can control the shapes by moving and speaking commands.Audio Capture Raw (C++) The Kinect sensor’s audio component is a four-element microphone array. The AudioCaptureRaw sample uses the Windows Audio Session API (WASAPI) to capture the raw audio stream from the Kinect sensor’s microphone array and write it to a .wav file.MicArrayEchoCancellation—Acoustic Echo Cancellation, Beam Forming, and Source Localization (C++)The primary way for C++ applications to access the Kinect sensor’s microphone array is through the MSRKinectAudio DirectX Media Object (DMO). The MSRKinectAudio DMO supports all standard microphone array functionality, and adds support for beamforming and source localization. The MicArrayEchoCancellation sample shows how to use the KinectAudio DMO in a DirectShow graph. It uses acoustic echo cancellation to record a high-quality audio stream and beamforming and source localization to determine the selected beam and the direction to the sound source. MFAudioFilter—Media Foundation Audio Filter (C++) Shows how to capture an audio stream from the Kinect sensor’s microphone array by using the MSRKinectAudio DMO in filter mode in a Windows Media Foundation topology.RecordAudio—Recording an Audio Stream and Monitoring Direction (C#) Demonstrates how to capture an audio stream from the Kinect sensor’s microphone array and monitor the currently selected beam and sound source direction.Speech—Recognizing Voice Commands (C#) Demonstrates how to use the Kinect sensor’s microphone array with the Microsoft.Speech API to recognize voice commands