This study investigated how auditory and visual information is integrated when presented through an augmented reality device. Subjects were shown a visual cue in the form of a green square at different angles using Sony SmartEyeGlass and played white noise sounds through headphones at angles from +45 to -45 degrees. Subjects responded whether the sound came from the direction of the visual cue. The results showed that two audio rendering methods - amplitude-based panning and HRTF-based methods - did not produce significant differences for the AR device. This suggests that simpler panning methods can be used to save computational costs for AR applications with simple visual displays. Future work will explore integration in other planes and with more complex visual cues using different AR devices like
1. Investigating integration between auditory and visual location
information presented in an augmented-reality (AR) device
Hiraku Okumura1,2, Sushrut Khiwadkar1, and Sungyoung Kim1
1 Rochester Institute of Technology, Rochester, NY, USA, 2 Yamaha Corporation, Hamamatsu, Japan
Background
Method
Devices Visual Auditory
VR HMD (Oculus Rift, HTC Vive, Sony Playstation VR, etc.) Immersive Immersive
AR
HMD (Microsoft, HoloLens, etc.) Immersive Immersive
Glasses (Sony SmartEyeGlass, Epson Moverio, etc.) Simple
Immersive?
Simple?
The leap of the Virtual Reality (VR) and Augmented
Reality (AR) technologies allows more people to
access head mounted displays (HMDs) and smart
glasses in daily uses. HMDs can provide users with
immersive visual information, while smart glasses
display relatively simple visual information.
We presented a target visual object, the green small square, in the AR
glasses, Sony SmartEyeGlass (Fig. 1), at the 0° (face-forward center), -
5° and -10° in a counter clockwise from the center. Auditory stimuli,
White noise, were presented through the headphone to have target
locations in the horizontal plane from +45° to -45° with a 5-degree
interval.
The subjects were forced to choose between Yes or No as a response to
the question: ”Is the sound coming from the direction of the video cue”.
Discussion & Future work
[1] Kaneko, S., et al., “Ear Shape
Modeling for 3D Audio and Acoustic
Virtual Reality: The Shape-Based
Average HRTF,” in Audio
Engineering Society 61st
International Conference on Audio
for Games, February 2016.
Reference
Result
Panning
HRTF
Video cue angle (Horizontal)
0º -5º -10º
App
Host Android device
(Android tablet)
Laptop PCMAX
Bluetooth
Video Cue
Wireless Network
AKG K550
−45°45°
0° −10°
Sony
SmartEyeGlass
Audio Cue
By integrating proper Auditory information, we can enhance perceived immersion as well as improve clarity (and thus understandability) of visual
information. In audio engineering, binaural techniques based on HRTF (Head Related Transfer Function) are very popular rendering methods for a
pair of headphones. HRTF-based techniques can generate virtual sound sources all around a user and generate more immersive auditory image.
However, they require more computational costs. On the other hand, conventional amplitude-based panning methods can generate virtual sound
sources between loudspeakers, which require less computational costs. How much different are those two approaches in the context of AR?
To answer this question, we have conducted a study that compared two audio rendering techniques–HRTF-based and panning-based one. The
objective of the study was to quantitatively investigate possible benefits of employing panning-based rendering methods for AR devices.
• Audio rendering
• HRTF Dataset: Shape-based average HRTF (30-people) [1]
• Panning Method: Sine-Cosine panning law
• AR Device: Sony SmartEyeGlass
• Screen size (Angle of view) ± 10º (horizontal) x ± 5º
(vertical)
• Display color Monochrome (green)
Eyewear Controller
Fig.2 System Configuration
Fig.3 The user interface for subjectsFig.1 Sony SmartEyeGlass
The results show that two methods–amplitude-based panning and HRTF-based methods–did not produce any
significant differences for the tested AR device. This implicates that we can employ simple panning methods for
AR applications that mainly display simple visual information and save the computational cost.
As the first step, the current study has concentrated on audio/visual interaction along the horizontal plane. We
plan to extend the current work to the integration in the median plane as well as integration of perceived
distance. Furthermore, other types of the AR device such as Microsoft Hololens are being tested for multimodal
integration for more complex visual cues.