Paper & Abstract
Blind people typically access videos via audio descriptions (AD) crafted by sighted describers who comprehend, select, and describe crucial visual content in the videos. 360° video is an emerging storytelling medium that enables immersive experiences that people may not possibly reach in everyday life. However, the omnidirectional nature of 360° videos makes it challenging for describers to perceive the holistic visual content and interpret spatial information that is essential to create immersive ADs for blind people. Through a formative study with a professional describer, we identified key challenges in describing 360° videos and iteratively designed OmniScribe, a system that supports the authoring of immersive ADs for 360° videos. OmniScribe uses AI-generated content-awareness overlays for describers to better grasp 360° video content. Furthermore, OmniScribe enables describers to author spatial AD and immersive labels for blind users to consume the videos immersively with our mobile prototype. In a study with 11 professional and novice describers, we demonstrated the value of OmniScribe in the authoring workflow; and a study with 8 blind participants revealed the promise of immersive AD over standard AD for 360° videos. Finally, we discuss the implications of promoting 360° video accessibility.
Web Authoring Interface
Authoring Immersive Labels
To author spatial ADs, the describer can use the brushing tool to paint the sound paths of the selected description on the equirectangular view. OmniScribe then transforms the 2D painted path into 3D spherical coordinates to be visualized on the video view during future playbacks and for rendering immersive sound in OmniScribe's mobile prototype.
OmniScribe preempts an AD slot for each scene, allowing the describer to provide more details about them. Scenes are automatically detected and segmented once the video is loaded. BVI users in the mobile prototype can be notified of each new scene through vibration and manually play scene descriptions.
OmniScribe enables the describer to select the crucial objects and describe them, which we call object descriptions. The moving path of the object was prepopulated using object tracking in the preprocessing stage. The audio path of the object description is automatically mapped to the moving path of the object. Thus, users do not need to spatialize the object descriptions manually with the above brushing tool.
Content-Awareness Components
OmniScribe uses a rectangular view indicator in the equirectangular view to roughly indicate what is presented in NFOV. The view indicator can be panned in either the equirectangular view or NFOV and is synchronized across the two. In NFOV, we also added section control widgets for the six sections: top, bottom, left, right, front and back views, which allow users to focus on the desired section by clicking the section tag or shifting using arrows.
OmniScribe visualizes the detected objects into a circular map by centering the viewer and placing the iconic representations around them. Once an icon is clicked, the user will be automatically guided to the clicked object in the other video views. A viewing compass is rendered to indicate the direction of facing and the corresponding field of view.
The visualization of bounding boxes for detected objects can serve as another cue for users to observe the visual flow and follow specific content, or infer the number of objects. Therefore, OmniScribe presents the object bounding boxes in another visual overlay. The object bounding boxes also allow users to easily author object descriptions.
The equirectangular image encodes all 360° information in a 2D format that is hard to observe simultaneously. Therefore, we aimed to increase visual awareness by enhancing the contour of salient objects. OmniScribe outlines salient objects with green strokes.
Using our mobile prototype, BVI people can listen to spatial ADs during the video playback. The smartphone will vibrate to notify users of scene transitions, and users can then proactively access and listen to the scene descriptions by tapping the screen to pause the video. After the playback of a scene description is finished, users can explore the spatially-anchored object descriptions by turning around.
Mobile Prototype
Using our mobile prototype, BVI people can listen to spatial ADs during the video playback. The smartphone will vibrate to notify users of scene transitions, and users can then proactively access and listen to the scene descriptions by tapping the screen to pause the video. After the playback of a scene description is finished, users can explore the spatially-anchored object descriptions by turning around.
FULL CITATION
Ruei-Che Chang, Chao-Hsien Ting, Chia-Sheng Hung, Wan-Chen Lee, Liang-Jin Chen, Yu-Tzu Chao, Bing-Yu Chen, and Anhong Guo. 2022. OmniScribe: Authoring Immersive Audio Descriptions for 360° Videos. In The 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22), October 29-November 2, 2022, Bend, OR, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3526113.3545613
@inproceedings{omniscribe, author = {Chang, Ruei-Che and Ting, Chao-Hsien and Hung, Chia-Sheng and Lee, Wan-Chen and Chen, Liang-Jin and Chao, Yu-Tzu and Chen, Bing-Yu and Guo, Anhong}, title = {OmniScribe: Authoring Immersive Audio Description for 360° Videos}, year = {2022}, isbn = {9781450393201}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https //doi.org/10.1145/3526113.3545613}, doi = {10.1145/3526113.3545613}, booktitle = {The 35th Annual ACM Symposium on User Interface Software and Technology}, numpages = {14}, keywords = {360° video, audio description, virtual reality, multimedia, accessibility, Blind, visual impairment, sonifcation, computer vision, mobile}, location = {Bend, Oregon, USA}, series = {UIST '22} }