Paper & Abstract


Click to view/download

Blind people typically access videos via audio descriptions (AD) crafted by sighted describers who comprehend, select, and describe crucial visual content in the videos. 360° video is an emerging storytelling medium that enables immersive experiences that people may not possibly reach in everyday life. However, the omnidirectional nature of 360° videos makes it challenging for describers to perceive the holistic visual content and interpret spatial information that is essential to create immersive ADs for blind people. Through a formative study with a professional describer, we identified key challenges in describing 360° videos and iteratively designed OmniScribe, a system that supports the authoring of immersive ADs for 360° videos. OmniScribe uses AI-generated content-awareness overlays for describers to better grasp 360° video content. Furthermore, OmniScribe enables describers to author spatial AD and immersive labels for blind users to consume the videos immersively with our mobile prototype. In a study with 11 professional and novice describers, we demonstrated the value of OmniScribe in the authoring workflow; and a study with 8 blind participants revealed the promise of immersive AD over standard AD for 360° videos. Finally, we discuss the implications of promoting 360° video accessibility.

Web Authoring Interface

This is a figure of an overview of the OmniScribe interface.
        a. There are two video views. The left half of the overall picture is an augmented equirectangular view with a clock meter (top), and there is another normal field-of-view video view on the bottom. There are three people making formula milk for elephants in the equirectangular view. 
        b. This is a content map in the lower-left corner of the overall picture of the OmniScribe interface, presenting dynamic objects and the user's viewing angle. Now three people are standing around you; one is at ten o'clock, one is at two o’clock, and the other is at three o’clock.
        c. There is a description authoring panel in the upper right corner of the overall picture of OmniScribe interface. This panel is for authoring standard AD and scene and object descriptions. There are textboxes; for each textbox, there are widgets, including playing audio descriptions, creating descriptions, and estimating the duration of audio descriptions.
        d. A timeline panel in the lower right corner of the overall picture helps users visualize the scenes, described objects, audio content, and ADs. This figure describes the detailed functionalities of OmniScribe timeline components. First, four toggles from left to the right for selecting video files, section division, object tracking, and saliency. Second, there are four rows from top to bottom: Timelines for the scene, Background sound, Speech in the original video, and User's description. Third, a region below the four rows mentioned above. The objects made with immersive labels are visualized in this timeline panel.

Authoring Immersive Labels

Authoring Spatial Audio Descriptions (Spatial AD)
To author spatial ADs, the describer can use the brushing tool to paint the sound paths of the selected description on the equirectangular view. OmniScribe then transforms the 2D painted path into 3D spherical coordinates to be visualized on the video view during future playbacks and for rendering immersive sound in OmniScribe's mobile prototype.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.


Authoring Scene Descriptions
OmniScribe preempts an AD slot for each scene, allowing the describer to provide more details about them. Scenes are automatically detected and segmented once the video is loaded. BVI users in the mobile prototype can be notified of each new scene through vibration and manually play scene descriptions.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.



Authoring Object Descriptions
OmniScribe enables the describer to select the crucial objects and describe them, which we call object descriptions. The moving path of the object was prepopulated using object tracking in the preprocessing stage. The audio path of the object description is automatically mapped to the moving path of the object. Thus, users do not need to spatialize the object descriptions manually with the above brushing tool.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.


Content-Awareness Components

View Control Widgets
OmniScribe uses a rectangular view indicator in the equirectangular view to roughly indicate what is presented in NFOV. The view indicator can be panned in either the equirectangular view or NFOV and is synchronized across the two. In NFOV, we also added section control widgets for the six sections: top, bottom, left, right, front and back views, which allow users to focus on the desired section by clicking the section tag or shifting using arrows.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.


Content Map
OmniScribe visualizes the detected objects into a circular map by centering the viewer and placing the iconic representations around them. Once an icon is clicked, the user will be automatically guided to the clicked object in the other video views. A viewing compass is rendered to indicate the direction of facing and the corresponding field of view.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.



Object Tracking Overlay
The visualization of bounding boxes for detected objects can serve as another cue for users to observe the visual flow and follow specific content, or infer the number of objects. Therefore, OmniScribe presents the object bounding boxes in another visual overlay. The object bounding boxes also allow users to easily author object descriptions.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.



Saliency Overlay
The equirectangular image encodes all 360° information in a 2D format that is hard to observe simultaneously. Therefore, we aimed to increase visual awareness by enhancing the contour of salient objects. OmniScribe outlines salient objects with green strokes.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.




Section Division Overlay
Using our mobile prototype, BVI people can listen to spatial ADs during the video playback. The smartphone will vibrate to notify users of scene transitions, and users can then proactively access and listen to the scene descriptions by tapping the screen to pause the video. After the playback of a scene description is finished, users can explore the spatially-anchored object descriptions by turning around.
This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.

Mobile Prototype

Using our mobile prototype, BVI people can listen to spatial ADs during the video playback. The smartphone will vibrate to notify users of scene transitions, and users can then proactively access and listen to the scene descriptions by tapping the screen to pause the video. After the playback of a scene description is finished, users can explore the spatially-anchored object descriptions by turning around.

This figure demonstrates the mobile prototype. Using a headphone and a smartphone, BVI users, the woman in this picture, can acquire the object descriptions by orienting themselves to the anchored objects in 3D space. In this figure, the woman will hear the description “A baby elephant is drinking milk from …” when turning her head to the right. If turning her head to the left, she will hear the description “The keeper is feeding the baby elephant.

[Full Demo Video]

FULL CITATION

Ruei-Che Chang, Chao-Hsien Ting, Chia-Sheng Hung, Wan-Chen Lee, Liang-Jin Chen, Yu-Tzu Chao, Bing-Yu Chen, and Anhong Guo. 2022. OmniScribe: Authoring Immersive Audio Descriptions for 360° Videos. In The 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22), October 29-November 2, 2022, Bend, OR, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3526113.3545613


@inproceedings{omniscribe, author = {Chang, Ruei-Che and Ting, Chao-Hsien and Hung, Chia-Sheng and Lee, Wan-Chen and Chen, Liang-Jin and Chao, Yu-Tzu and Chen, Bing-Yu and Guo, Anhong}, title = {OmniScribe: Authoring Immersive Audio Description for 360° Videos}, year = {2022}, isbn = {9781450393201}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https //doi.org/10.1145/3526113.3545613}, doi = {10.1145/3526113.3545613}, booktitle = {The 35th Annual ACM Symposium on User Interface Software and Technology}, numpages = {14}, keywords = {360° video, audio description, virtual reality, multimedia, accessibility, Blind, visual impairment, sonifcation, computer vision, mobile}, location = {Bend, Oregon, USA}, series = {UIST '22} }