Motion Design Principles for Accessible Video-based Learning: Addressing Cognitive Challenges for Deaf and Hard of Hearing Learners

Si Chen sic3@illiois.edu 0000-0002-0640-6883 School of Information Sciences, University of Illinois Urbana-ChampaignChampaignIllinoisUSA61802 Haocong Cheng haocong2@illinois.edu School of Information Sciences, University of Illinois Urbana-ChampaignChampaignIllinoisUSA Suzy Su xiaoyus4@illinois.edu School of Information Sciences, University of Illinois Urbana-ChampaignChampaignIllinoisUSA Lu Ming Gallaudet UniversityWashingtonDistrict of ColumbiaUSA Sarah Masud University of Illinois Urbana-ChampaignChampaignIllinoisUSA Qi Wang qi.wang@gallaudet.edu Gallaudet UniversityWashingtonDistrict of ColumbiaUSA  and  Yun Huang yunhuang@illinois.edu School of Information Sciences, University of Illinois Urbana-ChampaignChampaignIllinoisUSA
(2018)
Abstract.

Deaf and Hard-of-Hearing (DHH) learners face unique challenges in video-based learning due to the complex interplay between visual and auditory information in videos. Traditional approaches to making video content accessible primarily focus on captioning, but these solutions often neglect the cognitive demands of processing both visual and textual information simultaneously. This paper introduces a set of Motion design guidelines, aimed at mitigating these cognitive challenges and improving video learning experiences for DHH learners. Through a two-phase research, we identified five key challenges, including misaligned content and visual overload. We proposed five design principles accordingly. User study with 16 DHH participants showed that improving visual-audio relevance and guiding visual attention significantly enhances the learning experience by reducing physical demand, alleviating temporal pressure, and improving learning satisfaction. Our findings highlight the potential of Motion design to transform educational content for DHH learners, and we discuss implications for inclusive video learning tools.

Video, d/Deaf and hard of hearing, Visual Abilities, Motion
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2025; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Human-centered computing Accessibility design and evaluation methodsccs: Social and professional topics People with disabilitiesccs: Applied computing E-learning

1. Introduction

Efforts to make video content accessible for d/Deaf and Hard of Hearing (DHH) learners often focus on closed captions and transcripts, e.g., (Bhavya et al., 2022).However, simply providing accurate audio-to-text features does not fully address accessibility challenges (Marschark et al., 2005). Captions alone are insufficient, as they do not account for the visual aspects of lecture videos, which are crucial for comprehension and learning outcomes (Bhavya et al., 2022; Chen et al., 2024a). DHH learners, on average, read more slowly than their hearing peers and face challenges with constant attention switches in captions reading and video watching, making it difficult to keep up (Wang and Williams, 2014; Kushalnagar et al., 2010, 2014). For example, previous work highlights the value of sign language comments in video-based learning, showing how they support peer interaction and provide visual explanations that enhance understanding (Chen et al., 2024a). This underscores the need for further research on accessible video that goes beyond captions and centers on enhancing DHH learners’ visual experiences and preferences.

DHH individuals process visual information differently from their hearing counterparts, which is crucial for designing effective educational videos. Studies show that they respond faster to peripheral stimuli, focusing more attention on the periphery (Hong Lore and Song, 1991; Chen et al., 2006). Proksch and Bavelier (Proksch and Bavelier, 2002) found that while hearing individuals are more distracted by central stimuli, DHH individuals are more susceptible to peripheral distractions, suggesting different attentional resource allocation (Dye et al., 2008). Additionally, DHH individuals demonstrate greater activation in motion-selective visual areas and even the auditory cortex when viewing visual motion (Fine et al., 2005; Bavelier et al., 2001), responding faster and more accurately to motion direction changes (Hauthal et al., 2013). These visual processing differences should be considered in video design. DHH individuals’ preferences may contribute new insights into previous studies that found movement, when used strategically, can make information easier to process and interact with on websites (Petersen and Nielsen, 2002).

With DHH learners’ visual information processing needs and their video-based learning experience in focus, this paper addresses How can video presentation designs be improved to better support DHH learners? We conducted a two-phase study. Phase I aimed at understanding the specific challenges DHH learners face when engaging with mainstream video content. To answer RQ1: What challenges in video lecture delivery hinder the learning experience of d/Deaf and Hard of Hearing (DHH) learners?, we recruited DHH learners to identify video lecture delivery challenges when watching educational videos and propose suggestions. We identified the unique perspectives and experiences, pinpointed five key video lecture delivery challenges that hinder the learning process for DHH learners. These challenges ranged from the misalignment of visual and audio content, the lack of visuals to illustrate the lecture content, and to the overwhelming presence of too much visual information at once. For each challenge, we proposed design suggestions centering on improving the motion aspect of the video representation.

Then to answer RQ2: How do DHHs perceive the value of the proposed motion design principle for video presentation?, Phase II focused on understanding how DHH users perceived the value of the proposed suggestions. Specifically, we asked the participants to compare the edited video clips based on the new design suggestions to the original versions. The results showed two of the design suggestions, i.e., making visuals more relevant and guiding visual attention, to be more effective and highly valued by the participants. The others received mixed feedback.

This research makes novel and significant contributions to the HCI community, especially for the accessible design of video-based interaction with educational content. First, we identified unique challenges of multimedia content video lecture delivery causing barriers for DHH learners, they include: temporal misalignment of visuals and audio, overload of visual information, irrelevant visual content, lack of visual attention guidance, and text overload. We proposed a set of new design suggestions, coined as Motion design. We used the term ”motion” to highlight DHH individuals’ unique visual abilities, including greater activation in motion-selective areas and the auditory cortex, and faster, more accurate responses to motion changes (Fine et al., 2005; Bavelier et al., 2001; Hauthal et al., 2013). Then, we provided empirical evidence to showcase the perceived value of different design suggestions. Second, this design is theoretically grounded and supports the Multimedia Learning Theory (Mayer, 2002), specially by addressing cognitive demands on focusing on essential processing and generative processing (fostering connections between multimedia content). These design suggestions enrich the design toolkit by examining the relationship between text, audio, and video, as well as the relationship within the lecture content, which goes beyond the traditional visual design guidelines.

2. Related Works

In this section, we provide relevant literature review focusing on the visual information processing needs and unique challenges of d/Deaf and Hard of Hearing (DHH) learners when consuming learning videos and other multimedia content. Their distinctive visual abilities and challenges as well as the current state of caption call for improved design in learning videos that not only enhance accessibility but also mitigate visual split-attention, reduce cognitive overload, and promote DHH student learning with multimedia content.

2.1. Accessible Videos for d/Deaf and Hard of Hearing (DHH) Learners

Efforts to make video learning content accessible for d/Deaf and Hard of Hearing (DHH) learners largely focus on providing closed captions and descriptive transcripts. However, it is a misguided assumption that merely offering audio-to-text features adequately addresses accessibility challenges. The most evident issue is that captions are often inaccurate due to relying on automatic speech recognition (ASR) algorithms (Bhavya et al., 2022). While ASR technology is improving, it is still prone to errors from extraneous noise and ambiguity in human speech (Kafle and Huenerfauth, 2016). In a study assessing TV captioning quality, even high-quality captions were perceived to have problems regarding errors, difficulty in following captions, and caption appearance (Arroyo Chavez et al., 2024). Because of this high error rate, auto-generated captions are deemed too inaccurate to meet DHH learners’ needs when they are exclusively used for video learning (Parton, 2016). Research has also suggested that for both DHH and hearing people, ASR-generated errors are more difficult to comprehend and follow than human-produced errors in collaborative captioning (Kushalnagar et al., 2014). However, when DHH learners are asked to consider future captioning tools, many are interested in improving non-speech elements such as speaker identity, speech rate, and volume to work towards accessible learning (McDonnell et al., 2021).

In addition to problems with accuracy, the appearance of captions can also pose accessibility issues. Unlike TV captioning, which is highly standardized and optimized for readability, there is no standard defined for online video captions. Most web captions adhere to outdated TV captioning standards that suffered from the technical constraints of 1970s technology, such as limited options for typeface, size, and color (Kushalnagar et al., 2013). Social media applications like TikTok also suffer from a lack of universal standards in their captions, and they could benefit from user-generated captioning standards that define capitalization, typeface, color, caption rate and placement, punctuation, and speaker identification (McDonnell et al., 2024). Another issue arises when audio and visual content is paired but shows separate information, for example, a verbal lecture accompanied by text on slides (Lasecki et al., 2014). Most learners read captions at a slower pace than listening to verbal content (Jensema, 1998). However, DHH learners tend to have a significantly slower reading pace than their hearing peers (Tyler et al., 2009). The combined effect is that the caption readers, i.e., DHH learners, fall behind in processing audio information when they focus on reading the slide as they cannot read from both sources, the text on the slide and the caption on the screen without missing information. On the other hand, their hearing counterparts can study the slide while listening to the audio simultaneously . It is a well-studied phenomenon that DHH learners suffer from visual split-attention in multimedia learning, thus, it is important to design accessible learning videos that also mitigate this unique problem for DHH learners (Mather and Clark, 2012).

2.2. Visual Abilities of DHH vs Hearing Learners

Another important consideration is that DHH learners often exhibit knowledge, conceptual organization, and cognitive strategies different from hearing learners, which may put them at an academic disadvantage in typical classrooms that are not designed to accommodate this variability (Marschark and Hauser, 2008). DHH and hearing individuals show slight differences in their visual abilities as well, suggesting that auditory deprivation leads to certain visual enhancements. Studies have shown that DHH participants respond faster to peripheral stimuli (Hong Lore and Song, 1991; Chen et al., 2006), indicating that they focus more of their attention on the periphery. This finding was reinforced in a study by Proksch and Bavelier (Proksch and Bavelier, 2002) where DHH and hearing participants attempted to complete a task with distractors in the central and peripheral fields. As expected, hearing individuals were more distracted by central distractors, and DHH individuals were more distracted by peripheral distractors. This enhanced peripheral processing has been previously interpreted as greater distractibility in DHH people, but it may be better explained by a difference in the allocation of visual attention between DHH and hearing people; in terms of DHH individuals adapting to their environment, this reallocation of attention to the visual peripheral makes sense considering the lack of peripheral auditory cues provided by the environment (Dye et al., 2008). Recognizing these differences in visual perception is not only helpful in practical contexts—such as designing classrooms and educational content—but also theoretically, as it potentially explains the gap in learning outcomes and academic achievement between DHH and hearing students (Hauser and Marschark, 2008). Efforts to make visually oriented classrooms that aid DHH students’ learning have considered the physical layout of educational settings, such as (1) making classrooms square in shape rather than rectangular, (2) seating a limited number of students rather than many, and (3) arranging seating in a semicircle shape rather than several rows so that visual contact can be made with everyone in the room (Mather and Clark, 2012).

Studies have also found that DHH and hearing people differ in motion perception, such that DHH people show activations in the auditory cortex when presented with purely visual motion stimuli (Fine et al., 2005). Additionally, DHH individuals show greater recruitment of motion-selective visual areas than hearing individuals (Bavelier et al., 2001). Another study reported that DHH participants responded faster and more accurately to small differences in the direction of motion (Hauthal et al., 2013). When designing educational video content, these enhancements in peripheral processing and motion perception should be considered to ensure that important information is accessible without distractions. While hearing people may be able to comprehend rich visual backgrounds and motion-heavy content without much difficulty, it may be a more taxing process for DHH people. In particular, content in mobile contexts (e.g., while walking) needs to be carefully designed to be portable and adaptable to changing contexts, as there are more attentional demands and distractions compared to stationary contexts (e.g., lectures) (Jain et al., 2018).

2.3. Improving Multimedia Learning Experience for DHH Learners

Multimedia can be used to represent concepts in multiple ways through text, images, videos, sound, and animation. While multimedia learning content is largely beneficial, it must be carefully designed to avoid cognitive overload, which occurs when the processing demands of the learning task are greater than the processing capacity of the human information-processing system (Mayer and Moreno, 2003). Mayer’s cognitive theory of multimedia learning is based on three principles relevant to designing multimedia content: humans use dual channels to process visual and auditory/verbal information (i.e., dual-channel assumption), each channel has a limited capacity of information it can process (i.e., limited-capacity assumption), and active processing involves carrying out a series of cognitive processes during learning (i.e., active processing assumption) (Mayer, 2014).

The current cognitive theory of multimedia content focuses on the cognitive processes during learning (e.g., selecting, organizing, and integrating), while future directions involve integrating learning components that go beyond basic cognitive processes and show an increase and refinement in evidence-based design principles (Mayer, 2024). Some principles for designing effective instructional videos include signaling (highlighting key content), coherence (avoiding extraneous material), and segmenting (breaking complex content into progressively presented parts) (Mayer, 2021). Multimedia content should therefore aim to minimize the cognitive demands on each processing pathway to optimize learning. Three types of cognitive demands could be optimized, which is why our manuscript focuses on addressing (1) essential processing — cognitive processes that allow a mental representation to be held in working memory for a period of time, (2) extraneous processing — cognitive processes that are not required for making sense of the presented material but occur due to the design of the learning task, and (3) generative processing — cognitive processes that are required for making sense of the presented material (selecting, organizing, and integrating words and images).

Research has shown that the use of information and communication technology increases DHH learners’ learning and level of understanding particularly when it is designed per their specific needs (Debevc and Peljhan, 2004). However, it is not clear how multimedia content should be designed to support DHH learners. While there is extensive literature addressing issues of cognitive overload in multimedia design, little research has been conducted with and for DHH learners (Hidayat et al., 2017), as designers must account for the lack of the auditory/verbal channel that theories of multimedia design are based on (Techaraungrong et al., 2017). A study by de Lacerda Pateca et al. (de Lacerda Pataca et al., 2024) found that pairing font-color with font-weight exhibits low cognitive load in affective captioning; however, DHH learners did not reach a consensus that this was easy to read and intuitive, and preferences varied from person to person. DHH learners also vary on preferences for interpreter placement and font size, indicating that the customizability and personalization of features is necessary for optimal accessibility (Boudreault et al., 2024).

Additionally, accessible multimedia that includes a visual representation of audio content (i.e., sign language interpreters or captions) is prone to cognitive overload and explains why DHH learners get less out of lectures than their hearing peers (Lang, 2002). While seeing an instructor’s onscreen image may engage more in-depth cognitive processing and subsequently improve learning for some, others may find the image to be distracting and therefore hinder learning (Kizilcec et al., 2015). A study by Gu et al. (Gu et al., 2024) found a benefit-cost tradeoff where multimedia learning is only favorable when the cognitive benefits of an onscreen instructor’s social presence outweigh the cognitive costs of attentional distractions that are unrelated to the content. This is in accordance with the seductive details principle, which asserts that people do not learn better when interesting but extraneous details are added (Mayer et al., 2020). In fact, another study reported that when viewing multimedia content with dynamic visuals and written text, DHH learners spent more time looking at the text and largely ignored visualizations (Schmidt-Weigand et al., 2010). The difference in learning outcomes between hearing and DHH learners is largely because hearing learners can simultaneously attend to verbal and visual content using separate modal senses, while DHH learners are unable to do so. They must instead switch their focus between the visual representation of the audio content and the instructor’s visual focus, which is usually the slides or whiteboard (Marschark et al., 2005).

3. Phase 1: Identify Video Lecture Delivery Challenges and Propose Motion Design Principles (RQ1)

In Phase 1, we understood the video lecture delivery challenges for DHH individuals to consume video created for a mainstream audience, focusing on video learning as a multimedia learning experience rather than limited to caption design.

In our manuscript, “mainstream video” refers to video created for a general audience. The term “mainstream” is employed analogously to its use in educational contexts, where it denotes the practice of integrating learners with special education needs into general education classrooms based on their abilities (Lindsay, 2007). The research team identified five key challenges associated with making mainstream videos accessible to DHH learners and proposed five corresponding motion design principles. In addition, the research team and most participants agreed that the term “video learning challenge” was inadequate, as it placed emphasis on the DHH learners’ process rather than the shortcomings of the video material. Therefore, we prefer the term “video lecture delivery challenges” to more accurately reflect the source of the difficulties.

3.1. Positionality for Phase 1

The research team consisted of researchers with different hearing-ability and background. The first author is a hearing with beginner ASL ability who led this phase of the study. A Deaf co-author and three hearing co-authors participated in this phase of the study. The study was supervised by two hearing faculties, including one with over 30 years of experience in college-level DHH education.

3.2. Study Details and Procedure

Six DHH college students (5 Deaf and one Hard-of-Hearing) and three instructors who were experienced in DHH education (1 Deaf and 2 Hearing) participated in this phase. The student participants were majoring in design, business, education, and computer science. The collection of video lecture delivery challenges and suggestions was asynchronous by filling up a spreadsheet provided to the participants. Participants took between two and four hours in total to complete the task. Each student participant was compensated $25 per hour for their time, whereas the three instructors voluntarily participated in the study.

During the study, participants were asked to watch an AR video on Coursera created for mainstream learners, identify video lecture delivery challenges, and describe corresponding changes they wish to make. This mainstream educational video was the first lesson in a series of AR classes. It covered the history of AR, widely known applications of AR, and major technical areas of AR. The video presentation style alternated between a talking head (A talking head video is a video of someone speaking directly in front of the camera. It’s shot in a way that the viewer feels that the speaker is talking to them face to face. talking-head is a common video lecture production style (Guo et al., 2014)) and full-screen visual examples and sometimes included both simultaneously. This is a mainstream video developed for hearing learners with only spoken English used by the instructors but no ASL interpreters present. The video was selected because it had the most reviews (average score 4.5 out of 5 by 3.7k learners) on Coursera on the AR topic, and was marked beginner-level. In other words, it was perceived as helpful and well-designed as a mainstream video. The research team added error-free open captions to the video. We decided to only include one video for the study to manage the overall duration of the study session based on the availability of our participants, and we acknowledge that this single video may not fully explore the video lecture delivery challenge that exists in other educational videos.

3.2.1. Step 1: Identifying Video Lecture Delivery Challenges and Suggestions for DHH learners

First, participants were asked to identify the video lecture delivery challenges and suggestions for the 15-minute mainstream videos. The researcher explained to the participants that this AR video was an example of a mainstream video, and participants were encouraged to recall their previous watching experience of mainstream videos to focus on challenges that have been frequently experienced before.

When completing the spreadsheet, each row was a video lecture delivery challenge-suggestion pair. Each pair had one video lecture delivery suggestion addressing one or more challenges from the video. Within each row, participants took screenshots from at least one video segment of the AR video as example of the challenge and used text or annotation to present their suggestion ideas. Participants were asked to provide as many challenges-suggestion pairs as they could. In total, we gathered 105 challenge-suggestion pairs frou our nine participants.

3.2.2. Step 2: Explaining How Design Suggestions Address DHH Learners’ Needs

Next, participants were asked to explain each suggestion using open-ended comments and to indicate how it would improve understanding the video for DHH learners by selecting checkboxes on the spreadsheet. The checkboxes were mainly used to help the participants understand the purpose of our study: empowering DHH learners’ visual abilities and addressing multimedia learning cognitive demands.

There are a total of seven checkboxes. The first three checkboxes included the following visual abilities that have been studied in (Bell et al., 2019): Spatial Vision: Understanding where things are, how they are arranged, and how they look; Temporal Vision: The ability to perceive and process visual information over time; Motion Vision: Noticing and understanding movement, compiling both spatial and temporal information. The other four checkboxes included addressing the following cognitive demands according to multimedia learning theory (Mayer, 2002): Reducing Irrelevant Information; Focusing on Essential Information; Fostering Connection between Text and Image and “Others,” which can be selected if none of the three demands can be applied. Mayer’s multimedia learning theory states there are 3 types of cognitive demands as explained in related works: Essential processing, Extraneous processing, and Generative processing. Our checkboxes are simplified versions of the three demands, revised based on the reading preferences of DHH learners in consultation with Deaf researcher members.

3.2.3. Analysis

The research team conducted a thematic analysis of the video lecture delivery challenges and suggestions and concluded with five agreed challenge-suggestion pairs after reading through the data collected in Step 1. While developing the synthesized challenges and suggestions via discussion, the research team segmented the original 15 minutes video into 15 clips based on the natural pause of the video content. Each video clip lasted between 50 seconds to 65 seconds (M = 55, SD = 9).

For video lecture delivery challenges, the lead author first analyzed the data collected in Step 1 and proposed an initial list of three themes: visual and audio/captions are not temporally aligned, too much visual information at the same time, and visual content irrelevant to audio/captions. Then, other members of the research team individually selected the most obvious challenge from the initial list. The initial agreement was 66.7% (10/15). The research team further discussed and improved the themes till a full agreement was reached. A total of five challenges were identified.

To develop the design principles toward identified challenges, the research team systematically mapped each design challenge to the 105 video lecture delivery challenge-suggestion pairs collected from nine participants. These pairs served as qualitative data, with each potentially linked to more than one of the five summarized design challenges. The team then discussed and refined the design principles to include as many relevant suggestions as possible for each challenge until full agreement was reached among the team members.

Design Principle Challenges in Video Lecture Delivery Suggestions Changes to Make (by video lecture creator)
D-Illustrate: Improve Visual-Audio Relevance. Lack relevant visual elements to explain the captions. Replace such visuals with illustrative visuals for the captions, including a “talking-head” (Guo et al., 2014). The new content should not be more visually demanding than the old one. Replace old visuals with new ones that illustrate the captions to reduce the gap between the semantic meaning of the visuals and the audio. No change to the audio.
D-Guide: Guide Visual Attention Switch Lack focus on visual elements: important visual elements are displayed all at once, competing for attention (cause of visual split-attention) but the caption can only focus on one element at a time. Overlay shades, shapes, colors, and animation effects to support users’ comprehension via effective visual attention guidance when the content is covered in the captions. Identify the important referred object in the visual content and its sequence in the audio (depending on video content and auditory cues conveyed by the speaker), locate it on the visual space, and add the overlays sequentially. Could add pauses to the audio to match the visual.
D-Sync: Sync Visual to Audio Lack synchronization between visuals and audio/caption, causing confusion. Temporally align the visuals with audio/caption so visuals show up when they are mentioned in the audio. Detect audio timeline and relocate visual content accordingly. No change to the audio.
D-Declutter: Centralize Essential Text Lack visual support: screen filled with chunks of repeated text in captions, increasing unnecessary reading load and causing anxiety. Visually center and summarize the text block to make the key information stand out. Remove/reduce lengthy irrelevant visual information accordingly, such as “talking head.” Identify long and duplicated texts on the screen and in the captions, extract essential information/terms with visual meanings, re-design the visual typographic of captions to convey the meaning visually. No change to the audio.
D-Slowdown: Slow Visuals Down Lack time to comprehend the video content: rapid visual movements hinder learners from reading captions and causing anxiety and distraction. Slow the visual movements and extend the on-screen time so there is enough time to see/read. Extract fast-moving visuals and extend their play time on screen. Add audio pauses if needed. Visual and audio may both be relocated, and video might be longer.
Table 1. RQ1 findings: Motion Design Principles and Challenges.
\Description

This table summarizes the design principles, challenges in video lecture delivery they addressed, suggestions, and changes to make by video lecture creator. For ”D-Illustrate: Improve Visual-Audio Relevance,” the challenges are ”Lack relevant visual elements to explain the captions;” the suggestions are ”Replace such visuals with illustrative visuals for the captions, including a ’talking-head’. The new content should not be more visually demanding than the old one;” and the changes to make are ”Replace old visuals with new ones that illustrate the captions to reduce the gap between the semantic meaning of the visuals and the audio. No change to the audio.” For ”D-Guide: Guide Visual Attention Switch”, the challenges are ”Lack focus on visual elements: important visual elements are displayed all at once, competing for attention (cause of visual split-attention) but the caption can only focus on one element at a time;” the suggestions are ”Overlay shades, shapes, colors, and animation effects to support users’ comprehension via effective visual attention guidance when the content is covered in the captions;” and the changes to make are ”Identify the important referred object in the visual content and its sequence in the audio (depending on video content and auditory cues conveyed by the speaker), locate it on the visual space, and add the overlays sequentially. Could add pauses to the audio to match the visual.” For ”D-Sync: Sync Visual to Audio,” the challenges are ”Lack synchronization between visuals and audio/caption, causing confusion;” the suggestions are ”Temporally align the visuals with audio/caption so visuals show up when they are mentioned in the audio;” and the changes to make are ”Detect audio timeline and relocate visual content accordingly. No change to the audio.” For ”D-Declutter: Centralize Essential Text,” the challenges are ”Lack visual support: screen filled with chunks of repeated text in captions, increasing unnecessary reading load and causing anxiety;” the suggestions are ”Visually center and summarize the text block to make the key information stand out. Remove/reduce lengthy irrelevant visual information accordingly, such as ’talking head’;” and the changes to make are ” Identify long and duplicated texts on the screen and in the captions, extract essential information/terms with visual meanings, re-design the visual typographic of captions to convey the meaning visually. No change to the audio.” For ”D-Slowdown: Slow Visuals Down,” the challenges are ”Lack time to comprehend the video content: rapid visual movements hinder learners from reading captions and causing anxiety and distraction;” the suggestions are ”Slow the visual movements and extend the on-screen time so there is enough time to see/read;” and the changes to make are ”Extract fast-moving visuals and extend their play time on screen. Add audio pauses if needed. Visual and audio may both be relocated, and video might be longer.”

3.3. Finding: Five Video Lecture Delivery Challenges and Suggestions

The five video lecture challenges are presented in Table 1 column 1 and 2. For each of the five identified video lecture delivery challenges, the corresponding Motion design principles are presented in Table 1 columns 3 and 4. The Motion design is for delivering multimedia content more efficiently along the timeline. No changes to audio or caption content were suggested by our participants, as they identified it as important lecture material. Different design principles are ranked based on the frequency of video lecture delivery challenge that occurred in the 105 challenge-suggestion pairs collected from our participants. Note that this ranking does not reflect the importance of the design principles themselves, nor does it indicate their significance. Since a suggestion pair collected from participants may be associated with more than one of the five design principles, we did not provide the exact number of pairs to reduce confusion.

Refer to caption
Figure 1. Original vs Revised Video Pairs used in RQ2 Following Five Motion Design Principles Identified in RQ1. We did not make changes to the audio/caption content.
\Description

This figure describes the five video examples used in RQ2: original vs revised for D1 to D5. There are 2 columns rows in this image, they are design suggestion, and sample video used on study 2. The sample video row has two sub rows which are original videos and revised videos, and with each pair, there is a short description. For D-Illustrate: improve visual-audio relevance, the two arrows marked ”OR” extend from the triangle, suggesting alternative points for the visual content to align with the audio. In the original video, a person is shown speaking, and in the revised version, an animation relevant to the spoken content is added, helping to illustrate the discussed concepts. The description is: Show Image with no emphasis vs. Overlay the relevant part with flashing highlight. For D-Guide: guide visual attention switch, the original video uses a person using device illustration, whereas the revised video employs flashing highlights on the device to draw attention to specific parts of the device being discussed, which can help viewers engage and understand. The description is: Show no animation vs. Show animation synced to audio / caption. For D-Sync: sync visual to audio, the rectangular are overlapped, indicating simultaneous audio and visual content. The original video shows all keyboard diagrams at once, while the revised version displays these components sequentially, making it easier for viewers to follow along with the narration. The description is: Show chunked text vs. Show centered text with typographic. For D-Declutter: centralize essential text, the original video shows a presenter with a text-heavy background. The revised format reorganizes this text with different colors and shadows and makes the text more digestible. The description is: Show chunked text vs. Show visual meanings with typography. For D-Slowdown: slow visuals down, the original video features a quick display of the video content, while the revised version extends the display time from 10 seconds to 12 seconds. There are arrows pointing from the frame of beginning to the end of the clip, representing the process of the video. The description is: Slow down the speed of visual animation to make it easier to understand.

3.4. Implication and Implementation for RQ2 Study

Following Table 1, the research team revised the same 15 clips from the 15-minute video accordingly. There are five video clip pairs for D-Illustrate, three video clip pairs for D-Guide, one video clip pair for D-Sync, three video clip pair for D-Declutter, and three video clip pairs for D-Slowdown. Examples are shown in Fig. 1. The number of changes largely depends on the challenges present in the original video, so we were unable to evenly distribute the video numbers across different design principles. Additionally, we do not claim that our design principles sufficiently summarized all challenges in Motion design or indicated whether one video lecture delivery challenge was more common than another. These are problems that require further research.

The revision to the original video was done manually using professional video editing software. We explored the feasibility of using AI tools to make these changes but found them insufficient, leading to the decision to proceed with manual edits. More reflections on the try-out AI tools are presented in the discussion section.

For the study design purposes of RQ2, each clip was applied with only one design principle. Based on our findings, three of the 15 clips appeared to have more than one video lecture delivery challenges. Researchers chose to focus on the most significant challenge for these clips. To minimize the impact of secondary challenges, the researchers first addressed the secondary challenge to create a revised “original” version of each video. Then, they made further modifications to address the primary challenge for comparison in the study. The research team recreated these three video clips to closely resemble the original online videos, although some slight differences, such as font and color shading, remained. In summary, 12 out of 15 “original” videos used in the study were the exact versions available online, while 3 out of 15 included some manual edits for study design comparison purposes. When we refer to the “original” version in RQ2, we are referring to the video with only one video lecture delivery challenge was improved based on.

4. Phase 2: Perceived Value of Five Design Principles (RQ2)

4.1. Method

In Phase 2 of the study, we conducted user studies (n=16) with DHH learners to review the perceived value of five design principles identified in Phase 1. Participants were referred to as V1 to V16 in the remainder of this paper. Each study session took between 1.5 to 2 hours. We compensated participants at a rate of $25 per hour. This study is approved by University Institutional Review Board (IRB).

Refer to caption
Figure 2. Study Procedure for RQ2. During onboarding, participants were introduced to the five design principles described in Table 1. Then, they watched 15 video pairs. Each pair of videos consists of an original (unedited) version and a revised (edited) version with one of the five design principles applied. For each pair of videos, participants will watch both versions in a randomized order (either showing the original version first, or the revised version first, as demonstrated by two examples in the figure), and then complete a four-question survey developed from NASA Task Load Index for each version they watched (TLX questions). Then, they will answer three additional survey questions on learning cognitive demand scores (LCD questions) comparing the two versions. The same survey questions were used for all 15 video pairs. Participants were given at least one 5-minute break during video rating. Then, participants took place in a brief interview to discuss about their suggestions toward the design principles they experienced.
\Description

This figure introduces the study process. It is built with two main lines. In the first line, there are three main grey boxes showing three steps of the study procedure. The first box is “Onboarding (10 minutes).” On the top of the second box, it writes D-Sync, D-slowdown, D-Illustrate, D-Declutter, and D-Guide Applied to Video Pair 1 to 15 (60 minutes = 15*4 minutes). There are two white boxes representing Video Pair 1 (D-Illustrate) and Video Pair 15 (D-Declutter) on two sides of this box, with … in between. The white box for Video Pair 1 is expanded in the second line with the following: screenshot of the original video, text ”NASA Task Load (TLX) Questions,” screenshot of revised video, text ”NASA Task Load (TLX) Questions,” text ”Learning Cognitive Demand (LCD) Questions.” The white box for Video Pair 15 is expanded in the second line with the following: screenshot of the revised video, text ”NASA Task Load (TLX) Questions,” screenshot of original video, text ”NASA Task Load (TLX) Questions,” text ”Learning Cognitive Demand (LCD) Questions.” The third box is “interview (30 minutes).”

4.1.1. Positionality for Phase 2

The same research members in Phase 1 worked on Phase 2. The first author, designed, conducted, and analyzed the study. A Deaf co-author conducted 12 studies, whereas the remaining studies were led by the first author with ASL interpreters for communication with participants. All study sessions were supervised by the hearing faculty with experience in DHH education. The study data was mainly analyzed by the first author and two other hearing co-authors with the assistance of the Deaf co-author. The overall study was supervised by the other hearing faculty.

4.1.2. Participants

We recruited 16 DHH participants from a university in the U.S. that specializes in DHH education. None of the participants in Phase 2 were involved in previous phases of the study. Participants were aged between 20 to 49. Out of all participants, 14 self-identified as Deaf, whereas the other two were self-identified as hard-of-hearing (HoH). Ten were identified as male, while the other six were identified as female. As for ethnicity, six were identified as White, five as African American, three as Asian, one as Hispanic, and one did not disclose. Eleven participants were majoring in business or accounting, four in information technology-related majors, and one in art history. Nine participants reported sign language (e.g., ASL) as their first language for face-to-face communication, whereas the other seven reported spoken language (e.g., English) as their first language.

4.1.3. Study Procedure

We split the same 15-minute mainstream educational video on AR technology into 15 segments based on annotations in Phase 1. Each video clip was then revised based on one of the design principles in Table 1. All studies were conducted through Zoom. After completing the consent form, participants were first introduced to the five design principles.

Original vs. Revised Video Rating. For each design, we presented two versions, original (before edit) and revised (after edit), in a randomized order. However, we used the same order to present all video clips based on their presentation order in the original video, as the original video was designed in a progressive order in presenting knowledge that could lead to confusion if the order was shuffled. We described the five design principles as ways future technology was used to improve a mainstream educational video. Participants were told which one of the five design principles was applied to the revised video clip, but they were not informed which version was the revised clip. After watching each clip, they were asked to complete the following survey questions:

  • NASA Task Load Index (TLX) Questions: For each version of the video clip, participants complete four questions on how each version made them feel in Mental Demand, Physical Demand, Temporal Pressure, and Learning Satisfaction. These four questions were developed based on NASA Task Load Index (TLX) questions, which has been used to understand DHH individuals’ interaction with technology, e.g., (Li et al., 2022; Dust et al., 2023; Chen et al., 2024a). Participants could choose on a 10-point Likert scale, where 1 represents low demand / pressure / satisfaction, and 10 represents high demand / pressure / satisfaction. A total of eight questions are completed per video pair.

  • Learning Cognitive Demands (LCD) Questions: For each video pair, participants were asked to compare between the original and revised versions and rate on how the revised version helped them in three cognitive demands in multimedia learning theory (Mayer, 2002): Reducing Irrelevant Information, Focusing on Essential Information, and Fostering Connection between Text and Image. Participants may choose from Strongly Disagree, Disagree, Slightly Disagree, Neutral, Slightly Agree, Agree, Strongly Agree. A total of three questions are completed per video pair.

To avoid participants being fatigued with watching video clips, we included a 5-minute break after watching 11 video clips, when all five design principles were experienced by participants at least once and around 1 hour into the study session. After the break, the researcher had a brief interview with the participants to revisit the five design principles for suggestions and potential additional designs that should be introduced. Then, they continued to watch the remaining 6 video clips that represented four of five design principles. Participants were also allowed to take additional breaks if they wished to during this session. After they finished rating all video clips, they were asked to complete a demographics survey and move on to the interview.

Interview. During the interview, we focused on understanding participants’ feelings toward the design principles of the video, as well as their suggestions for further improving the video clips they watched. We also asked questions for them to envision how future technology could automate the video-editing process. The sample questions are: How would you like the video to be further enhanced?, Do you think you would directly apply the design principles to improve accessibility of a video?, Do you think all DHH learners would be interested in using the design principles? Why?. Participants completed the interview in either ASL or spoken English based on their preferences. The interview recordings were transcribed into deitentified transcripts in written English and destroyed immediately afterwards.

4.1.4. Data Analysis

For survey results, we analyzed the LCD questions and TLX questions separately. As mentioned in previous section, each video clip was applied with only one design principle. For LCD questions, we converted the agreement scores (Strongly Disagree, Disagree, Slightly Disagree, Neutral, Slightly Agree, Agree, Strongly Agree) to 1 to 7, respectively. We primarily built linear mixed models to analyze the survey results as they fit the distribution of our survey data the best and addressed the effects of different numbers of video clips for different design principles in our study. For all models, we included a random effect for Participant ID (“1/PID”) to account for individual differences that were not explained by the fixed effects in the model (Chang et al., 2023). The detailed models we built are described in the Findings section below where relevant. All significant statistical results presented had a power of at least 0.80. Post-hoc analysis was conducted for all models. The details of each models will be described in the findings section.

For interview transcripts, we conducted an inductive thematic analysis (Braun and Clarke, 2006). Two hearing co-authors independently open-coded two of the transcripts and discussed to create an initial codebook from scratch. Then, they analyzed the remaining transcripts independently. The hearing co-authors regularly discussed with the Deaf co-author and hearing faculty with DHH education experience during data analysis process to ensure a comprehensive analysis.

4.2. Finding-Perceived Value of Design Principles

4.2.1. The Value of Motion Design for Multimedia Learning

First, to evaluate the perceived value of the proposed Motion design in addressing the three LCD Questions, a linear mixed model was built to compare the agreement score across three demands collected via survey: reducing irrelevant information, focusing on essential information, and fostering connections between text and images. For each cognitive demand, each participant provided one response after watching each video pair, resulting in a total of 15 responses per participant for 15 video pairs. The violin plots showing the distribution of survey results for each design principle are shown in Fig 3.

Refer to caption
Figure 3. Survey results of using five design principles to address multimedia learning cognitive demands scales (LCD Questions), shown in a violin plot. Video clips for each design principle were aggregated. A post-hoc analysis on linear mixed models for each question suggested that for Focusing on Essential Information, there is significant difference between D-Illustrate and D-Slowdown. For Fostering Connection between Text and Image, there is significant difference between D-Illustrate and D-Declutter. The ** denotes p¡.01.
\Description

This figure is a violinplot with title ”Survey Results of Using five Design Principles to Address Learning Cognitive Demands.” The vertical axis includes Strongly Agree, Neutral, and Strongly Disagree from top to bottom; the horizontal axis includes Reducing Irrelevant Information, Focusing on Necessary Information, and Fostering Connection Between Text and Image. For each label on x-axis, there are five violinplots representing five design principles: D-Illustrate, D-Guide, D-Sync, D-Declutter, D-Slowdown. For Reducing Irrelevant Information, the distribution is similar across five design principles. For Focusing on Necessary Information, the distribution of D-Illustrate, D-Guide, D-Sync, and D-Declutter are skewed toward Strongly Agree, whereas D-Slowdown is evenly distributed. This a line between D-Illustrate and D-Slowdown with ”**” representing significant difference between the two. For Fostering Connection Between Textand Image, the distribution of D-Illustrate, D-Guide, and D-Sync are skewed toward Strongly Agree, whereas D-Decluter and D-Slowdown are evenly distributed. This a line between D-Illustrate and D-Declutter with ”**” representing significant difference between the two.

Overall, participants rated Slightly Agree on LCD questions across all video clips and scales (Mean = 5.0, SD = 1.42, N=720). Participants rated higher agreement for Focusing on Essential Information (Mean = 5.3, SD = 1.32, N=240) and Fostering Connection between Text and Image (Mean = 5.1, SD = 1.47, N=240), compared to Reducing Irrelevant Information (Mean = 4.6, SD = 1.40, N=240). Post-hoc analysis on a linear mixed model across three scales showed that there are significant differences between Focusing on Essential Information and Reducing Irrelevant Information (t(798) = 5.809, p¡0.001), and between Fostering Connection between Text and Image and Reducing Irrelevant Information (t(798) = 4.225, p¡0.001). The results remained significant before and after applying Bonferroni correction.

4.2.2. D-Illustrate: Addressing Learning Cognitive Demands

To compare the effects of the five design principles of Motion design, three linear mixed models—one for each of the three cognitive demands—were built to understand how each design principle addressed these demands. The number of responses varied depending on the design principles applied, ranging from one to five responses per participant. The results are presented in Fig 3.

D-Illustrate was rated to be most valuable in addressing potential learning processes. Post-hoc analysis on the three linear mixed models suggested that there were significant differences in agreement for Focusing on Essential Information between D-Illustrate (M = 5.5, SD = 1.15) and D-Slowdown (M = 4.8, SD = 1.26) (t(252)=3.34, p¡.01). There are significant differences in agreement for Fostering Connection between Text and Image between D-Illustrate (M = 5.5, SD = 1.38) and D-Declutter (M = 4.7, SD = 1.57) (t(252)=3.33, p¡.01). There were no change in significant results after applying Bonferroni correction.

4.2.3. D-Illustrate and D-Guide: Reducing Workload and Increasing Satisfaction

For TLX questions, the influence of each of the five design principles was modeled. Each participant made two responses, one for the original video and one for the revised video, for each video pair per question. The total number of responses per design principle varies depending on the number of video clips applied to each design principle. For example, in the case of the mental demand for design principle D-Slowdown (which was applied to three clips), each participant provided three ratings for the original video’s mental demand and three ratings for the revised video’s mental demand. A total of 16 linear mixed models were built for four design principles (D-Illustrate, D-Guide, D-Declutter, D-Slowdown) to compare the responses to the original video and the revised video for each pair of survey ratings across four TLX questions. For D-Sync, which only had one video clip applied and thus did not follow the data distribution for linear mixed models, four Wilcoxon signed-rank tests were performed between survey ratings on original and revised videos across four TLX questions. The results are presented in Table 2.

D-Illustrate and D-Guide were the design principles with higher rating scores. D-Illustrate showed promise in all four questions: significantly reducing mental demand (t(15)=-2.97, p¡.01), significantly reducing temporal pressure (t(15)=-2.68, p¡.05), significantly reducing physical demand (t(15)=-2.50, p¡.05), and significantly increasing learning satisfaction (t(15)=3.19, p¡.01). D-Guide showed promise in three questions: significantly reducing temporal pressure (t(15)=-2.45, p¡.05), significantly reducing physical demand (t(15)=-2.18, p¡.05), and significantly increasing learning satisfaction (t(15)=2.77, p¡.05). D-Declutter showed promise in significantly increasing learning (t(15)=2.61, p¡.05). There were no change in significant results after applying Bonferroni correction.

Mental Demand Physical Demand Temporal Pressure Learning Satisfaction
Original Revised Original Revised Original Revised Original Revised
Design Principle Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD)
D-Illustrate 3.09 (1.88) 2.39 (1.72) ** 3.01 (1.83) 2.31 (1.92) * 2.96 (1.81) 2.28 (1.45) * 7.75 (1.83) 8.90 (1.96) **
D-Guide 3.08 (1.96) 2.42 (1.72) 3.02 (2.10) 2.30 (1.70) * 3.00 (1.80) 2.22 (1.49) * 7.92 (1.86) 8.83 (1.77) *
D-Sync 2.44 (1.69) 2.44 (1.84) 2.38 (1.90) 2.94 (2.28) 2.38 (1.32) 2.88 (2.20) 8.63 (2.09) 8.38 (2.23)
D-Declutter 3.04 (1.82) 2.63 (1.77) 3.27 (2.25) 2.67 (2.05) 2.98 (1.97) 2.52 (1.70) 7.69 (2.02) 8.50 (1.81) *
D-Slowdown 2.81 (1.51) 2.63 (1.72) 2.90 (1.96) 2.71 (1.96) 2.92 (1.45) 2.81 (1.74) 8.29 (1.77) 8.23 (1.93)
Table 2. Survey Results of mean and standard deviation for each TLX question item per design principle.
\Description

This table presents the findings of the means and standard deviations for original and revised versions of each TLX question item for each design principle. For Illustrate, there are significant changes in reducing mental demand, reducing physical demand, reducing temporal pressure, and increasing learning satisfaction. For D-Guide, there are significant changes in reducing physical demand, temporal pressure, and increasing learning satisfaction, as well as a reduce in ratings for mental demand. For D-Sync, the mean remains the same for mental demand, increases for physical demand, increases for temporal pressure, and decreases for learning satisfaction. For D-Declutter, there is a significant change in increasing learning satusfaction, as well as decreases in mental demand, physical demand, and temporal pressure. For D-Slowdown, there are slight decreases in mental demand, physical demand, temporal pressure, and a slight decrease in learning satisfaction.

4.2.4. Perceived Value of Each Design Principles

Below, we present the qualitative findings of interview results, where participants shared their perceived values for each design principle.

D-Illustrate: Supported Visual Understandings of Captions. The survey results presented in Section 4.2.3 suggested that D-Illustrate showed promise in all four dimensions: reducing mental demand, reducing physical demand, reducing temporal pressure, and increasing learning satisfaction. Additionally, D-Illustrate showed promise in addressing two of three learning cognitive demands: Focusing on Essential Information, and Fostering Connection between Text and Image.

During the interview, participants appreciated the attempt to replace irrelevant images with animated visual examples. For example, V8 commented, “Not much need to be say for [revised version]. It is just a good idea to make it visual and with what the speaker is discussing about.” Participants also called for more parts of the video to be applied with D-Illustrate, regardless of the design principle they were presented. V12 said, “I think that even more can be added in different parts of the video because the beginning and end started to get dull and boring.” Similarly, V13 mentioned, “I think the video could be further enhanced with more application [of D-Illustrate].

D-Guide: Reducing Temporal Pressure in Visual Attention Switch between Caption and Screen. D-Guide was also rated well in reducing physical demand, reducing temporal pressure, and increasing learning satisfaction, as mentioned in Section 4.2.3. During the interview, participants found D-Guide generally helpful for optimizing when to look where. However, three participants mentioned challenges in simultaneously focusing on captions and visual demonstrations in certain situations, as explained in the examples below. V12 suggested that on-screen text should be reduced, aligning with D-Declutter, as it could distract attention when switching between caption reading and visual viewing, “Captions could be split into chunks and have less words on the screen at one time. With the graph and the words, it became a little cluttered.” V15 also said the added motion on-screen may distract caption reading for some DHH learners, “The [revised] video is alright; however, the changes may confused some people while watching the captions.” V15 further suggested that guiding visual attention might lead to missing visual information that is not mentioned in the audio or captions but is still important,“I would like to ensure all the information is complete and not deleted while keeping it simple.

D-Declutter: Increasing Satisfaction, though Distracted Caption Reading Pace. Survey results presented in Section 4.2.3 suggested that D-Declutter were rated to increase learning satisfaction. Participants gave overall positive feedback for D-Declutter. On the one hand, participants appreciated the attempt to highlight text using color, as V13 said, “It was great enhance the user experience and make video content more engaging.” V17 also mentioned, “I love the color text because it is clear and point.” On the other hand, participants found colorized text less engaging compared to visual examples, e.g., V15 said, “I don’t think [revised version] should have that caption. The image will be instead.” The current design was also perceived as making the captions harder to read by some participants. V8 commented, “The red small word and bigger word are very distracting. It is difficult to tell what the speaker was referring to.” Two participants also raised accessibility concerns for visually diverse users, such as DeafBlind learners who use braille to access video captions. Braille may not fully convey colorful information. For example, V8 noted, “I wonder if that works for those who may have eye issues or can’t see color.” This highlights the need for inclusive design that considers a broader range of accessibility needs.

D-Slowdown: Participant Not Fully Satisfied despite Calling for Visuals to Slow Down. For D-Slowdown, there are no significant changes in four dimensions for TLX questions, as mentioned in Section 4.2.3. During interviews, participants commented on their reasons for the rated value in D-Slowdown. V15 said, “I like [original] version better than [revised] because [revised] version is very slow to me that I lose track comparing to [original] Version.” Meanwhile, in other video clips not applied with D-Slowdown, participants also mentioned a demand to make the video slower. For example, V11 suggested on a video clip applied with D-Illustrate to “have less pictures and video or making the video longer. I notice that some of the pictures and video were too fast and I could not see and read both at the same time.” V3 suggested an approach to manage the play speed of the video clip, “I believe the speed should correspond to the video’s content: use normal speed in typical situations and slow down when something is unclear.

D-Sync: Beneficial for Learners with Residual Hearing. For D-Sync, there are no significant changes in the four dimensions of TLX questions between the revised and the original video. In fact, as shown in Table 2, there is an increase in physical demand, an increase in temporal pressure, and a decrease in learning satisfaction. The statistical tests also showed that D-Sync was rated less value in reducing physical demand and reducing temporal compared to D-Illustrate and D-Guide, as well as less value in increasing learning satisfaction compared to D-Illustrate. Table 2 also shows that the variance of users’ feedback for the revised version of D-Sync was relatively large, especially for physical demand and temporal pressure. During the interview, some participants shared their positive thoughts about D-Sync. For example, V8 found the revised version leading them willing to learn more, “I think B is very simple and sweet but it makes want to learn more. A is good but same time I feel like a little to too wordy, somewhat a little harder to understand.

One possibility for wide variance of user response is related to varied hearing status’s impact on rated value of D-sync. Around half of Deaf participants were not able to tell the differences between the original version and the revised version. Both of the two HoH learners were able to tell the differences and perceive them as helpful. For example, V12 who is HoH and does not lipread suggested that the revised video helped reduce the cluttered information by matching the narrator’s speaking pace. She explained that she was able to ”follow” most of the instructor’s speech through residual hearing, though she couldn’t hear every word clearly, “When the diagram is made to show the parts of AR and VR, AI in the future might match all of the [visual] parts shown in the second[revised] video only introduce them when the person mentions them to reduce the amount of clutter on the screen that was shown in Version A [original].

Residual hearing is the amount of hearing a person has left after experiencing hearing loss, (eg, through hearing aids or a cochlear implant). Following IRB requirements, we did not ask Deaf participants to report their medical residual hearing, as Deaf is more related to cultural identity and does not directly associate with residual hearing. And, HoH learners often have residual hearing. The other possible reason for the large variance is there was only one video clip studied for D-sync while there are 3 and more clips for the other four design principles.

5. Discussion

5.1. Motion Design Principles for Enhanced Accessibility of Video-Lecture Delivery

Our findings focus on empowering DHH learners by altering the visual representation of learning materials through participatory research. Although motion design may have established relevance in fields such as the film and cinematic industries, this work is the first to study motion design within the context of accessible video-based learning. Below, we discuss design insights for video-based platform designers and content creators to enhance video accessibility for DHH learners. In Fig 4, we summarize the common video lecture delivery challenges we found that create barriers for DHH learners, which may not be exhaustive but serve as a foundation for further research. Ideally, video presentations for DHH individuals should ensure that the content of visual and audio information is highly relevant and illustrative, minimize unnecessary visual and reading load, and support smooth visual attention switches.

5.1.1. Improving Illustrativeness of Video

Improving visual-audio relevance (D-Illustrate) was found to be the most effective in Focusing on Essential Information and Fostering Connection between Text and Image. It also supports learning by reducing mental demand, physical demand, and temporal pressure, and increasing learning satisfaction. We also discovered that “talking head” segments in a video are less visually illustrative, although it might not be the only type of presentation style that may lack visual illustration and should be further studied. While previous research has highlighted the importance of visual examples for DHH learners, our study found that presenting relevant visuals are particularly crucial in a video where learners heavily depend on captions and the reading load is high. This echoes previous research on the importance of the visual version of information– for example, Signmaku, a sign-language-based commenting mechanism replacing test-based comments proposed by Chen et al. (Chen et al., 2024a), was found to enhance DHH learners’ comprehension of concepts by offering a visual-motality-based alternative for easier understanding. Enhancing the illustrativeness of learning materials may also benefit other learners with diversabilities, such as those with lower reading literacy or younger generations who are growing up with learning through digital media rather than traditional text-based books.

Refer to caption
Figure 4. Summary of the five Motion Design Principle: addressed video lecture delivery challenges and rated usefulness to improve learning experience for DHH learners. For the three learning cognitive demand dimensions (first three rows), when the median value of a rating is greater than 4, an ”x” is placed in the table.
\Description

This figure is a table that summarizes the findings of RQ1 and RQ2. On the top, there are five diagrams illustrating the video lecture delivery challenges addressed by each design principle. For D-Illustrate, a dotted rectangle labeled ”Audio” appears first, with an arrow pointing from the rectangle to a smaller one labeled ”Visual,” indicating that the visual information follows the audio. For D-Guide, two overlapping rectangles are shown: the bottom one is labeled ”Audio,” while a taller, vertically oriented rectangle labeled ”Visual” overlaps it on the left. For D-Sync, two rectangles are merely overlapping, with the left rectangle representing ”Audio” and right rectangle representing ”Visual.” For D-Declutter, there is only one rectangle labeled ”Audio.” For D-Slowdown, there are two rectangles. In the bottom is a rectangle labeled ”Audio,” and above it is a rectangle with a faded, blurry ”Visual” element. Below it are the video lecture delivery challenges in text. D-Illustrate: Lack relevant visual elements to explain the captions. D-Guide: Lack focus on visual elements: important visual elements are displayed all at once, competing for attention (cause of visual split-attention) but the caption can only focus on one element at a time. D-Sync: Lack synchronization between visuals and audio/caption, causing confusion. D-Declutter: Lack visual support: screen filled with chunks of repeated text in captions, increasing unnecessary reading load and causing anxiety. D=Slowdown: Lack time to comprehend the video content: rapid visual movements hinder learners from reading captions and causing anxiety and distraction. Below the five design principles are the usefulness of each design principle, marked in X. For D-Illustrate, the X marks focus on essential information, foster connection between text and image, reduce mental demand, reduce physical demand, reduce temporal pressure, and improve learning satisfaction; for D-Guide, the X marks reduce irreverent information, focus on essential information, foster connection between text and image, reduce physical demand, reduce temporal pressure, and improve learning satisfaction; for D-Sync, the X marks reduce irreverent information, focus on essential information, foster connection between text and image; for D-Declutter, the X marks focus on essential information, foster connection between text and image, and improve learning satisfaction; for D-slowdown, the X marks foster connection between text and image.

5.1.2. Diversifying Cues to Guide DHH Learner’s Attention

To enhance video accessibility for DHH learners, our findings suggest that when visual components are crucial and closely tied to audio, overlaying visual attention cues can reduce temporal pressure and increase learning satisfaction (D-Guide). This is particularly important when complex visuals are presented. In STEM learning, complex concepts and data are often conveyed through diagrams, graphs, models, simulations, and visual demonstrations (Abdullah et al., 2014; Ge et al., 2024). For example, in the video about AR technology that we used in our study, visual demonstrations were used to explain how a VR headset connects to a computer (D-Guide in Fig 1). D-Guide could be especially beneficial for supporting STEM education for DHH learners.

The need for visual attention cues arises from the split between visual content and separately located captions. This problem, which has been shown to negatively impact learning experiences for DHH learners in classroom settings and workspaces (Chen et al., 2024b), is also present in video viewing with captions. In our study, we used flashing highlights to direct attention, but other animations, such as pulsing borders and zooming in, could be explored in future research. Additionally, while audio often captures hearing learners’ attention and supports attention switches, for DHH individuals, the effectiveness of caption typology in conveying the speaker’s emotion, explored in (de Lacerda Pataca et al., 2024), as a visual attention cue in learning still requires further investigation.

5.1.3. Leveraging Varied Residual Hearing

According to the ratings on TLX questions, D-Sync showed a wide variance, with most Deaf participants unable to discern differences between the original and revised videos. The interview reveals that residual hearing might be linked to the pace that a learner can follow a video. Some can follow the instructor’s speech pace and use captions as a reference, while other DHH learners may primarily rely on captions due to their limited residual hearing. This residual hearing can vary based on factors such as environmental settings, voice characteristics, and whether hearing devices are on or off (McDonnell et al., 2021; Wang and Piper, 2018). For DHH learners, the pace of reading captions often differs from the actual spoken audio pace, and our findings show that these differences negatively impacted the perception of synchronized audio and visuals. Therefore, designing for learners with varied levels of residual hearing could be a future direction. This includes personalizing synchronization in motion design based on how much caption reading an individual requires, their caption reading speed, and the difference between their caption reading pace and the instructor’s speech speed.

5.1.4. Signing “Slow” Does Not Mean “S-L-O-W”

Following the participatory process in RQ1, we proposed to slow visuals down (D-Slowdown). However, it did not receive as much positive feedback as expected. Participants in Phase 2 only slightly agreed that it was helpful for learning and did not report significant reductions in cognitive load or increases in satisfaction. Interestingly, participants also continued to mention the need to “slow down” in other video clips revised based on other design principles. This suggests that the research team may not have explored the ideal approach to slow the videos sufficiently. Future video-based learning systems might consider giving users more control over how much the video should be slowed down and explore additional options that could allow learners to process the visual information before continuing, such as pausing the video in the middle and asking the learners to proceed manually.

Another possibility is that researchers may have overlooked the deeper meaning behind the request to “slow down.” For some participants, our interpretation of “slow down” was based on the English transcription of the ASL sign for “SLOW.” In ASL, this sign is made by holding one arm straight out in front of you, palm down and fingers relaxed, then gently sliding the fingertips of your other hand up the forearm from the wrist toward the elbow. It’s possible that this English transcription was taken too literally or did not fully capture the intended meaning of the original ASL sign. Participants might have been expressing a need for something different, such as more thorough explanations or alternative ways of presenting the information.

5.1.5. Automating Video Editing Guided by Motion Design Principles using Generative AI

During the interviews, participants were asked whether generative AI tools like ChatGPT and Dall-E could be used to apply our proposed motion design principles to video generation. They were also invited to try out Dall-E to express their desired video edits. While participants responded positively and suggested additional features—such as V5 using Dall-E to add speech balloons for emotional clarity in mainstream videos—they encountered difficulties with the tool. None of the participants had prior experience with Dall-E, and most participants struggled to articulate their ideas through written prompts. This may be related to linguistic diversity, as many DHH individuals are bilingual in English and ASL, often switching between the two languages. Further research is needed to understand the specific challenges DHH individuals face when using generative AI and to develop more intuitive, less text-dependent interaction methods, along with improving learners’ prompt engineering skills. More recent works on visual language models, such as GPT-4o-like multi-modal chatbot, (Xue et al., 2024), offer the potential to apply Motion design principles and address more diverse learning needs of various learning materials.

5.2. Enriching Multimedia Learning Theory with Diversability

Our work is one of the few to study multimedia learning theory with DHH learners, a theory that has previously focused only on the hearing population. In our study, we found that the Motion design can effectively address two key cognitive demands from multimedia learning theory (Mayer, 2002) —Essential and Generative Processing—with Generative Processing being the most helpful. However, Extraneous Processing is not fully addressed, and further research is needed to understand its value and approaches. Generative Processing is particularly important and challenging for DHH learners, as they often need to switch between visual content and captions on-screen. This differs from the hearing population, who can simultaneously process word via ears and visual information via eyes.

Our study fills a gap in existing research, which has primarily focused on how captions and subtitles affect viewing behavior rather than the actual processing of verbal information contained in subtitles, as noted by Kruger et al. (Kruger et al., 2015). We suggest that multimedia learning theory could be applied to further study DHH learners’ experiences with caption design, not only in traditional 2D video but also in 3D VR/AR environments. Additionally, other learning theories, such as embodied learning, may be valuable in designing and evaluating technology for DHH learners. Embodied learning emphasizes the integration of the mind with the body’s sensorimotor systems (Stolz, 2015), suggesting that cognition is deeply rooted in perception and action, and therefore making it a promising approach for inclusive education design.

5.3. Limitations and Future Work

We acknowledge a few limitations in our study. First, our study only explored five Motion design principles identified from one educational video on AR technology. We do not claim that these were the only Motion design principles for educational video. In fact, the presentation style of the video we used may limit the exploration of other design principles. For example, this video was made for desktop users and created by a technology company. The results might be different for videos designed for mobile platforms such as TikTok, and for educational videos created by instructors from college-level institution. Future studies should explore more potential Motion design principles with more educational videos from other genre and with other presentation styles. Second, our study only explored applying one design principle to a video to for study design purposes . In real world, a video clip may have multiple delivery challenges and be applied with multiple design principles to address all challenges DHH learners may have. Future studies should explore how applying multiple design principles to a video may further enhance the learning experience for DHH learners. Third, all video clips explored in our study were edited by the research team rather than being automated by technology such as AI. Future studies may explore how AI could improve video lectures for DHH learners by applying the Motion design.

Regarding our sample of DHH learners, all our participants were from a Deaf-centric university in the US, with most participants self-identified as Deaf in the DHH community. The findings might differ with a different sample from the DHH community with different educational background, hearing ability, language preference, cultural background, etc. In fact, our participants also mentioned the challenge of creating a standard of designing AI given the diversity in DHH community with d/Deaf, hard-of-hearing, and deaf-blind individuals. Future studies should explore our Motion design with DHH individuals with more diversity, such as DHH learners from a mainstream university. Additionally, as our work focused on DHH learners as a sample of diversability learners, future work should explore whether the Motion design could be effective for other diversability learners as well as for hearing learners.

6. Conclusion

The study followed a two-phase approach: first, DHH participants identified five key mainstream video content delivery challenges and design suggestions, and second, 16 DHH learners evaluated revised videos incorporating five design principles. The findings demonstrate that Motion Design principles, such as improving visual-audio relevance and guiding visual attention, significantly improve the learning experience and address cognitive demands. The results underscore the need for further research into motion design’s application across diverse educational contexts and content types. Additionally, the potential of generative AI to automate and optimize these motion design interventions offers a promising avenue for future exploration.

References

  • (1)
  • Abdullah et al. (2014) Nasarudin Abdullah, Lilia Halim, and Effandi Zakaria. 2014. VStops: A thinking strategy and visual representation approach in mathematical word problem solving toward enhancing STEM literacy. Eurasia Journal of Mathematics, Science and Technology Education 10, 3 (2014), 165–174.
  • Arroyo Chavez et al. (2024) Mariana Arroyo Chavez, Molly Feanny, Matthew Seita, Bernard Thompson, Keith Delk, Skyler Officer, Abraham Glasser, Raja Kushalnagar, and Christian Vogler. 2024. How Users Experience Closed Captions on Live Television: Quality Metrics Remain a Challenge. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–16.
  • Bavelier et al. (2001) Daphne Bavelier, Craig Brozinsky, Andrea Tomann, Teresa Mitchell, Helen Neville, and Guoying Liu. 2001. Impact of early deafness and early exposure to sign language on the cerebral organization for motion processing. Journal of Neuroscience 21, 22 (2001), 8931–8942.
  • Bell et al. (2019) Laura Bell, Lisa Wagels, Christiane Neuschaefer-Rube, Janina Fels, Raquel E Gur, and Kerstin Konrad. 2019. The cross-modal effects of sensory deprivation on spatial and temporal processes in vision and audition: A systematic review on behavioral and neuroimaging research since 2000. Neural plasticity 2019, 1 (2019), 9603469.
  • Bhavya et al. (2022) Bhavya Bhavya, Si Chen, Zhilin Zhang, Wenting Li, Chengxiang Zhai, Lawrence Angrave, and Yun Huang. 2022. Exploring collaborative caption editing to augment video-based learning. Educational technology research and development 70, 5 (2022), 1755–1779.
  • Boudreault et al. (2024) Patrick Boudreault, Muhammad Abubakar, Andrew Duran, Bridget Lam, Zehui Liu, Christian Vogler, and Raja Kushalnagar. 2024. Closed Sign Language Interpreting: A Usability Study. In International Conference on Computers Helping People with Special Needs. Springer, 42–49.
  • Braun and Clarke (2006) Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101.
  • Chang et al. (2023) Joseph Chee Chang, Amy X Zhang, Jonathan Bragg, Andrew Head, Kyle Lo, Doug Downey, and Daniel S Weld. 2023. Citesee: Augmenting citations in scientific papers with persistent and personalized historical context. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–15.
  • Chen et al. (2006) Qi Chen, Ming Zhang, and Xiaolin Zhou. 2006. Effects of spatial distribution of attention during inhibition of return (IOR) on flanker interference in hearing and congenitally deaf people. Brain research 1109, 1 (2006), 117–127.
  • Chen et al. (2024a) Si Chen, Haocong Cheng, Jason Situ, Desirée Kirst, Suzy Su, Saumya Malhotra, Lawrence Angrave, Qi Wang, and Yun Huang. 2024a. Towards Inclusive Video Commenting: Introducing Signmaku for the Deaf and Hard-of-Hearing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–18.
  • Chen et al. (2024b) Si Chen, James Waller, Matthew Seita, Christian Vogler, Raja Kushalnagar, and Qi Wang. 2024b. Towards Co-Creating Access and Inclusion: A Group Autoethnography on a Hearing Individual’s Journey Towards Effective Communication in Mixed-Hearing Ability Higher Education Settings. In Proceedings of the CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 55, 14 pages. https://doi.org/10.1145/3613904.3642017
  • de Lacerda Pataca et al. (2024) Caluã de Lacerda Pataca, Saad Hassan, Nathan Tinker, Roshan Lalintha Peiris, and Matt Huenerfauth. 2024. Caption Royale: Exploring the Design Space of Affective Captions from the Perspective of Deaf and Hard-of-Hearing Individuals. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–17.
  • Debevc and Peljhan (2004) Matjaž Debevc and Živa Peljhan. 2004. The role of video technology in on-line lectures for the deaf. Disability and rehabilitation 26, 17 (2004), 1048–1059.
  • Dust et al. (2023) A’di Dust, Carola Gonzalez-Lebron, Shannon Connell, Saurav Singh, Reynold Bailey, Cecilia Ovesdotter Alm, and Jamison Heard. 2023. Understanding Differences in Human-Robot Teaming Dynamics between Deaf/Hard of Hearing and Hearing Individuals. In Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. 552–556.
  • Dye et al. (2008) Matthew W Dye, Peter C Hauser, and Daphne Bavelier. 2008. Visual attention in deaf children and adults. Deaf cognition: Foundations and outcomes (2008), 250–263.
  • Fine et al. (2005) Ione Fine, Eva M Finney, Geoffrey M Boynton, and Karen R Dobkins. 2005. Comparing the effects of auditory deprivation and sign language within the auditory and visual cortex. Journal of cognitive neuroscience 17, 10 (2005), 1621–1637.
  • Ge et al. (2024) Lily W Ge, Maryam Hedayati, Yuan Cui, Yiren Ding, Karen Bonilla, Alark Joshi, Alvitta Ottley, Benjamin Bach, Bum Chul Kwon, David N Rapp, et al. 2024. Toward a More Comprehensive Understanding of Visualization Literacy. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–7.
  • Gu et al. (2024) Chanyuan Gu, Yingying Peng, Samuel A Nastase, Richard E Mayer, and Ping Li. 2024. Onscreen presence of instructors in video lectures affects learners’ neural synchrony and visual attention during multimedia learning. Proceedings of the National Academy of Sciences 121, 12 (2024), e2309054121.
  • Guo et al. (2014) Philip J Guo, Juho Kim, and Rob Rubin. 2014. How video production affects student engagement: An empirical study of MOOC videos. In Proceedings of the first ACM conference on Learning@ scale conference. 41–50.
  • Hauser and Marschark (2008) Peter C Hauser and Marc Marschark. 2008. What we know and what we don’t know about cognition and deaf learners. Deaf cognition: Foundations and outcomes (2008), 439–457.
  • Hauthal et al. (2013) Nadine Hauthal, Pascale Sandmann, Stefan Debener, and Jeremy D Thorne. 2013. Visual movement perception in deaf and hearing individuals. Advances in cognitive psychology 9, 2 (2013), 53.
  • Hidayat et al. (2017) Luqman Hidayat, G Gunarhadi, and Furqon Hidayatulloh. 2017. Multimedia based learning materials for deaf students. European Journal of Special Education Research (2017).
  • Hong Lore and Song (1991) Wing Hong Lore and Shareen Song. 1991. Central and peripheral visual processing in hearing and nonhearing individuals. Bulletin of the Psychonomic Society 29, 5 (1991), 437–440.
  • Jain et al. (2018) Dhruv Jain, Rachel Franz, Leah Findlater, Jackson Cannon, Raja Kushalnagar, and Jon Froehlich. 2018. Towards accessible conversations in a mobile context for people who are deaf and hard of hearing. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 81–92.
  • Jensema (1998) Carl Jensema. 1998. Viewer reaction to different television captioning speeds. American annals of the deaf (1998), 318–324.
  • Kafle and Huenerfauth (2016) Sushant Kafle and Matt Huenerfauth. 2016. Effect of speech recognition errors on text understandability for people who are deaf or hard of hearing.(2016). Google Scholar Google Scholar Reference (2016).
  • Kizilcec et al. (2015) René F Kizilcec, Jeremy N Bailenson, and Charles J Gomez. 2015. The instructor’s face in video instruction: Evidence from two large-scale field studies. Journal of Educational Psychology 107, 3 (2015), 724.
  • Kruger et al. (2015) Jan-Louis Kruger, Agnieszka Szarkowska, and Izabela Krejtz. 2015. Subtitles on the moving image: An overview of eye tracking studies. Refractory: a journal of entertainment media 25 (2015), 1–14.
  • Kushalnagar et al. (2010) Raja S Kushalnagar, Anna C Cavender, and Jehan-François Pâris. 2010. Multiple view perspectives: improving inclusiveness and video compression in mainstream classroom recordings. In Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility. 123–130.
  • Kushalnagar et al. (2013) Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2013. Captions versus transcripts for online video content. In Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility. 1–4.
  • Kushalnagar et al. (2014) Raja S Kushalnagar, Walter S Lasecki, and Jeffrey P Bigham. 2014. Accessibility evaluation of classroom captions. ACM Transactions on Accessible Computing (TACCESS) 5, 3 (2014), 1–24.
  • Lang (2002) Harry G Lang. 2002. Higher education for deaf students: Research priorities in the new millennium. Journal of deaf studies and deaf education 7, 4 (2002), 267–280.
  • Lasecki et al. (2014) Walter S Lasecki, Raja Kushalnagar, and Jeffrey P Bigham. 2014. Helping students keep up with real-time captions by pausing and highlighting. In Proceedings of the 11th Web for All Conference. 1–8.
  • Li et al. (2022) Ziming Li, Shannon Connell, Wendy Dannels, and Roshan Peiris. 2022. SoundVizVR: Sound Indicators for Accessible Sounds in Virtual Reality for Deaf or Hard-of-Hearing Users. In Conference on Computers and Accessibility (ASSETS’22).
  • Lindsay (2007) Geoff Lindsay. 2007. Educational psychology and the effectiveness of inclusive education/mainstreaming. British journal of educational psychology 77, 1 (2007), 1–24.
  • Marschark and Hauser (2008) Marc Marschark and Peter C Hauser. 2008. Cognitive underpinnings of learning by deaf and hard-of-hearing students. Deaf cognition: Foundations and outcomes 1973 (2008), 3–23.
  • Marschark et al. (2005) Marc Marschark, Patricia Sapere, Carol Convertino, and Rosemarie Seewagen. 2005. Access to postsecondary education through sign language interpreting. Journal of Deaf Studies and deaf education 10, 1 (2005), 38–50.
  • Mather and Clark (2012) Susan M Mather and M Diane Clark. 2012. An issue of learning: the effect of visual split attention in classes for deaf and hard of hearing students. Odyssey: New directions in deaf education 13 (2012), 20–24.
  • Mayer (2002) Richard E Mayer. 2002. Multimedia learning. In Psychology of learning and motivation. Vol. 41. Elsevier, 85–139.
  • Mayer (2014) Richard E Mayer. 2014. 3 Cognitive Theory of Multimedia Learning. The Cambridge Handbook of Multimedia Learning (2014), 43.
  • Mayer (2021) Richard E Mayer. 2021. Evidence-based principles for how to design effective instructional videos. Journal of Applied Research in Memory and Cognition 10, 2 (2021), 229–240.
  • Mayer (2024) Richard E Mayer. 2024. The past, present, and future of the cognitive theory of multimedia learning. Educational Psychology Review 36, 1 (2024), 8.
  • Mayer et al. (2020) Richard E Mayer, Logan Fiorella, and Andrew Stull. 2020. Five ways to increase the effectiveness of instructional video. Educational Technology Research and Development 68, 3 (2020), 837–852.
  • Mayer and Moreno (2003) Richard E Mayer and Roxana Moreno. 2003. Nine ways to reduce cognitive load in multimedia learning. Educational psychologist 38, 1 (2003), 43–52.
  • McDonnell et al. (2024) Emma J McDonnell, Tessa Eagle, Pitch Sinlapanuntakul, Soo Hyun Moon, Kathryn E Ringland, Jon E Froehlich, and Leah Findlater. 2024. “Caption It in an Accessible Way That Is Also Enjoyable”: Characterizing User-Driven Captioning Practices on TikTok. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–16.
  • McDonnell et al. (2021) Emma J McDonnell, Ping Liu, Steven M Goodman, Raja Kushalnagar, Jon E Froehlich, and Leah Findlater. 2021. Social, environmental, and technical: Factors at play in the current use and future design of small-group captioning. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–25.
  • Parton (2016) Becky Parton. 2016. Video captions for online courses: Do youtube’s auto-generated captions meet deaf students’ needs? Journal of Open, Flexible, and Distance Learning 20, 1 (2016), 8–18.
  • Petersen and Nielsen (2002) Helle Petersen and Janni Nielsen. 2002. The eye of the user: the influence of movement on users’ visual attention. Digital Creativity 13, 2 (2002), 109–121.
  • Proksch and Bavelier (2002) Jason Proksch and Daphne Bavelier. 2002. Changes in the spatial distribution of visual attention after early deafness. Journal of cognitive neuroscience 14, 5 (2002), 687–701.
  • Schmidt-Weigand et al. (2010) Florian Schmidt-Weigand, Alfred Kohnert, and Ulrich Glowalla. 2010. A closer look at split visual attention in system-and self-paced instruction in multimedia learning. Learning and instruction 20, 2 (2010), 100–110.
  • Stolz (2015) Steven A Stolz. 2015. Embodied learning. Educational philosophy and theory 47, 5 (2015), 474–487.
  • Techaraungrong et al. (2017) Piyaporn Techaraungrong, Surachai Suksakulchai, Wacheerapan Kaewprapan, and Elizabeth Murphy. 2017. The design and testing of multimedia for teaching arithmetic to deaf learners. Education and Information Technologies 22 (2017), 215–237.
  • Tyler et al. (2009) Michael D Tyler, Caroline Jones, Leonid Grebennikov, Greg Leigh, William Noble, and Denis Burnham. 2009. Effect of caption rate on the comprehension of educational television programmes by deaf school students. Deafness & Education International 11, 3 (2009), 152–162.
  • Wang and Piper (2018) Emily Q Wang and Anne Marie Piper. 2018. Accessibility in action: Co-located collaboration among deaf and hearing professionals. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1–25.
  • Wang and Williams (2014) Ye Wang and Cheri Williams. 2014. Are we hammering square pegs into round holes? An investigation of the meta-analyses of reading research with students who are d/Deaf or hard of hearing and students who are hearing. American Annals of the Deaf 159, 4 (2014), 323–345.
  • Xue et al. (2024) Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. 2024. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. arXiv preprint arXiv:2408.10188 (2024).