RSS feed

Video tutorials: an expanding audiovisual genre

Gianna Tarquini, University of Bologna
Richard E. McDorman, Language On


With video-based learning genres (e.g. MOOCs, the flipped classroom and tutorials) increasingly gaining momentum in formal and informal education, accessibility needs become a crucial challenge for ensuring inclusive learning among diverse audiences, both locally and internationally. This paper aims to shed light on unexplored linguistic and audiovisual features of video tutorials by looking at instruction as a multimodal instance of communication. Using the official PowerPoint tutorials as the basis of a pilot study, we propose a model for the conceptualisation of semiotic resources, along with translation analysis of multilingual versions, specifically an original U.S. English version, an Italian translation with interlingual subtitles (partial localisation) and a French adaptation (full localisation). Results describe the benefits of using intralingual subtitles and best practices for enhancing the accessibility and internationalisation of video tutorial content, with interlingual subtitles deemed a less effective translation method. Finally, adaptation techniques are described in their semiotic dimensions of re-creation and transcreation, suggesting possible ways in which audiovisual translation theories and methods can contribute to the design of video-based learning content for international learners.


Video-based learning, tutorials, multimedia learning, audiovisual translation, localisation, multimodality.

1. Introduction

As a unique form of readily available and scalable video-based learning tool, video tutorials are rapidly expanding across digital media to support users/learners in getting started with new tools or accomplishing specialised tasks. In particular, video tutorials are being increasingly exploited by technology companies as a complement to technical documentation (user’s guides, online help, machinery and equipment manuals, etc.) to impart procedural knowledge through hands-on methods and advanced multimedia techniques. While video tutorials have been investigated as a teaching aid in the field of education, they have received little attention in multimodal analysis and audiovisual translation research. What techniques are involved in the creation of video tutorials? What unique meaning-making resources do they combine? Are video tutorials captioned/translated and how can they be inclusive for diverse audiences?

This article aims to shed light on the fundamental linguistic and audiovisual aspects of video tutorials, setting out to investigate them not only as a teaching tool, but as a text in its own right. To this end, we will first introduce key concepts and terms along with the relevant academic literature devoted to video-based learning, which posits fundamental theories through which audiovisual/instructional practices should be framed. Second, we will propose a classification of three distinct video tutorial types: academic, corporate and user-generated tutorials, based on parameters of pragmatic-situational and audiovisual interest.
Building on these assumptions, we will then focus on the official PowerPoint tutorials as a reference for multimodal and translation investigation. Drawing from the toolbox of multimodal analysis, we will identify the meaning-making technologies and dimensions that make up video tutorials as a “semiotic artefact,” intended as:

a semiotic resource that has material form and incorporates selections from different semiotic modes (e.g. layout, texture, color, sound) and media (e.g. visual, aural, print, electronic), and is deployed for making meaning alongside other types of semiotic resources (i.e. modes, media and abstract semiotic principles such as genre and rhythm) (Zhao et al. 2014: 355).

We will thus introduce a model for the conceptualisation of the diverse semiotic layers that are involved in their composition. From this perspective, we will briefly consider the main video tutorial localisation options and levels, which involve three major strategies: no localisation (closed captioning), partial localisation (interlingual subtitling) and full localisation (adaptation of visual-verbal, visual non-verbal and audio resources). In so doing, we intend to call attention not only to the implications for multimedia learning, but also to new text-making practices that are emerging in the modern landscape of the “learning society”.

2. Key terminology

In addressing video tutorials as a linguistic and multimodal object, the first hurdle we encounter is terminological in nature. Upon closer inspection, this is a problem of delineating the field of inquiry. In this section, we will examine the definitions of the term ‘tutorial’ and list its most frequently associated compound forms, including ‘online tutorials,’ ‘interactive tutorials’ and ‘computer-based tutorials,’ among others. We will then focus on ‘video tutorials’ as a unique multimodal form of instructional videos and discuss its partial synonym ‘screencast.’

The term tutorial originally referred to “a period of tuition given by a university or college tutor to an individual or very small group” (Oxford Dictionaries 2016). The term is closely related etymologically to the word tutor, which meant “guardian, custodian” in Middle English, and eventually “senior boy appointed to help a junior in his studies”1 (Oxford Dictionaries 2016. This concept implies individual learning and a personal relationship between a tutor and a single tutee, or a small group of tutees. As computer-mediated communication and e-learning have revolutionised the way we communicate, teach and learn, the scope of the term tutorial has expanded to include instructional resources and tools that ‘tutor’ students individually, supporting or even replacing the human tutor. Furthermore, due to the rapid pace of technological change and the growing demand for training, tutorials have established a strong online presence, primarily for the benefit of informal learners seeking practical guidance on a variety of areas and activities. Oxford Dictionaries’ secondary definition incorporates this semantic shift as “an account or explanation of a subject, printed or on a computer screen, intended for private study” (Oxford Dictionaries 2016).

Nowadays, tutorials comprise a wide range of forms, technologies and sometimes fuzzy terms, including ‘online tutorials,’ ‘electronic tutorials,’ ‘interactive tutorials,’ ‘computer-based tutorials,’ ‘video tutorials,’ ‘multimedia tutorials’ and ‘intelligent tutorials,’ among others. We conducted a literature search of the term ‘(video) tutorial’ to investigate the state of the art of this topic. The bibliographic resources and digital libraries found include Google Books, DART Europe and IEEEXplore. The terms listed above were among the most relevant occurrences of the word ‘tutorial.’ As the search results were rarely associated with scholarly research in the field of applied linguistics, we also queried the electronic databases Linguistic Abstracts Online, the MLA International Bibliography and the Translation Studies Bibliography. Results generally point to a scant number of contributions dealing with (video) tutorials from the perspective of foreign language learning, discourse analysis or translation (see next section for the literature review).
The most sophisticated standalone tutorial systems are designed to provide instructional content, assess students and direct them to more challenging problems in an automatic fashion. ‘Online tutorials’, in particular, are the object of an extensive literature in the educational domain and have proven to be highly effective in university courses in technical areas such as the physical sciences and engineering. Their main benefits include self-pacing, appeal to different learning styles, accessibility, round-the-clock availability and cost-efficiency, although significant disadvantages exist as well. In fact, online tutorials are best suited for distance learning and conveying specific types of knowledge, namely declarative, conceptual and procedural knowledge, whereas problem-solving knowledge, cognitive strategies, attitude learning and psychomotor skills are less likely to be enhanced by online tutorials (Rempel and Slebodnik 2015: 7).

Video tutorials can be broadly defined as asynchronous instructional videos (often screencasts, but also animation or live-action footage) providing step-by-step guidance for specialised activities. Although they share many similarities with computer-based tutorials — and are sometimes featured inside them — video tutorials are not computer programs per se and do not offer extensive interactive features, except for the play/pause button and video menu options. Likewise, the terms ‘online tutorials’ and ‘multimedia tutorials’ are generally used to refer to educational resources that incorporate web-centric content, thought-provoking interactions, productive tests, automatic feedback and surveys, in addition to audiovisual features. Their creation requires specific competence in graphic design, interactive design and web development, while video tutorials instead call for advanced video and audio production skills.

Finally, a screencast is a digital recording of computer screen activity, often containing audio narration (Raftery 2010: 213). While a screenshot is a fixed screen capture, a screencast is a recording of a sequence of screen captures. As such, screencasts are particularly suited for illustrating how to use particular software. Offering a “look over my shoulder effect,” they are reminiscent of one-on-one instruction (Raftery 2010: 213). Adding audio narration enhances multimedia learning and motivates the viewer to follow an expert at work, thus offering a scaffold for undertaking the activity herself/himself. In the present discussion, we will use the term screencast mostly to indicate plain recordings of computer screen activity, contrasting them to edited video tutorials that contain narration and additional visual cues.

3. Video-based learning and beyond

Video-based learning is widely recognised as a powerful pedagogical tool in online teaching activities. The earliest video-based learning experiments were conducted during the Second World War for military training and were then applied over the years to educational television, VHS, and later to CDs and DVDs in the classroom, and more recently to Internet technologies, enabling the integration of video conferences and video applications in education (Yousef et al. 2014). MOOCs (massive open online courses, which provide a large number of learners the opportunity to attend free online courses from anywhere in the world) and the “flipped classroom” (an instructional approach whereby learners watch video lectures as homework, with the class serving as an active learning session in which the concepts presented in the video lecture are discussed and explored, Yousef et al. 2014: 130) are today regarded as the most advanced expressions of video-based learning and are exerting a profound impact on teaching and learning methodologies.

Presenting knowledge in an attractive and engaging way, videos tend to stimulate learners’ attention while motivating them to participate, ultimately improving learning outcomes (Zhang et al. 2006). Furthermore, videos can support different learning styles, and are appealing to both visual and auditory learners. The combination of visual and oral materials is known to enhance memory and cognition due to the activation of two separate but interconnected neurological systems (dual coding theory, Paivio 1971). These systems have limited processing capacities and providing too much information with written text, pictures, or sounds can cause cognitive overload (cognitive load theory, Sweller 1988). However, presenting visual and oral content simultaneously helps learners maximise intake and integrate new content with previous knowledge, as confirmed by the experiments on the cognitive theory of multimedia learning (Mayer and Moreno 1999; Tindall-Ford et al. 1997).

An influential model in learning theory is Dale’s cone of experience (1969), which illustrates how information is understood, processed, transferred and maintained as knowledge in the learning process (see Figure 1). The focus on experience is motivated by the assumption that abstract concepts are more easily understood and retained by students if they are grounded in concrete personal experiences. The most effective instructional methods stand at the bottom of the cone and involve creative and practical activities that compel learners to better remember what they see and hear (Yousef et al. 2014: 125). Bringing together video based-learning and learning-by-doing, video tutorials provide a unique tool for knowledge acquisition along with skills development. Audiovisual demonstrations invite learners to experience knowledge first hand and can be repeated any number of times, paused to perform the steps involved in the task and resumed for further learning.

Figure 1

Figure 1. Cone of experience (Yousef et al. 2014: 125, adapted from Dale 1969)

These assumptions have been substantially corroborated by research focusing on the use of video tutorials to enhance the laboratory learning of university students in engineering and programming (Mu et al. 2009; Wells et al.2012). Results show that video tutorials improve student engagement and learning outcomes, especially when they are provided as a prerequisite for completing assignments. Their use facilitates not only distance learning but also on-campus sessions, as students can watch the video tutorial at home and save lab time for questions and troubleshooting. Results also confirm that video tutorials are more effective than text-based instructions and written materials:

static content struggles to pass on experience, technique, process, and an application of knowledge. This is especially relevant where applying that knowledge requires the use of unfamiliar tools and environments, whose pitfalls have the potential to frustrate and distract the learner (Wells et al. 2012: 458).

The main drawback of video tutorials is the time and resources needed for the creation of video tutorials, since even a short video can take many hours of preparation.
Video tutorials are a strategic resource in large-scale educational projects, ensuring scalability, uniformity of delivery and cost-effectiveness. The Spoken Tutorial Project, developed by the Indian Institute of Technology and funded by the Indian Government, has launched a massive training programme to promote IT literacy through open-source software. The use of video tutorials during hands-on workshops allowed for the training of more than 1.000.000 students and teachers in 2.000 colleges across the country (Srivastava and Sharma 2013).

From an applied linguistics perspective, Gromik’s pioneering study (2007) is the first attempt to test the impact of video tutorials on foreign language learners. The experiment, held in an EFL course for Japanese students in engineering, required learners to first design two electronic presentations using Open Office Impress and then deliver a speech for each. The class was divided into an experimental group which was provided with video tutorials presenting the software functionalities and a control group. The integration of video tutorials into the curriculum of the experimental group proved to be beneficial for enhancing students’ understanding of how to use software, technical expertise and autonomous learning. However, the study mainly focused on foreign language listening and comprehension and failed to take into account improvements in lexical acquisition or oral skills.

On the basis of this review, and to the best of our knowledge, it is apparent that video tutorials have been explored mostly from an educational perspective and few attempts have been made to describe them as a multimodal instance of communication (Bezemer and Kress 2016). Furthermore, the theories of multimedia learning are fundamental for building an interdisciplinary framework for the study of the most effective audiovisual practices in the creation and translation of video tutorials, as will be shown in particular with regard to captioning (section 5.3.1).

4. Video tutorial types: a classification proposal

As a multimodal text, video tutorials unfold in different contexts of situation, intended as the total environment of the text. The context of situation, which encapsulates key extra-linguistic factors such as the setting, the participants and their relationship, the communicative purpose, the medium and the genre, determines to a great extent the vocabulary, the discursive patterns and the specific multimodal features of the text. For instance, many video tutorials commonly available on YouTube are amateur video productions in which (expert) users address other users in an informal learning context. These videos are not always carefully scripted and post-edited. As a result, the verbal texture tends to lack cohesion and coherence while the audio performance may be unsteady, featuring typical characteristics of spontaneous speech (fillers, hesitations, repetitions and anacolutha) in an effort to match onscreen actions and speech. Among the extra-linguistic factors that most significantly impact the textual patterns of video tutorials, the distribution infrastructure provides core technological affordances that enable a variety of multimodal (re-)configurations, for instance through the closed captioning menu (section 5.3.1) or video response options (Adami 2014). Furthermore, video tutorials can be accessed via a range of devices, including desktop computers, laptops, mobile phones, tablets and digital TVs — the technological trend being multi-device usability.

In order to set the groundwork for a more thorough linguistic inquiry into this new audiovisual genre, we will identify diverse contexts of production, distribution and use. In this section we will put forward a classification of three main tutorial types: ‘academic’, ‘corporate’ and ‘user-generated’ video tutorials, based on hybrid parameters of pragmatic-situational and audiovisual interest. Note that in this context, the label ‘corporate’ may refer not only to tutorials designed for customers, but also to tutorials used in-house for staff training purposes (the focus of this study is on corporate tutorials addressed to a large group of international users, thus foregrounding accessibility and localisation issues). This classification is not meant to be rigid or inclusive of all instances of video tutorials but rather seeks to identify key extra-linguistic factors that are likely to shape textual and multimodal features of each tutorial type.

Academic tutorials

Corporate tutorials

User-generated tutorials

Learning environment




Intended audience





teaching staff



Communicative purpose




Content domain  




Level of specialisation





standalone/educational platform

Official website

Video-sharing platform











Interlingual subtitling/adaptation


Table 1. Classification of the main video tutorial types

The level of formality of the learning environment is a crucial distinctive factor, for it situates the text in an institutionalised context of communication between teachers and learners, which can be understood as form of scientific instruction. Academic video tutorials are designed as part of a broader curriculum and are the object of assessment activities and student feedback, whether on campus or through distance learning. Academic tutorials are created by teaching staff, including (lab) teachers and instructional designers, and are tailored to the learning goals and competencies of a more-or-less homogenous class of students. One exception to this prototypical situation is video tutorials addressed to larger academic audiences, as in the case of video tutorials for library instruction or the aforementioned Spoken Tutorial Project in India. It should also be noted that students are often asked to create video tutorials as part of their coursework, in which case the addresser/addressee relationship is reversed. As far as the content domain is concerned, academic video tutorials generally incorporate highly specialised discourse, often blending multiple fields, such as medicine and engineering, or information technology and ecology. As the distribution and scalability of this tutorial type is restricted to the specific instructional needs of a small to medium group of people in a given institution (ranging from colleges and universities to vocational training schools and other types of postsecondary educational institutions) academic tutorials do not lend themselves well to translation. Captioning, however, is a recommended feature for ensuring content accessibility for foreign and hearing-impaired learners.

Unlike academic video tutorials, corporate (along with user-generated) video tutorials can be framed within an informal learning context of situation. This has implications on various levels: first, video tutorials are designed for rapid and effective consultation, just like user’s guides. Their communicative purpose is to ‘tutor’ the customer who is exploring a new product/software or offering assistance at some point of need. Unlike scientific instruction, knowledge transfer takes place within a (unidirectional) flow of communication from specialist to non-specialist. Furthermore, no assessment activities, questionnaires or exams follow their viewing, although user feedback may be gathered via web analytics and short questions. Corporate video tutorials are usually published on the official website of a product/service vendor alongside relevant support documentation and promotional videos, as in the case of the Adobe Learn and Office Training Center infrastructures (section 5). This tutorial type is created by a team of skilled professionals, including product designers, instructional designers, technical writers, voice actors and multimedia specialists. As a result, the audiovisual quality of corporate tutorials tends to be generally high, not only relative to user-generated tutorials but also to academic tutorials. The target audience of corporate video tutorials tends to be quite large and heterogeneous in many respects, including prior knowledge, expertise and age. Therefore, corporate video tutorials blend instructional and popularising techniques that liberally employ semi-technical vocabulary, simplifications and scaffolded content. The specific content domain of this video tutorial type bears on language for specific purposes — most often ICT — but may also relate to other technical domains such as electronics and robotics. Considering the wide distribution and high scalability of corporate tutorials, both accessibility needs and foreign languages and cultures must be taken into account. Not only is captioning a requirement, but translation (in the form of interlingual subtitling or adaptation) becomes a key strategy for their international viability.

With a ubiquitous online presence, user-generated video tutorials may show considerable sociolinguistic variation vis-à-vis the participants, their status and expertise, as well as the field of discourse and level of specialisation. Under this typology, we include in particular video tutorials that can be qualified as ‘vernacular’ productions. This label designates a broad category of amateur and informal videos including vlogs, mundane home movies and do-it-yourself tutorials, and can be contextualised within the wider framework of popular and participatory culture (Burgess and Green 2009; Omizo 2012). As opposed to mainstream media and professionally-made videos, vernacular productions are generally marked by non-professionalised skills, lower production quality, and a home-made mise en scène. In this context, expert users engage in the creation of audiovisual guides with the aim of tutoring the community of less expert users. This tutorial type covers a wide range of domains and activities: not only software and ICT, but also music, knitting, cooking, hairdressing and all sorts of crafts. The level of linguistic specialisation may therefore range from every-day language to language for specific purposes to jargon and argot and from semi-technical to technical vocabulary. Moreover, specialised language may coexist with dialectal or sociolectal variation at the phonetic, phonological, morphological and syntactic levels. Accessibility and translation do not generally represent a major concern in vernacular video production. If we consider, for example, the Open Office video tutorials available on YouTube, we can find many user-generated tutorials not only in English, but also in French, Italian, German, Portuguese, Spanish and other languages. Each linguistic version is an original performance of authors who implicitly address their own language community. Automatic captions are an option in some of these video tutorials, but they are rarely post-edited and checked for transcription accuracy.

5. Pilot study on Microsoft PowerPoint tutorials: multimodal and localisation insights
5.1. Preliminary remarks and methodology

After addressing fundamental terminological, theoretical and situational-pragmatic issues, we move on in this section to focus more particularly on the official PowerPoint tutorials, which we have selected as a reference sample of corporate tutorials involving the ICT domain.
The Office Training Center platform offers a variety of colourful and engaging instructional materials including video tutorials, webinars and quick start guides along with online support documents for the use of Office applications. These materials seem to highlight not only a manifest communicative shift from monodimensional to multidimensional texts (Díaz Cintas 2008b), but also an ‘instructional turn’ in the corporate management of support documents and services to users, moving from instructions to instruction (i.e. to full educational programmes relying on practical training and multimedia learning). This trend, as shown by the new Adobe Learn and YouTube Help documentation, reflects the findings of scholarly research highlighted in section 3 and significantly mirrors the broader “learning society” scenario, which places education, learning and training at the centre of its social and economic concerns, fostering a growing circulation of formal and informal learning content (Guerra (ed.) 2010; Holford et al. 1998). Delving into the profound pedagogical, sociological and economic implications of this paradigm is beyond the scope of the present discussion but it is within our remit to address specific multimodal and translation challenges that it raises.

The Office Training Center platform is available in 63 locales, including more than 30 different languages worldwide. This presupposes not only a marked accessibility of the instructional content in the original U.S. English version, but also a management model for the creation, internationalisation, localisation and translation of both web and video-based content3.

After a preliminary investigation of multilingual versions of the PowerPoint tutorials (mostly in English, French, German, Italian and Spanish, but other languages have also been taken into account for the sake of accuracy), we have identified three major localisation strategies:

  • No localisation: original U.S. English version with closed captions,
  • Partial localisation: original U.S. English version with interlingual subtitles,
  • Full localisation: adaptation of visual-verbal, visual non-verbal and audio resources.

Next, we have selected the PowerPoint 2013 tutorial collection from among the 2010 and 2016 PowerPoint series due to its status as the most recent freeze’ version. Video tutorials are subject to regular updates due to software modifications or the maintenance of website and instructional materials on both a local and international basis. In particular, we have singled out the tutorial Apply and change a theme in order to: (1) investigate the key semiotic dimensions and resources that it incorporates, along with their interrelationship, and (2) examine a partially localised version (Italian) and a fully localised version (French).
The methodology of analysis of new multimodal entities is known to pose major challenges, both from a descriptive and contrastive perspective, requiring ad hoc approaches and potentially new theories (Adami and Kress 2014; O’Sullivan2013). In order to overcome these challenges, we have devised a multidisciplinary approach that combines the toolbox of multimodality (Baldry and Thibault 2006) and relevant paradigms in audiovisual translation (Díaz Cintas (ed.) 2008a; Mangiron et al. (eds) 2014), linking them with the principles of video-based learning. In the next section, we move on to contextualise PowerPoint tutorials in relation to the social semiotic approach to the study of PowerPoint (Djonov and van Leeuwen 2013; Zhao et al. 2014) before touching on localisation issues.

5.2. Multimodal insights

Nowadays there is a copious amount of literature on PowerPoint presentations. In this context, the term PowerPoint refers to (i) the software designed by Microsoft, (ii) a slideshow created with it, or (iii) a presentation featuring such a slideshow (Zhao et al. 2014: 351). Of particular interest is the social semiotic approach to PowerPoint, which provides a valuable basis for re-contextualising and describing the semiotic modes and resources that are deployed in PowerPoint tutorials. Based on this model, we can identify three interdependent dimensions of semiotic practice: software design, the multimodal composition of slideshows and slideshow-supported presentation (Zhao et al. 2014: 354-357). These dimensions are connected through two semiotic artefacts: the software and the slideshow. Specifically, the first dimension entails “software designers selecting from potentially unlimited meaning-making resources that are recognized in the culture (such as language, images, colours, layouts and texture, etc.), making the selected resources available to users through the software’s interface” (Zhao et al. 2014: 355).  The design practice produces the software artefact, which is employed in the second dimension of PowerPoint (i.e. the multimodal composition of slideshows). At this level, users select from the resources available within the software (font types, texture, layout, animation, and sound effects, etc.) while simultaneously creating or incorporating new content (written language, images, video, etc.). Finally, the slide-show supported presentation actualises and reconfigures the slideshow artefact in a complex and embodied semiotic event that intertwines verbal (speech) and non-verbal resources (e.g. gestures, head movements, time sequence) in accordance with the discursive practices of the target community.

Drawing on this approach, we can identify five major semiotic dimensions of practice in the creation of PowerPoint tutorials: the PowerPoint software design, the slide composition, video composition, audio production and captioning. These are interconnected through three artefacts: the video tutorial authoring software (or suite of tools), the plain screencast (embedding the PowerPoint software design, slide composition and initial video composition) and the final video tutorial (see Figure 2). A major difference in this particular use of PowerPoint is that the software interface and slide content elements become the stage for a fictional user’s performance in which the selection of menus and items, cursor movements, the time sequence and onscreen actions are captured to yield the (plain) screencast artefact. Thus, the screencast turns out to be the simulacrum — in the philological sense of simulare ‘to make like, imitate, copy, represent’ — of the PowerPoint software artefact and of the multimodal composition practices that it enables. The semiotic device of mise en abyme, actualised on the technological stage, generates an interesting form of screen-within-a-screen, a game of mirrors for the real PowerPoint user dealing with the composition of slides, thereby generating an immersive learning experience. Afterwards, the screencast artefact becomes the canvas for composite semiotic practices that for the most part transcend the PowerPoint software and its capabilities. The slideshow presentation event is replaced by a complex workflow encompassing video composition that incorporates additional static and dynamic visual resources (such as photos, illustrations, callouts and 3D animation) and editing (cuts, zooms, transitions, montage), audio production (narration, post-synchronisation, editing and mixing) and captioning.

Figure 2

Figure 2. Semiotic dimensions of practice in the creation of PowerPoint tutorials

This conceptualisation seeks to chart the core semiotic technologies and dimensions of practice that make up video tutorials as a composite and multi-faceted artefact. Moving beyond a surface analysis of the spatial arrangement and temporal sequence of multimodal elements that are observable in the final product, it specifically aims to scrutinise the diverse semiotic practices and components that construe each artefact stratum. Furthermore, this model will form the basis for a multimodal transcription and analysis scheme that takes into account the interplay of original and/or localised semiotic resources with perspectival depth (examples 1, 2 and 3).

5.3. Localisation insights

In the following sections we address fundamental issues that bear on the localisation strategies of video tutorials, using examples extracted from the tutorial Apply and change a theme in the original U.S. English version (no localisation), Italian version (partial localisation) and French version (full localisation). For analytical purposes, we have combined the model of multimodal transcription and analysis introduced above to gain insights into the rich combination of multimodal dimensions and resources in multiple linguistic versions, while making observations on particular audiovisual translation issues. With specific regard to the closed caption and interlingual subtitle dimensions, we have referred to established norms and conventions in subtitling research and practice (Baldo de Brébisson 2016; Díaz Cintas and Remael 2007; Ford Williams 2009), focusing on the key spatial and temporal features that are likely to have a major impact on multimedia learning. Although these standards mostly apply to film/TV genres and media, they can nonetheless serve as a useful reference in evaluating the quality of video tutorial captions and interlingual subtitles, keeping in mind that the overall quality of a multimedia product is best measured by its usability in terms of effectiveness, efficiency and user-satisfaction (Barnum 2011).

5.3.1 No localisation

The first localisation level of the PowerPoint tutorials foregrounds key accessibility and internationalisation issues (Anastasiou and Schäler 2010; Oncins et al. 2013). The original U.S. English tutorials are designed to accommodate not only users with functional diversity, but also international users who have variable degrees of competence in English, thus serving as an international pivot version. On the one hand, the support of captions, among other accessibility features, is a key functionality for ensuring inclusive learning on an international scale. On the other hand, it has been argued that adding captions to an already rich screen texture risks saturating the visual channel and increasing the cognitive load of learners, thus detrimentally affecting learning (Rempel and Slebodnik 2015: 83-84). Unlike film viewers, video tutorial viewers must simultaneously or sequentially process multiple layers of verbal information: the PowerPoint software design, slide content, video composition elements, speech and, additionally, captions (see Example 1). A good practice implemented in the Office PowerPoint tutorials is keeping captions (and interlingual subtitles) on a separate track and adding menu options that allow viewers to turn them on or off (closed captions/subtitles) or to customise them. In addition, captions consisting of scripted renditions of the audio text may be sufficient to accommodate a vast public, as the transcription of paralinguistic information such as speaker identification, music and background noise is not applicable with this audiovisual genre.

Turning to relevant spatial features in the PowerPoint 2013 Apply and change a theme tutorial, we can first note that the caption region is positioned within the screen region, causing potential text overlaps as shown in example 1, where the closed caption ‘Imagine you’ve created slides/for your presentation,’ obscures the PowerPoint caption ‘Apply and change a theme.’ Usually, video tutorials start with a full screen title composed of a static image or an animation in the form of an intro, followed by the screencast. In this case, however, the tutorial instead opens directly with the screencast and a PowerPoint caption at the bottom of the screen that functions as a title.

Example 1.1

Example 1

Example 1. Multimodal transcription and analysis of the PowerPoint 2013 tutorial Apply and change a theme, original U.S. English version

Nevertheless, the menu at the right bottom corner of the video window offers dedicated Closed Captioning Settings that enable users to customise the caption window size, font size and family, foreground and background colours as well as text edge style. This menu is a form of paratext that enables a reconfiguration of some of the video resources afforded by the website infrastructure. The menu allows not only for the activation of closed captions in the available languages but also for the selection of a variety of closed caption formats, video quality settings and audio tracks. We can consider this an optimum accessibility feature for impaired users as well as a good internationalisation practice, aimed at enabling a product at a technical level for localisation (Anastasiou and Schäler 2010: 13), as this menu allows for the adjustment of captions/subtitles in target locales that may require different layouts, such as Chinese and Japanese.

Among other key spatial features, captions are displayed in one or two (or rarely three) lines and do not exceed 45 characters per line, although subtitle segmentation does not follow syntactic rules and is presumably carried out automatically. Clearly, if the default captions track is modified by the viewer via the dedicated menu, all standards are blurred.
At the temporal level, the synchronisation of captions and speech is a key requirement for enhancing multimedia learning (Mayer and Moreno 1999; Tindall-Ford et al. 1997). Captions follow the conventions of spotting, generally appearing when the narrator starts to speak, and abiding by the six-second rule (Díaz Cintas and Remael 2007). The other semiotic dimensions must also be synchronised to speech, giving viewers enough time to read the user interface elements and follow onscreen actions. These dimensions, unlike captions, are to some extent anticipated relative to the speech that describes them.

5.3.2. Partial localisation

Partial localisation entails preserving all the semiotic dimensions of the original tutorial — namely, software design, slide content, video composition and audio resources — providing closed interlingual subtitles in the target language (see Example 2). This method has been applied to many linguistic versions of the PowerPoint 2013 tutorials, including Scandinavian languages, Italian and Polish. Analysis of the Italian tutorial Applicare e modificare un tema (Apply and change a theme) raises a number of theoretical issues. First, Italian users are acquainted with Italian localisations of Microsoft Office applications. Therefore, a tutorial, like all supporting documentation, is expected to be provided in the linguistic version of software that customers use. Second, this localisation level is very likely to increase the cognitive load of target users, who must simultaneously process media-rich information in two languages, one of which they may have little knowledge of. This atypical context is further complicated by the “anchoring function” of semiotic elements (Marleau 1982: 274), such as user interface items, which are coded at three different levels. As illustrated in Example 2, the target user must look at three salient elements: the “Themes gallery,” the “More arrow” and the “bottom arrow” (i.e. the cursor) while reading and matching the Italian translations (namely, raccolta Temi, freccia Altro and freccia in basso) and listening to the audio text within a few seconds. Although video tutorials can obviously be stopped and replayed, the saturation of multidimensional and multilingual information may be confusing to weaker learners, especially those who are less competent in English or have limited computer literacy. Finally, the quality of interlingual subtitles appears to be poorer than the quality of closed captions in the original version, from a spatial, temporal (i.e. synchronisation) as well as linguistic point of view.

Software Design/Slide Content

Video composition

Interlingual subtitles


  • PowerPoint 2013 interface:

Design tab
Default view → Themes menu → Themes gallery
U.S. English locale

  • Title slide:

Theme: Office
Title: “Fabrikam”
Subtitle: “For fine living”
Alignment: centred
Colours: black characters on white background
Font: Calibri Light

  • Left menu:

5 slide miniatures

  • Cursor:

Moving. From the Themes menu, selects the More arrow and opens the dropdown Themes gallery on the top left corner of the screen. Then cursor scrolls through the vertical bar of the Themes gallery to show all available options and thumbnails.

  • Caption region within the screencast (and video) region


  • “Per vedere la raccolta Temi, fai clic sulla/ freccia in basso, chiamata freccia Altro”.

Colours: white characters on black background
Font: Arial
Font size: 100%
(default settings)

  • Narration:

Gender: female voice
Accent: U.S. English

  • “To see the full Themes gallery, click this bottom arrow, called the More arrow”
  • Other audio elements:

No music, system sounds of special effect



Time in: 00:01:05,01 ≈
Time out: 00:01:13,03 ≈

Time in: 00:01:04,23 ≈
Time out: 00:01:09,06 ≈


Example 2

Example 2. Multimodal transcription and analysis of the PowerPoint 2013 tutorial Applicare e modificare un tema, original English version with interlingual subtitles in Italian

These observations merit further exploration by combining usability testing protocols established in the domain of instructional design and reception/perception methodologies already applied in the field of audiovisual translation (Kruger and Steyn 2013; Perego (ed.) 2012). Such exploration should compare the impact of each localisation level on selected groups of foreign language users.

Usability tests already form part of the tutorial design process and involve three primary methods for gathering user feedback, starting from the early development stages: rapid prototyping, formative usability testing and beta testing (Rempel and Slebodnik 2015: 123-133). Tutorial usability is measured at four levels: reaction, learning, application and programmatic results (Kirkpatrick 2006). Despite the utilisation of advanced techniques and technologies for collecting user feedback and input data, current approaches focus mainly on formative assessment and fail to specifically address two crucial factors relevant to the international dimension of video tutorials: the heterogeneity of the audience, namely foreign (language) users, and the presence of multilinguistic elements in the video, viz. interlinguistic subtitles.

In our view, eye-tracking methods that measure perceptual and attentional patterns of selected samples of learners combined with comprehension tests could provide further insights into usability testing approaches while contributing to the final assessment of learning outcomes. Nonetheless, based on our initial observations and the results of comparative tests involving the use of different subtitling modes for language learning purposes4, our preliminary hypothesis is that this localisation level is more cost-effective than accessible.

5.3.3. Full localisation

Full localisation involves adapting (or transcreating) all the semiotic dimensions of the original tutorial. An examination of the fully localised French tutorial Appliquer et modifier un thème (Apply and Change a Theme) provides several insights into the complex process of adapting all the relevant semiotic dimensions:

1) Software design: the official French version of the PowerPoint 2013 software displays the same menus and options included in the original tutorial. Of course, the software interface has been localised by a dedicated team.

2) Slide composition: the multimodal content of slides, related to the fictional project named ‘Fabrikam for fine living’, is faithfully localised into French, including written text, colours, fonts, images, themes and layout.

3) Video composition: the video resources of the original tutorial are duplicated in the French version, with only the verbal components (viz. the PowerPoint caption text Appliquer et modifier un thème) translated. More generally, the original video composition project is “faithfully” rendered in the localised version, including onscreen actions, animation effects, montage, cuts, duration and rhythm.

4) Audio: the script, accurately translated into French, is read by a female voice actor who delivers a professional performance with the appropriate intonation, pitch and pauses. The audio track is properly synchronised with visual resources. As in the original version, no music or sound effects are present.

5) Closed captions: closed captions consist of scripted renditions of the audio text with the same translated script presumably used for the audio and caption dimensions. Synchronisation with utterances and other visual information is generally kept in good balance. However, the video composition elements tend to have the same duration as the original video, while the translated script is considerably longer. As a result, French captions may appear on four lines or fail to comply with the six-second rule (see Example 3). In the sequence under examination, the caption expansion (Vous réfléchissez à présent à l’apparence de celles-ci) and overall duration (7 seconds) are partly due to the anticipation of information contained in the second caption of the original version (“and now you’re thinking about slide design—”), while the time codes of video resources are broadly similar in both versions.

Software Design/Slide Content

Video composition



  • PowerPoint 2013 interface:

“Accueil” (Home) tab
Default view
French/France locale

  • Title slide:

Theme: Office
Title: “Fabrikam”
Subtitle: “Découvrez un nouvel art de vivre” (for fine living)
Alignment: centred
Colours: black characters on white background
Font: Calibri Light

  • Left menu:

5 slide miniatures

  • Title screen:

PowerPoint Caption

  • Caption text: “Appliquer et modifier un thème”

Alignment: left
Colours: white characters on red background
Font: plain,
Animation effects: only fade out, left entry. Written text appears and disappears with the background.

  • Cursor:

Still. Pointing to Title slide on left menu.

  • Caption region within the screencast (and video) region


  • Text : “Imaginez que vous avez créé des/ diapositives pour une présentation. Vous/ réfléchissez à présent à l’apparence de/ celles-ci.ʺ (Imagine you’ve created slides for your presentation, and now you’re thinking about slide design.)

Colours: white characters on black background
Font: Arial
Font size: 100%
(default settings)

  • Narration:

Gender: female voice
Accent: Standard French (France)

  • “Imaginez que vous avez créé des diapositives pour une présentation. Vous réfléchissez à présent à l’apparence de celles-ci.ʺ
  • Other audio elements:

No music, system sounds or special effects


Time in: 00:00:00,02 ≈
Time out: 00:00:04,10 ≈

Time in: 00:00:00,08 ≈
Time out: 00:00:07,15 ≈

Time in: 00:00:00,13 ≈
Time out: 00:00:07,05 ≈


Example 3

Example 3. Multimodal transcription and analysis of the PowerPoint 2013 tutorial Appliquer et modifier un thème, fully localised French version

From these general observations we can infer that the U.S. English source tutorial has been localised by adapting all relevant semiotic dimensions, which entails a complex, time-consuming and costly process. In our view, the concept of ‘transcreation’ does not fully apply in this case because, even though multiple semiotic resources were translated or recreated, the localisation project did not seek to remove the preconceived authority of the original or give way to a fresh avenue for the creation of a new entity (Mangiron and O’Hagan 2013: 199

6. Conclusions

Adopting a top-down approach, this paper has developed an interdisciplinary framework and a conceptualisation model for the semiotic investigation of video tutorials, thus laying the foundation for a more fine-tuned assessment of accessibility and localisation issues in this type of audiovisual product. After proposing a classification of video tutorial types, we focused our attention on ‘corporate tutorials’ using the PowerPoint tutorials as a benchmark for describing new audiovisual practices that are becoming increasingly relevant to both multimedia learning and technical communication.

We have barely scratched the surface of a video-based genre which, like MOOCs, is gaining an increasing foothold in formal and informal learning contexts. In our view, the theories and methods of audiovisual translation have a crucial contribution to make toward enhancing the multimedia learning and accessibility features of instructional content on a broader scale while providing a valuable framework for reception studies and usability tests. In particular, established standards and conventions in captioning and interlingual subtitling practices should be tested and adjusted for instructional purposes, contexts and settings. Indeed, access to, and the accessibility of, knowledge is crucial for removing the linguistic, social and economic barriers of the digital divide. Finally, we note that video tutorials may be used not only as a pedagogical tool in a variety of specialised domains, but also as a learning playground for the development of students’ technical, linguistic and multimodal skills and for the training of audiovisual and technical translators. Remarkably, the semiotic practices involved in the creation and localisation of video tutorials are linked to the new European framework of reference for audiovisual communicative skills — namely, Watch, AVlisten, AVread, Avspeak, AVwrite, AVproduce — and suggest that video tutorials can be integrated into course materials for a wide range of creative and practical activities.

  • Adami, Elisabetta (2014). “‘Why did dinosaurs evolve from water?’: (in)coherent relatedness in YouTube video interaction.” Text & Talk 34(3), 239-259.
  • Adami, Elisabetta and Gunther Kress (2014). “Introduction: multimodality, meaning making, and the issue of ‘text’”. Text & Talk 34(3), 231-237.
  • Anastasiou, Dimitra and Reinhard Schäler (2010). “Translating Vital Information: Localisation, Internationalisation, and Globalisation.” Syn-Thèses 3, 11-25.
  • Baldo de Brébisson, Sabrina (2016). “Formes, sens et pratiques du sous-titrage spécial.” Signata 7, 255-284.
  • Baldry, Anthony and Paul J. Thibault (2006). Multimodal Transcription and Text Analysis. London: Equinox.
  • Barnum, Carol M. (2011). Usability Testing Essentials: Ready, Set...Test!. San Francisco, CA: Morgan Kaufmann.
  • Bezemer, Jeff and Gunther Kress (2016). Multimodality, Learning and Communication: A Social Semiotic Frame. London and New York: Routledge.
  • Burgess, Jean and Joshua Green (2009). YouTube: Online Video and Participatory Culture. Cambridge: Polity Press.
  • Dale, Edgar (1969). Audiovisual Methods in Teaching. New York: Dryden Press.
  • Díaz Cintas, Jorge and Aline Remael (2007). Audiovisual Translation: Subtitling. Manchester: St. Jerome Publishing.
  • Díaz Cintas, Jorge (ed.) (2008a). The Didactics of Audiovisual Translation. Amsterdam/Philadelphia: John Benjamins.
  • - (2008b). “Teaching and learning to subtitle in an academic environment.” Díaz Cintas (2008a), 89-104.
  • Djonov, Emilia and Theo van Leeuwen (2013). “Between the grid and composition: Layout in PowerPoint’s design and use.” Semiotica 197, 1-34.
  • Ford Williams, Gareth (2009). Online Subtitling Editorial Guidelines V1.1. Subtitling Guidelines (consulted 10 April 2017).
  • Gambier, Yves (2015). “Thirty years of research in subtitles and language learning. The knowns and unknowns.” Beatrice Garzelli and Monica Baldo (eds) (2015). Subtitling and Intercultural Communication, European Languages and Beyond. Pisa: Edizioni ETS, 145-68.
  • Gromik, Nicolas (2007). “Video tutorials, Camtasia in the EFL classroom.” JALT CALL Journal 3(1-2), 132-140.
  • Guerra, Luigi (ed.) (2010). Tecnologie dell'educazione e innovazione didattica. Bergamo: Edizioni Junior.
  • Holford, John, Jarvis, Peter and Colin Griffin (eds) (1998). International Perspectives on Lifelong Learning. London: Kogan Page.
  • Kirkpatrick, Donald (2006). Evaluating Training Programs. San Francisco, CA: Berrett-Koehler Publishers.
  • Kruger, Jan-Louis and Faans Steyn (2014). “Subtitles and eye tracking: Reading and performance.” Reading Research Quarterly 49(1), 105–120.
  • Mangiron, Carme, Orero, Pilar and Minako O’Hagan (eds) (2014). Fun for All: Translation and Accessibility Practices in Video Games. Bern: Peter Lang.
  • Mangiron, Carme and Minako O’Hagan (2013). Game Localization, Translating for the global digital entertainment industry. Amsterdam/Philadelphia: John Benjamins.
  • Marleau, Lucien (1982). “Les sous-titres... un mal nécessaire.” Meta: Translators’ Journal 27(3), 271-85.
  • Mayer, Richard E. and Roxana Moreno (1999). “Cognitive Principles of Multimedia Learning: The Role of Modality and Contiguity.” Journal of Educational Psychology 91(2), 358-368.
  • Mu, Xiaoyan et al. (2009). “Work in progress — video-based lab tutorials in an undergraduate Electrical Circuit course.” 39th IEEE Frontiers in Education Conference, 1-2. (consulted 10/06/2018).
  • Omizo, Ryan Masaaki (2012). Facing vernacular video. PhD Thesis. The Ohio State University.
  • Oncins, Estella et al. (2013).  “All Together Now: A multi-language and multi-system mobile application to make live performing arts accessible.” The Journal of Specialised Translation 20, 147-164.
  • Online Etymology Dictionary (2001- 2016). 15 June 2017)
  • Oxford Dictionaries (online) (2016). Oxford University Press, 2016. (consulted 25 June 2017)
  • O’Sullivan, Carol (2013). “Introduction: Multimodality as challenge and resource for translation.” The Journal of Specialised Translation 20, 2-14.
  • Paivio, Allan (1971). Imagery and verbal processes. New York: Holt, Rinehart, and Winston.
  • Perego, Elisa (ed.) (2012). Eye-Tracking in audiovisual translation. Roma: Aracne editrice.
  • Raftery, Damien (2010). “Developing Educational Screencasts: A Practitioner’s Perspective.” Roisin Donnelly, Jen Harvey and Kevin O'Rourke (eds) (2010). Critical Design and Effective Tools for E-Learning in Higher Education: Theory into Practice. New York: Information Science Reference, 213-226.
  • Rempel, Hannah Gascho and Maribeth Slebodnik (2015). Creating online tutorials: a practical guide for librarians. Lanham, Maryland:Rowman & Littlefield.
  • Sweller, John (1988). ”Cognitive load during problem solving: Effects on learning.” Cognitive Science 12(2), 257-285.
  • Srivastava, Madhukriti and Sharda Sharma (2013). “Spoken Tutorial Project — IIT Bombay: Building IT literate India.” IEEE International Conference in MOOC, Innovation and Technology in Education (MITE), 289-293. (consulted 10 June 2018).
  • Tindall-Ford, Sharon, Chandler, Paul and John Sweller (1997). “When Two Sensory Modes are Better than One.” Journal of Experimental Psychology: Applied 3(4), 257-287.
  • Wells, Jason, Barry, Robert Mathie and Aaron Spence (2012). “Using Video Tutorials as a Carrot-and-Stick Approach to Learning.” IEEE Transactions on Education 55(4), 453-458.
  • Yousef, Ahmed Mohamed Fahmy, Chatti, Mohamed Amine and Ulrik Schroeder (2014). “The State of Video-Based Learning: A Review and Future Perspectives.” International Journal on Advances in Life Sciences 6(3-4), 122-135.
  • Zhang, Dongsong et al. (2006). “Instructional video in e-learning: Assessing the impact of interactive video on learning effectiveness.” Information & Management 43(1), 15-27.
  • Zhao, Sumin, Djonov, Emilia and Theo van Leeuwen (2014). “Semiotic technology and practice: a multimodal social semiotic approach to PowerPoint.” Text & Talk 34(3), 349-375.

Tarquini Portrait

Gianna Tarquini has been a Postdoctoral Research Fellow in the Department of Interpreting and Translation at the University of Bologna, where she managed the FORLIXT project, a multimedia database for foreign language learning and translator training. She has developed a research project on videogame localisation for the Game Studies area at the Kwansei Gakuin University in Japan and for the GILT framework at the Localisation Research Centre of the University of Limerick. She is also a member of the teaching staff of the CAWEB master’s degree at the University of Strasbourg, a programme strongly focused on web and video creation in addition to foreign languages and localisation.


McDorman Portrait

Richard E. McDorman is Chief Academic Officer at Language On, a private language training institution in Miami, Florida. He is also a Commissioner and the 2019 Chair of the Commission on English Language Program Accreditation (CEA), an accrediting agency recognised by the United States Department of Education for the accreditation of postsecondary English language programs and institutions. A freelance translator and author and editor of ESL/EFL textbooks, his prior research has focused on historical linguistics, phonetics and phonology, English dialectology and the semiotics of early writing systems.




Note 1:
From Old French tuteor ‘guardian, private teacher’ (Modern French tuteur), from Latin tutorem (nominative tutor) ‘guardian, watcher,’ from tutus, variant past participle of tueri ‘watch over,’ of uncertain origin (Online Etymology Dictionary).
Return to this point in the text

Note 2:
Here, the term ‘accessibility’ is used in its broadest sense as encompassing not only users with functional diversity, but also those who do not speak the original language (Oncins et al. 2013). Closed captioning is a key feature for ensuring video tutorial accessibility, but there are many others, especially if we consider related usability issues. These include tutorial design features such as the use of clear graphics and text fonts with adequate contrast between elements, the provision of consistent navigational cues, the organisation of content according to a logical reading order and the use of arrows and text boxes to highlight important concepts. Flashing objects should be avoided as they tend to distract learners and may trigger seizures in individuals affected by seizure disorders (Rempel and Slebodnik 2015).
Return to this point in the text

Note 3:
This model is known as GILT (Globalisation, Internationalisation, Localisation and Translation) and encompasses “the global product development cycle, where internationalisation includes the planning and preparation stages for a product and localisation the actual adaptation of the product for a specific market" (Anastasiou and Schäler 2010: 14). In this context, translation mainly refers to the transfer of written text while localisation involves linguistic, cultural and technical adaptation. Nevertheless, the notion of translation incorporates intralinguistic, interlinguistic and intersemiotic transfer.
Return to this point in the text

Note 4:
Although not fully applicable to the sample under examination, comparative research on the use of different subtitle types for foreign language learning suggests that interlingual subtitles are the less effective input method while intralingual subtitles are suitable for both formal and informal learners (Gambier 2015: 147-162).
Return to this point in the text