Exploring a city as vibrant as Berlin, an urban landscape punctuated by music from booming sound systems and pulsating with imagery projected on every wall and façade during the current Festival of Lights, sharpens the mind to the relevance of accessible media in all our lives. This dynamic geographical space, with its iconic audiovisual heritage, speaks not of a backwards glance at a challenging past, but of a desire for rapid advancement, a willingness to adapt to new realities, and a relentless optimism for the future.
What better location then, to serve as a backdrop for Languages and the Media 2018, a meeting of almost 400 of the world’s leading audiovisual translation and media accessibility experts squaring up to the dual challenges of harnessing digitisation and adapting to the advances afforded by automation. Framed as ‘The Fourth Industrial Revolution: Re-Shaping Languages in the Media’, the conference brought together broadcasters, practitioners, researchers and technicians from around 40 countries to reflect on technological and methodological innovation in audiovisual media and the role language (subtitling, audio description, dubbing, re-speaking) plays in enhancing viewer access.
In our role as ambassadors for the MeMAD project, we set out to present not only our own research agenda, but also to give a flavour of the work being undertaken by our other MeMAD project partners. Consequently, our presentation, ‘From Slicing Bananas to Pluto the Dog’, began with an overview of the project aims, before focusing on the work being undertaken across the Universities of Surrey, Aalto and Helsinki in the field of machine vs. human video description evaluation. Surveying two particular frameworks for meaning-making, we explored mental modelling and cognitive narratological approaches, both representing a perspective on the kinds of viewing experience commonly encountered when humans engage with images in a multimodal context. For most of us, this also means seeking relevance in observed human communication, establishing coherence within narrative storytelling, and employing powers of inference to detect meaning where no visual or verbal exposition occurs.
But what, I hear you ask, of computer vision and machine-based storytelling? How might a computer, which ‘sees’ film material only as independent still images, and produces descriptions of these visual artefacts based on neural networks and crowd-sourced captions, be trained to describe sophisticated storylines? Not easily, is the short answer.
Preliminary comparisons between machine- and human-generated descriptions of moving images suggest a long road ahead. Life experience will continue to give human meaning-making a significant edge over the computer for the foreseeable future, and the training data used to supply the computer with a limited facsimile of this experience must expand exponentially if computers are to assume responsibility for a significant proportion of the human describer’s workload. For example, whereas the machine may currently be able to identify an image of a woman or man, and a room populated by furniture, any relationship between these ‘objects’ (e.g. man/woman, desk, bed, window), and the precise nature of that inter-relatedness, remains largely out of reach. Cognitively, semantically and syntactically, therefore, the computer describes with less finesse than the average three or four year old child. For this reason, interpretation within a narrative context – for instance, identifying visual markers of emotion or irony – remain far beyond the computer’s reach.
To illustrate the size of this knowledge gap, one image we used in our presentation (below) portrays a stuffed toy resembling Disney’s Pluto the Dog, a bloodhound with floppy ears and long yellow snout, which the computer-generated video description rendered simply as ‘a banana’. Alas, so near and yet so very far!
So for the moment, our analysis must start from the ground up, beginning with a basic comparison of lexical content between key elements captured in machine descriptions (character, action, object etc.) and those present in the corresponding human annotations. However, we are still very much at the beginning of our man vs. the machine journey. As computer-generated video descriptions become more sophisticated, we anticipate our agile, multi-layered, multi-dimensional annotation and analytical approach will allow us to continue comparing machine outputs with their human counterparts in a meaningful and quantifiable way. With that in mind, we look forward to returning to the L&M conference in the near future with a different story to tell. We anticipate our research will capture evidence of how the ‘human advantage’ plays out in circumstances where the computer fails. Perhaps we might then be a step closer to understanding how machine-generated descriptions could one day liberate the human audio describer to focus on what humans do best: supporting intuitive, creative and entertaining storytelling. Not simply sorting out bananas from bloodhounds!