Human-driven AI Solutions to Access and Manage Audiovisual Content
A Research Perspective
Time: 16.00 CET
Duration: 60 minutes
We present the results of the MeMAD project for multimodal media description, and show how human work can be done effectively with state-of-the-art tools.
16.00 – 16.03 Mikko Kurimo, Welcome and presentation of speakers
16.03 – 16.13 Jorma Laaksonen, Recent advances in automatic description of audiovisual content
ABSTRACT: Automatic tools for media content analysis have traditionally aimed at indexing the content to facilitate fast and efficient search. Recent developments have made it possible to (semi-)automate the description of audiovisual media content, for example for situations where a textual gist of the content is needed, especially archive retrieval, and – in the mid-term – for visually impaired handicapped audiences. In this presentation, we will demonstrate how various unimodal inputs, such as face recognition, speaker diarisation, language identification and audio background classification, can be combined in automatically generated captions for enriched media description.
16.13 – 16.23 Sabine Braun, Multimodal concept description
ABSTRACT: In the drive to make media archives more searchable (through content description) and audiovisual media more widely accessible (through audio description), computer vision experts strive to automate the description of audiovisual content in order to supplement human description activities. Advances in computer vision led to increasingly accurate automatic image description, and – as has been demonstrated in the previous presentation – attempts at (semi-)automating the description of video scenes and audiovisual content have also begun to emerge. However, the complex, multimodal nature of audiovisual content continues to constitute a non-trivial challenge for automation. A key question arising for research – and addressed in one of the research strands of the MeMAD project – is how machine-generated descriptions compare with their human-made counterparts and how a better understanding of human approaches to describing audiovisual content can inform the (semi-)automation of this task. This presentation will focus on what makes a human-derived description of audiovisual content more accessible (in the broadest sense), appropriate or entertaining than those currently produced by AI-driven machine description and how insights into human processes of multimodal comprehension and description can inform efforts to develop viable (semi-)automated approaches.
16.23 – 16.33 Raphaël Troncy, Leveraging Knowledge Graphs and Human’s Inputs for Improving Media Understanding
ABSTRACT: Representing and modelling Radio and TV programs, being broadcasted or archived, can present several challenges due to the variety of metadata that could be attached to them as well as the potential applications using this metadata. To tackle this challenge, we have developed the so-called MeMAD Knowledge Graph, which integrates and unifies audiovisual content from multiple distributors, producers, channels, genres and languages, and unifies access to their related metadata using the EBUCore ontology promoted by the European Broadcasting Union. Next, we present an Exploratory Search Engine that enables consumer users to search and explore collections of programs using this semantic knowledge graph. Finally, we present our approaches for predicting what shorter moments should be highlighted in those programs, either because they are judged to be interesting or memorable, based on human’s input and external sources.
16.33 – 16.43 Jörg Tiedemann, Personalised solutions for translation and cross-lingual information access
ABSTRACT: Increasing volumes of multimodal content and global interest in productions world-wide create a growing demand on automatic translation and localisation of audiovisual content. Properly supporting translators in their demanding work of subtitle translation in highly diverse domains is still a scientific challenge. Personalising services and building adaptive machine translation for dynamic and context-aware applications is one of our focus areas. Using automatic translation for cross-lingual content access is another one.
16.43 – 17.00 Conclusions and discussion