Unlocking copyrighted media archives for research

Marjamäki castle in Tallinn, venue of Baltic Audiovisual Archives Council conference 2018

MeMAD research work – especially the parts analysing video and audio – depends on data and media because data intensive methods such as data mining are being used. MeMAD research teams of course use the standard open datasets such as TRECVID, but we have also worked to gain access and get rights to also use data and media from the Yle and INA media archive collections. With these datasets we can test how the studied methods work on actual video and audio content that media professionals deal with every day.

Having shared our thoughts with Baltic colleagues at BAAC conference in Tallinn, it is also good to sum up here what we have learned this far on sharing media archive collections with research projects such as MeMAD.

Based on the feedback from MeMAD project research teams, separate datasets are needed for different research topics. Parallel subtitles in multiple languages are useful for translation studies, while rich visual storytelling might serve the research on audio descriptions. Studio discussions might serve the needs of facial recognition and speaker diarization, while lifestyle programs with varying topics may be better for visual object detection in their rich and changing settings. It is possible to combine some of these requirements into the same dataset, but in many cases it is simpler to release multiple parallel datasets for different purposes.

Rights and licensing

Clearing the rights for these datasets has been time-consuming, as a large number of rights holders affect our possibilities to access and use the media archives. Broadcasting archives seldom hold all the rights to programs in their collections, and the available rights are defined by combination of individual production and distribution contracts, framework agreements with copyright societies and legislation both on EU and national levels.

For example a media archive such as Yle’s may hold a copy of a program, but that does not automatically mean that the archive can copy and distribute this program freely. During the production or archiving of the program, certain rights have been granted to the archive, but typically these rights need to be extended if the program is needed for purposes that have not been agreed upon in advance. Research through data mining and machine learning is something that has not been foreseen in the past decades when the rights of the archive content have been agreed upon.

As always, communication does help, and after explaining the copyright societies what we wish to do and how the research methods actually work, we were able to agree on a license to use the copyrighted tv and radio programs from our archives. It’s worth noting that newer techniques studied in our project are new also to the rights holders and their societies, so going through our plans in enough detail did take it’s time but it seems to be time well spent. Views and expectations on the project should be based on facts instead of fears or suspicions.

Aim for long-term licenses

This said, there are conflicting interests between the research and rights management communities. Copyright societies, and partly legal staff in general, aim to manage risks by e.g. limiting the scope and duration of licenses and agreements. Research community in contrast is trending towards open science, which aims at open ended licenses where the intended purpose or duration of availability for the content is not unnecessarily limited in advance.

Currently we have to find a balance between these interests, but from the media archive point of view, longer lasting licenses are more attractive than one time licenses. As work has been put into getting the datasets together, it would make sense to maximise their use, instead of licensing them for a single project only.

As our work in the MeMAD project progresses, we may come across new needs for datasets, but at the moment it seems we have good starting point with the data we have. We will also see whether we actually need new datasets, or if we will expand the existing ones with annotations and new data produced by our project teams. In the future we will also explore possibilities to provide the R&D community with media datasets that will be available long enough to be used as reference points and with licenses that allow flexible use of emerging research methods and techniques.

***

And why all the fuzz on rights and licenses?

Extra effort on rights management is needed because our project deals mostly with copyrighted content such as tv shows. Copyright is an exclusive right by nature, which means the rights holders have full control over the use of their works, except in cases where copyright has been narrowed e.g. by legislation.

Typically this means that old enough works move into public domain and are free to use, or that legislators have seen some types of use important for the common good. Exceptions to the copyright have been granted e.g. for research use in some nation’s legislation.

Copyright is also narrowed in those cases where the rights holders have given away some or all of their rights. Typically this happens e.g. when a rights holder labels content as CC0, or uploads content into an online service where terms of use state that some rights are granted or given to the company providing the service.