Pop Up shares source code for public media speech-to-text software

Sharing software to make sound searchable

 

e7bdffa8-f09f-4d73-83cc-8077288b6f48

Cultural heritage institutions around the world house millions of hours of audiovisual content — but much of that sonic history is effectively unsearchable. Even when reels, tapes, and discs are digitized, the content they contain is opaque. Each digital file is like a black box, impossible to see within.

This week, we’re thrilled to announce a major open-source software release intended to help combat this problem. Over the course of this year, Pop Up Archive has trained special models, targeted specifically at public media content, for use with the widely-used open-source Kaldi speech-to-text software.

The development of this software is part of our work with WGBH and the American Archive of Public Broadcasting. Our goal is to make the American Archive — which contains over 40,000 hours of the most significant public radio and television programs from the past 60 years — searchable and discoverable.

screen-shot-2016-10-15-at-5-03-04-pm

To train our speech-to-text models, we collected millions of words from pre-existing public media transcripts and other content, then compiled the text into a language model, which is the component of speech-to-text software that deals with the probabilities of sequences of words or phrases.

If you’re curious for more details, take a peek at these slides prepared by Pop Up Archive computational linguist Tali Singer. You’ll also find our source code on Github.

This work was funded by an Institute of Museum and Library Services Research Grant for the “Improving Access to Time Based Media through Crowdsourcing and Machine Learning” project (see the full IMLS grant proposal or visit the American Archive site).

We’re very excited to make this contribution to to digital archiving and audiovisual communities. We’d love to hear from anyone interested in implementing the software at their own institution.