Analyzing audio with free speech-to-text models from Pop Up Archive

Source: Wikimedia Commons.

An emerging archival speech technology project is underway at the University of Texas at Austin, and its inspiration comes from a piece of software called ARLO that was developed to analyze an entirely different form of aural communication: bird song.

Called HiPSTAS (High Performance Sound Technologies for Access and Scholarship), the project is a collection of software tools and communication channels to help researchers access and analyze archival spoken word collections — for example, by analyzing audio spectrograms to identify particular traits such as pitch, tone, and speed. HiPSTAS was founded by Tanya Clement, an associate professor at the UT-Austin School of Information, along with collaborators at the University of Illinois Urbana-Champaign. The group has received several grants since its founding in 2012, including two from the National Endowment for the Humanities.

The ability to search archival spoken word recordings along multiple parameters has powerful implications for historical research and literary scholarship. While digital sound recordings are increasingly available to scholars, searching through the files for discernible patterns is time-intensive and cost-prohibitive.

Stephen McLaughlin is a research assistant and PhD student in Information Studies at UT-Austin and a contributor to HiPSTAS. Steve is currently working with WGBH and Pop Up Archive on the American Archive of Public Broadcasting, part of which entails a massive effort to use machine learning to identify notable speakers’ voices (for example, Martin Luther King, Jr.) from within AAPB’s 70,000 digitized audio and video recordings. As part of the AAPB project, Pop Up Archive transcribed the entirety of the AAPB and released free models for the open source speech-to-text software Kaldi, with the intention that the models could be used to enable other audio collections to be transcribed and thus searchable.

“The language models Pop Up Archive has assembled are more accurate, and do a way better job of identifying proper nouns, than other tools I’ve seen.”

While testing potentially useful tools to combine with HiPSTAS, Steve downloaded Pop Up Archive’s free models. “The language models Pop Up Archive has assembled are more accurate, and do a way better job of identifying proper nouns, than other tools I’ve seen,” said Steve. “The other exciting thing is that those models can be extended. I can take the model, transcribe a recording or correct its machine transcription, and use that output to make it more accurate for my needs. The flexibility of this system is really exciting.”

Kaldi is free software, released under an MIT License, that can be run on institutional servers, making it a natural choice for libraries, archives, and institutions that have a long-term approach and technical resources to support their projects. This is something Steve has personal experience with: “I’ve seen this happen, where commercial API services are here and gone. So the ability to run Kaldi locally is really an important tool to have in your tool belt. Pop Up Archive is providing a really valuable service by contributing their resources to a project like the Kaldi models.”

As an exercise, Steve recently used Kaldi to run through 80 hours of recordings by the poet Robert Creeley from the PennSound archive, the topic of his undergraduate thesis. Creeley used certain phrases and expressions over and over, such as the word “company” — so Steve searched the Kaldi transcript for every instance of “company” and then generated a supercut, combining the different tones and intonations into a continuous string of audio.

HiPSTAS team members are very interested to share what they’ve learned from their collective experience, and in the coming months Steve plans to publish code demonstrations for other researchers working with archival audio. He’d also like to write about his work with the American Archive of Public Broadcasting to identify speakers in large audio collections. Finally, he intends to run a workshop covering his work with Kaldi at one or more conferences later this year.

Anyone currently working on archival sound projects can access HiPSTAS’ preliminary release of tools called the Audio Tagging Toolkit. You can also download Pop Up Archive’s Kaldi models on Github.

See you in the archive,
The Pop Up Archive team