*not a real word
by Shindo N. Strzelczyk, software engineer at Pop Up Archive
Podcast enthusiasts have a hard time talking about the thing they love. There is even disagreement about whether to use the Apple-centric term “podcast,” or the too-broad term “radio,” or the vague and generic-sounding “digital radio,” to describe, precisely, digital spoken audio on the internet. Another problem is that this medium is still relatively new and only recently gaining popularity, so the discourse surrounding it is unsettled. In spite of its newness, there are over 250,000 podcasts in the iTunes store, comprising more than 8,000,000 episodes, and almost all of that content is opaque and unsearchable. So, if you want to know what’s happening right now in the “podcastphere” (for lack of a better word), you are pretty much out of luck.
Our newest project, Audiosear.ch, is an attempt to index, analyze, and interpret all of the data about podcasts on the internet, and make that data publicly available through our API. Part of that process involves generating a high quality automatic transcripts and extracting entities related to the content. We have now identified over 10,000 people from those entities and begun to determine whether they are a host, producer, guest, or topic of conversation in shows and individual episodes.
Our methods are still exploratory and experimental, but now that we have amassed a wide selection of podcasts from various sources, totaling more than 190,000 minutes of about 6,400 episodes, we can begin to analyze trends and patterns in the data.
Our first data visualization is a pair of charts of the most-mentioned people in our database over the last five months, grouped by week. The full interactive visualization allows you to change the time range and isolate the data for individual people.
The nature of this data presents several challenges when attempting to quantify it. For example:
- Varying timelines of publication. We decided to group podcasts by week since that gives a relatively good estimate of total mentions in a given time period. But podcasts are produced at various intervals ranging from daily to monthly, so it isn’t necessarily the case that the person is being mentioned at a particular time for similar reasons.
- People with multiple aliases. You may also notice the conspicuous absence of Hillary Clinton in this list which includes many politicians. That is because our people entities are named in reference to Wikipedia, and that site uses her full name: Hillary Rodham Clinton. So, when our software scrapes our transcripts, it’s looking for “Hillary Rodham Clinton,” even though that’s not how people usually refer to her. (We are currently working on identifying aliases of people.)
As we continue to grow our catalog of podcasts and improve the accuracy of our data, we hope others will follow our lead and utilize it to begin to crack open the secrets of the podcastphere. And if you can come up with a better word for it, please tweet it to us.