a close-up picture of a radio broadcast microphone overlooking a studio backdrop.

photo courtesy of Fringer Cat on Unsplash.

rewind, part 1: an idea. (several.)

Jan 1, 2023

you might be thinking that this is going to become a recap of the past year (which was 2022, to establish chronological context). in a way, you're both right and wrong.

more specifically, this is one of a series of blog posts talking through how I'm writing a (concept?) web service named Rewind. this service is meant to parse metadata for episodes of PRX Remix into a more user-friendly index of episodes.

inspiration.

I'm fairly certain my first listen to PRX Remix on my local NPR affiliate (specifically, the Ideas Network on WPR - thus inspiring the post title) was 2022-11-19. while listening, I kept wanting to learn more about everything that played in the 3 hour block, and I found it interesting that every "episode" of Remix is documented on PRX's websites. (yes, you read that right - PRX has the Exchange, their "classic" frontend, in addition to a beta site that appears less monolithic. keep this in mind, as it becomes somewhat important.)

I tried to look through both frontends to find the episode that played on the radio some time later, but this ended up being difficult because:

  • looking directly at the Exchange site for Remix unfortunately doesn't list the latest pieces like it should. sorting by "oldest" shows them all in a paginated view up until a certain point, but episodes past that point don't even render.
  • to work around this, you can look at the "group account" for Remix on the Exchange. now we can at least see episode numbers and the "times" they're supposed to air...
  • but clicking on one of these might not correctly resolve the URL right away. for whatever reason, classic Exchange won't resolve Remix episode URLs until after the "time" stamped in the title has passed. (if we visit too early, it'll tell us "The page you were looking for doesn't exist.")
  • even weirder - if we visit the beta site for the Remix series, we don't see any kind of timestamp indication in the episode titles (????), but... the episodes we're seeing ahead of time are actually streamable!
  • the beta site does not allow us to "page" through episodes like the Exchange, so (unfortunately) a lot of scrolling is involved to get to the result you need, and that might not even be correct if you're not cross-referencing both sites.
  • using the Exchange's search functionality is also hit or miss, even if you know the exact episode ID that you're expecting to see. sometimes the episodes show up as you expect them to, and sometimes (read: often) they don't show up at all.

note that you can usually switch from the beta site to the Exchange via the "View on PRX Exchange" link in the footers, and switching from the Exchange to the beta is pretty simple (just replace "exchange.prx.org" with "beta.prx.org" in the URL), but both solutions are very tedious for simple questions that should be relatively simple to answer, even if the Exchange may not be geared towards regular listeners.

given all of the above, I started looking at ways to make the Remix episodes more searchable. there's got to be a way...

enter the CMS API.

there's an endpoint for that?

I poked through the developer tools in Firefox^[my browser of choice!] and found that the beta site makes calls to API endpoints accessed at cms.prx.org. a quick Google search found that its source code is available on GitHub (AGPL v3), and there's a friendly browser endpoint to explore its endpoints. nice!

to avoid encouraging folks to overload the CMS servers, I won't go into detail (for now) as to how things work, but it's fairly trivial to get much more than 10 records in a consistent sorting order (ascending, by episode number). we get a lot of information, and the description (formatted in both Markdown and HTML) contains the list of pieces used in each episode:

{
  "id": 455033,
  "title": "Remix: Episode 109980",
  "shortDescription": "Episode 109980 of the Remix series.",
  "episodeNumber": 109980,
  "createdAt": "2023-01-01T16:34:37.000Z",
  "updatedAt": "2023-01-01T16:37:40.000Z",
  "publishedAt": "2023-01-01T16:37:39.000Z",
  "duration": 3540,
  "points": 0,
  "appVersion": "v3",
  "description": "<ul>\n\t<li><div>PRX Remix Be Good to Them</div><div>The Books and Roman Mars</div><div>00:00:22</div></li>\n\t<li><div>Let It Grow</div><div>AirSpace</div><div>00:18:32</div><div><a href='https://exchange.prx.org/pieces/418929'>PRX</a></div></li>\n\t<li><div>It's Musiiiiic!</div><div>The Books</div><div>00:00:29</div></li>\n\t<li><div>The Core</div><div>AirSpace</div><div>00:17:30</div><div><a href='https://exchange.prx.org/pieces/451180'>PRX</a></div></li>\n\t<li><div>PRX Remix DJs for Stories</div><div>Ray Pang & Fantasia Fantasia by Kevin MacLeod</div><div>00:00:39</div></li>\n\t<li><div>Closing the TaB</div><div>WFIU</div><div>00:11:50</div><div><a href='https://exchange.prx.org/pieces/443124'>PRX</a></div></li>\n\t<li><div>It's Complicated</div><div>Sandip Roy's Dispatches from Kolkata</div><div>00:06:00</div><div><a href='https://exchange.prx.org/pieces/438575'>PRX</a></div></li>\n\t<li><div>Guess this Sound!</div><div>Ibby Caputo</div><div>00:01:00</div><div><a href='https://exchange.prx.org/pieces/51062'>PRX</a></div></li>\n\t<li><div>What it's like to be a parent</div><div>Kathleen Polanco</div><div>00:02:00</div></li>\n\t<li><div>Impress Of</div><div>MelonFlex</div><div>00:00:16</div></li>\n\t<li><div>Solomon Jones</div><div>RJD2</div><div>00:00:34</div></li>\n\t<li><div>Remix Sonic ID</div><div>Remix</div><div>00:00:06</div></li>\n<ul>",
  "descriptionMd": "- \nPRX Remix Be Good to Them\n\nThe Books and Roman Mars\n\n00:00:22\n- \nLet It Grow\n\nAirSpace\n\n00:18:32\n\n[PRX](https://exchange.prx.org/pieces/418929)\n- \nIt's Musiiiiic!\n\nThe Books\n\n00:00:29\n- \nThe Core\n\nAirSpace\n\n00:17:30\n\n[PRX](https://exchange.prx.org/pieces/451180)\n- \nPRX Remix DJs for Stories\n\nRay Pang & Fantasia Fantasia by Kevin MacLeod\n\n00:00:39\n- \nClosing the TaB\n\nWFIU\n\n00:11:50\n\n[PRX](https://exchange.prx.org/pieces/443124)\n- \nIt's Complicated\n\nSandip Roy's Dispatches from Kolkata\n\n00:06:00\n\n[PRX](https://exchange.prx.org/pieces/438575)\n- \nGuess this Sound!\n\nIbby Caputo\n\n00:01:00\n\n[PRX](https://exchange.prx.org/pieces/51062)\n- \nWhat it's like to be a parent\n\nKathleen Polanco\n\n00:02:00\n- \nImpress Of\n\nMelonFlex\n\n00:00:16\n- \nSolomon Jones\n\nRJD2\n\n00:00:34\n- \nRemix Sonic ID\n\nRemix\n\n00:00:06",
  "tags": [],
  "license": {
    "streamable": true,
    "editable": false
  },
  ...
}

Markdown seems a lot easier to parse with regex, so we'll use it moving forward here. note that each piece is delimited by - \n in the description, and within each piece, \n\n separates each attribute. (also, we have additional \n characters to work around.)

speaking of attributes, we are given the following for each piece:

  • the piece title
  • the piece's collection name (which may be the artist/author/producer, the series the piece belongs to, or something else entirely)
  • the piece's duration (down to the second)
  • a link to the piece on the Exchange, if available (but even then, the link may be incorrect or nonfunctional due to technical issues - see below for an example)

because we can't rely on the Exchange ID (which is what I'll use to refer to the numeric part of series and piece URLs moving forward) to be given for each and every piece used, let alone correct, some inference magic will need to be used to refer to each individual piece.

let's take a look at some historical records to prove this. I've been developing a Python script to pull out the above data from each episode description:

chris@whatever:~/repos/python/rewind$ python3 json_episode_parser.py -f 'prx first 700 remix eps.json' | grep "Claire Schoen"
ID 85438 w/ runtime 00:20:18: "RISE, Part 2: Stemming the Tide" from Claire Schoen
ID 91633 w/ runtime 00:02:38: "Frogs" from Claire Schoen
ID 91633 w/ runtime 00:02:36: "Nature vs Noise" from Claire Schoen
ID 91633 w/ runtime 00:02:35: "Elephant Seals" from Claire Schoen
ID 91633 w/ runtime 00:02:36: "Wolves" from Claire Schoen
ID 91633 w/ runtime 00:02:38: "Frogs" from Claire Schoen
ID 91633 w/ runtime 00:02:36: "Wilderness Pollution" from Claire Schoen
ID 91633 w/ runtime 00:02:37: "Crickets" from Claire Schoen
ID 85438 w/ runtime 00:20:55: "RISE, Part 1: Swamped" from Claire Schoen
ID 91633 w/ runtime 00:02:36: "Bubbling Mudpots" from Claire Schoen
ID 91633 w/ runtime 00:02:38: "Frogs" from Claire Schoen
ID 91633 w/ runtime 00:02:36: "Bubbling Mudpots" from Claire Schoen
ID 91633 w/ runtime 00:02:37: "Crickets" from Claire Schoen
ID 91633 w/ runtime 00:02:38: "Frogs" from Claire Schoen
chris@whatever:~/repos/python/rewind$ 

here, we're seeing that there are multiple segments associated under pieces 85438 and 91633. from an API perspective, it's not a major issue because the API endpoint that allows us to view information on specific pieces (or stories if you follow the API terminology, but we'll call them pieces for consistency) will absolutely give us information on each segment if we query the additional endpoint it tells us about, but this still means we can't expect piece IDs given in the episode descriptions to be entirely correct.

in some cases (as evidenced by, among other things, the results for the search for a specific piece from a PRX/Radiotopia podcast that apparently aired on Remix), not only is the piece in question not publicly available on the Exchange, but the URL that is provided links to... something else entirely.

all of this goes to show that we're probably better off deriving unique identifiers for pieces by:

  • hashing a combination of name and series to form a unique piece ID, and
  • hashing just the series name to create a unique identifier to correlate each piece to a common series, especially if neither are on the Exchange (or we're just being given incorrect information).

I initially planned on using the Exchange ID as the primary key (and hashing to generate a "fake" ID where one didn't exist), but stumbling upon these caveats made that plan a nonstarter. doing things this way also gives us a few advantages in terms of flexibility:

  • if data parsed from the episode descriptions conflicts with what is returned from the Exchange/CMS API for a particular piece, we can implement an algorithm to manually (or automatically, where appropriate) adjudicate and seamlessly update this information for all instances of that piece across the board.
  • for pieces that do not exist publicly on the Exchange in any searchable manner, we can still keep track of them. (and probably do so more efficiently.)

I'd say that that's enough for part 1. I'm going to publish a follow-up at some point in the nebulous future... once I've gotten further along here. I may also need to obtain approval from station management[1].


  1. (bonus fun fact: Night Vale Presents, an independent podcast network that's home to podcasts like the one I just referenced, also partners with PRX.) ↩ī¸Ž

Chris

hi there. I'm the webadmin for this website and the applications hosted on it.