404: The Transcription Challenge: Building Infrastructure That Scales With The World

July 18, 2025

Key Takeaways

Building a scalable podcast transcription infrastructure requires a fundamental shift from customer-based scaling to managing an uncontrollable external factor: the sheer volume of daily podcast releases.
Cost-effective GPU infrastructure for transcription is achievable by prioritizing smaller, more affordable GPUs and self-managing servers, rather than relying on expensive, high-end AI-focused hardware or cloud services.
The biggest challenges in podcast transcription infrastructure are not just processing power, but also efficient data storage, reliable search capabilities, and handling the inherent quality issues and name/brand recognition problems in audio data.

Segments

Building Transcription Infrastructure (00:33:52)

Key Takeaway: Early prototyping leveraged existing tools like Whisper.cpp on CPUs, but scaling required a shift to GPU-based cloud solutions.
Summary: The initial approach to transcription used Whisper.cpp on local CPUs, but the need for scale led to exploring cloud-based GPU solutions, starting with local development and then moving to cloud providers.

Optimizing GPU Costs (00:45:52)

Key Takeaway: Cost-effective transcription infrastructure is achieved by using smaller, cheaper GPUs (like A10s) at scale, rather than expensive, high-end GPUs.
Summary: The speaker details the process of finding affordable GPU servers, moving from expensive AWS instances to Lambda Labs, and eventually discovering Hetzner as a cost-effective solution for renting GPU servers suitable for transcription.

Advanced Transcription Features (01:29:52)

Key Takeaway: Diarization is more resource-intensive than transcription itself, requiring careful prioritization to manage costs and efficiency.
Summary: The discussion shifts to implementing advanced features like diarization and word-level timestamps, highlighting the increased computational cost of diarization and the need for smart prioritization strategies to manage resources.

Storage and Search Challenges (01:43:34)

Key Takeaway: Massive amounts of transcript data necessitate offloading older transcripts to S3 storage and using dedicated search clusters (like OpenSearch) to avoid overwhelming the primary database.
Summary: The speaker explains the challenges of storing and searching vast quantities of transcript data, detailing the strategy of using S3 for older files and OpenSearch for efficient full-text search, as traditional databases become unmanageable.

Quality Control and Cost Savings (02:14:39)

Key Takeaway: Self-managed transcription infrastructure can reduce costs from tens of thousands of dollars daily to a few thousand monthly, but quality control and name recognition remain ongoing challenges.
Summary: The conversation concludes by emphasizing the significant cost savings achieved through self-managed infrastructure compared to using commercial APIs, while also acknowledging the persistent issues of audio quality and accurate transcription of names and brands.

Debug Information

Processing Details

VTT File: b3fb3033.vtt
Processing Time: September 11, 2025 at 01:28 PM
Total Chunks: 1
Transcript Length: 36,757 characters
Caption Count: 273 captions

Prompts Used

Prompt 1: Context Setup

You are an expert data extractor tasked with analyzing a podcast transcript. 
I will provide you with part 1 of 1 from a podcast transcript. 
I will then ask you to extract different types of information from this content in subsequent messages. Please confirm you have received and understood the transcript content.

Transcript section:
[00:00:00.320 --> 00:00:04.320] Hey, it's Arvid, and this is the Bootstrap founder.
[00:00:09.120 --> 00:00:16.560] Today we'll talk about keeping up with an avalanche of audio data and how I build PodScan's transcription infrastructure.
[00:00:16.560 --> 00:00:23.360] This episode is sponsored by Paddle.com, my merchant of record payment provider of choice who's been helping me focus on PodScan from day one.
[00:00:23.360 --> 00:00:33.520] They're taking care of all the little things related to money so that founders like you and me can focus on building the things that only we can build, like a massive podcast transcription infrastructure.
[00:00:33.520 --> 00:00:37.280] Paddle handles all the rest, sales tax, credit cards, those kind of things.
[00:00:37.280 --> 00:00:38.960] Don't need to deal with it because they do.
[00:00:39.040 --> 00:00:40.480] I highly recommend checking it out.
[00:00:40.480 --> 00:00:44.080] So please go to paddle.com and take a look.
[00:00:45.040 --> 00:00:54.880] Now, when I started building the first prototype of PodScan, I very quickly realized that this was going to be a different business than any that I've built before.
[00:00:54.880 --> 00:00:59.600] The difference had everything to do with one fundamental challenge in this field.
[00:00:59.600 --> 00:01:11.360] Unlike most software service businesses, the resources that I would need from the start wouldn't scale with the number of customers I had, but would scale with something completely out of my control.
[00:01:11.360 --> 00:01:16.640] The number of new podcast episodes being released worldwide every single day.
[00:01:16.640 --> 00:01:26.080] So no matter if I had one customer or a hundred, if they wanted to track every podcast out there for a keyword, I needed to deal with this from day one.
[00:01:26.080 --> 00:01:36.960] And that's hard because if you ever investigated the idea of stoicism, you will know that there are certain things you can control that you should care about and certain things that you cannot control that you shouldn't fret about at all.
[00:01:36.960 --> 00:01:45.120] That's kind of the idea, like a very rough description of stoicism here, but you know, like it's deal with the things you could deal with and don't whine about the others.
[00:01:45.120 --> 00:01:46.640] So that's exactly what I did.
[00:01:46.640 --> 00:01:52.080] I focused on what I could do to make transcribing every single podcast out there a reality.
[00:01:52.080 --> 00:02:01.720] And I didn't complain about the fact that there are tens of thousands, millions of shows being released all the time with tens of thousands of shows being released every day.
[00:01:59.680 --> 00:02:03.880] That's kind of the framework here.
[00:02:03.880 --> 00:02:05.320] I had to deal with it.
[00:02:05.320 --> 00:02:08.920] I think I'm currently tracking 3.8 million shows.
[00:02:08.920 --> 00:02:13.720] And roughly every day, there's somewhere between 30 to 70,000 being released.
[00:02:13.720 --> 00:02:15.640] Depends on the day of the week.
[00:02:15.640 --> 00:02:33.720] And I want to talk about this Herculean effort of building transcription infrastructure, how I got it from being extremely expensive to manageably cheap comparatively, what the trade-offs were along the way, and how much of the development of new technologies has impacted the feasibility of this entire project for me.
[00:02:33.720 --> 00:02:38.920] Now, for my first prototype, I obviously didn't try to transcribe everything at once.
[00:02:39.160 --> 00:02:51.720] I knew that that didn't make sense to try it all, but I had found my source of podcast feed data, just a couple of good podcast feeds to try it out with through the podcast index project.
[00:02:51.720 --> 00:02:52.520] It's very interesting.
[00:02:52.520 --> 00:02:53.960] If you're into podcasting, check it out.
[00:02:53.960 --> 00:02:57.560] It's an open source approach to listing all the podcasts everywhere.
[00:02:57.560 --> 00:03:06.920] It's free and it's openly available as an API and a database of podcasts that provides where they're hosted, the names, descriptions, and links to episodes as well.
[00:03:06.920 --> 00:03:09.960] I think maybe not necessarily all of them, but some.
[00:03:10.280 --> 00:03:18.920] And they even have a full SQLite export, like four gigabytes of just one big file with all this data, makes it very easy to jumpstart any system.
[00:03:18.920 --> 00:03:26.840] But they even have a great API and the podcast index API has a very beneficial endpoint for trending shows and newly released episodes.
[00:03:26.840 --> 00:03:38.360] So my first prototype used that API and just grab the most recently or most popular released episodes and transcribe those with the existing resources that I had.
[00:03:38.360 --> 00:03:42.840] And when it comes to the tech, I'm just going to share everything here because why not?
[00:03:42.840 --> 00:03:51.280] I already had been experimenting with an open source library called Whisper for a previous project called Podline, a voice messaging tool for podcasts.
[00:03:51.280 --> 00:03:52.080] That was the idea.
[00:03:52.080 --> 00:03:59.680] I was going to take in voice messages through the browser, transcribe them on the back end, and then send a notification to my customers.
[00:03:59.680 --> 00:04:05.760] And I had found that Whisper, which is supposed to be run on GPUs, could also be run on a CPU.
[00:04:05.760 --> 00:04:12.480] So without a graphics card at all, through a project called whisper.cpp, albeit quite slowly.
[00:04:12.480 --> 00:04:18.960] But for Podline, where I needed to occasionally transcribe a short one-minute clip, this worked perfectly.
[00:04:18.960 --> 00:04:25.520] It may have taken five minutes to transcribe it on one CPU core, but that's okay.
[00:04:25.520 --> 00:04:27.520] There's many cores in modern CPUs.
[00:04:27.520 --> 00:04:32.080] And if I have five minutes, sure, people will take the notification a bit after.
[00:04:32.080 --> 00:04:33.440] That's all right.
[00:04:33.440 --> 00:04:42.400] And since PodScan, my current business, was initially a marketing effort for Podline because I wanted to know where people already talk about having voicemail.
[00:04:42.400 --> 00:04:46.560] So I built a tool that would figure out where people talked about it.
[00:04:46.560 --> 00:04:48.320] I had all the tech laying around.
[00:04:48.320 --> 00:04:52.000] But obviously, there's a stark difference in transcription scale here, right?
[00:04:52.000 --> 00:04:59.440] Podline needed to handle occasional short clips, but PodScan needed to reliably transcribe 50,000 shows per day.
[00:04:59.440 --> 00:05:03.440] And those are often shows that go for 40 to 80 minutes, right?
[00:05:03.440 --> 00:05:06.960] That's not just 30 seconds, that is hours of material.
[00:05:06.960 --> 00:05:17.920] And if you look at Joe Rogan, who reliably puts out four plus hour shows, that system needed to be fast and good enough to get the whole conversation and transcribe it.
[00:05:17.920 --> 00:05:26.320] So the first smart choice that I needed to make was treating this as a queuing system, not as something that would happen synchronously to when stuff was released.
[00:05:26.320 --> 00:05:30.000] I needed a queue of podcast episodes that would just wait to be transcribed.
[00:05:30.680 --> 00:05:37.240] And whenever I had time and resources, I would transcribe the next one in descending priority.
[00:05:37.240 --> 00:05:42.840] And this required a priority system to determine which episodes should be handled first.
[00:05:43.160 --> 00:05:46.600] That is a whole thing that I could probably do a full episode on.
[00:05:46.600 --> 00:05:53.640] I've come to a system where I have three queues right now that are high priority, middle priority, and low priority.
[00:05:53.640 --> 00:06:09.560] And the high priority shows would be the Joe Rogan's of this world that would get like preferential treatment because I know that anything said on this show, if it triggers an alert, then that would have the biggest impact on whatever my customers might need to do with it.
[00:06:09.560 --> 00:06:12.920] So I need these episodes to be transcribed early.
[00:06:12.920 --> 00:06:22.200] But then there are maybe mid-tier podcasts that can wait half an hour or so before they get transcribed or that could even wait a couple days because they're just not that important.
[00:06:22.200 --> 00:06:24.440] And that descends them in priority.
[00:06:24.440 --> 00:06:30.600] There's also an immediate priority queue, which skips all the other queues for like custom re-transcriptions.
[00:06:30.600 --> 00:06:37.480] If I ever have an episode that needs to be re-transcribed or somebody really needs this episode right now, there's a bypass version.
[00:06:37.480 --> 00:06:41.160] But effectively, that's the priority system that I have.
[00:06:41.160 --> 00:06:45.720] And for that queue, my initial setup was really just one consumer.
[00:06:45.720 --> 00:06:50.920] And that was the Mac Studio that I was developing the software on.
[00:06:50.920 --> 00:06:57.480] And that has right now a microphone that I'm speaking into where I'm recording this podcast.
[00:06:57.480 --> 00:07:03.000] Like I was running my full queue from my production system on a local computer.
[00:07:03.000 --> 00:07:10.400] So, running whisper.cpp locally on a Mac, that's really cool because it will use the GPU if I can connect to it.
[00:07:10.400 --> 00:07:17.440] And the MacVox unified memory system, like the MPS system, is capable of running these models really, really quickly.
[00:07:17.440 --> 00:07:21.520] And I was getting about 200 words per second, which is really something.
[00:07:21.520 --> 00:07:28.720] And this meant I could fetch a couple hundred episodes per hour with some parallel processing on my local system.
[00:07:28.720 --> 00:07:40.960] So then I realized that to deploy this as a business properly, I needed a transcription server running on a different cloud system, very likely, because I couldn't just keep it running locally at home.
[00:07:40.960 --> 00:07:43.360] If my internet ever goes out, my company wouldn't work, right?
[00:07:43.360 --> 00:07:54.480] So I started exploring companies that would offer access to computers with graphics cards where I could install whatever stack I had locally and keep it running there 24/7.
[00:07:54.480 --> 00:07:59.040] So the first thing I tried was AWS with their G-type instances.
[00:07:59.040 --> 00:08:02.000] The G, I guess, stands for graphics cards, I presume.
[00:08:02.000 --> 00:08:02.880] I don't know.
[00:08:02.880 --> 00:08:10.880] But these were quite expensive instances that didn't really have much power for the work that I was doing.
[00:08:10.880 --> 00:08:17.680] The ones I could afford, I think it was around $400 a month, just weren't powerful enough, particularly compared to my local server here.
[00:08:17.680 --> 00:08:20.800] I would have preferred them to be either cheaper or better.
[00:08:20.800 --> 00:08:23.040] So I quickly stepped away from AWS.
[00:08:23.040 --> 00:08:26.960] And even to get them, you have to apply for quota there and they have to verify it.
[00:08:27.200 --> 00:08:28.720] It's quite hard to get there.
[00:08:28.720 --> 00:08:31.920] So I looked for alternative, easier solutions.
[00:08:31.920 --> 00:08:38.480] So I looked into Lambda Labs, which was one of the first reliable options for GPU systems that I found.
[00:08:38.480 --> 00:08:39.760] And I used them for quite a while.
[00:08:39.760 --> 00:08:44.720] Lambda was helpful because they offered different servers with different GPUs attached.
[00:08:44.720 --> 00:08:57.040] So you could rent an H100, like one of the most powerful NVIDIA GPUs at the time, for about a thousand bucks a month or a bit more, which was quite expensive, obviously, to rent a GPU.
[00:08:57.040 --> 00:09:02.520] Or you could have an A100 or an A10, which were much cheaper and actually perfect for transcription purposes.
[00:09:02.520 --> 00:09:13.960] So I spent a couple months experimenting with my own personal money, testing whether an A10 would outperform an A100 or an H100, not in terms of raw throughput, but in terms of words per dollar.
[00:09:14.040 --> 00:09:15.480] That's kind of the unit that I had.
[00:09:15.480 --> 00:09:18.440] And I think I shared this on Twitter where I did some math there.
[00:09:18.440 --> 00:09:23.160] I deployed my transcription systems to different hosts with different graphics cards.
[00:09:23.160 --> 00:09:29.160] And I ran experiments with a very number of parallel transcriptions just to see how it worked.
[00:09:29.160 --> 00:09:30.840] And I found a working solution eventually.
[00:09:30.920 --> 00:09:35.400] I think I settled on 12 to 16-ish servers with A10 graphics cards.
[00:09:35.400 --> 00:09:37.000] That was just the best.
[00:09:37.000 --> 00:09:39.880] These became my transcription fleet for a while.
[00:09:39.880 --> 00:09:45.560] But even then, got quite expensive, which made me realize that I needed to do something about this price.
[00:09:45.560 --> 00:09:48.440] Because that was also pre-funding for me.
[00:09:48.440 --> 00:09:55.560] I didn't have any funding at that point just yet, but I was paying thousands of dollars a month personally for my personal money.
[00:09:55.560 --> 00:09:57.720] So I needed to figure something out.
[00:09:57.720 --> 00:10:04.600] And the most effective thing that I did was to look for hosted servers outside of services that are focused on renting AI.
[00:10:04.840 --> 00:10:13.000] I just needed to look for other people that had GPU-based servers that were not yet in the AI hype space.
[00:10:13.000 --> 00:10:21.960] And those services tended to offer sizable graphics cards, like the ones that do AI for inference, which is great if you need impressive GPU power.
[00:10:21.960 --> 00:10:25.320] But in most cases, for transcription, that's actually not what you need.
[00:10:25.320 --> 00:10:35.880] You need some graphics card that can do some transcription, and the cheaper the better because transcription doesn't require a lot of VRAM, it just requires some time on a GPU.
[00:10:35.880 --> 00:10:41.800] And I found this solution in Hetznad, the German company well known for being an affordable hosting company.
[00:10:41.800 --> 00:10:48.400] They had just started offering GPU servers and they also have auction systems where you can get really great hardware quite cheaply.
[00:10:48.560 --> 00:10:52.400] But they offer servers, I think, what is it called?
[00:10:52.400 --> 00:10:54.240] GEX-44.
[00:10:54.240 --> 00:10:55.840] That's the one that I use.
[00:10:55.840 --> 00:11:00.320] They have an RTX 4000 SFF ADA generation GPU.
[00:11:00.320 --> 00:11:05.200] And I think they cost 200 euros a month just to rent.
[00:11:05.200 --> 00:11:08.000] And these servers are spectacular.
[00:11:08.000 --> 00:11:17.440] They have 64 gigabytes of DDR4 RAM, 4 terabytes of disk space, like for 200-ish dollars a month to rent this.
[00:11:17.440 --> 00:11:18.400] That's really cool.
[00:11:18.400 --> 00:11:20.480] Like you can rent, how much do I have right now?
[00:11:20.480 --> 00:11:29.760] Like 10-ish of them, and have a significant GPU-based workload running 24-7 on many different servers for $2,000 or less.
[00:11:29.760 --> 00:11:37.840] The key insight from all these experiments was that transcription has very different requirements from other AI tasks like inference.
[00:11:37.840 --> 00:11:49.680] You can run transcription quite reliably by using somewhere between 4 and 20 gigabytes of RAM, depends on the model that you use, which is something that if you use Whisper, you can choose different models, right?
[00:11:49.680 --> 00:11:58.400] There's a tiny, a small, a medium, a large model, and they all use different size of gigabytes of this RAM that these GPUs use.
[00:11:58.400 --> 00:12:01.760] And the smaller ones obviously are faster and use less of that RAM.
[00:12:01.760 --> 00:12:03.760] So you can run a couple in parallel.
[00:12:03.760 --> 00:12:17.040] And it really doesn't mean that much if it's just a couple of gigabytes, but it's definitely enough to get the highest quality transcription data, particularly if you use a model called Whisper V3 Large Turbo.
[00:12:17.040 --> 00:12:20.880] That's the one I currently use, fastest and best quality.
[00:12:20.880 --> 00:12:29.440] And when I switched all my transcription servers from these A10s at Lambda Labs to the Hetsnar systems, I picked up Steam dramatically with these.
[00:12:29.440 --> 00:12:31.080] It was so much more effective.
[00:12:31.080 --> 00:12:35.080] So I could get by with half the number of servers and still have a higher throughput than before.
[00:12:29.920 --> 00:12:35.800] So that's where I'm now.
[00:12:35.960 --> 00:12:43.080] Self-maintained servers running transcription scripts 24-7 on the HeadSnop platform being highly efficient over time.
[00:12:43.080 --> 00:12:53.160] And the solution that I had with Whisper CPP, that was great in the beginning, but as PodScan started gaining customers, they had more requirements than just default transcription.
[00:12:53.160 --> 00:13:02.600] So I needed diarization, which is a fancy term for determining different speakers in an audio file, and word-level timestamps for precise interactivity.
[00:13:02.600 --> 00:13:08.760] People wanted to be able to have like exact cuts in videos or in audio, I guess, so they can extract stuff.
[00:13:08.760 --> 00:13:12.200] So they needed to know exactly where their sentence starts and where it ends.
[00:13:12.200 --> 00:13:16.360] So for that and knowing who's speaking, I needed something bigger.
[00:13:16.360 --> 00:13:32.200] So I switched from Whisper CPP to another implementation running on top of Faster Whisper, which is a library that uses these models more efficiently, that includes both diarization capabilities and granular timing data.
[00:13:32.200 --> 00:13:35.880] But this revealed a couple of surprising technical challenges.
[00:13:35.880 --> 00:13:39.960] So if you're into transcription, this is going to be very, very useful.
[00:13:39.960 --> 00:13:42.680] So you don't have to fall into these traps yourself.
[00:13:42.680 --> 00:13:46.120] Diarization is more resource-intensive than transcription.
[00:13:46.120 --> 00:13:51.240] Detecting speakers takes much longer than actually transcribing what they're saying.
[00:13:51.240 --> 00:14:04.920] You would think it would be easier to determine somebody speaking here than somebody else is speaking there, but it's actually harder to figure out if it's person one, two, or three than it is to determine the actual words that this person is speaking.
[00:14:04.920 --> 00:14:12.040] And from the start, I needed a careful prioritization system here because I only could diarize what I really needed.
[00:14:12.040 --> 00:14:20.560] If I know that a podcast has only one speaker and has had for the last 200 episodes, well, I don't need to diarize it and I can save over 50% of the time.
[00:14:20.560 --> 00:14:26.960] But if it's a popular show with different guests all the time, then I guess I need to prioritize it and it's going to take double the time.
[00:14:26.960 --> 00:14:38.480] At scale, turning off diarization for some shows where it doesn't matter or has not yet proven to matter means that I can transcribe twice as many podcasts in any given day.
[00:14:38.480 --> 00:14:39.840] And that's massive.
[00:14:39.840 --> 00:14:52.640] If that means that I can do all the shows in a day and still have resources left, then I can step back in time and get some of the older episodes that might be still very interesting for search purposes.
[00:14:52.640 --> 00:14:55.280] So that's the trade-off that I'm dealing with here.
[00:14:55.280 --> 00:15:00.000] Finally, there's something that I learned after a couple of weeks of experimenting with this.
[00:15:00.000 --> 00:15:04.720] GPU memory limits affect the quality of the transcription.
[00:15:04.720 --> 00:15:14.800] If you do a lot of parallelized transcription, the GPU reaching its memory limit where it gets full, that can cause transcription quality to decline.
[00:15:14.800 --> 00:15:19.360] I initially had a thought like this graphics cards has 20 gigabytes of reRAM.
[00:15:19.360 --> 00:15:26.240] Each transcription process uses at most four gigabytes, so it can run five at a time, fill up the whole graphics card, right?
[00:15:26.240 --> 00:15:27.360] The whole RAM.
[00:15:27.360 --> 00:15:31.040] That tends to be true most of the time that it works.
[00:15:31.040 --> 00:15:44.800] But if one process runs a little bit longer, maybe it's a three-hour Joe Rogan podcast yet again, and then another process spawns and five or six processes are fighting for memory, or even just the five on there that should have four.
[00:15:44.800 --> 00:15:47.120] One of them is like 4.2, right?
[00:15:47.120 --> 00:15:55.440] Quality quickly degrades on all of them simultaneously because there's something in there that just breaks down and then it just hallucinates stuff.
[00:15:55.440 --> 00:16:00.000] I have since reduced parallel processes to two or three podcasts at any given time.
[00:16:01.000 --> 00:16:07.800] There's a small chance the GPU isn't fully utilized when you know all of them spin up at the same time, but that's okay.
[00:16:07.800 --> 00:16:11.400] Most of the time, it's in full use anyway without quality degradation.
[00:16:11.400 --> 00:16:17.000] And I would rather not risk it because I want these transcripts to be reliably good.
[00:16:17.000 --> 00:16:22.920] Biggest learning in all of this has been that bigger GPUs aren't necessarily faster or better.
[00:16:22.920 --> 00:16:28.040] And not just from a words to dollar ratio, just even from a usage of the GPU.
[00:16:28.040 --> 00:16:33.640] Just because the GPU is bigger doesn't mean it's faster at transcribing, surprisingly, you would think, but it's not.
[00:16:33.640 --> 00:16:43.640] When I ran transcription on my local machine and then on A10 and A100 GPUs, I got quite similar results, like always between 150 to 200 words per second.
[00:16:43.640 --> 00:16:46.600] And those things cost 200 bucks a month max.
[00:16:46.600 --> 00:16:55.960] But then I ran it at H100 GPU and the word count stayed almost the same, maybe going up to 225 to 250 words per second.
[00:16:55.960 --> 00:17:00.040] But that GPU had five to ten times the monthly cost.
[00:17:00.040 --> 00:17:05.080] And you couldn't really run it in parallel there either because then it would start degrading quality.
[00:17:05.080 --> 00:17:12.440] So for transcription specifically, it is way more effective to run on smaller and maybe slightly slower GPUs at scale.
[00:17:12.440 --> 00:17:16.680] And this has turned out to be the only feasible way for me to do this.
[00:17:17.000 --> 00:17:24.360] And we're just talking about like self-managed transcription here because there's an alternative that puts everything into perspective.
[00:17:24.360 --> 00:17:37.720] If I were to transcribe all 50,000 episodes that come in every day using OpenAI's platform, their AI platform, their whisper endpoint there, I would pay a five-figure dollar amount every single day.
[00:17:37.720 --> 00:17:45.040] After many months of optimizing and experimenting with transcription setups, I have obviously not done this.
[00:17:44.680 --> 00:17:50.560] I have turned the whole thing into just a few thousand dollars in expenses a month by having my own infrastructure.
[00:17:50.800 --> 00:18:09.920] The cost savings are significant because when you run your own infrastructure, even though you aren't able to do as many parallel things as you could by using Whisper and OpenAI or other transcription systems like Deepgram, but instead of paying like $100,000 a month, you pay four or two, right?
[00:18:09.920 --> 00:18:18.160] If you do it well, the daily cost that these commercial models can incur for you is easily in the thousands of dollars.
[00:18:18.160 --> 00:18:23.760] And I've gotten it down to just 100 and change on a per day basis, which is quite significant.
[00:18:23.760 --> 00:18:27.920] The biggest expense for PodScan at this point is not transcription capacity.
[00:18:27.920 --> 00:18:34.160] You would think, right, that this would be the most impactful expense, but it's a database where all of this information is stored.
[00:18:34.160 --> 00:18:37.920] And that's the next big challenge that I didn't ever think about in the beginning.
[00:18:37.920 --> 00:18:48.160] Because when I initially started tracking a couple hundred podcasts, yeah, it was totally fine to have my SQL database store all of this data without doing anything specific around data storage, right?
[00:18:48.160 --> 00:18:50.240] Just throw it in and figure it out.
[00:18:50.240 --> 00:18:56.880] But once I turn on the full fire hose of podcast data, all 50K a day, it became a massive challenge.
[00:18:56.880 --> 00:19:05.520] If every transcript is like 200 kilobytes to one megabyte in text size, because that's what text is, it gets massive, right?
[00:19:05.520 --> 00:19:06.800] It can be megabytes of data.
[00:19:06.800 --> 00:19:11.520] Again, Joe Rogan, thank you so much for filling my database with massive transcripts here.
[00:19:11.520 --> 00:19:15.600] Then every day you're adding several gigabytes to your database.
[00:19:15.600 --> 00:19:21.040] So, if you're trying to do full text search or quick lookups with some filtering, this becomes a problem.
[00:19:21.040 --> 00:19:28.800] Even if you have an index there for a full text index or just regular string index, it is so hard to get this right.
[00:19:28.800 --> 00:19:33.560] I had to build infrastructure that prevents my database from overgrowing or slowing down to a halt.
[00:19:33.880 --> 00:19:43.800] Older transcripts are actually transferred to an S3-based storage and loaded by the main process when they are requested by a user on the front end or in the API.
[00:19:43.800 --> 00:19:54.760] I don't keep all my transcripts in the database because that would easily be six terabytes right now, just in raw size, which is super expensive to maintain and super clunky for database access.
[00:19:54.760 --> 00:20:07.880] Now, all transcripts live on S3 as JSON files and can be loaded on demand for anything older than a couple months for regular transcripts and anything older than just a couple days for the word-level timestamp transcripts that we also save.
[00:20:07.880 --> 00:20:12.360] That is probably the biggest one: JSON data for every second of a show.
[00:20:12.360 --> 00:20:18.600] And this has been very helpful in ensuring that the database stays at least a little nimble in comparison.
[00:20:18.600 --> 00:20:34.840] When it comes to search, I'm using an open search cluster, also in AWS, where I just pipe the full transcript in there and then have its own inverted index, I think that's what it's called, built to be able to search for full text there.
[00:20:34.840 --> 00:20:48.200] So, we're not doing full-text search in the database, we have an additional secondary database that we feed all these transcripts into and facilitate search by just having what is kind of an elastic search fork deal with all of that.
[00:20:48.520 --> 00:20:59.960] It would never, never, ever work in this MySQL database, and probably also not in Postgres if I were to have a full text search there, just because data is so massive.
[00:20:59.960 --> 00:21:04.040] I was using MightySearch for a while, and that also works.
[00:21:04.040 --> 00:21:14.360] Like all these search engines that can deal with large text, they are good at it, but transcript data is so big that even those databases struggle a little bit.
[00:21:14.360 --> 00:21:16.480] So you have to build something that works, right?
[00:21:14.520 --> 00:21:24.080] You have to save them in a way that they can be looked up reliably and you have to save them in a way that they can be searched reliably too.
[00:21:24.400 --> 00:21:27.440] Now, there are other challenges, not just storage.
[00:21:27.440 --> 00:21:29.200] There's also a quality problem.
[00:21:29.200 --> 00:21:32.400] Yet again, podcasting is full of quality problems.
[00:21:32.400 --> 00:21:35.360] There's no normal standard for quality in podcasts.
[00:21:35.360 --> 00:21:37.200] And I mean the audio data for that matter.
[00:21:37.200 --> 00:21:44.720] Some people record into what feels like a potato and others have extremely high-end setups like this fine podcast.
[00:21:44.720 --> 00:21:50.720] And you never know reliably which one you will encounter if you listen to one or if you try to transcribe it.
[00:21:50.720 --> 00:22:00.880] So transcription systems expect at least a certain kind of quality and they struggle with low quality audio or non-speech content like music that people throw into this as well.
[00:22:00.880 --> 00:22:09.360] So I had to implement a transcription quality checking system that tries to determine if a transcript is acceptable or if you need to re-transcribe it with different settings.
[00:22:09.360 --> 00:22:16.240] Whisper is pretty good by default, but there are edge cases where you need multiple attempts to get it right.
[00:22:16.240 --> 00:22:17.920] And that all costs money.
[00:22:18.000 --> 00:22:29.360] Biggest problem, and that's probably also why PodScan is actually so impactful for the people using it, is that transcription systems like Whisper, but also others, struggle with names and brands.
[00:22:29.360 --> 00:22:34.640] Anything that a human could easily get right from context, they don't get right because they don't have context.
[00:22:34.640 --> 00:22:39.680] They just have a voice pattern, an audio waveform, and they don't get it right most of the time.
[00:22:39.680 --> 00:22:54.480] And what works really well here, but is extremely expensive, is taking the full transcript from Whisper with all the little mistakes in there, and having an AI do a pass over it with context from the podcast name, the description, and maybe prior episodes data.
[00:22:54.480 --> 00:23:00.920] And you get extremely high-quality transcripts that way, but at scale, this costs several dollars per episode.
[00:22:59.840 --> 00:23:04.520] Because imagine what this would mean to use an AI system.
[00:23:04.760 --> 00:23:20.440] Let's say you have 500 kilobytes of text, that is, I don't know two hours of a podcast, and you pipe that into even a cheap LLM that is hosted on, I don't know, on Anthropic or on OpenAI's platforms.
[00:23:20.440 --> 00:23:28.680] So you have like half a million input tokens, and then it does some stuff, and then it has half a million output tokens.
[00:23:28.680 --> 00:23:30.280] And that's the expensive part.
[00:23:30.440 --> 00:23:35.160] Output tokens are probably the most expensive stuff for LLMs right now to create.
[00:23:35.480 --> 00:23:41.400] And that does not work at scale because that again would cost me $50,000 a day.
[00:23:41.800 --> 00:23:43.240] Where am I going to take that money?
[00:23:43.240 --> 00:23:44.360] Not happening.
[00:23:44.360 --> 00:23:51.000] That might be very limited to a very limited number of shows, but even then, it gets super expensive.
[00:23:51.000 --> 00:23:52.600] So that's an unsolved challenge.
[00:23:52.600 --> 00:24:03.240] Currently, Whisper can take 120 or so tokens of context, just like things like the title of the show, maybe the episode title, and a couple of names of people that will be mentioned.
[00:24:03.240 --> 00:24:04.600] So that's what I throw in.
[00:24:04.600 --> 00:24:08.360] I just give it what I know is true about the episode.
[00:24:08.360 --> 00:24:19.000] And back in the day, I experimented with giving it all the brand names from all the Podscan accounts that were subscribed, like all my customers, to give it as context to maybe find those better.
[00:24:19.000 --> 00:24:23.800] But Whisper actually started finding these words in places where they weren't actually there.
[00:24:23.800 --> 00:24:28.440] It was gaslighting me into believing that it found certain words why they didn't exist.
[00:24:28.440 --> 00:24:31.960] So I quickly stopped piping all of these brand names in there.
[00:24:31.960 --> 00:24:38.600] Since then, I only provide context that I can reliably infer that will be in that particular episode.
[00:24:38.600 --> 00:24:48.320] And the big benefit of the system that I've built so far, like the whole installable on some VPC somewhere system, is that it's pretty easy to set up.
[00:24:48.640 --> 00:24:53.840] It's a Laravel application that I can deploy through LaravelForge onto any new server.
[00:24:53.840 --> 00:25:04.400] I have an install script that fetches a couple Python libraries and I can spin up a new server quite easily that then automatically attaches to my API and starts fetching and transcribing new episodes.
[00:25:04.400 --> 00:25:05.920] It's really nice, quite scalable.
[00:25:05.920 --> 00:25:10.400] It's not on a Docker container level scalable, but I'll get to that in the future too.
[00:25:10.400 --> 00:25:23.520] And as PodScan's infrastructure grows, we can quickly add more systems so new episodes are transcribed faster with even more quality because they can be run in less parallel with less little outlier errors faster.
[00:25:23.520 --> 00:25:27.600] And as models evolve, they might even become better at transcribing.
[00:25:27.600 --> 00:25:36.560] Eventually, I think I can increase the number to get diarized and get more good data that can then be fed into AI systems for what my customers want.
[00:25:36.560 --> 00:25:48.000] When I first set up the fleet of servers to transcribe all my podcasts that I wanted to transcribe, all podcasts everywhere, it probably would have cost me $30,000 a month, even on my own hardware.
[00:25:48.000 --> 00:26:00.960] But I'm now at a point where, through proper optimization and balancing my customer needs with the expense requirements, I can reliably capture the majority of podcasts at good quality for just a couple thousand dollars a month in expenses.
[00:26:00.960 --> 00:26:02.400] And I think that's really cool.
[00:26:02.400 --> 00:26:09.680] The fact that that is even possible for a Sotopreneur to build, I don't think that would have been the thing a couple years ago, but the tools are all out there.
[00:26:09.680 --> 00:26:12.720] It just takes a year and a half of 24/7 work.
[00:26:13.040 --> 00:26:14.560] So yeah, it takes a while.
[00:26:14.560 --> 00:26:16.800] But it still is possible.
[00:26:16.800 --> 00:26:30.920] The key insight for all of this is that when you're building a business that scales with these factors outside of your control, like the global output of an entire medium, you just need to think differently about infrastructure optimization and trade-offs.
[00:26:29.680 --> 00:26:36.280] Sometimes the most expensive solution will not be the best one, like OpenAI's hosted whisper.
[00:26:36.440 --> 00:26:37.720] It just doesn't work.
[00:26:37.720 --> 00:26:46.280] And sometimes the constraints that you think are impossible to work with actually force you into more creative and ultimately better solutions.
[00:26:46.280 --> 00:26:53.000] The kind of challenge that makes building businesses both terrifying and accelerating, that's exactly this.
[00:26:53.000 --> 00:27:03.320] You can't control how many podcasts get published worldwide every day, but you can control how cleverly and effectively you solve the problems that stem from this.
[00:27:03.640 --> 00:27:05.480] And that's it for today.
[00:27:05.480 --> 00:27:07.640] Thank you so much for listening to the Bootstrap Founder.
[00:27:07.640 --> 00:27:10.840] You can find me on Twitter at OvidKal, A-R-V-A-D-K-A-H-L.
[00:27:10.840 --> 00:27:23.720] If you want to support me on this show, please share podscan.fm with your peers and those who you think will benefit from tracking brands, competitors, their products, all kinds of things and names on podcasts out there.
[00:27:23.720 --> 00:27:29.400] PodScan is this near real-time podcast database with a really solid integration system.
[00:27:29.400 --> 00:27:35.720] We allow a lot of people to build solutions to get leads, to get information on their clients and all of that.
[00:27:35.720 --> 00:27:41.240] So please share the word with those who need to stay on top of the podcast ecosystem.
[00:27:41.240 --> 00:27:42.600] Thank you so much for listening.
[00:27:42.600 --> 00:27:45.240] Have a wonderful day and bye-bye.

Prompt 2: Key Takeaways

Now please extract the key takeaways from the transcript content I provided.

Extract the most important key takeaways from this part of the conversation. Use a single sentence statement (the key takeaway) rather than milquetoast descriptions like "the hosts discuss...". 

Limit the key takeaways to a maximum of 3. The key takeaways should be insightful and knowledge-additive. 

IMPORTANT: Return ONLY valid JSON, no explanations or markdown. Ensure:
- All strings are properly quoted and escaped
- No trailing commas
- All braces and brackets are balanced
Format: {"key_takeaways": ["takeaway 1", "takeaway 2"]}

Prompt 3: Segments


Now identify 2-4 distinct topical segments from this part of the conversation.

For each segment, identify:
- Descriptive title (3-6 words)  
- START timestamp when this topic begins (HH:MM:SS format)
- Double check that the timestamp is accurate - a timestamp will NEVER be greater than the total length of the audio
- Most important Key takeaway from that segment. Key takeaway must be specific and knowledge-additive.
- Brief summary of the discussion

IMPORTANT: The timestamp should mark when the topic/segment STARTS, not a range. Look for topic transitions and conversation shifts.

Return ONLY valid JSON. Ensure all strings are properly quoted, no trailing commas:
{
  "segments": [
    {
      "segment_title": "Topic Discussion", 
      "timestamp": "01:15:30",
      "key_takeaway": "main point from this segment",
      "segment_summary": "brief description of what was discussed"
    }
  ]
}

Timestamp format: HH:MM:SS (e.g., 00:05:30, 01:22:45) marking the START of each segment.

Prompt 4: Media Mentions

Now scan the transcript content I provided for ACTUAL mentions of specific media titles:

Find explicit mentions of:
- Books (with specific titles)
- Movies (with specific titles)  
- TV Shows (with specific titles)
- Music/Songs (with specific titles)

DO NOT include:
- Websites, URLs, or web services
- Other podcasts or podcast names

IMPORTANT: 
- Only include items explicitly mentioned by name. Do not invent titles.
- Valid categories are: "Book", "Movie", "TV Show", "Music"
- Include the exact phrase where each item was mentioned
- Find the nearest proximate timestamp where it appears in the conversation
- THE TIMESTAMP OF THE MEDIA MENTION IS IMPORTANT - DO NOT INVENT TIMESTAMPS AND DO NOT MISATTRIBUTE TIMESTAMPS
- Double check that the timestamp is accurate - a timestamp will NEVER be greater than the total length of the audio
- Timestamps are given as ranges, e.g. 01:13:42.520 --> 01:13:46.720. Use the EARLIER of the 2 timestamps in the range.

Return ONLY valid JSON. Ensure all strings are properly quoted and escaped, no trailing commas:
{
  "media_mentions": [
    {
      "title": "Exact Title as Mentioned",
      "category": "Book",
      "author_artist": "N/A",
      "context": "Brief context of why it was mentioned",
      "context_phrase": "The exact sentence or phrase where it was mentioned",
      "timestamp": "estimated time like 01:15:30"
    }
  ]
}

If no media is mentioned, return: {"media_mentions": []}

Full Transcript

[00:00:00.320 --> 00:00:04.320] Hey, it's Arvid, and this is the Bootstrap founder.
[00:00:09.120 --> 00:00:16.560] Today we'll talk about keeping up with an avalanche of audio data and how I build PodScan's transcription infrastructure.
[00:00:16.560 --> 00:00:23.360] This episode is sponsored by Paddle.com, my merchant of record payment provider of choice who's been helping me focus on PodScan from day one.
[00:00:23.360 --> 00:00:33.520] They're taking care of all the little things related to money so that founders like you and me can focus on building the things that only we can build, like a massive podcast transcription infrastructure.
[00:00:33.520 --> 00:00:37.280] Paddle handles all the rest, sales tax, credit cards, those kind of things.
[00:00:37.280 --> 00:00:38.960] Don't need to deal with it because they do.
[00:00:39.040 --> 00:00:40.480] I highly recommend checking it out.
[00:00:40.480 --> 00:00:44.080] So please go to paddle.com and take a look.
[00:00:45.040 --> 00:00:54.880] Now, when I started building the first prototype of PodScan, I very quickly realized that this was going to be a different business than any that I've built before.
[00:00:54.880 --> 00:00:59.600] The difference had everything to do with one fundamental challenge in this field.
[00:00:59.600 --> 00:01:11.360] Unlike most software service businesses, the resources that I would need from the start wouldn't scale with the number of customers I had, but would scale with something completely out of my control.
[00:01:11.360 --> 00:01:16.640] The number of new podcast episodes being released worldwide every single day.
[00:01:16.640 --> 00:01:26.080] So no matter if I had one customer or a hundred, if they wanted to track every podcast out there for a keyword, I needed to deal with this from day one.
[00:01:26.080 --> 00:01:36.960] And that's hard because if you ever investigated the idea of stoicism, you will know that there are certain things you can control that you should care about and certain things that you cannot control that you shouldn't fret about at all.
[00:01:36.960 --> 00:01:45.120] That's kind of the idea, like a very rough description of stoicism here, but you know, like it's deal with the things you could deal with and don't whine about the others.
[00:01:45.120 --> 00:01:46.640] So that's exactly what I did.
[00:01:46.640 --> 00:01:52.080] I focused on what I could do to make transcribing every single podcast out there a reality.
[00:01:52.080 --> 00:02:01.720] And I didn't complain about the fact that there are tens of thousands, millions of shows being released all the time with tens of thousands of shows being released every day.
[00:01:59.680 --> 00:02:03.880] That's kind of the framework here.
[00:02:03.880 --> 00:02:05.320] I had to deal with it.
[00:02:05.320 --> 00:02:08.920] I think I'm currently tracking 3.8 million shows.
[00:02:08.920 --> 00:02:13.720] And roughly every day, there's somewhere between 30 to 70,000 being released.
[00:02:13.720 --> 00:02:15.640] Depends on the day of the week.
[00:02:15.640 --> 00:02:33.720] And I want to talk about this Herculean effort of building transcription infrastructure, how I got it from being extremely expensive to manageably cheap comparatively, what the trade-offs were along the way, and how much of the development of new technologies has impacted the feasibility of this entire project for me.
[00:02:33.720 --> 00:02:38.920] Now, for my first prototype, I obviously didn't try to transcribe everything at once.
[00:02:39.160 --> 00:02:51.720] I knew that that didn't make sense to try it all, but I had found my source of podcast feed data, just a couple of good podcast feeds to try it out with through the podcast index project.
[00:02:51.720 --> 00:02:52.520] It's very interesting.
[00:02:52.520 --> 00:02:53.960] If you're into podcasting, check it out.
[00:02:53.960 --> 00:02:57.560] It's an open source approach to listing all the podcasts everywhere.
[00:02:57.560 --> 00:03:06.920] It's free and it's openly available as an API and a database of podcasts that provides where they're hosted, the names, descriptions, and links to episodes as well.
[00:03:06.920 --> 00:03:09.960] I think maybe not necessarily all of them, but some.
[00:03:10.280 --> 00:03:18.920] And they even have a full SQLite export, like four gigabytes of just one big file with all this data, makes it very easy to jumpstart any system.
[00:03:18.920 --> 00:03:26.840] But they even have a great API and the podcast index API has a very beneficial endpoint for trending shows and newly released episodes.
[00:03:26.840 --> 00:03:38.360] So my first prototype used that API and just grab the most recently or most popular released episodes and transcribe those with the existing resources that I had.
[00:03:38.360 --> 00:03:42.840] And when it comes to the tech, I'm just going to share everything here because why not?
[00:03:42.840 --> 00:03:51.280] I already had been experimenting with an open source library called Whisper for a previous project called Podline, a voice messaging tool for podcasts.
[00:03:51.280 --> 00:03:52.080] That was the idea.
[00:03:52.080 --> 00:03:59.680] I was going to take in voice messages through the browser, transcribe them on the back end, and then send a notification to my customers.
[00:03:59.680 --> 00:04:05.760] And I had found that Whisper, which is supposed to be run on GPUs, could also be run on a CPU.
[00:04:05.760 --> 00:04:12.480] So without a graphics card at all, through a project called whisper.cpp, albeit quite slowly.
[00:04:12.480 --> 00:04:18.960] But for Podline, where I needed to occasionally transcribe a short one-minute clip, this worked perfectly.
[00:04:18.960 --> 00:04:25.520] It may have taken five minutes to transcribe it on one CPU core, but that's okay.
[00:04:25.520 --> 00:04:27.520] There's many cores in modern CPUs.
[00:04:27.520 --> 00:04:32.080] And if I have five minutes, sure, people will take the notification a bit after.
[00:04:32.080 --> 00:04:33.440] That's all right.
[00:04:33.440 --> 00:04:42.400] And since PodScan, my current business, was initially a marketing effort for Podline because I wanted to know where people already talk about having voicemail.
[00:04:42.400 --> 00:04:46.560] So I built a tool that would figure out where people talked about it.
[00:04:46.560 --> 00:04:48.320] I had all the tech laying around.
[00:04:48.320 --> 00:04:52.000] But obviously, there's a stark difference in transcription scale here, right?
[00:04:52.000 --> 00:04:59.440] Podline needed to handle occasional short clips, but PodScan needed to reliably transcribe 50,000 shows per day.
[00:04:59.440 --> 00:05:03.440] And those are often shows that go for 40 to 80 minutes, right?
[00:05:03.440 --> 00:05:06.960] That's not just 30 seconds, that is hours of material.
[00:05:06.960 --> 00:05:17.920] And if you look at Joe Rogan, who reliably puts out four plus hour shows, that system needed to be fast and good enough to get the whole conversation and transcribe it.
[00:05:17.920 --> 00:05:26.320] So the first smart choice that I needed to make was treating this as a queuing system, not as something that would happen synchronously to when stuff was released.
[00:05:26.320 --> 00:05:30.000] I needed a queue of podcast episodes that would just wait to be transcribed.
[00:05:30.680 --> 00:05:37.240] And whenever I had time and resources, I would transcribe the next one in descending priority.
[00:05:37.240 --> 00:05:42.840] And this required a priority system to determine which episodes should be handled first.
[00:05:43.160 --> 00:05:46.600] That is a whole thing that I could probably do a full episode on.
[00:05:46.600 --> 00:05:53.640] I've come to a system where I have three queues right now that are high priority, middle priority, and low priority.
[00:05:53.640 --> 00:06:09.560] And the high priority shows would be the Joe Rogan's of this world that would get like preferential treatment because I know that anything said on this show, if it triggers an alert, then that would have the biggest impact on whatever my customers might need to do with it.
[00:06:09.560 --> 00:06:12.920] So I need these episodes to be transcribed early.
[00:06:12.920 --> 00:06:22.200] But then there are maybe mid-tier podcasts that can wait half an hour or so before they get transcribed or that could even wait a couple days because they're just not that important.
[00:06:22.200 --> 00:06:24.440] And that descends them in priority.
[00:06:24.440 --> 00:06:30.600] There's also an immediate priority queue, which skips all the other queues for like custom re-transcriptions.
[00:06:30.600 --> 00:06:37.480] If I ever have an episode that needs to be re-transcribed or somebody really needs this episode right now, there's a bypass version.
[00:06:37.480 --> 00:06:41.160] But effectively, that's the priority system that I have.
[00:06:41.160 --> 00:06:45.720] And for that queue, my initial setup was really just one consumer.
[00:06:45.720 --> 00:06:50.920] And that was the Mac Studio that I was developing the software on.
[00:06:50.920 --> 00:06:57.480] And that has right now a microphone that I'm speaking into where I'm recording this podcast.
[00:06:57.480 --> 00:07:03.000] Like I was running my full queue from my production system on a local computer.
[00:07:03.000 --> 00:07:10.400] So, running whisper.cpp locally on a Mac, that's really cool because it will use the GPU if I can connect to it.
[00:07:10.400 --> 00:07:17.440] And the MacVox unified memory system, like the MPS system, is capable of running these models really, really quickly.
[00:07:17.440 --> 00:07:21.520] And I was getting about 200 words per second, which is really something.
[00:07:21.520 --> 00:07:28.720] And this meant I could fetch a couple hundred episodes per hour with some parallel processing on my local system.
[00:07:28.720 --> 00:07:40.960] So then I realized that to deploy this as a business properly, I needed a transcription server running on a different cloud system, very likely, because I couldn't just keep it running locally at home.
[00:07:40.960 --> 00:07:43.360] If my internet ever goes out, my company wouldn't work, right?
[00:07:43.360 --> 00:07:54.480] So I started exploring companies that would offer access to computers with graphics cards where I could install whatever stack I had locally and keep it running there 24/7.
[00:07:54.480 --> 00:07:59.040] So the first thing I tried was AWS with their G-type instances.
[00:07:59.040 --> 00:08:02.000] The G, I guess, stands for graphics cards, I presume.
[00:08:02.000 --> 00:08:02.880] I don't know.
[00:08:02.880 --> 00:08:10.880] But these were quite expensive instances that didn't really have much power for the work that I was doing.
[00:08:10.880 --> 00:08:17.680] The ones I could afford, I think it was around $400 a month, just weren't powerful enough, particularly compared to my local server here.
[00:08:17.680 --> 00:08:20.800] I would have preferred them to be either cheaper or better.
[00:08:20.800 --> 00:08:23.040] So I quickly stepped away from AWS.
[00:08:23.040 --> 00:08:26.960] And even to get them, you have to apply for quota there and they have to verify it.
[00:08:27.200 --> 00:08:28.720] It's quite hard to get there.
[00:08:28.720 --> 00:08:31.920] So I looked for alternative, easier solutions.
[00:08:31.920 --> 00:08:38.480] So I looked into Lambda Labs, which was one of the first reliable options for GPU systems that I found.
[00:08:38.480 --> 00:08:39.760] And I used them for quite a while.
[00:08:39.760 --> 00:08:44.720] Lambda was helpful because they offered different servers with different GPUs attached.
[00:08:44.720 --> 00:08:57.040] So you could rent an H100, like one of the most powerful NVIDIA GPUs at the time, for about a thousand bucks a month or a bit more, which was quite expensive, obviously, to rent a GPU.
[00:08:57.040 --> 00:09:02.520] Or you could have an A100 or an A10, which were much cheaper and actually perfect for transcription purposes.
[00:09:02.520 --> 00:09:13.960] So I spent a couple months experimenting with my own personal money, testing whether an A10 would outperform an A100 or an H100, not in terms of raw throughput, but in terms of words per dollar.
[00:09:14.040 --> 00:09:15.480] That's kind of the unit that I had.
[00:09:15.480 --> 00:09:18.440] And I think I shared this on Twitter where I did some math there.
[00:09:18.440 --> 00:09:23.160] I deployed my transcription systems to different hosts with different graphics cards.
[00:09:23.160 --> 00:09:29.160] And I ran experiments with a very number of parallel transcriptions just to see how it worked.
[00:09:29.160 --> 00:09:30.840] And I found a working solution eventually.
[00:09:30.920 --> 00:09:35.400] I think I settled on 12 to 16-ish servers with A10 graphics cards.
[00:09:35.400 --> 00:09:37.000] That was just the best.
[00:09:37.000 --> 00:09:39.880] These became my transcription fleet for a while.
[00:09:39.880 --> 00:09:45.560] But even then, got quite expensive, which made me realize that I needed to do something about this price.
[00:09:45.560 --> 00:09:48.440] Because that was also pre-funding for me.
[00:09:48.440 --> 00:09:55.560] I didn't have any funding at that point just yet, but I was paying thousands of dollars a month personally for my personal money.
[00:09:55.560 --> 00:09:57.720] So I needed to figure something out.
[00:09:57.720 --> 00:10:04.600] And the most effective thing that I did was to look for hosted servers outside of services that are focused on renting AI.
[00:10:04.840 --> 00:10:13.000] I just needed to look for other people that had GPU-based servers that were not yet in the AI hype space.
[00:10:13.000 --> 00:10:21.960] And those services tended to offer sizable graphics cards, like the ones that do AI for inference, which is great if you need impressive GPU power.
[00:10:21.960 --> 00:10:25.320] But in most cases, for transcription, that's actually not what you need.
[00:10:25.320 --> 00:10:35.880] You need some graphics card that can do some transcription, and the cheaper the better because transcription doesn't require a lot of VRAM, it just requires some time on a GPU.
[00:10:35.880 --> 00:10:41.800] And I found this solution in Hetznad, the German company well known for being an affordable hosting company.
[00:10:41.800 --> 00:10:48.400] They had just started offering GPU servers and they also have auction systems where you can get really great hardware quite cheaply.
[00:10:48.560 --> 00:10:52.400] But they offer servers, I think, what is it called?
[00:10:52.400 --> 00:10:54.240] GEX-44.
[00:10:54.240 --> 00:10:55.840] That's the one that I use.
[00:10:55.840 --> 00:11:00.320] They have an RTX 4000 SFF ADA generation GPU.
[00:11:00.320 --> 00:11:05.200] And I think they cost 200 euros a month just to rent.
[00:11:05.200 --> 00:11:08.000] And these servers are spectacular.
[00:11:08.000 --> 00:11:17.440] They have 64 gigabytes of DDR4 RAM, 4 terabytes of disk space, like for 200-ish dollars a month to rent this.
[00:11:17.440 --> 00:11:18.400] That's really cool.
[00:11:18.400 --> 00:11:20.480] Like you can rent, how much do I have right now?
[00:11:20.480 --> 00:11:29.760] Like 10-ish of them, and have a significant GPU-based workload running 24-7 on many different servers for $2,000 or less.
[00:11:29.760 --> 00:11:37.840] The key insight from all these experiments was that transcription has very different requirements from other AI tasks like inference.
[00:11:37.840 --> 00:11:49.680] You can run transcription quite reliably by using somewhere between 4 and 20 gigabytes of RAM, depends on the model that you use, which is something that if you use Whisper, you can choose different models, right?
[00:11:49.680 --> 00:11:58.400] There's a tiny, a small, a medium, a large model, and they all use different size of gigabytes of this RAM that these GPUs use.
[00:11:58.400 --> 00:12:01.760] And the smaller ones obviously are faster and use less of that RAM.
[00:12:01.760 --> 00:12:03.760] So you can run a couple in parallel.
[00:12:03.760 --> 00:12:17.040] And it really doesn't mean that much if it's just a couple of gigabytes, but it's definitely enough to get the highest quality transcription data, particularly if you use a model called Whisper V3 Large Turbo.
[00:12:17.040 --> 00:12:20.880] That's the one I currently use, fastest and best quality.
[00:12:20.880 --> 00:12:29.440] And when I switched all my transcription servers from these A10s at Lambda Labs to the Hetsnar systems, I picked up Steam dramatically with these.
[00:12:29.440 --> 00:12:31.080] It was so much more effective.
[00:12:31.080 --> 00:12:35.080] So I could get by with half the number of servers and still have a higher throughput than before.
[00:12:29.920 --> 00:12:35.800] So that's where I'm now.
[00:12:35.960 --> 00:12:43.080] Self-maintained servers running transcription scripts 24-7 on the HeadSnop platform being highly efficient over time.
[00:12:43.080 --> 00:12:53.160] And the solution that I had with Whisper CPP, that was great in the beginning, but as PodScan started gaining customers, they had more requirements than just default transcription.
[00:12:53.160 --> 00:13:02.600] So I needed diarization, which is a fancy term for determining different speakers in an audio file, and word-level timestamps for precise interactivity.
[00:13:02.600 --> 00:13:08.760] People wanted to be able to have like exact cuts in videos or in audio, I guess, so they can extract stuff.
[00:13:08.760 --> 00:13:12.200] So they needed to know exactly where their sentence starts and where it ends.
[00:13:12.200 --> 00:13:16.360] So for that and knowing who's speaking, I needed something bigger.
[00:13:16.360 --> 00:13:32.200] So I switched from Whisper CPP to another implementation running on top of Faster Whisper, which is a library that uses these models more efficiently, that includes both diarization capabilities and granular timing data.
[00:13:32.200 --> 00:13:35.880] But this revealed a couple of surprising technical challenges.
[00:13:35.880 --> 00:13:39.960] So if you're into transcription, this is going to be very, very useful.
[00:13:39.960 --> 00:13:42.680] So you don't have to fall into these traps yourself.
[00:13:42.680 --> 00:13:46.120] Diarization is more resource-intensive than transcription.
[00:13:46.120 --> 00:13:51.240] Detecting speakers takes much longer than actually transcribing what they're saying.
[00:13:51.240 --> 00:14:04.920] You would think it would be easier to determine somebody speaking here than somebody else is speaking there, but it's actually harder to figure out if it's person one, two, or three than it is to determine the actual words that this person is speaking.
[00:14:04.920 --> 00:14:12.040] And from the start, I needed a careful prioritization system here because I only could diarize what I really needed.
[00:14:12.040 --> 00:14:20.560] If I know that a podcast has only one speaker and has had for the last 200 episodes, well, I don't need to diarize it and I can save over 50% of the time.
[00:14:20.560 --> 00:14:26.960] But if it's a popular show with different guests all the time, then I guess I need to prioritize it and it's going to take double the time.
[00:14:26.960 --> 00:14:38.480] At scale, turning off diarization for some shows where it doesn't matter or has not yet proven to matter means that I can transcribe twice as many podcasts in any given day.
[00:14:38.480 --> 00:14:39.840] And that's massive.
[00:14:39.840 --> 00:14:52.640] If that means that I can do all the shows in a day and still have resources left, then I can step back in time and get some of the older episodes that might be still very interesting for search purposes.
[00:14:52.640 --> 00:14:55.280] So that's the trade-off that I'm dealing with here.
[00:14:55.280 --> 00:15:00.000] Finally, there's something that I learned after a couple of weeks of experimenting with this.
[00:15:00.000 --> 00:15:04.720] GPU memory limits affect the quality of the transcription.
[00:15:04.720 --> 00:15:14.800] If you do a lot of parallelized transcription, the GPU reaching its memory limit where it gets full, that can cause transcription quality to decline.
[00:15:14.800 --> 00:15:19.360] I initially had a thought like this graphics cards has 20 gigabytes of reRAM.
[00:15:19.360 --> 00:15:26.240] Each transcription process uses at most four gigabytes, so it can run five at a time, fill up the whole graphics card, right?
[00:15:26.240 --> 00:15:27.360] The whole RAM.
[00:15:27.360 --> 00:15:31.040] That tends to be true most of the time that it works.
[00:15:31.040 --> 00:15:44.800] But if one process runs a little bit longer, maybe it's a three-hour Joe Rogan podcast yet again, and then another process spawns and five or six processes are fighting for memory, or even just the five on there that should have four.
[00:15:44.800 --> 00:15:47.120] One of them is like 4.2, right?
[00:15:47.120 --> 00:15:55.440] Quality quickly degrades on all of them simultaneously because there's something in there that just breaks down and then it just hallucinates stuff.
[00:15:55.440 --> 00:16:00.000] I have since reduced parallel processes to two or three podcasts at any given time.
[00:16:01.000 --> 00:16:07.800] There's a small chance the GPU isn't fully utilized when you know all of them spin up at the same time, but that's okay.
[00:16:07.800 --> 00:16:11.400] Most of the time, it's in full use anyway without quality degradation.
[00:16:11.400 --> 00:16:17.000] And I would rather not risk it because I want these transcripts to be reliably good.
[00:16:17.000 --> 00:16:22.920] Biggest learning in all of this has been that bigger GPUs aren't necessarily faster or better.
[00:16:22.920 --> 00:16:28.040] And not just from a words to dollar ratio, just even from a usage of the GPU.
[00:16:28.040 --> 00:16:33.640] Just because the GPU is bigger doesn't mean it's faster at transcribing, surprisingly, you would think, but it's not.
[00:16:33.640 --> 00:16:43.640] When I ran transcription on my local machine and then on A10 and A100 GPUs, I got quite similar results, like always between 150 to 200 words per second.
[00:16:43.640 --> 00:16:46.600] And those things cost 200 bucks a month max.
[00:16:46.600 --> 00:16:55.960] But then I ran it at H100 GPU and the word count stayed almost the same, maybe going up to 225 to 250 words per second.
[00:16:55.960 --> 00:17:00.040] But that GPU had five to ten times the monthly cost.
[00:17:00.040 --> 00:17:05.080] And you couldn't really run it in parallel there either because then it would start degrading quality.
[00:17:05.080 --> 00:17:12.440] So for transcription specifically, it is way more effective to run on smaller and maybe slightly slower GPUs at scale.
[00:17:12.440 --> 00:17:16.680] And this has turned out to be the only feasible way for me to do this.
[00:17:17.000 --> 00:17:24.360] And we're just talking about like self-managed transcription here because there's an alternative that puts everything into perspective.
[00:17:24.360 --> 00:17:37.720] If I were to transcribe all 50,000 episodes that come in every day using OpenAI's platform, their AI platform, their whisper endpoint there, I would pay a five-figure dollar amount every single day.
[00:17:37.720 --> 00:17:45.040] After many months of optimizing and experimenting with transcription setups, I have obviously not done this.
[00:17:44.680 --> 00:17:50.560] I have turned the whole thing into just a few thousand dollars in expenses a month by having my own infrastructure.
[00:17:50.800 --> 00:18:09.920] The cost savings are significant because when you run your own infrastructure, even though you aren't able to do as many parallel things as you could by using Whisper and OpenAI or other transcription systems like Deepgram, but instead of paying like $100,000 a month, you pay four or two, right?
[00:18:09.920 --> 00:18:18.160] If you do it well, the daily cost that these commercial models can incur for you is easily in the thousands of dollars.
[00:18:18.160 --> 00:18:23.760] And I've gotten it down to just 100 and change on a per day basis, which is quite significant.
[00:18:23.760 --> 00:18:27.920] The biggest expense for PodScan at this point is not transcription capacity.
[00:18:27.920 --> 00:18:34.160] You would think, right, that this would be the most impactful expense, but it's a database where all of this information is stored.
[00:18:34.160 --> 00:18:37.920] And that's the next big challenge that I didn't ever think about in the beginning.
[00:18:37.920 --> 00:18:48.160] Because when I initially started tracking a couple hundred podcasts, yeah, it was totally fine to have my SQL database store all of this data without doing anything specific around data storage, right?
[00:18:48.160 --> 00:18:50.240] Just throw it in and figure it out.
[00:18:50.240 --> 00:18:56.880] But once I turn on the full fire hose of podcast data, all 50K a day, it became a massive challenge.
[00:18:56.880 --> 00:19:05.520] If every transcript is like 200 kilobytes to one megabyte in text size, because that's what text is, it gets massive, right?
[00:19:05.520 --> 00:19:06.800] It can be megabytes of data.
[00:19:06.800 --> 00:19:11.520] Again, Joe Rogan, thank you so much for filling my database with massive transcripts here.
[00:19:11.520 --> 00:19:15.600] Then every day you're adding several gigabytes to your database.
[00:19:15.600 --> 00:19:21.040] So, if you're trying to do full text search or quick lookups with some filtering, this becomes a problem.
[00:19:21.040 --> 00:19:28.800] Even if you have an index there for a full text index or just regular string index, it is so hard to get this right.
[00:19:28.800 --> 00:19:33.560] I had to build infrastructure that prevents my database from overgrowing or slowing down to a halt.
[00:19:33.880 --> 00:19:43.800] Older transcripts are actually transferred to an S3-based storage and loaded by the main process when they are requested by a user on the front end or in the API.
[00:19:43.800 --> 00:19:54.760] I don't keep all my transcripts in the database because that would easily be six terabytes right now, just in raw size, which is super expensive to maintain and super clunky for database access.
[00:19:54.760 --> 00:20:07.880] Now, all transcripts live on S3 as JSON files and can be loaded on demand for anything older than a couple months for regular transcripts and anything older than just a couple days for the word-level timestamp transcripts that we also save.
[00:20:07.880 --> 00:20:12.360] That is probably the biggest one: JSON data for every second of a show.
[00:20:12.360 --> 00:20:18.600] And this has been very helpful in ensuring that the database stays at least a little nimble in comparison.
[00:20:18.600 --> 00:20:34.840] When it comes to search, I'm using an open search cluster, also in AWS, where I just pipe the full transcript in there and then have its own inverted index, I think that's what it's called, built to be able to search for full text there.
[00:20:34.840 --> 00:20:48.200] So, we're not doing full-text search in the database, we have an additional secondary database that we feed all these transcripts into and facilitate search by just having what is kind of an elastic search fork deal with all of that.
[00:20:48.520 --> 00:20:59.960] It would never, never, ever work in this MySQL database, and probably also not in Postgres if I were to have a full text search there, just because data is so massive.
[00:20:59.960 --> 00:21:04.040] I was using MightySearch for a while, and that also works.
[00:21:04.040 --> 00:21:14.360] Like all these search engines that can deal with large text, they are good at it, but transcript data is so big that even those databases struggle a little bit.
[00:21:14.360 --> 00:21:16.480] So you have to build something that works, right?
[00:21:14.520 --> 00:21:24.080] You have to save them in a way that they can be looked up reliably and you have to save them in a way that they can be searched reliably too.
[00:21:24.400 --> 00:21:27.440] Now, there are other challenges, not just storage.
[00:21:27.440 --> 00:21:29.200] There's also a quality problem.
[00:21:29.200 --> 00:21:32.400] Yet again, podcasting is full of quality problems.
[00:21:32.400 --> 00:21:35.360] There's no normal standard for quality in podcasts.
[00:21:35.360 --> 00:21:37.200] And I mean the audio data for that matter.
[00:21:37.200 --> 00:21:44.720] Some people record into what feels like a potato and others have extremely high-end setups like this fine podcast.
[00:21:44.720 --> 00:21:50.720] And you never know reliably which one you will encounter if you listen to one or if you try to transcribe it.
[00:21:50.720 --> 00:22:00.880] So transcription systems expect at least a certain kind of quality and they struggle with low quality audio or non-speech content like music that people throw into this as well.
[00:22:00.880 --> 00:22:09.360] So I had to implement a transcription quality checking system that tries to determine if a transcript is acceptable or if you need to re-transcribe it with different settings.
[00:22:09.360 --> 00:22:16.240] Whisper is pretty good by default, but there are edge cases where you need multiple attempts to get it right.
[00:22:16.240 --> 00:22:17.920] And that all costs money.
[00:22:18.000 --> 00:22:29.360] Biggest problem, and that's probably also why PodScan is actually so impactful for the people using it, is that transcription systems like Whisper, but also others, struggle with names and brands.
[00:22:29.360 --> 00:22:34.640] Anything that a human could easily get right from context, they don't get right because they don't have context.
[00:22:34.640 --> 00:22:39.680] They just have a voice pattern, an audio waveform, and they don't get it right most of the time.
[00:22:39.680 --> 00:22:54.480] And what works really well here, but is extremely expensive, is taking the full transcript from Whisper with all the little mistakes in there, and having an AI do a pass over it with context from the podcast name, the description, and maybe prior episodes data.
[00:22:54.480 --> 00:23:00.920] And you get extremely high-quality transcripts that way, but at scale, this costs several dollars per episode.
[00:22:59.840 --> 00:23:04.520] Because imagine what this would mean to use an AI system.
[00:23:04.760 --> 00:23:20.440] Let's say you have 500 kilobytes of text, that is, I don't know two hours of a podcast, and you pipe that into even a cheap LLM that is hosted on, I don't know, on Anthropic or on OpenAI's platforms.
[00:23:20.440 --> 00:23:28.680] So you have like half a million input tokens, and then it does some stuff, and then it has half a million output tokens.
[00:23:28.680 --> 00:23:30.280] And that's the expensive part.
[00:23:30.440 --> 00:23:35.160] Output tokens are probably the most expensive stuff for LLMs right now to create.
[00:23:35.480 --> 00:23:41.400] And that does not work at scale because that again would cost me $50,000 a day.
[00:23:41.800 --> 00:23:43.240] Where am I going to take that money?
[00:23:43.240 --> 00:23:44.360] Not happening.
[00:23:44.360 --> 00:23:51.000] That might be very limited to a very limited number of shows, but even then, it gets super expensive.
[00:23:51.000 --> 00:23:52.600] So that's an unsolved challenge.
[00:23:52.600 --> 00:24:03.240] Currently, Whisper can take 120 or so tokens of context, just like things like the title of the show, maybe the episode title, and a couple of names of people that will be mentioned.
[00:24:03.240 --> 00:24:04.600] So that's what I throw in.
[00:24:04.600 --> 00:24:08.360] I just give it what I know is true about the episode.
[00:24:08.360 --> 00:24:19.000] And back in the day, I experimented with giving it all the brand names from all the Podscan accounts that were subscribed, like all my customers, to give it as context to maybe find those better.
[00:24:19.000 --> 00:24:23.800] But Whisper actually started finding these words in places where they weren't actually there.
[00:24:23.800 --> 00:24:28.440] It was gaslighting me into believing that it found certain words why they didn't exist.
[00:24:28.440 --> 00:24:31.960] So I quickly stopped piping all of these brand names in there.
[00:24:31.960 --> 00:24:38.600] Since then, I only provide context that I can reliably infer that will be in that particular episode.
[00:24:38.600 --> 00:24:48.320] And the big benefit of the system that I've built so far, like the whole installable on some VPC somewhere system, is that it's pretty easy to set up.
[00:24:48.640 --> 00:24:53.840] It's a Laravel application that I can deploy through LaravelForge onto any new server.
[00:24:53.840 --> 00:25:04.400] I have an install script that fetches a couple Python libraries and I can spin up a new server quite easily that then automatically attaches to my API and starts fetching and transcribing new episodes.
[00:25:04.400 --> 00:25:05.920] It's really nice, quite scalable.
[00:25:05.920 --> 00:25:10.400] It's not on a Docker container level scalable, but I'll get to that in the future too.
[00:25:10.400 --> 00:25:23.520] And as PodScan's infrastructure grows, we can quickly add more systems so new episodes are transcribed faster with even more quality because they can be run in less parallel with less little outlier errors faster.
[00:25:23.520 --> 00:25:27.600] And as models evolve, they might even become better at transcribing.
[00:25:27.600 --> 00:25:36.560] Eventually, I think I can increase the number to get diarized and get more good data that can then be fed into AI systems for what my customers want.
[00:25:36.560 --> 00:25:48.000] When I first set up the fleet of servers to transcribe all my podcasts that I wanted to transcribe, all podcasts everywhere, it probably would have cost me $30,000 a month, even on my own hardware.
[00:25:48.000 --> 00:26:00.960] But I'm now at a point where, through proper optimization and balancing my customer needs with the expense requirements, I can reliably capture the majority of podcasts at good quality for just a couple thousand dollars a month in expenses.
[00:26:00.960 --> 00:26:02.400] And I think that's really cool.
[00:26:02.400 --> 00:26:09.680] The fact that that is even possible for a Sotopreneur to build, I don't think that would have been the thing a couple years ago, but the tools are all out there.
[00:26:09.680 --> 00:26:12.720] It just takes a year and a half of 24/7 work.
[00:26:13.040 --> 00:26:14.560] So yeah, it takes a while.
[00:26:14.560 --> 00:26:16.800] But it still is possible.
[00:26:16.800 --> 00:26:30.920] The key insight for all of this is that when you're building a business that scales with these factors outside of your control, like the global output of an entire medium, you just need to think differently about infrastructure optimization and trade-offs.
[00:26:29.680 --> 00:26:36.280] Sometimes the most expensive solution will not be the best one, like OpenAI's hosted whisper.
[00:26:36.440 --> 00:26:37.720] It just doesn't work.
[00:26:37.720 --> 00:26:46.280] And sometimes the constraints that you think are impossible to work with actually force you into more creative and ultimately better solutions.
[00:26:46.280 --> 00:26:53.000] The kind of challenge that makes building businesses both terrifying and accelerating, that's exactly this.
[00:26:53.000 --> 00:27:03.320] You can't control how many podcasts get published worldwide every day, but you can control how cleverly and effectively you solve the problems that stem from this.
[00:27:03.640 --> 00:27:05.480] And that's it for today.
[00:27:05.480 --> 00:27:07.640] Thank you so much for listening to the Bootstrap Founder.
[00:27:07.640 --> 00:27:10.840] You can find me on Twitter at OvidKal, A-R-V-A-D-K-A-H-L.
[00:27:10.840 --> 00:27:23.720] If you want to support me on this show, please share podscan.fm with your peers and those who you think will benefit from tracking brands, competitors, their products, all kinds of things and names on podcasts out there.
[00:27:23.720 --> 00:27:29.400] PodScan is this near real-time podcast database with a really solid integration system.
[00:27:29.400 --> 00:27:35.720] We allow a lot of people to build solutions to get leads, to get information on their clients and all of that.
[00:27:35.720 --> 00:27:41.240] So please share the word with those who need to stay on top of the podcast ecosystem.
[00:27:41.240 --> 00:27:42.600] Thank you so much for listening.
[00:27:42.600 --> 00:27:45.240] Have a wonderful day and bye-bye.

404: The Transcription Challenge: Building Infrastructure That Scales With The World

Key Takeaways

Segments

The Goods

Debug Information

Processing Details

Prompts Used

Prompt 1: Context Setup

Prompt 2: Key Takeaways

Prompt 3: Segments

Prompt 4: Media Mentions

Full Transcript

About Spoken Goods

What We Do

Why Spoken Goods Exists

Our Collections