Ask HN: MP3 Crawler

jm4 · on July 14, 2009

They probably use a better crawler than the one you've put together. Reliable crawling is not the easiest problem to solve. There are a lot of crappy sites out there. When you're Google you can tell them to screw off. When you're small and you need to crawl the content you have to figure out a way to make things work.

To accurately collect links you've got to be able to follow redirects (this is really a no brainer), interpret JavaScript, handle DOM events, have AJAX support, possibly parse Flash files for content or links, etc. There are still plenty of sites out there that use Flash for navigation and don't provide a fallback. I recently saw a site that used the window.onload event to call a function that wrote out the HTML for the entire page using document.write.

Depending on what your needs are you could end up with anything from a small script to a full fledged browser. You could either develop something yourself, use an open source crawler or script Mozilla or IE. With a couple Perl modules you could have your own headless Mozilla.

Once you have a good crawler it's still going to be tricky to use. There are all sorts of spider traps out there- circular navigation, unique URLs that produce duplicate content, etc. Sometimes it's deliberate; most of the time it's not. People just don't usually design sites with web crawlers in mind. It may taking a little prodding (site-specific configuration) to make it work.

westside1506 · on July 14, 2009

Our service, 80legs, will let you easily do this. We let you specify seed links, how deep you want to crawl, and control many other aspects of the crawl. By default, we control the hard bits, like redirects and spider traps, but if you want to override our default functionality you can easily insert your own code to do it.

Our default functionality will let you identify mp3 files by regex or keyword, but if you need something more sophisticated you can override that too. I'm pretty sure, based on what you've said, that you could simply put in a few parameters and start running some jobs within a few minutes of getting started with 80legs that will do exactly what you want. If not, adding custom code to 80legs is pretty simple too.

Just send us your contact info on our website (http://www.80legs.com) and mention HN and I'll make sure you get a beta invite. BTW - we're still in private beta and the service is still free for right now.

anfractuosity · on July 14, 2009

You may be interested in Nutch (http://lucene.apache.org/nutch/), an Open Source crawler, which handles indexing etc. for you. Its also based on Hadoop, so it should scale nicely just by throwing more machines at the job.

ScottWhigham · on July 14, 2009

How is this different from doing a google search for filetype:mp3? I don't remember...