Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: MP3 Crawler
6 points by obaid on July 14, 2009 | hide | past | favorite | 4 comments
I have been looking into a good way to implement this. I am working on a simple website crawler that will go around a specific set of websites and crawl all the mp3 links into the database.

I don't want to download the files, just crawl the link, index them and be able to search them. So far for some of the sites i have been successful, but for some they use url redirects and stuff which confuses the crawler..

any ideas? how does beemp3.com index all these links?

thanks



They probably use a better crawler than the one you've put together. Reliable crawling is not the easiest problem to solve. There are a lot of crappy sites out there. When you're Google you can tell them to screw off. When you're small and you need to crawl the content you have to figure out a way to make things work.

To accurately collect links you've got to be able to follow redirects (this is really a no brainer), interpret JavaScript, handle DOM events, have AJAX support, possibly parse Flash files for content or links, etc. There are still plenty of sites out there that use Flash for navigation and don't provide a fallback. I recently saw a site that used the window.onload event to call a function that wrote out the HTML for the entire page using document.write.

Depending on what your needs are you could end up with anything from a small script to a full fledged browser. You could either develop something yourself, use an open source crawler or script Mozilla or IE. With a couple Perl modules you could have your own headless Mozilla.

Once you have a good crawler it's still going to be tricky to use. There are all sorts of spider traps out there- circular navigation, unique URLs that produce duplicate content, etc. Sometimes it's deliberate; most of the time it's not. People just don't usually design sites with web crawlers in mind. It may taking a little prodding (site-specific configuration) to make it work.


Our service, 80legs, will let you easily do this. We let you specify seed links, how deep you want to crawl, and control many other aspects of the crawl. By default, we control the hard bits, like redirects and spider traps, but if you want to override our default functionality you can easily insert your own code to do it.

Our default functionality will let you identify mp3 files by regex or keyword, but if you need something more sophisticated you can override that too. I'm pretty sure, based on what you've said, that you could simply put in a few parameters and start running some jobs within a few minutes of getting started with 80legs that will do exactly what you want. If not, adding custom code to 80legs is pretty simple too.

Just send us your contact info on our website (http://www.80legs.com) and mention HN and I'll make sure you get a beta invite. BTW - we're still in private beta and the service is still free for right now.


You may be interested in Nutch (http://lucene.apache.org/nutch/), an Open Source crawler, which handles indexing etc. for you. Its also based on Hadoop, so it should scale nicely just by throwing more machines at the job.


How is this different from doing a google search for filetype:mp3? I don't remember...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: