I have been looking into a good way to implement this. I am working on a simple website crawler that will go around a specific set of websites and crawl all the mp3 links into the database.
I don't want to download the files, just crawl the link, index them and be able to search them. So far for some of the sites i have been successful, but for some they use url redirects and stuff which confuses the crawler..
any ideas? how does beemp3.com index all these links?
thanks
To accurately collect links you've got to be able to follow redirects (this is really a no brainer), interpret JavaScript, handle DOM events, have AJAX support, possibly parse Flash files for content or links, etc. There are still plenty of sites out there that use Flash for navigation and don't provide a fallback. I recently saw a site that used the window.onload event to call a function that wrote out the HTML for the entire page using document.write.
Depending on what your needs are you could end up with anything from a small script to a full fledged browser. You could either develop something yourself, use an open source crawler or script Mozilla or IE. With a couple Perl modules you could have your own headless Mozilla.
Once you have a good crawler it's still going to be tricky to use. There are all sorts of spider traps out there- circular navigation, unique URLs that produce duplicate content, etc. Sometimes it's deliberate; most of the time it's not. People just don't usually design sites with web crawlers in mind. It may taking a little prodding (site-specific configuration) to make it work.