I've been reading through the documentation on the celery github page. I haven't been able to figure out the appropriate task breakdown. That is, I'm trying to do some crawling and ingestion, and I'm wondering if I should be pushing a dozen small tasks onto the queue every second, or push larger tasks (possibly with subtasks broken out like it suggests) every minute or hour.
This sounds like a dumb question to my own ears, but I just don't have to familiarity to know the proper use case. I essentially want continuous crawling and ingestion with the potential to spread the load across multiple servers one day.
(presumably the ingestors would be populating local databases, with a query getting farmed out to each server+database, but I haven't figured that part out, either....ummm, sounds like a task I could put into the queue, as well. Are these things really nails?)
I'd be grateful if anyone can point me to some examples or provide a bit of context.
There isn't a single good answer to this, all depending on what storage you use, the work you need to do etc. But in general I'd think you want the task to be as granular as sensible, so you can spread the work to many servers. The best thing you can do is to try out, stress-test and benchmark the different ways to do this. As of your description, I'm not even sure if celery is the right tool for the job, but you could join irc.freenode.net #celery to get more information.
This sounds like a dumb question to my own ears, but I just don't have to familiarity to know the proper use case. I essentially want continuous crawling and ingestion with the potential to spread the load across multiple servers one day.
(presumably the ingestors would be populating local databases, with a query getting farmed out to each server+database, but I haven't figured that part out, either....ummm, sounds like a task I could put into the queue, as well. Are these things really nails?)
I'd be grateful if anyone can point me to some examples or provide a bit of context.