Looks interesting. But shouldn't a library like celery work outside the context of a web framework? I don't see a reason to call this a distributed task queue 'for Django' specifically, except for the dependencies on Django's ORM and settings definitions. Swapping out Django's ORM with SQLAlchemy (or DB-API) would make this project much more useful.
I might not want to swap out Django with CGI, but I might want to use RabbitMQ-based distributed tasks in Pylons. Seems like a waste to write all that code from scratch. For Django users (and I am one), Celery is clearly useful. But a distributed task processing library, even if it is a thin wrapper over RabbitMQ, should not depend on a web-framework.
Yes. You're right, it could certainly be useful. But for now, it just hasn't been the focus to separate it from Django, yet. Tight deadlines and all that, you know. But it is definitely something I want to do, and you're more than welcome to help.
Meanwhile, I think there's nothing wrong with people distributing tools which integrate queuing solutions into specific libraries/frameworks; such things can often be quite useful and end up offering more natural interfaces for the task at hand.
I was simply replying to the seeming assertion that it's wrong to develop a queueing solution which integrates with Django. Writing queueing solutions in Python is easy, and integrating with popular tools should be fine.
Carrot (by the same author) might be more what you want - http://github.com/ask/carrot/ It started off as being Django specific then was later re-factored to not be.
Having just hacked together an ugly threaded task queue for scraping and multi-stage data processing in Django, this looks like a breath of fresh air. I need to work my way out of the self-inflicted mess I've created.
Does anyone have experience with this library or anything similar?
Yeah, I also have experience in hacking together a multi-threaded task queue with ugly results. Try getting messaging in multiple daemon threads to communicate with the web client (who is spawned off early in the process, only to be reunited later), and you'll see how much of a bear this is.
It's not too hard to perform only one of the use cases that celery handles, but to get all of them? I'm installing celery this weekend and will see how it goes. Maybe if it goes well, I can give a before-and-after blog post and submit it to HN.
beanstalkd (http://xph.us/software/beanstalkd/) also has similarities to this, and for non-Django / simpler needs, it may be better. It's basically memcached repurposed into a queue server.
A "task" would be equivalent to a script which only looks for jobs in a certain bucket (or "tube" as they're called). You can run as many clients on as many machines as you like. Obviously, since it is memory-based, you'll lose the queue in the event of a system crash.
That being said, as a rabid Django user, this is definitely going into my bookmarks!
The similarities are really more between Beanstalkd and RabbitMQ, the backend message-queuing system used by Celery. Most of the interesting bits of Celery to me is in the response handling, which lets you get the result of a deferred job after the fact without the usual busywork of correlating message IDs and managing result queues.
That being said, sometimes a more bare-bones solution like talking directly to Beanstalkd or an AMQP service makes sense. This can be especially true if you're dealing with a mixed-language environment where jobs may not share a class library or easy RPC.
I would suggest looking at redis (http://code.google.com/p/redis/). In my own experience, redis is really fast, and you have persistent storage of the tasks.
I've been reading through the documentation on the celery github page. I haven't been able to figure out the appropriate task breakdown. That is, I'm trying to do some crawling and ingestion, and I'm wondering if I should be pushing a dozen small tasks onto the queue every second, or push larger tasks (possibly with subtasks broken out like it suggests) every minute or hour.
This sounds like a dumb question to my own ears, but I just don't have to familiarity to know the proper use case. I essentially want continuous crawling and ingestion with the potential to spread the load across multiple servers one day.
(presumably the ingestors would be populating local databases, with a query getting farmed out to each server+database, but I haven't figured that part out, either....ummm, sounds like a task I could put into the queue, as well. Are these things really nails?)
I'd be grateful if anyone can point me to some examples or provide a bit of context.
There isn't a single good answer to this, all depending on what storage you use, the work you need to do etc. But in general I'd think you want the task to be as granular as sensible, so you can spread the work to many servers. The best thing you can do is to try out, stress-test and benchmark the different ways to do this. As of your description, I'm not even sure if celery is the right tool for the job, but you could join irc.freenode.net #celery to get more information.
I have often wondered why not use a MySQL table as a "queue" (or more tables if needed). Basically, you get great performance (MySQL is really fast), you get great language support (a LOT of languages can add tasks via simple SQL) and you get such things like easy backups and replication.
According to Mike Malone the official name for a queue built on top of a MySQL table is a "ghetto queue". I believe Flickr still do all of their queuing in that way.
See pp (http://www.parallelpython.com/) for something similar, without the django dependency. More parallel processing goodies at http://wiki.python.org/moin/ParallelProcessing.