Celery: A Distributed Task Queue for Django

anuraggoel · on June 12, 2009

Looks interesting. But shouldn't a library like celery work outside the context of a web framework? I don't see a reason to call this a distributed task queue 'for Django' specifically, except for the dependencies on Django's ORM and settings definitions. Swapping out Django's ORM with SQLAlchemy (or DB-API) would make this project much more useful.

See pp (http://www.parallelpython.com/) for something similar, without the django dependency. More parallel processing goodies at http://wiki.python.org/moin/ParallelProcessing.

gaborcselle · on June 12, 2009

The way I understand this, Celery is that binding glue between the Django a message queue (RabbitMQ), and cronjobs.

I.e. you could swap out Django for CGI, and the ORM with SQLAlchemy, but then you might as well start from scratch.

anuraggoel · on June 12, 2009

I might not want to swap out Django with CGI, but I might want to use RabbitMQ-based distributed tasks in Pylons. Seems like a waste to write all that code from scratch. For Django users (and I am one), Celery is clearly useful. But a distributed task processing library, even if it is a thin wrapper over RabbitMQ, should not depend on a web-framework.

asksol · on June 12, 2009

Yes. You're right, it could certainly be useful. But for now, it just hasn't been the focus to separate it from Django, yet. Tight deadlines and all that, you know. But it is definitely something I want to do, and you're more than welcome to help.

ubernostrum · on June 13, 2009

It's actually not terribly hard. See this article for some pointers on how easy it is to work with AMQP from Python:

http://playgroundblues.com/posts/2009/may/20/working-django-...

Meanwhile, I think there's nothing wrong with people distributing tools which integrate queuing solutions into specific libraries/frameworks; such things can often be quite useful and end up offering more natural interfaces for the task at hand.

asksol · on June 13, 2009

It's not about communicating with AMQP. It uses carrot+pyamqplib for that.

ubernostrum · on June 13, 2009

I was simply replying to the seeming assertion that it's wrong to develop a queueing solution which integrates with Django. Writing queueing solutions in Python is easy, and integrating with popular tools should be fine.

webology · on June 12, 2009

Carrot (by the same author) might be more what you want - http://github.com/ask/carrot/ It started off as being Django specific then was later re-factored to not be.

pie · on June 12, 2009

Having just hacked together an ugly threaded task queue for scraping and multi-stage data processing in Django, this looks like a breath of fresh air. I need to work my way out of the self-inflicted mess I've created.

Does anyone have experience with this library or anything similar?

conesus · on June 12, 2009

Yeah, I also have experience in hacking together a multi-threaded task queue with ugly results. Try getting messaging in multiple daemon threads to communicate with the web client (who is spawned off early in the process, only to be reunited later), and you'll see how much of a bear this is.

It's not too hard to perform only one of the use cases that celery handles, but to get all of them? I'm installing celery this weekend and will see how it goes. Maybe if it goes well, I can give a before-and-after blog post and submit it to HN.

tdavis · on June 12, 2009

beanstalkd (http://xph.us/software/beanstalkd/) also has similarities to this, and for non-Django / simpler needs, it may be better. It's basically memcached repurposed into a queue server.

A "task" would be equivalent to a script which only looks for jobs in a certain bucket (or "tube" as they're called). You can run as many clients on as many machines as you like. Obviously, since it is memory-based, you'll lose the queue in the event of a system crash.

That being said, as a rabid Django user, this is definitely going into my bookmarks!

rcoder · on June 12, 2009

The similarities are really more between Beanstalkd and RabbitMQ, the backend message-queuing system used by Celery. Most of the interesting bits of Celery to me is in the response handling, which lets you get the result of a deferred job after the fact without the usual busywork of correlating message IDs and managing result queues.

That being said, sometimes a more bare-bones solution like talking directly to Beanstalkd or an AMQP service makes sense. This can be especially true if you're dealing with a mixed-language environment where jobs may not share a class library or easy RPC.

henriklied · on June 12, 2009

I would suggest looking at redis (http://code.google.com/p/redis/). In my own experience, redis is really fast, and you have persistent storage of the tasks.

asksol · on June 13, 2009

Please note that RabbitMQ does support persistence. See http://www.rabbitmq.com

diN0bot · on June 13, 2009

I've been reading through the documentation on the celery github page. I haven't been able to figure out the appropriate task breakdown. That is, I'm trying to do some crawling and ingestion, and I'm wondering if I should be pushing a dozen small tasks onto the queue every second, or push larger tasks (possibly with subtasks broken out like it suggests) every minute or hour.

This sounds like a dumb question to my own ears, but I just don't have to familiarity to know the proper use case. I essentially want continuous crawling and ingestion with the potential to spread the load across multiple servers one day.

(presumably the ingestors would be populating local databases, with a query getting farmed out to each server+database, but I haven't figured that part out, either....ummm, sounds like a task I could put into the queue, as well. Are these things really nails?)

I'd be grateful if anyone can point me to some examples or provide a bit of context.

asksol · on June 13, 2009

There isn't a single good answer to this, all depending on what storage you use, the work you need to do etc. But in general I'd think you want the task to be as granular as sensible, so you can spread the work to many servers. The best thing you can do is to try out, stress-test and benchmark the different ways to do this. As of your description, I'm not even sure if celery is the right tool for the job, but you could join irc.freenode.net #celery to get more information.

amix · on June 12, 2009

I have often wondered why not use a MySQL table as a "queue" (or more tables if needed). Basically, you get great performance (MySQL is really fast), you get great language support (a LOT of languages can add tasks via simple SQL) and you get such things like easy backups and replication.

simonw · on June 12, 2009

According to Mike Malone the official name for a queue built on top of a MySQL table is a "ghetto queue". I believe Flickr still do all of their queuing in that way.

asksol · on June 12, 2009

See Alexis Richardson's talk, Databases Sucks for Messaging: http://oxford.geeknights.net/2009/may-27th/talks/keynote-Ale...

amix · on June 13, 2009

Messaging != A queue (at least the queue that Celery represents...)

mshafrir · on June 12, 2009

Google App Engine needs something like this.

ropiku · on June 13, 2009

They said they are working on a queuing system for offline processing that uses HTTP POST for doing the actual work. See the session at Google I/O: http://code.google.com/events/io/sessions/OfflineProcessingA...