Queued Storage Backend for Django

Published on 4.01.2010 at 18:53

[UPDATE] I added this code to github http://github.com/seanbrant/django-queued-storage if anyone is interested.

Say you have a web application that allows users to store large images and then serve them back to the user. Think something like Flickr. Now lets say you want to use Amazon S3 as your image server. You might start running into slowness with uploads if you upload the image to S3 in the same request the user made to upload the image to your site. What’s happening is the image first needs to get uploaded to your servers filesystem then it needs to get sent to S3. Depending on the file size this could provide a poor user experience.

So in trying to solve this I did what any good developer would do. I googled the best way to solve this problem. I came across several approaches none of which seemed that elegant. One suggested adding two fields to your modal and a flag that would tell you which field to use, yuck. I first went this route and the boiler plate and messiness was not worth it.

What I really wanted was one interface to the file just like a normal storage backend. I wanted to keep the logic for switching storages all in one place as much as possible and I wanted it to be no harder to use then a normal storage backend.

Something like this.

image = ImageField(storage=QueuedRemoteStorage(local='django.core.files.storage.FileSystemStorage',
                   remote='backends.s3boto.S3BotoStorage'), upload_to='uploads')

I came up with what I am calling QueuedRemoteStorage, for lack of a better name. This is basically a proxy for a local and a remote storage backends that takes care of determining what backend to use depending on what state the file is in. State is maintained using Django’s caching framework, and the queue service is Celery.

The only downside is this requires you have an app created to hold the Celery tasks.py file. So create a new app or add a tasks.py file to another app you have. I won’t go into how to use Celery, have a look at their documentation.

In tasks.py you need to define a subclass of Task.

from django.core.cache import cache
from django.core.files.storage import get_storage_class

from celery.registry import tasks
from celery.task import Task

class SaveToRemoteTask(Task):
    def run(self, name, local, remote, cache_key):
        local_storage = get_storage_class(local)()
        remote_storage = get_storage_class(remote)()
        remote_storage.save(name, local_storage.open(name))
        cache.set(cache_key, True)
        return True

tasks.register(SaveToRemoteTask)

What this code is doing is defining a Task that Celery will load on start up automatically because you called the file tasks.py and put it in the root of the app. Next it uses a function that Django provides that takes the path of the storage backend as a string and returns the storage class. We then open the file from the local storage and save it to the remote storage and set the key in the cache to True (the file is on the remote server).

This will hopefully make more sense when you see the storage backend.

import urllib

from django.core.cache import cache
from django.core.files.storage import get_storage_class, Storage

from yourapp.tasks import SaveToRemoteTask

QUEUED_REMOTE_STORAGE_CACHE_KEY_PREFIX = 'queued_remote_storage_'

class QueuedRemoteStorage(Storage):
    def __init__(self, local, remote, cache_prefix=QUEDED_REMOTE_STORAGE_CACHE_KEY_PREFIX):
        self.local_class = local
        self.local = get_storage_class(self.local_class)()
        self.remote_class = remote
        self.remote = get_storage_class(self.remote_class)()
        self.cache_prefix = cache_prefix

    def get_storage(self, name):
        cache_result = cache.get(self.get_cache_key(name))
        if cache_result:
            return self.remote 
        elif cache_result is None:
            if self.remote.exists(name):
                cache.set(self.get_cache_key(name), True)
                return self.remote   
        return self.local

    def get_cache_key(self, name):
        return '%s%s' % (self.cache_prefix, urllib.quote(name))

    def using_local(self, name):
        return self.get_storage(name) is self.local

    def using_remote(self, name):
        return self.get_storage(name) is self.remote

    def open(self, name, **kwargs):
        return self.local.open(name, **kwargs)

    def save(self, name, content):
        cache.set(self.get_cache_key(name), False)
        name = self.local.save(name, content)
        SaveToRemoteTask.delay(name, self.local_class, self.remote_class, self.get_cache_key(name))
        return name

    def get_valid_name(self, name):
        return self.get_storage(name).get_valid_name(name)

    def get_available_name(self, name):
        return self.get_storage(name).get_available_name(name)

    def path(self, name):
        return self.get_storage(name).path(name)

    def delete(self, name):
        return self.get_storage(name).delete(name)

    def exists(self, name):
        return self.get_storage(name).exists(name)

    def listdir(self, name):
        return self.get_storage(name).listdir(name)

    def size(self, name):
        return self.get_storage(name).size(name)

    def url(self, name):
        return self.get_storage(name).url(name)

Most of this code is just providing a proxy to the actual storage methods as determined by get_storage. The heart and soul of this class is found in get_storage and save. get_storage checks for the key in the cache if it finds it, it can assume that the file is on the remote server and it returns the remote storage class instance. If cache_result is None we check the remote backend for the existence of the file if found we update the cache and return the remote backend. All else fails we return the local backend. Now the save method is responsible for queuing up the remote transfer. It first sets the cache to False then saves the file locally. Next it sends a job to the queue and returns the name of the file.

Hopefully this is pretty straight forward. At this point this is nothing more than proof of concept that has not been tested in a production setting. I hope that this will at least give other’s some ideas.

1 Comment

Looks pretty awesome! Will definitely be looking in to messing around with this. Thanks!

Derek Reynolds avatar
Derek Reynolds 13.01.2010

Comments are closed for this entry.

About Steps and Numbers
avatar

Steps and Numbers is the personal blog of web developer/designer Sean Brant. I enjoy coding and designing wonderful user experiences for the web. I spend my spare time hanging out in the beautiful city of Chicago, coding random projects most of which never see the light of day, and attempting to resurrect my failed carrier as a rock musician. You can contact me if you want and I’ll try and respond.

Flickr Photos
IMG_0970.JPG
IMG_0991.JPG
IMG_1002.JPG
IMG_0990.JPG
IMG_1015.JPG
IMG_1001.JPG
Search
Feeds
Post, Links, and Quotes
Just Posts
Just Links
Just Quotes