Archive for tag "s3"
Queued Storage Backend for Django
[UPDATE] I added this code to github http://github.com/seanbrant/django-queued-storage if anyone is interested.
Say you have a web application that allows users to store large images and then serve them back to the user. Think something like Flickr. Now lets say you want to use Amazon S3 as your image server. You might start running into slowness with uploads if you upload the image to S3 in the same request the user made to upload the image to your site. What’s happening is the image first needs to get uploaded to your servers filesystem then it needs to get sent to S3. Depending on the file size this could provide a poor user experience.
So in trying to solve this I did what any good developer would do. I googled the best way to solve this problem. I came across several approaches none of which seemed that elegant. One suggested adding two fields to your modal and a flag that would tell you which field to use, yuck. I first went this route and the boiler plate and messiness was not worth it.
What I really wanted was one interface to the file just like a normal storage backend. I wanted to keep the logic for switching storages all in one place as much as possible and I wanted it to be no harder to use then a normal storage backend.
Something like this.
image = ImageField(storage=QueuedRemoteStorage(local='django.core.files.storage.FileSystemStorage',
remote='backends.s3boto.S3BotoStorage'), upload_to='uploads')
I came up with what I am calling QueuedRemoteStorage, for lack of a better name. This is basically a proxy for a local and a remote storage backends that takes care of determining what backend to use depending on what state the file is in. State is maintained using Django’s caching framework, and the queue service is Celery.
The only downside is this requires you have an app created to hold the Celery tasks.py file. So create a new app or add a tasks.py file to another app you have. I won’t go into how to use Celery, have a look at their documentation.
In tasks.py you need to define a subclass of Task.
from django.core.cache import cache
from django.core.files.storage import get_storage_class
from celery.registry import tasks
from celery.task import Task
class SaveToRemoteTask(Task):
def run(self, name, local, remote, cache_key):
local_storage = get_storage_class(local)()
remote_storage = get_storage_class(remote)()
remote_storage.save(name, local_storage.open(name))
cache.set(cache_key, True)
return True
tasks.register(SaveToRemoteTask)
What this code is doing is defining a Task that Celery will load on start up automatically because you called the file tasks.py and put it in the root of the app. Next it uses a function that Django provides that takes the path of the storage backend as a string and returns the storage class. We then open the file from the local storage and save it to the remote storage and set the key in the cache to True (the file is on the remote server).
This will hopefully make more sense when you see the storage backend.
import urllib
from django.core.cache import cache
from django.core.files.storage import get_storage_class, Storage
from yourapp.tasks import SaveToRemoteTask
QUEUED_REMOTE_STORAGE_CACHE_KEY_PREFIX = 'queued_remote_storage_'
class QueuedRemoteStorage(Storage):
def __init__(self, local, remote, cache_prefix=QUEDED_REMOTE_STORAGE_CACHE_KEY_PREFIX):
self.local_class = local
self.local = get_storage_class(self.local_class)()
self.remote_class = remote
self.remote = get_storage_class(self.remote_class)()
self.cache_prefix = cache_prefix
def get_storage(self, name):
cache_result = cache.get(self.get_cache_key(name))
if cache_result:
return self.remote
elif cache_result is None:
if self.remote.exists(name):
cache.set(self.get_cache_key(name), True)
return self.remote
return self.local
def get_cache_key(self, name):
return '%s%s' % (self.cache_prefix, urllib.quote(name))
def using_local(self, name):
return self.get_storage(name) is self.local
def using_remote(self, name):
return self.get_storage(name) is self.remote
def open(self, name, **kwargs):
return self.local.open(name, **kwargs)
def save(self, name, content):
cache.set(self.get_cache_key(name), False)
name = self.local.save(name, content)
SaveToRemoteTask.delay(name, self.local_class, self.remote_class, self.get_cache_key(name))
return name
def get_valid_name(self, name):
return self.get_storage(name).get_valid_name(name)
def get_available_name(self, name):
return self.get_storage(name).get_available_name(name)
def path(self, name):
return self.get_storage(name).path(name)
def delete(self, name):
return self.get_storage(name).delete(name)
def exists(self, name):
return self.get_storage(name).exists(name)
def listdir(self, name):
return self.get_storage(name).listdir(name)
def size(self, name):
return self.get_storage(name).size(name)
def url(self, name):
return self.get_storage(name).url(name)
Most of this code is just providing a proxy to the actual storage methods as determined by get_storage. The heart and soul of this class is found in get_storage and save. get_storage checks for the key in the cache if it finds it, it can assume that the file is on the remote server and it returns the remote storage class instance. If cache_result is None we check the remote backend for the existence of the file if found we update the cache and return the remote backend. All else fails we return the local backend. Now the save method is responsible for queuing up the remote transfer. It first sets the cache to False then saves the file locally. Next it sends a job to the queue and returns the name of the file.
Hopefully this is pretty straight forward. At this point this is nothing more than proof of concept that has not been tested in a production setting. I hope that this will at least give other’s some ideas.






