2014-05-01 - Handling file upload is not always easy

Creating a form to upload a file is a trivial operation with modern web framework. Just define a form object, like this:

class UploadForm(wtforms.Form):
    name = wtforms.TextField('name', validators=[wtforms.validators.DataRequired()])
    email = wtforms.TextField('email', validators=[wtforms.validators.Email(), wtforms.validators.DataRequired()])
    message = wtforms.TextAreaField('message', validators=[wtforms.validators.DataRequired()])
    media = wtforms.FileField(u'file', validators=[wtforms.validators.DataRequired()])

and define a template with a form tag, don't forget the enctype="multipart/form-data".

<form  action="{{ url_for('file_upload') }}" method="POST" enctype="multipart/form-data">
    {{ form }}
</form>

The multipart/form-data is used by the browser to send a Request (with the POST verb) to the server with a specific format using boundary to define each posted fields. Like:

POST /uploads HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: fr,en-us;q=0.7,en;q=0.3
Connection: keep-alive
Content-Type: multipart/form-data; boundary=---------------------------10116180066357935432050933376

-----------------------------10116180066357935432050933376
Content-Disposition: form-data; name="name"
test

-----------------------------10116180066357935432050933376
Content-Disposition: form-data; name="email"

thomas.rabaix@gmail.com

-----------------------------10116180066357935432050933376
Content-Disposition: form-data; name="message"

test
-----------------------------10116180066357935432050933376
Content-Disposition: form-data; name="media"; filename="2014-03-10 18.53.56.png"
Content-Type: image/png

Binary Content here ...

When the web server receives the request, it will call the configured handler to serve the request. Depends on the language used, the backend receives a stream of data containing all information or a reference to the file already available on the filesystem.

Calling the backend works for many cases, however how this work with big files? How can you handle the load when you receive too many files? Something important to keep in mind is that while receiving a file, the backend cannot accept more connection and uploading files might lock your website.

Now, why the backend should handle file upload? Cannot we use the webserver to handle this task, and leaving the backend serving pages for users or doing some business workflows.

The Nginx webserver has an optional module to accept file upload before calling the backend: Nginx upload module and available in your favorite distribution by installing the nginx-extras package. This module will allow to offload the uploading process to the webserver, and configure valid timeout or the max body size for the uploaded contents. The other advantage is that you don't need to define custom settings for the backend.

The module configuration is pretty simple:

location = /uploads {
    # configure the max size for the request
    client_max_body_size 128M;
    client_body_buffer_size 1024k;

    # configure the location of the uploaded files
    upload_store /var/uploads;
    upload_state_store /var/upload_state;

    # set permissions on the uploaded files
    upload_store_access user:rw group:rw all:r;
    upload_set_form_field $upload_field_name '{"filename":"$upload_file_name", "content_type": "$upload_content_type", "path": "$upload_tmp_path"}';

    # pass all other fields posted to the backend
    upload_pass_form_field "^(.*)$";

    # call the backend to complete the request, ie, get the file and store meta data into a dedicated database.
    upload_pass   @tornado_backend;
}

There is one important thing in this configuration: upload_set_form_field directive. This directive is used to transformed the posted file content into form fields so the backend can retrieve metadata about the file: upload_file_name, upload_content_type and upload_tmp_path. In our case, we create one field corresponding to your uploaded file, but the metadatas are encoded into a json string. Most web frameworks expect to have a one-to-one relation between the form field and configured form type.

So in the current example: Python with Tornado and WTForms, we create a custom field to handle the transformation:

class NginxUploadField(wtforms.Field):
    widget = wtforms.widgets.FileInput()

    def _value(self):
        return None

    def process_formdata(self, valuelist):
        try:
            self.data = json.loads(valuelist[0])
        except:
            self.data = None

So now, the code to handle the file is pretty simple:

class IndexView(NodeHandler):
    def execute(self, request_handler, **kwargs):
        form = UploadForm(TornadoMultiDict(request_handler))

        if request_handler.request.method == 'POST' and form.validate():
            data = {}
            form.populate_obj(data)

            # store meta
            reference = uuid.uuid4()
            fd = open("%s/%s.json" % (self.path, reference), 'w')
            fd.write(json.dumps(data))
            fd.close()

            # copy file to the final destination
            shutil.copyfile(element.media['path'], "%s/%s.bin" % (self.path, reference))

Of course, this is a simple example. But this setup allows to have a dedicated set of fronts to accept files and can be used as the main storage for your files. Actually, the copyfile is optional as nginx uploads file to a unique location on the filesystem.

The upload module is great to offload most of the work to the webserver, the same logic can be done when you want to provide a download link to the user. Your backend should not be responsible to send the file contents to the browser. Your backend should only check if the user is allowed to retrieve the file.

The nginx configuration is:

location /protected/files {
    internal;
    alias   /var/shared/files/;
}

And the python pseudo code is:

class DownloadView(NodeHandler):
    def execute(self, request_handler, reference, **kwargs):
        data = json.loads(open("%s/%s.json" % (self.path, reference), 'r').read())
        data['reference'] = reference

        request_handler.set_header('Content-Disposition', 'attachment; filename=%s;' % data['content']['name'])
        request_handler.set_header('Content-Type', data['content']['type'])
        request_handler.set_header('X-Accel-Redirect', '/protected/files/%s.bin' % data['reference'])

The backend add a specific header X-Accel-Redirect used by nginx (with the internal directive) to send the file to the user. Please note, the file does not need to be stored in the public path, and the backend does not need to know the location on the filesystem. The backend only need to know the virtual location (/protected/files).

With a bit of work and depends on the infrastructure, the backend will never have access to uploaded files so even if your code have security issues, it will be harder to retrieve the file. Also it will be impossible to execute the uploaded files.

Comments

comments powered by Disqus