123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365 |
- ============
- Unicode data
- ============
- Django natively supports Unicode data everywhere. Providing your database can
- somehow store the data, you can safely pass around strings to templates,
- models, and the database.
- This document tells you what you need to know if you're writing applications
- that use data or templates that are encoded in something other than ASCII.
- Creating the database
- =====================
- Make sure your database is configured to be able to store arbitrary string
- data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
- a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
- able to store certain characters in the database, and information will be lost.
- * MySQL users, refer to the `MySQL manual`_ for details on how to set or alter
- the database character set encoding.
- * PostgreSQL users, refer to the `PostgreSQL manual`_ (section 22.3.2 in
- PostgreSQL 9) for details on creating databases with the correct encoding.
- * Oracle users, refer to the `Oracle manual`_ for details on how to set
- (`section 2`_) or alter (`section 11`_) the database character set encoding.
- * SQLite users, there is nothing you need to do. SQLite always uses UTF-8
- for internal encoding.
- .. _MySQL manual: https://dev.mysql.com/doc/refman/en/charset-database.html
- .. _PostgreSQL manual: https://www.postgresql.org/docs/current/static/multibyte.html
- .. _Oracle manual: https://docs.oracle.com/database/121/NLSPG/toc.htm
- .. _section 2: https://docs.oracle.com/database/121/NLSPG/ch2charset.htm#NLSPG002
- .. _section 11: https://docs.oracle.com/database/121/NLSPG/ch11charsetmig.htm#NLSPG011
- All of Django's database backends automatically convert strings into
- the appropriate encoding for talking to the database. They also automatically
- convert strings retrieved from the database into strings. You don't even need
- to tell Django what encoding your database uses: that is handled transparently.
- For more, see the section "The database API" below.
- General string handling
- =======================
- Whenever you use strings with Django -- e.g., in database lookups, template
- rendering or anywhere else -- you have two choices for encoding those strings.
- You can use normal strings or bytestrings (starting with a 'b').
- .. warning::
- A bytestring does not carry any information with it about its encoding.
- For that reason, we have to make an assumption, and Django assumes that all
- bytestrings are in UTF-8.
- If you pass a string to Django that has been encoded in some other format,
- things will go wrong in interesting ways. Usually, Django will raise a
- ``UnicodeDecodeError`` at some point.
- If your code only uses ASCII data, it's safe to use your normal strings,
- passing them around at will, because ASCII is a subset of UTF-8.
- Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set
- to something other than ``'utf-8'`` you can use that other encoding in your
- bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as
- the result of template rendering (and email). Django will always assume UTF-8
- encoding for internal bytestrings. The reason for this is that the
- :setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the
- application developer). It's under the control of the person installing and
- using your application -- and if that person chooses a different setting, your
- code must still continue to work. Ergo, it cannot rely on that setting.
- In most cases when Django is dealing with strings, it will convert them to
- strings before doing anything else. So, as a general rule, if you pass
- in a bytestring, be prepared to receive a string back in the result.
- Translated strings
- ------------------
- Aside from strings and bytestrings, there's a third type of string-like
- object you may encounter when using Django. The framework's
- internationalization features introduce the concept of a "lazy translation" --
- a string that has been marked as translated but whose actual translation result
- isn't determined until the object is used in a string. This feature is useful
- in cases where the translation locale is unknown until the string is used, even
- though the string might have originally been created when the code was first
- imported.
- Normally, you won't have to worry about lazy translations. Just be aware that
- if you examine an object and it claims to be a
- ``django.utils.functional.__proxy__`` object, it is a lazy translation.
- Calling ``str()`` with the lazy translation as the argument will generate a
- string in the current locale.
- For more details about lazy translation objects, refer to the
- :doc:`internationalization </topics/i18n/index>` documentation.
- Useful utility functions
- ------------------------
- Because some string operations come up again and again, Django ships with a few
- useful functions that should make working with string and bytestring objects
- a bit easier.
- Conversion functions
- ~~~~~~~~~~~~~~~~~~~~
- The ``django.utils.encoding`` module contains a few functions that are handy
- for converting back and forth between strings and bytestrings.
- * ``smart_text(s, encoding='utf-8', strings_only=False, errors='strict')``
- converts its input to a string. The ``encoding`` parameter
- specifies the input encoding. (For example, Django uses this internally
- when processing form input data, which might not be UTF-8 encoded.) The
- ``strings_only`` parameter, if set to True, will result in Python
- numbers, booleans and ``None`` not being converted to a string (they keep
- their original types). The ``errors`` parameter takes any of the values
- that are accepted by Python's ``str()`` function for its error
- handling.
- * ``force_text(s, encoding='utf-8', strings_only=False,
- errors='strict')`` is identical to ``smart_text()`` in almost all
- cases. The difference is when the first argument is a :ref:`lazy
- translation <lazy-translations>` instance. While ``smart_text()``
- preserves lazy translations, ``force_text()`` forces those objects to a
- string (causing the translation to occur). Normally, you'll want
- to use ``smart_text()``. However, ``force_text()`` is useful in
- template tags and filters that absolutely *must* have a string to work
- with, not just something that can be converted to a string.
- * ``smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')``
- is essentially the opposite of ``smart_text()``. It forces the first
- argument to a bytestring. The ``strings_only`` parameter has the same
- behavior as for ``smart_text()`` and ``force_text()``. This is
- slightly different semantics from Python's builtin ``str()`` function,
- but the difference is needed in a few places within Django's internals.
- Normally, you'll only need to use ``force_text()``. Call it as early as
- possible on any input data that might be either a string or a bytestring, and
- from then on, you can treat the result as always being a string.
- .. _uri-and-iri-handling:
- URI and IRI handling
- ~~~~~~~~~~~~~~~~~~~~
- Web frameworks have to deal with URLs (which are a type of IRI_). One
- requirement of URLs is that they are encoded using only ASCII characters.
- However, in an international environment, you might need to construct a
- URL from an IRI_ -- very loosely speaking, a URI_ that can contain Unicode
- characters. Use these functions for quoting and converting an IRI to a URI:
- * The :func:`django.utils.encoding.iri_to_uri()` function, which implements the
- conversion from IRI to URI as required by :rfc:`3987#section-3.1`.
- * The :func:`urllib.parse.quote` and :func:`urllib.parse.quote_plus`
- functions from Python's standard library.
- These two groups of functions have slightly different purposes, and it's
- important to keep them straight. Normally, you would use ``quote()`` on the
- individual portions of the IRI or URI path so that any reserved characters
- such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
- the full IRI and it converts any non-ASCII characters to the correct encoded
- values.
- .. note::
- Technically, it isn't correct to say that ``iri_to_uri()`` implements the
- full algorithm in the IRI specification. It doesn't (yet) perform the
- international domain name encoding portion of the algorithm.
- The ``iri_to_uri()`` function will not change ASCII characters that are
- otherwise permitted in a URL. So, for example, the character '%' is not
- further encoded when passed to ``iri_to_uri()``. This means you can pass a
- full URL to this function and it will not mess up the query string or anything
- like that.
- An example might clarify things here::
- >>> from urllib.parse import quote
- >>> from django.utils.encoding import iri_to_uri
- >>> quote('Paris & Orléans')
- 'Paris%20%26%20Orl%C3%A9ans'
- >>> iri_to_uri('/favorites/François/%s' % quote('Paris & Orléans'))
- '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
- If you look carefully, you can see that the portion that was generated by
- ``quote()`` in the second example was not double-quoted when passed to
- ``iri_to_uri()``. This is a very important and useful feature. It means that
- you can construct your IRI without worrying about whether it contains
- non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
- result.
- Similarly, Django provides :func:`django.utils.encoding.uri_to_iri()` which
- implements the conversion from URI to IRI as per :rfc:`3987#section-3.2`.
- An example to demonstrate::
- >>> from django.utils.encoding import uri_to_iri
- >>> uri_to_iri('/%E2%99%A5%E2%99%A5/?utf8=%E2%9C%93')
- '/♥♥/?utf8=✓'
- >>> uri_to_iri('%A9hello%3Fworld')
- '%A9hello%3Fworld'
- In the first example, the UTF-8 characters are unquoted. In the second, the
- percent-encodings remain unchanged because they lie outside the valid UTF-8
- range or represent a reserved character.
- Both ``iri_to_uri()`` and ``uri_to_iri()`` functions are idempotent, which means the
- following is always true::
- iri_to_uri(iri_to_uri(some_string)) == iri_to_uri(some_string)
- uri_to_iri(uri_to_iri(some_string)) == uri_to_iri(some_string)
- So you can safely call it multiple times on the same URI/IRI without risking
- double-quoting problems.
- .. _URI: https://www.ietf.org/rfc/rfc2396.txt
- .. _IRI: https://www.ietf.org/rfc/rfc3987.txt
- Models
- ======
- Because all strings are returned from the database as ``str`` objects, model
- fields that are character based (CharField, TextField, URLField, etc.) will
- contain Unicode values when Django retrieves data from the database. This
- is *always* the case, even if the data could fit into an ASCII bytestring.
- You can pass in bytestrings when creating a model or populating a field, and
- Django will convert it to strings when it needs to.
- Taking care in ``get_absolute_url()``
- -------------------------------------
- URLs can only contain ASCII characters. If you're constructing a URL from
- pieces of data that might be non-ASCII, be careful to encode the results in a
- way that is suitable for a URL. The :func:`~django.urls.reverse` function
- handles this for you automatically.
- If you're constructing a URL manually (i.e., *not* using the ``reverse()``
- function), you'll need to take care of the encoding yourself. In this case,
- use the ``iri_to_uri()`` and ``quote()`` functions that were documented
- above_. For example::
- from urllib.parse import quote
- from django.utils.encoding import iri_to_uri
- def get_absolute_url(self):
- url = '/person/%s/?x=0&y=0' % quote(self.location)
- return iri_to_uri(url)
- This function returns a correctly encoded URL even if ``self.location`` is
- something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
- call isn't strictly necessary in the above example, because all the
- non-ASCII characters would have been removed in quoting in the first line.)
- .. _above: `URI and IRI handling`_
- The database API
- ================
- You can pass either strings or UTF-8 bytestrings as arguments to
- ``filter()`` methods and the like in the database API. The following two
- querysets are identical::
- qs = People.objects.filter(name__contains='Å')
- qs = People.objects.filter(name__contains=b'\xc3\x85') # UTF-8 encoding of Å
- Templates
- =========
- Use strings when creating templates manually::
- from django.template import Template
- t2 = Template('This is a string template.')
- But the common case is to read templates from the filesystem, and this creates
- a slight complication: not all filesystems store their data encoded as UTF-8.
- If your template files are not stored with a UTF-8 encoding, set the :setting:`FILE_CHARSET`
- setting to the encoding of the files on disk. When Django reads in a template
- file, it will convert the data from this encoding to Unicode. (:setting:`FILE_CHARSET`
- is set to ``'utf-8'`` by default.)
- The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates.
- This is set to UTF-8 by default.
- Template tags and filters
- -------------------------
- A couple of tips to remember when writing your own template tags and filters:
- * Always return strings from a template tag's ``render()`` method
- and from template filters.
- * Use ``force_text()`` in preference to ``smart_text()`` in these
- places. Tag rendering and filter calls occur as the template is being
- rendered, so there is no advantage to postponing the conversion of lazy
- translation objects into strings. It's easier to work solely with
- strings at that point.
- .. _unicode-files:
- Files
- =====
- If you intend to allow users to upload files, you must ensure that the
- environment used to run Django is configured to work with non-ASCII file names.
- If your environment isn't configured correctly, you'll encounter
- ``UnicodeEncodeError`` exceptions when saving files with file names that
- contain non-ASCII characters.
- Filesystem support for UTF-8 file names varies and might depend on the
- environment. Check your current configuration in an interactive Python shell by
- running::
- import sys
- sys.getfilesystemencoding()
- This should output "UTF-8".
- The ``LANG`` environment variable is responsible for setting the expected
- encoding on Unix platforms. Consult the documentation for your operating system
- and application server for the appropriate syntax and location to set this
- variable.
- In your development environment, you might need to add a setting to your
- ``~.bashrc`` analogous to:::
- export LANG="en_US.UTF-8"
- Form submission
- ===============
- HTML form submission is a tricky area. There's no guarantee that the
- submission will include encoding information, which means the framework might
- have to guess at the encoding of submitted data.
- Django adopts a "lazy" approach to decoding form data. The data in an
- ``HttpRequest`` object is only decoded when you access it. In fact, most of
- the data is not decoded at all. Only the ``HttpRequest.GET`` and
- ``HttpRequest.POST`` data structures have any decoding applied to them. Those
- two fields will return their members as Unicode data. All other attributes and
- methods of ``HttpRequest`` return data exactly as it was submitted by the
- client.
- By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding
- for form data. If you need to change this for a particular form, you can set
- the ``encoding`` attribute on an ``HttpRequest`` instance. For example::
- def some_view(request):
- # We know that the data must be encoded as KOI8-R (for some reason).
- request.encoding = 'koi8-r'
- ...
- You can even change the encoding after having accessed ``request.GET`` or
- ``request.POST``, and all subsequent accesses will use the new encoding.
- Most developers won't need to worry about changing form encoding, but this is
- a useful feature for applications that talk to legacy systems whose encoding
- you cannot control.
- Django does not decode the data of file uploads, because that data is normally
- treated as collections of bytes, rather than strings. Any automatic decoding
- there would alter the meaning of the stream of bytes.
|