unicode.txt 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447
  1. ============
  2. Unicode data
  3. ============
  4. Django natively supports Unicode data everywhere. Providing your database can
  5. somehow store the data, you can safely pass around Unicode strings to
  6. templates, models and the database.
  7. This document tells you what you need to know if you're writing applications
  8. that use data or templates that are encoded in something other than ASCII.
  9. Creating the database
  10. =====================
  11. Make sure your database is configured to be able to store arbitrary string
  12. data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
  13. a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
  14. able to store certain characters in the database, and information will be lost.
  15. * MySQL users, refer to the `MySQL manual`_ for details on how to set or alter
  16. the database character set encoding.
  17. * PostgreSQL users, refer to the `PostgreSQL manual`_ (section 22.3.2 in
  18. PostgreSQL 9) for details on creating databases with the correct encoding.
  19. * Oracle users, refer to the `Oracle manual`_ for details on how to set
  20. (`section 2`_) or alter (`section 11`_) the database character set encoding.
  21. * SQLite users, there is nothing you need to do. SQLite always uses UTF-8
  22. for internal encoding.
  23. .. _MySQL manual: https://dev.mysql.com/doc/refman/5.6/en/charset-database.html
  24. .. _PostgreSQL manual: http://www.postgresql.org/docs/current/static/multibyte.html
  25. .. _Oracle manual: https://docs.oracle.com/cd/E11882_01/server.112/e10729/toc.htm
  26. .. _section 2: https://docs.oracle.com/cd/E11882_01/server.112/e10729/ch2charset.htm#NLSPG002
  27. .. _section 11: https://docs.oracle.com/cd/E11882_01/server.112/e10729/ch11charsetmig.htm#NLSPG011
  28. All of Django's database backends automatically convert Unicode strings into
  29. the appropriate encoding for talking to the database. They also automatically
  30. convert strings retrieved from the database into Python Unicode strings. You
  31. don't even need to tell Django what encoding your database uses: that is
  32. handled transparently.
  33. For more, see the section "The database API" below.
  34. General string handling
  35. =======================
  36. Whenever you use strings with Django -- e.g., in database lookups, template
  37. rendering or anywhere else -- you have two choices for encoding those strings.
  38. You can use Unicode strings, or you can use normal strings (sometimes called
  39. "bytestrings") that are encoded using UTF-8.
  40. In Python 3, the logic is reversed, that is normal strings are Unicode, and
  41. when you want to specifically create a bytestring, you have to prefix the
  42. string with a 'b'. As we are doing in Django code from version 1.5,
  43. we recommend that you import ``unicode_literals`` from the __future__ library
  44. in your code. Then, when you specifically want to create a bytestring literal,
  45. prefix the string with 'b'.
  46. Python 2 legacy::
  47. my_string = "This is a bytestring"
  48. my_unicode = u"This is an Unicode string"
  49. Python 2 with unicode literals or Python 3::
  50. from __future__ import unicode_literals
  51. my_string = b"This is a bytestring"
  52. my_unicode = "This is an Unicode string"
  53. See also :doc:`Python 3 compatibility </topics/python3>`.
  54. .. warning::
  55. A bytestring does not carry any information with it about its encoding.
  56. For that reason, we have to make an assumption, and Django assumes that all
  57. bytestrings are in UTF-8.
  58. If you pass a string to Django that has been encoded in some other format,
  59. things will go wrong in interesting ways. Usually, Django will raise a
  60. ``UnicodeDecodeError`` at some point.
  61. If your code only uses ASCII data, it's safe to use your normal strings,
  62. passing them around at will, because ASCII is a subset of UTF-8.
  63. Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set
  64. to something other than ``'utf-8'`` you can use that other encoding in your
  65. bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as
  66. the result of template rendering (and email). Django will always assume UTF-8
  67. encoding for internal bytestrings. The reason for this is that the
  68. :setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the
  69. application developer). It's under the control of the person installing and
  70. using your application -- and if that person chooses a different setting, your
  71. code must still continue to work. Ergo, it cannot rely on that setting.
  72. In most cases when Django is dealing with strings, it will convert them to
  73. Unicode strings before doing anything else. So, as a general rule, if you pass
  74. in a bytestring, be prepared to receive a Unicode string back in the result.
  75. Translated strings
  76. ------------------
  77. Aside from Unicode strings and bytestrings, there's a third type of string-like
  78. object you may encounter when using Django. The framework's
  79. internationalization features introduce the concept of a "lazy translation" --
  80. a string that has been marked as translated but whose actual translation result
  81. isn't determined until the object is used in a string. This feature is useful
  82. in cases where the translation locale is unknown until the string is used, even
  83. though the string might have originally been created when the code was first
  84. imported.
  85. Normally, you won't have to worry about lazy translations. Just be aware that
  86. if you examine an object and it claims to be a
  87. ``django.utils.functional.__proxy__`` object, it is a lazy translation.
  88. Calling ``unicode()`` with the lazy translation as the argument will generate a
  89. Unicode string in the current locale.
  90. For more details about lazy translation objects, refer to the
  91. :doc:`internationalization </topics/i18n/index>` documentation.
  92. Useful utility functions
  93. ------------------------
  94. Because some string operations come up again and again, Django ships with a few
  95. useful functions that should make working with Unicode and bytestring objects
  96. a bit easier.
  97. Conversion functions
  98. ~~~~~~~~~~~~~~~~~~~~
  99. The ``django.utils.encoding`` module contains a few functions that are handy
  100. for converting back and forth between Unicode and bytestrings.
  101. * ``smart_text(s, encoding='utf-8', strings_only=False, errors='strict')``
  102. converts its input to a Unicode string. The ``encoding`` parameter
  103. specifies the input encoding. (For example, Django uses this internally
  104. when processing form input data, which might not be UTF-8 encoded.) The
  105. ``strings_only`` parameter, if set to True, will result in Python
  106. numbers, booleans and ``None`` not being converted to a string (they keep
  107. their original types). The ``errors`` parameter takes any of the values
  108. that are accepted by Python's ``unicode()`` function for its error
  109. handling.
  110. If you pass ``smart_text()`` an object that has a ``__unicode__``
  111. method, it will use that method to do the conversion.
  112. * ``force_text(s, encoding='utf-8', strings_only=False,
  113. errors='strict')`` is identical to ``smart_text()`` in almost all
  114. cases. The difference is when the first argument is a :ref:`lazy
  115. translation <lazy-translations>` instance. While ``smart_text()``
  116. preserves lazy translations, ``force_text()`` forces those objects to a
  117. Unicode string (causing the translation to occur). Normally, you'll want
  118. to use ``smart_text()``. However, ``force_text()`` is useful in
  119. template tags and filters that absolutely *must* have a string to work
  120. with, not just something that can be converted to a string.
  121. * ``smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')``
  122. is essentially the opposite of ``smart_text()``. It forces the first
  123. argument to a bytestring. The ``strings_only`` parameter has the same
  124. behavior as for ``smart_text()`` and ``force_text()``. This is
  125. slightly different semantics from Python's builtin ``str()`` function,
  126. but the difference is needed in a few places within Django's internals.
  127. Normally, you'll only need to use ``force_text()``. Call it as early as
  128. possible on any input data that might be either Unicode or a bytestring, and
  129. from then on, you can treat the result as always being Unicode.
  130. .. _uri-and-iri-handling:
  131. URI and IRI handling
  132. ~~~~~~~~~~~~~~~~~~~~
  133. Web frameworks have to deal with URLs (which are a type of IRI_). One
  134. requirement of URLs is that they are encoded using only ASCII characters.
  135. However, in an international environment, you might need to construct a
  136. URL from an IRI_ -- very loosely speaking, a URI_ that can contain Unicode
  137. characters. Quoting and converting an IRI to URI can be a little tricky, so
  138. Django provides some assistance.
  139. * The function :func:`django.utils.encoding.iri_to_uri()` implements the
  140. conversion from IRI to URI as required by the specification (:rfc:`3987#section-3.1`).
  141. * The functions :func:`django.utils.http.urlquote()` and
  142. :func:`django.utils.http.urlquote_plus()` are versions of Python's standard
  143. ``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII
  144. characters. (The data is converted to UTF-8 prior to encoding.)
  145. These two groups of functions have slightly different purposes, and it's
  146. important to keep them straight. Normally, you would use ``urlquote()`` on the
  147. individual portions of the IRI or URI path so that any reserved characters
  148. such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
  149. the full IRI and it converts any non-ASCII characters to the correct encoded
  150. values.
  151. .. note::
  152. Technically, it isn't correct to say that ``iri_to_uri()`` implements the
  153. full algorithm in the IRI specification. It doesn't (yet) perform the
  154. international domain name encoding portion of the algorithm.
  155. The ``iri_to_uri()`` function will not change ASCII characters that are
  156. otherwise permitted in a URL. So, for example, the character '%' is not
  157. further encoded when passed to ``iri_to_uri()``. This means you can pass a
  158. full URL to this function and it will not mess up the query string or anything
  159. like that.
  160. An example might clarify things here::
  161. >>> urlquote('Paris & Orléans')
  162. 'Paris%20%26%20Orl%C3%A9ans'
  163. >>> iri_to_uri('/favorites/François/%s' % urlquote('Paris & Orléans'))
  164. '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
  165. If you look carefully, you can see that the portion that was generated by
  166. ``urlquote()`` in the second example was not double-quoted when passed to
  167. ``iri_to_uri()``. This is a very important and useful feature. It means that
  168. you can construct your IRI without worrying about whether it contains
  169. non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
  170. result.
  171. Similarly, Django provides :func:`django.utils.encoding.uri_to_iri()` which
  172. implements the conversion from URI to IRI as per :rfc:`3987#section-3.2`.
  173. It decodes all percent-encodings except those that don't represent a valid
  174. UTF-8 sequence.
  175. An example to demonstrate::
  176. >>> uri_to_iri('/%E2%99%A5%E2%99%A5/?utf8=%E2%9C%93')
  177. '/♥♥/?utf8=✓'
  178. >>> uri_to_iri('%A9helloworld')
  179. '%A9helloworld'
  180. In the first example, the UTF-8 characters and reserved characters are
  181. unquoted. In the second, the percent-encoding remains unchanged because it
  182. lies outside the valid UTF-8 range.
  183. Both ``iri_to_uri()`` and ``uri_to_iri()`` functions are idempotent, which means the
  184. following is always true::
  185. iri_to_uri(iri_to_uri(some_string)) == iri_to_uri(some_string)
  186. uri_to_iri(uri_to_iri(some_string)) == uri_to_iri(some_string)
  187. So you can safely call it multiple times on the same URI/IRI without risking
  188. double-quoting problems.
  189. .. _URI: https://www.ietf.org/rfc/rfc2396.txt
  190. .. _IRI: https://www.ietf.org/rfc/rfc3987.txt
  191. Models
  192. ======
  193. Because all strings are returned from the database as Unicode strings, model
  194. fields that are character based (CharField, TextField, URLField, etc.) will
  195. contain Unicode values when Django retrieves data from the database. This
  196. is *always* the case, even if the data could fit into an ASCII bytestring.
  197. You can pass in bytestrings when creating a model or populating a field, and
  198. Django will convert it to Unicode when it needs to.
  199. Choosing between ``__str__()`` and ``__unicode__()``
  200. ----------------------------------------------------
  201. .. note::
  202. If you are on Python 3, you can skip this section because you'll always
  203. create ``__str__()`` rather than ``__unicode__()``. If you'd like
  204. compatibility with Python 2, you can decorate your model class with
  205. :func:`~django.utils.encoding.python_2_unicode_compatible`.
  206. One consequence of using Unicode by default is that you have to take some care
  207. when printing data from the model.
  208. In particular, rather than giving your model a ``__str__()`` method, we
  209. recommended you implement a ``__unicode__()`` method. In the ``__unicode__()``
  210. method, you can quite safely return the values of all your fields without
  211. having to worry about whether they fit into a bytestring or not. (The way
  212. Python works, the result of ``__str__()`` is *always* a bytestring, even if you
  213. accidentally try to return a Unicode object).
  214. You can still create a ``__str__()`` method on your models if you want, of
  215. course, but you shouldn't need to do this unless you have a good reason.
  216. Django's ``Model`` base class automatically provides a ``__str__()``
  217. implementation that calls ``__unicode__()`` and encodes the result into UTF-8.
  218. This means you'll normally only need to implement a ``__unicode__()`` method
  219. and let Django handle the coercion to a bytestring when required.
  220. Taking care in ``get_absolute_url()``
  221. -------------------------------------
  222. URLs can only contain ASCII characters. If you're constructing a URL from
  223. pieces of data that might be non-ASCII, be careful to encode the results in a
  224. way that is suitable for a URL. The :func:`~django.urls.reverse` function
  225. handles this for you automatically.
  226. If you're constructing a URL manually (i.e., *not* using the ``reverse()``
  227. function), you'll need to take care of the encoding yourself. In this case,
  228. use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented
  229. above_. For example::
  230. from django.utils.encoding import iri_to_uri
  231. from django.utils.http import urlquote
  232. def get_absolute_url(self):
  233. url = '/person/%s/?x=0&y=0' % urlquote(self.location)
  234. return iri_to_uri(url)
  235. This function returns a correctly encoded URL even if ``self.location`` is
  236. something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
  237. call isn't strictly necessary in the above example, because all the
  238. non-ASCII characters would have been removed in quoting in the first line.)
  239. .. _above: `URI and IRI handling`_
  240. The database API
  241. ================
  242. You can pass either Unicode strings or UTF-8 bytestrings as arguments to
  243. ``filter()`` methods and the like in the database API. The following two
  244. querysets are identical::
  245. from __future__ import unicode_literals
  246. qs = People.objects.filter(name__contains='Å')
  247. qs = People.objects.filter(name__contains=b'\xc3\x85') # UTF-8 encoding of Å
  248. Templates
  249. =========
  250. You can use either Unicode or bytestrings when creating templates manually::
  251. from __future__ import unicode_literals
  252. from django.template import Template
  253. t1 = Template(b'This is a bytestring template.')
  254. t2 = Template('This is a Unicode template.')
  255. But the common case is to read templates from the filesystem, and this creates
  256. a slight complication: not all filesystems store their data encoded as UTF-8.
  257. If your template files are not stored with a UTF-8 encoding, set the :setting:`FILE_CHARSET`
  258. setting to the encoding of the files on disk. When Django reads in a template
  259. file, it will convert the data from this encoding to Unicode. (:setting:`FILE_CHARSET`
  260. is set to ``'utf-8'`` by default.)
  261. The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates.
  262. This is set to UTF-8 by default.
  263. Template tags and filters
  264. -------------------------
  265. A couple of tips to remember when writing your own template tags and filters:
  266. * Always return Unicode strings from a template tag's ``render()`` method
  267. and from template filters.
  268. * Use ``force_text()`` in preference to ``smart_text()`` in these
  269. places. Tag rendering and filter calls occur as the template is being
  270. rendered, so there is no advantage to postponing the conversion of lazy
  271. translation objects into strings. It's easier to work solely with Unicode
  272. strings at that point.
  273. .. _unicode-files:
  274. Files
  275. =====
  276. If you intend to allow users to upload files, you must ensure that the
  277. environment used to run Django is configured to work with non-ASCII file names.
  278. If your environment isn't configured correctly, you'll encounter
  279. ``UnicodeEncodeError`` exceptions when saving files with file names that
  280. contain non-ASCII characters.
  281. Filesystem support for UTF-8 file names varies and might depend on the
  282. environment. Check your current configuration in an interactive Python shell by
  283. running::
  284. import sys
  285. sys.getfilesystemencoding()
  286. This should output "UTF-8".
  287. The ``LANG`` environment variable is responsible for setting the expected
  288. encoding on Unix platforms. Consult the documentation for your operating system
  289. and application server for the appropriate syntax and location to set this
  290. variable.
  291. In your development environment, you might need to add a setting to your
  292. ``~.bashrc`` analogous to:::
  293. export LANG="en_US.UTF-8"
  294. Email
  295. =====
  296. Django's email framework (in ``django.core.mail``) supports Unicode
  297. transparently. You can use Unicode data in the message bodies and any headers.
  298. However, you're still obligated to respect the requirements of the email
  299. specifications, so, for example, email addresses should use only ASCII
  300. characters.
  301. The following code example demonstrates that everything except email addresses
  302. can be non-ASCII::
  303. from __future__ import unicode_literals
  304. from django.core.mail import EmailMessage
  305. subject = 'My visit to Sør-Trøndelag'
  306. sender = 'Arnbjörg Ráðormsdóttir <arnbjorg@example.com>'
  307. recipients = ['Fred <fred@example.com']
  308. body = '...'
  309. msg = EmailMessage(subject, body, sender, recipients)
  310. msg.attach("Une pièce jointe.pdf", "%PDF-1.4.%...", mimetype="application/pdf")
  311. msg.send()
  312. Form submission
  313. ===============
  314. HTML form submission is a tricky area. There's no guarantee that the
  315. submission will include encoding information, which means the framework might
  316. have to guess at the encoding of submitted data.
  317. Django adopts a "lazy" approach to decoding form data. The data in an
  318. ``HttpRequest`` object is only decoded when you access it. In fact, most of
  319. the data is not decoded at all. Only the ``HttpRequest.GET`` and
  320. ``HttpRequest.POST`` data structures have any decoding applied to them. Those
  321. two fields will return their members as Unicode data. All other attributes and
  322. methods of ``HttpRequest`` return data exactly as it was submitted by the
  323. client.
  324. By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding
  325. for form data. If you need to change this for a particular form, you can set
  326. the ``encoding`` attribute on an ``HttpRequest`` instance. For example::
  327. def some_view(request):
  328. # We know that the data must be encoded as KOI8-R (for some reason).
  329. request.encoding = 'koi8-r'
  330. ...
  331. You can even change the encoding after having accessed ``request.GET`` or
  332. ``request.POST``, and all subsequent accesses will use the new encoding.
  333. Most developers won't need to worry about changing form encoding, but this is
  334. a useful feature for applications that talk to legacy systems whose encoding
  335. you cannot control.
  336. Django does not decode the data of file uploads, because that data is normally
  337. treated as collections of bytes, rather than strings. Any automatic decoding
  338. there would alter the meaning of the stream of bytes.