Encoding
========

You will notice that all lower-level functions in Dulwich take byte strings
rather than unicode strings. This is intentional.

Although `C git`_ recommends the use of UTF-8 for encoding, this is not
strictly enforced and C git treats filenames as sequences of non-NUL bytes.
There are repositories in the wild that use non-UTF-8 encoding for filenames
and commit messages.

.. _C git: https://github.com/git/git/blob/master/Documentation/i18n.txt

The library should be able to read *all* existing git repositories,
regardless of what encoding they use. This is the main reason why Dulwich
does not convert paths to unicode strings.

A further consideration is that converting back and forth to unicode
is an extra performance penalty. E.g. if you are just iterating over file
contents, there is no need to consider encoded strings. Users of the library
may have specific assumptions they can make about the encoding - e.g. they
could just decide that all their data is latin-1, or the default Python
encoding.

Higher level functions, such as the porcelain in dulwich.porcelain, will
automatically convert unicode strings to UTF-8 bytestrings.