encoding.txt 1.1 KB

1234567891011121314151617181920212223242526
  1. Encoding
  2. ========
  3. You will notice that all lower-level functions in Dulwich take byte strings
  4. rather than unicode strings. This is intentional.
  5. Although `C git`_ recommends the use of UTF-8 for encoding, this is not
  6. strictly enforced and C git treats filenames as sequences of non-NUL bytes.
  7. There are repositories in the wild that use non-UTF-8 encoding for filenames
  8. and commit messages.
  9. .. _C git: https://github.com/git/git/blob/master/Documentation/i18n.txt
  10. The library should be able to read *all* existing git repositories,
  11. irregardless of what encoding they use. This is the main reason why Dulwich
  12. does not convert paths to unicode strings.
  13. A further consideration is that converting back and forth to unicode
  14. is an extra performance penalty. E.g. if you are just iterating over file
  15. contents, there is no need to consider encoded strings. Users of the library
  16. may have specific assumptions they can make about the encoding - e.g. they
  17. could just decide that all their data is latin-1, or the default Python
  18. encoding.
  19. Higher level functions, such as the porcelain in dulwich.porcelain, will
  20. automatically convert unicode strings to UTF-8 bytestrings.