line_ending.py 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306
  1. # line_ending.py -- Line ending conversion functions
  2. # Copyright (C) 2018-2018 Boris Feld <boris.feld@comet.ml>
  3. #
  4. # Dulwich is dual-licensed under the Apache License, Version 2.0 and the GNU
  5. # General Public License as public by the Free Software Foundation; version 2.0
  6. # or (at your option) any later version. You can redistribute it and/or
  7. # modify it under the terms of either of these two licenses.
  8. #
  9. # Unless required by applicable law or agreed to in writing, software
  10. # distributed under the License is distributed on an "AS IS" BASIS,
  11. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12. # See the License for the specific language governing permissions and
  13. # limitations under the License.
  14. #
  15. # You should have received a copy of the licenses; if not, see
  16. # <http://www.gnu.org/licenses/> for a copy of the GNU General Public License
  17. # and <http://www.apache.org/licenses/LICENSE-2.0> for a copy of the Apache
  18. # License, Version 2.0.
  19. #
  20. """ All line-ending related functions, from conversions to config processing
  21. Line-ending normalization is a complex beast. Here is some notes and details
  22. about how it seems to work.
  23. The normalization is a two-fold process that happens at two moments:
  24. - When reading a file from the index and to the working directory. For example
  25. when doing a `git clone` or `git checkout` call. We call this process the
  26. read filter in this module.
  27. - When writing a file to the index from the working directory. For example
  28. when doing a `git add` call. We call this process the write filter in this
  29. module.
  30. Note that when checking status (getting unstaged changes), whether or not
  31. normalization is done on write depends on whether or not the file in the
  32. working dir has also been normalized on read:
  33. - For autocrlf=true all files are always normalized on both read and write.
  34. - For autocrlf=input files are only normalized on write if they are newly
  35. "added". Since files which are already committed are not normalized on
  36. checkout into the working tree, they are also left alone when staging
  37. modifications into the index.
  38. One thing to know is that Git does line-ending normalization only on text
  39. files. How does Git know that a file is text? We can either mark a file as a
  40. text file, a binary file or ask Git to automatically decides. Git has an
  41. heuristic to detect if a file is a text file or a binary file. It seems based
  42. on the percentage of non-printable characters in files.
  43. The code for this heuristic is here:
  44. https://git.kernel.org/pub/scm/git/git.git/tree/convert.c#n46
  45. Dulwich have an implementation with a slightly different heuristic, the
  46. `is_binary` function in `dulwich.patch`.
  47. The binary detection heuristic implementation is close to the one in JGit:
  48. https://github.com/eclipse/jgit/blob/f6873ffe522bbc3536969a3a3546bf9a819b92bf/org.eclipse.jgit/src/org/eclipse/jgit/diff/RawText.java#L300
  49. There is multiple variables that impact the normalization.
  50. First, a repository can contains a `.gitattributes` file (or more than one...)
  51. that can further customize the operation on some file patterns, for example:
  52. *.txt text
  53. Force all `.txt` files to be treated as text files and to have their lines
  54. endings normalized.
  55. *.jpg -text
  56. Force all `.jpg` files to be treated as binary files and to not have their
  57. lines endings converted.
  58. *.vcproj text eol=crlf
  59. Force all `.vcproj` files to be treated as text files and to have their lines
  60. endings converted into `CRLF` in working directory no matter the native EOL of
  61. the platform.
  62. *.sh text eol=lf
  63. Force all `.sh` files to be treated as text files and to have their lines
  64. endings converted into `LF` in working directory no matter the native EOL of
  65. the platform.
  66. If the `eol` attribute is not defined, Git uses the `core.eol` configuration
  67. value described later.
  68. * text=auto
  69. Force all files to be scanned by the text file heuristic detection and to have
  70. their line endings normalized in case they are detected as text files.
  71. Git also have a obsolete attribute named `crlf` that can be translated to the
  72. corresponding text attribute value.
  73. Then there are some configuration option (that can be defined at the
  74. repository or user level):
  75. - core.autocrlf
  76. - core.eol
  77. `core.autocrlf` is taken into account for all files that doesn't have a `text`
  78. attribute defined in `.gitattributes`; it takes three possible values:
  79. - `true`: This forces all files on the working directory to have CRLF
  80. line-endings in the working directory and convert line-endings to LF
  81. when writing to the index. When autocrlf is set to true, eol value is
  82. ignored.
  83. - `input`: Quite similar to the `true` value but only force the write
  84. filter, ie line-ending of new files added to the index will get their
  85. line-endings converted to LF.
  86. - `false` (default): No normalization is done.
  87. `core.eol` is the top-level configuration to define the line-ending to use
  88. when applying the read_filer. It takes three possible values:
  89. - `lf`: When normalization is done, force line-endings to be `LF` in the
  90. working directory.
  91. - `crlf`: When normalization is done, force line-endings to be `CRLF` in
  92. the working directory.
  93. - `native` (default): When normalization is done, force line-endings to be
  94. the platform's native line ending.
  95. One thing to remember is when line-ending normalization is done on a file, Git
  96. always normalize line-ending to `LF` when writing to the index.
  97. There are sources that seems to indicate that Git won't do line-ending
  98. normalization when a file contains mixed line-endings. I think this logic
  99. might be in text / binary detection heuristic but couldn't find it yet.
  100. Sources:
  101. - https://git-scm.com/docs/git-config#git-config-coreeol
  102. - https://git-scm.com/docs/git-config#git-config-coreautocrlf
  103. - https://git-scm.com/docs/gitattributes#_checking_out_and_checking_in
  104. - https://adaptivepatchwork.com/2012/03/01/mind-the-end-of-your-line/
  105. """
  106. from dulwich.objects import Blob
  107. from dulwich.patch import is_binary
  108. CRLF = b"\r\n"
  109. LF = b"\n"
  110. def convert_crlf_to_lf(text_hunk):
  111. """Convert CRLF in text hunk into LF
  112. Args:
  113. text_hunk: A bytes string representing a text hunk
  114. Returns: The text hunk with the same type, with CRLF replaced into LF
  115. """
  116. return text_hunk.replace(CRLF, LF)
  117. def convert_lf_to_crlf(text_hunk):
  118. """Convert LF in text hunk into CRLF
  119. Args:
  120. text_hunk: A bytes string representing a text hunk
  121. Returns: The text hunk with the same type, with LF replaced into CRLF
  122. """
  123. # TODO find a more efficient way of doing it
  124. intermediary = text_hunk.replace(CRLF, LF)
  125. return intermediary.replace(LF, CRLF)
  126. def get_checkout_filter(core_eol, core_autocrlf, git_attributes):
  127. """Returns the correct checkout filter based on the passed arguments"""
  128. # TODO this function should process the git_attributes for the path and if
  129. # the text attribute is not defined, fallback on the
  130. # get_checkout_filter_autocrlf function with the autocrlf value
  131. return get_checkout_filter_autocrlf(core_autocrlf)
  132. def get_checkin_filter(core_eol, core_autocrlf, git_attributes):
  133. """Returns the correct checkin filter based on the passed arguments"""
  134. # TODO this function should process the git_attributes for the path and if
  135. # the text attribute is not defined, fallback on the
  136. # get_checkin_filter_autocrlf function with the autocrlf value
  137. return get_checkin_filter_autocrlf(core_autocrlf)
  138. def get_checkout_filter_autocrlf(core_autocrlf):
  139. """Returns the correct checkout filter base on autocrlf value
  140. Args:
  141. core_autocrlf: The bytes configuration value of core.autocrlf.
  142. Valid values are: b'true', b'false' or b'input'.
  143. Returns: Either None if no filter has to be applied or a function
  144. accepting a single argument, a binary text hunk
  145. """
  146. if core_autocrlf == b"true":
  147. return convert_lf_to_crlf
  148. return None
  149. def get_checkin_filter_autocrlf(core_autocrlf):
  150. """Returns the correct checkin filter base on autocrlf value
  151. Args:
  152. core_autocrlf: The bytes configuration value of core.autocrlf.
  153. Valid values are: b'true', b'false' or b'input'.
  154. Returns: Either None if no filter has to be applied or a function
  155. accepting a single argument, a binary text hunk
  156. """
  157. if core_autocrlf == b"true" or core_autocrlf == b"input":
  158. return convert_crlf_to_lf
  159. # Checking filter should never be `convert_lf_to_crlf`
  160. return None
  161. class BlobNormalizer(object):
  162. """An object to store computation result of which filter to apply based
  163. on configuration, gitattributes, path and operation (checkin or checkout)
  164. """
  165. def __init__(self, config_stack, gitattributes):
  166. self.config_stack = config_stack
  167. self.gitattributes = gitattributes
  168. # Compute which filters we needs based on parameters
  169. try:
  170. core_eol = config_stack.get("core", "eol")
  171. except KeyError:
  172. core_eol = "native"
  173. try:
  174. core_autocrlf = config_stack.get("core", "autocrlf").lower()
  175. except KeyError:
  176. core_autocrlf = False
  177. self.fallback_read_filter = get_checkout_filter(
  178. core_eol, core_autocrlf, self.gitattributes
  179. )
  180. self.fallback_write_filter = get_checkin_filter(
  181. core_eol, core_autocrlf, self.gitattributes
  182. )
  183. def checkin_normalize(self, blob, tree_path):
  184. """Normalize a blob during a checkin operation"""
  185. if self.fallback_write_filter is not None:
  186. return normalize_blob(
  187. blob, self.fallback_write_filter, binary_detection=True
  188. )
  189. return blob
  190. def checkout_normalize(self, blob, tree_path):
  191. """Normalize a blob during a checkout operation"""
  192. if self.fallback_read_filter is not None:
  193. return normalize_blob(
  194. blob, self.fallback_read_filter, binary_detection=True
  195. )
  196. return blob
  197. def normalize_blob(blob, conversion, binary_detection):
  198. """Takes a blob as input returns either the original blob if
  199. binary_detection is True and the blob content looks like binary, else
  200. return a new blob with converted data
  201. """
  202. # Read the original blob
  203. data = blob.data
  204. # If we need to detect if a file is binary and the file is detected as
  205. # binary, do not apply the conversion function and return the original
  206. # chunked text
  207. if binary_detection is True:
  208. if is_binary(data):
  209. return blob
  210. # Now apply the conversion
  211. converted_data = conversion(data)
  212. new_blob = Blob()
  213. new_blob.data = converted_data
  214. return new_blob
  215. class TreeBlobNormalizer(BlobNormalizer):
  216. def __init__(self, config_stack, git_attributes, object_store, tree=None):
  217. super().__init__(config_stack, git_attributes)
  218. if tree:
  219. self.existing_paths = {
  220. name
  221. for name, _, _ in object_store.iter_tree_contents(tree)
  222. }
  223. else:
  224. self.existing_paths = set()
  225. def checkin_normalize(self, blob, tree_path):
  226. # Existing files should only be normalized on checkin if it was
  227. # previously normalized on checkout
  228. if (
  229. self.fallback_read_filter is not None
  230. or tree_path not in self.existing_paths
  231. ):
  232. return super().checkin_normalize(blob, tree_path)
  233. return blob