瀏覽代碼

Add code for computing which line-ending to use based on configuration

The implementation is based in my interpretation of several sources of documentation:
- man gitconfig for core.eol and core.autocrlf
- man gitattributes for line-ending conversion
- https://adaptivepatchwork.com/2012/03/01/mind-the-end-of-your-line/

It doesn't support gitattributes overriding, it should be easy enough to add
the logic in the functions introduced in the next commit.
Boris Feld 6 年之前
父節點
當前提交
ba3f97d1a5
共有 2 個文件被更改,包括 175 次插入3 次删除
  1. 137 1
      dulwich/line_ending.py
  2. 38 2
      dulwich/tests/test_line_ending.py

+ 137 - 1
dulwich/line_ending.py

@@ -17,8 +17,113 @@
 # and <http://www.apache.org/licenses/LICENSE-2.0> for a copy of the Apache
 # License, Version 2.0.
 #
-
 """ All line-ending related functions, from conversions to config processing
+
+Line-ending normalization is a complex beast. Here is some notes and details
+about how it seems to work.
+
+The normalization is a two-fold process that happens at two moments:
+
+- When reading a file from the index and to the working directory. For example
+  when doing a `git clone` or `git checkout` call. We call this process the
+  read filter in this module.
+- When writing a file to the index from the working directory. For example
+  when doing a `git add` call. We call this process the write filter in this
+  module.
+
+One thing to know is that Git does line-ending normalization only on text
+files. How does Git know that a file is text? We can either mark a file as a
+text file, a binary file or ask Git to automatically decides. Git has an
+heuristic to detect if a file is a text file or a binary file. It seems based
+on the percentage of non-printable characters in files.
+
+The code for this heuristic is here:
+https://git.kernel.org/pub/scm/git/git.git/tree/convert.c#n46
+
+Dulwich have an implementation with a slightly different heuristic, the
+`is_binary` function in `dulwich.patch`.
+
+The binary detection heuristic implementation is close to the one in JGit:
+https://github.com/eclipse/jgit/blob/f6873ffe522bbc3536969a3a3546bf9a819b92bf/org.eclipse.jgit/src/org/eclipse/jgit/diff/RawText.java#L300
+
+There is multiple variables that impact the normalization.
+
+First, a repository can contains a `.gitattributes` file (or more than one...)
+that can further customize the operation on some file patterns, for example:
+
+    *.txt text
+
+Force all `.txt` files to be treated as text files and to have their lines
+endings normalized.
+
+    *.jpg -text
+
+Force all `.jpg` files to be treated as binary files and to not have their
+lines endings converted.
+
+    *.vcproj text eol=crlf
+
+Force all `.vcproj` files to be treated as text files and to have their lines
+endings converted into `CRLF` in working directory no matter the native EOL of
+the platform.
+
+    *.sh text eol=lf
+
+Force all `.sh` files to be treated as text files and to have their lines
+endings converted into `LF` in working directory no matter the native EOL of
+the platform.
+
+If the `eol` attribute is not defined, Git uses the `core.eol` configuration
+value described later.
+
+    * text=auto
+
+Force all files to be scanned by the text file heuristic detection and to have
+their line endings normalized in case they are detected as text files.
+
+Git also have a obsolete attribute named `crlf` that can be translated to the
+corresponding text attribute value.
+
+Then there are some configuration option (that can be defined at the
+repository or user level):
+
+- core.autocrlf
+- core.eol
+
+`core.autocrlf` is taken into account for all files that doesn't have a `text`
+attribute defined in `.gitattributes`; it takes three possible values:
+
+    - `true`: This forces all files on the working directory to have CRLF
+      line-endings in the working directory and convert line-endings to LF
+      when writing to the index. When autocrlf is set to true, eol value is
+      ignored.
+    - `input`: Quite similar to the `true` value but only force the write
+      filter, ie line-ending of new files added to the index will get their
+      line-endings converted to LF.
+    - `false` (default): No normalization is done.
+
+`core.eol` is the top-level configuration to define the line-ending to use
+when applying the read_filer. It takes three possible values:
+
+    - `lf`: When normalization is done, force line-endings to be `LF` in the
+      working directory.
+    - `crlf`: When normalization is done, force line-endings to be `CRLF` in
+      the working directory.
+    - `native` (default): When normalization is done, force line-endings to be
+      the platform's native line ending.
+
+One thing to remember is when line-ending normalization is done on a file, Git
+always normalize line-ending to `LF` when writing to the index.
+
+There are sources that seems to indicate that Git won't do line-ending
+normalization when a file contains mixed line-endings. I think this logic
+might be in text / binary detection heuristic but couldn't find it yet.
+
+Sources:
+- https://git-scm.com/docs/git-config#git-config-coreeol
+- https://git-scm.com/docs/git-config#git-config-coreautocrlf
+- https://git-scm.com/docs/gitattributes#_checking_out_and_checking_in
+- https://adaptivepatchwork.com/2012/03/01/mind-the-end-of-your-line/
 """
 
 CRLF = b"\r\n"
@@ -43,3 +148,34 @@ def convert_lf_to_crlf(text_hunk):
     # TODO find a more efficient way of doing it
     intermediary = text_hunk.replace(CRLF, LF)
     return intermediary.replace(LF, CRLF)
+
+
+def get_checkout_filter_autocrlf(core_autocrlf):
+    """ Returns the correct checkout filter base on autocrlf value
+
+    :param core_autocrlf: The bytes configuration value of core.autocrlf.
+        Valid values are: b'true', b'false' or b'input'.
+    :return: Either None if no filter has to be applied or a function
+        accepting a single argument, a binary text hunk
+    """
+
+    if core_autocrlf == b"true":
+        return convert_lf_to_crlf
+
+    return None
+
+
+def get_checkin_filter_autocrlf(core_autocrlf):
+    """ Returns the correct checkin filter base on autocrlf value
+
+    :param core_autocrlf: The bytes configuration value of core.autocrlf.
+        Valid values are: b'true', b'false' or b'input'.
+    :return: Either None if no filter has to be applied or a function
+        accepting a single argument, a binary text hunk
+    """
+
+    if core_autocrlf == b"true" or core_autocrlf == b"input":
+        return convert_crlf_to_lf
+
+    # Checking filter should never be `convert_lf_to_crlf`
+    return None

+ 38 - 2
dulwich/tests/test_line_ending.py

@@ -22,8 +22,12 @@
 
 """Tests for the line ending conversion."""
 
-from dulwich.line_ending import convert_crlf_to_lf, convert_lf_to_crlf
-
+from dulwich.line_ending import (
+    convert_crlf_to_lf,
+    convert_lf_to_crlf,
+    get_checkin_filter_autocrlf,
+    get_checkout_filter_autocrlf,
+)
 from dulwich.tests import TestCase
 
 
@@ -55,3 +59,35 @@ class LineEndingConversion(TestCase):
         self.assertEqual(
             convert_lf_to_crlf(b"line1\r\n\nline2"), b"line1\r\n\r\nline2"
         )
+
+
+class GetLineEndingAutocrlfFilters(TestCase):
+    def test_get_checkin_filter_autocrlf_default(self):
+        checkin_filter = get_checkin_filter_autocrlf(b"false")
+
+        self.assertEqual(checkin_filter, None)
+
+    def test_get_checkin_filter_autocrlf_true(self):
+        checkin_filter = get_checkin_filter_autocrlf(b"true")
+
+        self.assertEqual(checkin_filter, convert_crlf_to_lf)
+
+    def test_get_checkin_filter_autocrlf_input(self):
+        checkin_filter = get_checkin_filter_autocrlf(b"input")
+
+        self.assertEqual(checkin_filter, convert_crlf_to_lf)
+
+    def test_get_checkout_filter_autocrlf_default(self):
+        checkout_filter = get_checkout_filter_autocrlf(b"false")
+
+        self.assertEqual(checkout_filter, None)
+
+    def test_get_checkout_filter_autocrlf_true(self):
+        checkout_filter = get_checkout_filter_autocrlf(b"true")
+
+        self.assertEqual(checkout_filter, convert_lf_to_crlf)
+
+    def test_get_checkout_filter_autocrlf_input(self):
+        checkout_filter = get_checkout_filter_autocrlf(b"input")
+
+        self.assertEqual(checkout_filter, None)