123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181 |
- # line_ending.py -- Line ending conversion functions
- # Copyright (C) 2018-2018 Boris Feld <boris.feld@comet.ml>
- #
- # Dulwich is dual-licensed under the Apache License, Version 2.0 and the GNU
- # General Public License as public by the Free Software Foundation; version 2.0
- # or (at your option) any later version. You can redistribute it and/or
- # modify it under the terms of either of these two licenses.
- #
- # Unless required by applicable law or agreed to in writing, software
- # distributed under the License is distributed on an "AS IS" BASIS,
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # limitations under the License.
- #
- # You should have received a copy of the licenses; if not, see
- # <http://www.gnu.org/licenses/> for a copy of the GNU General Public License
- # and <http://www.apache.org/licenses/LICENSE-2.0> for a copy of the Apache
- # License, Version 2.0.
- #
- """ All line-ending related functions, from conversions to config processing
- Line-ending normalization is a complex beast. Here is some notes and details
- about how it seems to work.
- The normalization is a two-fold process that happens at two moments:
- - When reading a file from the index and to the working directory. For example
- when doing a `git clone` or `git checkout` call. We call this process the
- read filter in this module.
- - When writing a file to the index from the working directory. For example
- when doing a `git add` call. We call this process the write filter in this
- module.
- One thing to know is that Git does line-ending normalization only on text
- files. How does Git know that a file is text? We can either mark a file as a
- text file, a binary file or ask Git to automatically decides. Git has an
- heuristic to detect if a file is a text file or a binary file. It seems based
- on the percentage of non-printable characters in files.
- The code for this heuristic is here:
- https://git.kernel.org/pub/scm/git/git.git/tree/convert.c#n46
- Dulwich have an implementation with a slightly different heuristic, the
- `is_binary` function in `dulwich.patch`.
- The binary detection heuristic implementation is close to the one in JGit:
- https://github.com/eclipse/jgit/blob/f6873ffe522bbc3536969a3a3546bf9a819b92bf/org.eclipse.jgit/src/org/eclipse/jgit/diff/RawText.java#L300
- There is multiple variables that impact the normalization.
- First, a repository can contains a `.gitattributes` file (or more than one...)
- that can further customize the operation on some file patterns, for example:
- *.txt text
- Force all `.txt` files to be treated as text files and to have their lines
- endings normalized.
- *.jpg -text
- Force all `.jpg` files to be treated as binary files and to not have their
- lines endings converted.
- *.vcproj text eol=crlf
- Force all `.vcproj` files to be treated as text files and to have their lines
- endings converted into `CRLF` in working directory no matter the native EOL of
- the platform.
- *.sh text eol=lf
- Force all `.sh` files to be treated as text files and to have their lines
- endings converted into `LF` in working directory no matter the native EOL of
- the platform.
- If the `eol` attribute is not defined, Git uses the `core.eol` configuration
- value described later.
- * text=auto
- Force all files to be scanned by the text file heuristic detection and to have
- their line endings normalized in case they are detected as text files.
- Git also have a obsolete attribute named `crlf` that can be translated to the
- corresponding text attribute value.
- Then there are some configuration option (that can be defined at the
- repository or user level):
- - core.autocrlf
- - core.eol
- `core.autocrlf` is taken into account for all files that doesn't have a `text`
- attribute defined in `.gitattributes`; it takes three possible values:
- - `true`: This forces all files on the working directory to have CRLF
- line-endings in the working directory and convert line-endings to LF
- when writing to the index. When autocrlf is set to true, eol value is
- ignored.
- - `input`: Quite similar to the `true` value but only force the write
- filter, ie line-ending of new files added to the index will get their
- line-endings converted to LF.
- - `false` (default): No normalization is done.
- `core.eol` is the top-level configuration to define the line-ending to use
- when applying the read_filer. It takes three possible values:
- - `lf`: When normalization is done, force line-endings to be `LF` in the
- working directory.
- - `crlf`: When normalization is done, force line-endings to be `CRLF` in
- the working directory.
- - `native` (default): When normalization is done, force line-endings to be
- the platform's native line ending.
- One thing to remember is when line-ending normalization is done on a file, Git
- always normalize line-ending to `LF` when writing to the index.
- There are sources that seems to indicate that Git won't do line-ending
- normalization when a file contains mixed line-endings. I think this logic
- might be in text / binary detection heuristic but couldn't find it yet.
- Sources:
- - https://git-scm.com/docs/git-config#git-config-coreeol
- - https://git-scm.com/docs/git-config#git-config-coreautocrlf
- - https://git-scm.com/docs/gitattributes#_checking_out_and_checking_in
- - https://adaptivepatchwork.com/2012/03/01/mind-the-end-of-your-line/
- """
- CRLF = b"\r\n"
- LF = b"\n"
- def convert_crlf_to_lf(text_hunk):
- """Convert CRLF in text hunk into LF
- :param text_hunk: A bytes string representing a text hunk
- :return: The text hunk with the same type, with CRLF replaced into LF
- """
- return text_hunk.replace(CRLF, LF)
- def convert_lf_to_crlf(text_hunk):
- """Convert LF in text hunk into CRLF
- :param text_hunk: A bytes string representing a text hunk
- :return: The text hunk with the same type, with LF replaced into CRLF
- """
- # TODO find a more efficient way of doing it
- intermediary = text_hunk.replace(CRLF, LF)
- return intermediary.replace(LF, CRLF)
- def get_checkout_filter_autocrlf(core_autocrlf):
- """ Returns the correct checkout filter base on autocrlf value
- :param core_autocrlf: The bytes configuration value of core.autocrlf.
- Valid values are: b'true', b'false' or b'input'.
- :return: Either None if no filter has to be applied or a function
- accepting a single argument, a binary text hunk
- """
- if core_autocrlf == b"true":
- return convert_lf_to_crlf
- return None
- def get_checkin_filter_autocrlf(core_autocrlf):
- """ Returns the correct checkin filter base on autocrlf value
- :param core_autocrlf: The bytes configuration value of core.autocrlf.
- Valid values are: b'true', b'false' or b'input'.
- :return: Either None if no filter has to be applied or a function
- accepting a single argument, a binary text hunk
- """
- if core_autocrlf == b"true" or core_autocrlf == b"input":
- return convert_crlf_to_lf
- # Checking filter should never be `convert_lf_to_crlf`
- return None
|