Просмотр исходного кода

Optimize status performance by using stat matching to skip unchanged files

This should help with #1999 where dulwich status with LFS filters was very slow.

This matches Git's behavior - Git uses stat matching to avoid expensive filter
operations on unchanged files. When autocrlf config is changed after files are
committed, Git also doesn't show them as modified until the files are actually
touched or explicitly renormalized.
Jelmer Vernooij 2 месяцев назад
Родитель
Сommit
b0b5d2ada2
3 измененных файлов с 53 добавлено и 3 удалено
  1. 7 0
      NEWS
  2. 41 0
      dulwich/index.py
  3. 5 3
      tests/test_porcelain.py

+ 7 - 0
NEWS

@@ -6,6 +6,13 @@
    pack files, pack indexes, index files, and other git metadata files.
    (Jelmer Vernooij, #1804)
 
+ * Optimize status performance by using stat matching to skip reading
+   and filtering unchanged files. This provides significant performance
+   improvements for repositories with LFS filters, where filter operations can
+   be very expensive. The optimization matches Git's behavior of using mtime
+   and size comparisons to determine if files need processing.
+   (Jelmer Vernooij, #1999)
+
  * Drop support for Python 3.9. (Jelmer Vernooij)
 
  * Add support for ``git rerere`` (reuse recorded resolution) with CLI

+ 41 - 0
dulwich/index.py

@@ -2758,6 +2758,37 @@ def update_working_tree(
     index.write()
 
 
+def _stat_matches_entry(st: os.stat_result, entry: IndexEntry) -> bool:
+    """Check if filesystem stat matches index entry stat.
+
+    This is used to determine if a file might have changed without reading its content.
+    Git uses this optimization to avoid expensive filter operations on unchanged files.
+
+    Args:
+      st: Filesystem stat result
+      entry: Index entry to compare against
+    Returns: True if stat matches and file is likely unchanged
+    """
+    # Get entry mtime
+    if isinstance(entry.mtime, tuple):
+        entry_mtime_sec = entry.mtime[0]
+    else:
+        entry_mtime_sec = int(entry.mtime)
+
+    # Compare modification time (seconds only for now)
+    # Note: We use int() to compare only seconds, as nanosecond precision
+    # can vary across filesystems
+    if int(st.st_mtime) != entry_mtime_sec:
+        return False
+
+    # Compare file size
+    if st.st_size != entry.size:
+        return False
+
+    # If both mtime and size match, file is likely unchanged
+    return True
+
+
 def _check_entry_for_changes(
     tree_path: bytes,
     entry: IndexEntry | ConflictedIndexEntry,
@@ -2788,6 +2819,16 @@ def _check_entry_for_changes(
         if not stat.S_ISREG(st.st_mode) and not stat.S_ISLNK(st.st_mode):
             return None
 
+        # Optimization: If stat matches index entry (mtime and size unchanged),
+        # we can skip reading and filtering the file entirely. This is a significant
+        # performance improvement for repositories with many unchanged files.
+        # Even with filters (e.g., LFS), if the file hasn't been modified (stat unchanged),
+        # the filter output would be the same, so we can safely skip the expensive
+        # filter operation. This addresses performance issues with LFS repositories
+        # where filter operations can be very slow.
+        if _stat_matches_entry(st, entry):
+            return None
+
         blob = blob_from_path_and_stat(full_path, st)
 
         if filter_blob_callback is not None:

+ 5 - 3
tests/test_porcelain.py

@@ -6071,9 +6071,11 @@ class StatusTests(PorcelainTestCase):
             {"add": [b"crlf-new"], "delete": [], "modify": []}, results.staged
         )
         # File committed with CRLF before autocrlf=input was enabled
-        # will appear as unstaged because working tree is normalized to LF
-        # during comparison but index still has CRLF
-        self.assertListEqual(results.unstaged, [b"crlf-exists"])
+        # will NOT appear as unstaged because stat matching optimization
+        # skips filter processing when file hasn't been modified.
+        # This matches Git's behavior, which uses stat matching to avoid
+        # expensive filter operations. Git shows a warning instead.
+        self.assertListEqual(results.unstaged, [])
         self.assertListEqual(results.untracked, [])
 
     def test_status_autocrlf_input_modified(self) -> None: