Sfoglia il codice sorgente

Optimize LFS performance by avoiding redundant disk writes (#1864)

When checking file status, the LFS filter's clean method was writing
every file to the LFS object store, even for unchanged files. This
caused severe performance degradation for large repositories.

The fix optimizes LFSStore.write_object to:
- First compute the SHA256 hash
- Check if the object already exists
- Only write to disk if the object doesn't exist

This avoids redundant disk I/O for unchanged files during status checks,
significantly improving performance in repositories with many
LFS-tracked files.

Fixes #1789
Jelmer Vernooij 4 mesi fa
parent
commit
4b801fd359
2 ha cambiato i file con 25 aggiunte e 6 eliminazioni
  1. 6 0
      NEWS
  2. 19 6
      dulwich/lfs.py

+ 6 - 0
NEWS

@@ -34,6 +34,12 @@
    headers to the server when communicating over HTTP(S).
    (Jelmer Vernooij, #1769)
 
+ * Optimize LFS filter performance by avoiding redundant disk writes when
+   checking file status. The LFS store now checks if objects already exist
+   before writing them to disk, significantly improving ``git status``
+   performance in repositories with many LFS-tracked files.
+   (Jelmer Vernooij, #1789)
+
  * Add support for ``patiencediff`` algorithm in diff.
    (Jelmer Vernooij, #1795)
 

+ 19 - 6
dulwich/lfs.py

@@ -140,24 +140,37 @@ class LFSStore:
 
         Returns: object SHA
         """
+        # First pass: compute SHA256 and collect data
         sha = hashlib.sha256()
+        data_chunks = []
+        for chunk in chunks:
+            sha.update(chunk)
+            data_chunks.append(chunk)
+
+        sha_hex = sha.hexdigest()
+        path = self._sha_path(sha_hex)
+
+        # If object already exists, no need to write
+        if os.path.exists(path):
+            return sha_hex
+
+        # Object doesn't exist, write it
+        if not os.path.exists(os.path.dirname(path)):
+            os.makedirs(os.path.dirname(path))
+
         tmpdir = os.path.join(self.path, "tmp")
         with tempfile.NamedTemporaryFile(dir=tmpdir, mode="wb", delete=False) as f:
-            for chunk in chunks:
-                sha.update(chunk)
+            for chunk in data_chunks:
                 f.write(chunk)
             f.flush()
             tmppath = f.name
-        path = self._sha_path(sha.hexdigest())
-        if not os.path.exists(os.path.dirname(path)):
-            os.makedirs(os.path.dirname(path))
 
         # Handle concurrent writes - if file already exists, just remove temp file
         if os.path.exists(path):
             os.remove(tmppath)
         else:
             os.rename(tmppath, path)
-        return sha.hexdigest()
+        return sha_hex
 
 
 class LFSPointer: