Bazel digest calculation of big sparse files is slow
#25,265 opened on Feb 12, 2025
Description
Description of the bug:
We use Bazel to build partition and disk images. The individual partitions are built across multiple build steps and assembled into a disk image in the end.
The images are stored as sparse files in the filesystem and their disk size is 1-10% of their logical size. For example: 100GB size but only 2GB on disk. When Bazel computes the digest of the images, it cannot take advantage of the sparse nature of the file (e.g. by using SEEK_HOLE and SEEK_DATA) and reads the giant holes which is very slow.
While DigestFunction in remote_execution.proto does have a few digest functions that could be easily implemented to work optimally with sparse files, these are not available in the hashFunctionRegistry. Only BLAKE3, SHA1 and SHA256 are available.
Which category does this issue belong to?
Performance
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
genrule(
name = "big_sparse_file",
outs = ["big_sparse_file.dat"],
cmd = "truncate $@ -s 10G",
)
INFO: Elapsed time: 23.773s, Critical Path: 14.13s
Which operating system are you running Bazel on?
Ubuntu 20.04
What is the output of bazel info release?
release 7.4.1
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse HEAD ?
No response
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response