bazelbuild/bazel

Bazel digest calculation of big sparse files is slow

Open

#25,265 opened on Feb 12, 2025

View on GitHub
 (2 comments) (0 reactions) (0 assignees)Java (25,384 stars) (4,465 forks)batch import
P3help wantedteam-Performancetype: feature request

Description

Description of the bug:

We use Bazel to build partition and disk images. The individual partitions are built across multiple build steps and assembled into a disk image in the end.

The images are stored as sparse files in the filesystem and their disk size is 1-10% of their logical size. For example: 100GB size but only 2GB on disk. When Bazel computes the digest of the images, it cannot take advantage of the sparse nature of the file (e.g. by using SEEK_HOLE and SEEK_DATA) and reads the giant holes which is very slow.

While DigestFunction in remote_execution.proto does have a few digest functions that could be easily implemented to work optimally with sparse files, these are not available in the hashFunctionRegistry. Only BLAKE3, SHA1 and SHA256 are available.

Which category does this issue belong to?

Performance

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

genrule(
    name = "big_sparse_file",
    outs = ["big_sparse_file.dat"],
    cmd = "truncate $@ -s 10G",
)

INFO: Elapsed time: 23.773s, Critical Path: 14.13s

Which operating system are you running Bazel on?

Ubuntu 20.04

What is the output of bazel info release?

release 7.4.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

Contributor guide