Contribute!

Invalidation of remote repositories

Status: Implemented

Author: Damien Martin-Guillerez

State at commit 808a651

Remote repositories are fetched the first time a build that depends on a repository is launched. The next time the same build happens, the already fetched repositories are not refetched, saving on download times or other expensive operations.

This behavior is also enforced even when the Bazel server is restarted by serializing the repository rule in the workspace file. A file named @<repositoryName>.marker is created for each repository with a fingerprint of the serialized rule. On next fetch, if that fingerprint has not changed, the rule is not refetched. This is not applied if the repository rule is marked as local because fetching a local repository is assumed to be fast.

Shortcomings

These consideration were well-suited when the implementation of repository rules were not depending on Skylark file. With the introduction of Skylark repositories, several issues appeared:

Proposed solution

Invalidation on the environment

Right now rules are not invalidated on the environment:

  • Invalidation on accessing repository_ctx.os.environ would generate invalidation on environment variable that might be volatile (e.g. CC when you want to use one C++ compiler and you reset your environment) and might miss other environment variables due to computed variable names.
  • There is no way to represent environment variables that influence repository_ctx.execute.

This document proposes to add a way to declare a dependency on an environment variable value that would trigger a refetch of a repository. An optional attribute environ would be added to the repository_rule method, taking a list of strings and would trigger invalidation of the repository on any of change to those environment variables. E.g.:

my_repo = repository_rule(impl = _impl, environ = ["FOO", "BAR"])

my_repo would be refetched on any change to the environment variables FOO or BAR but not if the environment variable BAZ would changes.

To be consistent with the new environment specification mechanism, the environment available through repository_ctx.os.environ or transmitted to repository_ctx.execute will take values from the --action_env flag, when specified. I.e. if --action_env FOO=BAR --action_env BAR are specified, and the environment set FOO=BAZ, BAR=FOO, BAZ=BAR, then the actual repository_ctx.os.environ map would contain {"FOO": "BAR", "BAR": "FOO", "BAZ": "BAR" }. This would ensure that the environment seen by repository rules is consistent with the one seen by actions (a repository rule see more than an action, leaving the rule writer the ability to filter the environment more finely).

Both these changes should allow Bazel to do auto-configuration based on environment variables:

  • Setting some environment variables would actually retrigger auto-configuration, corresponding to how the rule writter designed it (and not based on some assumption from Bazel).
  • The user set specific environment variables through the --action_env flag, and fix this environment using bazel info client-env.

Serialization of Skyframe dependencies

A local rule will be invalidated when any of its skyframe dependencies change. For non-local rule, a marker file will be stored on the external directory with a summary of the dependencies of the rule. At each fetch operation, we check the existence of the marker file and verify each dependency. If one of them have changed, we would refetch that repository.

To avoid unnecessary re-download of artifacts, a content-addressable cache has been developed for downloads (and thus not discuted here).

The marker file will be a manifest containing the following items:

  • A fingerprint of the serialized rule and the rule specific data (e.g., maven server information for maven_jar).
  • The declared environment (list of name, value pairs) through the environ attribute of the repository rule.
  • The list of FileValue-s requested by getPathFromLabel and the corresponding file content digest.
  • The transtive hash of the Extension definining the repository rule. This transitive hash is computed from the hash of the current extension and the extension loaded from it. This means that a repository function will get invalidated as soon as the extension file content changes, which is an over invalidation. However, getting an optimal result would require correct serialization of Skylark extensions.

Implementation plan

  1. Modify the SkylarkRepositoryFunction#getClientEnvironment method to get the values from the --action_env flag.
  2. Adds a markerData map argument to RepositoryFunction#fetch so SkylarkRepositoryFunction can include those change. This attribute should be mutable so a repository can add more data to be stored in the marker file. Adds a corresponding function for verification, verifyMarkerManifest, that would take a marker data map and return a tri-state: true if the repository is up to date, false if it needs refetch and null if additional Skyframe dependency need to be resolved for answering.
  3. Add the environ attribute to the repository_rule function and the dependency on the Skyframe values for the environment. Also create a SkyFunction for processed environment after the --action_env flag.
  4. Adds the environ values to the marker file through the getMarkerManifest function.
  5. Adds the FileValue-s to the marker file, adding all the files requested through the getPath method to a specific builder that will be passed to the SkylarkRepositoryContext.
  6. Adds the extension to the marker file by passing the transitiveHashCode of the Skylark Environment to the marker manifest.