A Guide to Skyframe StateMachines

Report an issue View source

Overview

A Skyframe StateMachine is a deconstructed function-object that resides on the heap. It supports flexible and evaluation without redundancy1 when required values are not immediately available but computed asynchronously. The StateMachine cannot tie up a thread resource while waiting, but instead has to be suspended and resumed. The deconstruction thus exposes explicit re-entry points so that prior computations can be skipped.

StateMachines can be used to express sequences, branching, structured logical concurrency and are tailored specifically for Skyframe interaction. StateMachines can be composed into larger StateMachines and share sub-StateMachines. Concurrency is always hierarchical by construction and purely logical. Every concurrent subtask runs in the single shared parent SkyFunction thread.

Introduction

This section briefly motivates and introduces StateMachines, found in the java.com.google.devtools.build.skyframe.state package.

A brief introduction to Skyframe restarts

Skyframe is a framework that performs parallel evaluation of dependency graphs. Each node in the graph corresponds with the evaluation of a SkyFunction with a SkyKey specifying its parameters and SkyValue specifying its result. The computational model is such that a SkyFunction may lookup SkyValues by SkyKey, triggering recursive, parallel evaluation of additional SkyFunctions. Instead of blocking, which would tie up a thread, when a requested SkyValue is not yet ready because some subgraph of computation is incomplete, the requesting SkyFunction observes a null getValue response and should return null instead of a SkyValue, signaling that it is incomplete due to missing inputs. Skyframe restarts the SkyFunctions when all previously requested SkyValues become available.

Before the introduction of SkyKeyComputeState, the traditional way of handling a restart was to fully rerun the computation. Although this has quadratic complexity, functions written this way eventually complete because each rerun, fewer lookups return null. With SkyKeyComputeState it is possible to associate hand-specified check-point data with a SkyFunction, saving significant recomputation.

StateMachines are objects that live inside SkyKeyComputeState and eliminate virtually all recomputation when a SkyFunction restarts (assuming that SkyKeyComputeState does not fall out of cache) by exposing suspend and resume execution hooks.

Stateful computations inside SkyKeyComputeState

From an object-oriented design standpoint, it makes sense to consider storing computational objects inside SkyKeyComputeState instead of pure data values. In Java, the bare minimum description of a behavior carrying object is a functional interface and it turns out to be sufficient. A StateMachine has the following, curiously recursive, definition2.

@FunctionalInterface
public interface StateMachine {
  StateMachine step(Tasks tasks) throws InterruptedException;
}

The Tasks interface is analogous to SkyFunction.Environment but it is designed for asynchrony and adds support for logically concurrent subtasks3.

The return value of step is another StateMachine, allowing the specification of a sequence of steps, inductively. step returns DONE when the StateMachine is done. For example:

class HelloWorld implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    System.out.println("hello");
    return this::step2;  // The next step is HelloWorld.step2.
  }

  private StateMachine step2(Tasks tasks) {
     System.out.println("world");
     // DONE is special value defined in the `StateMachine` interface signaling
     // that the computation is done.
     return DONE;
  }
}

describes a StateMachine with the following output.

hello
world

Note that the method reference this::step2 is also a StateMachine due to step2 satisfying StateMachine's functional interface definition. Method references are the most common way to specify the next state in a StateMachine.

Suspending and resuming

Intuitively, breaking a computation down into StateMachine steps, instead of a monolithic function, provides the hooks needed to suspend and resume a computation. When StateMachine.step returns, there is an explicit suspension point. The continuation specified by the returned StateMachine value is an explicit resume point. Recomputation can thus be avoided because the computation can be picked up exactly where it left off.

Callbacks, continuations and asynchronous computation

In technical terms, a StateMachine serves as a continuation, determining the subsequent computation to be executed. Instead of blocking, a StateMachine can voluntarily suspend by returning from the step function, which transfers control back to a Driver instance. The Driver can then switch to a ready StateMachine or relinquish control back to Skyframe.

Traditionally, callbacks and continuations are conflated into one concept. However, StateMachines maintain a distinction between the two.

  • Callback - describes where to store the result of an asynchronous computation.
  • Continuation - specifies the next execution state.

Callbacks are required when invoking an asynchronous operation, which means that the actual operation doesn't occur immediately upon calling the method, as in the case of a SkyValue lookup. Callbacks should be kept as simple as possible.

Continuations are the StateMachine return values of StateMachines and encapsulate the complex execution that follows once all asynchronous computations resolve. This structured approach helps to keep the complexity of callbacks manageable.

Tasks

The Tasks interface provides StateMachines with an API to lookup SkyValues by SkyKey and to schedule concurrent subtasks.

interface Tasks {
  void enqueue(StateMachine subtask);

  void lookUp(SkyKey key, Consumer<SkyValue> sink);

  <E extends Exception>
  void lookUp(SkyKey key, Class<E> exceptionClass, ValueOrExceptionSink<E> sink);

  // lookUp overloads for 2 and 3 exception types exist, but are elided here.
}

SkyValue lookups

StateMachines use Tasks.lookUp overloads to look up SkyValues. They are analogous to SkyFunction.Environment.getValue and SkyFunction.Environment.getValueOrThrow and have similar exception handling semantics. The implementation does not immediately perform the lookup, but instead, batches4 as many lookups as possible before doing so. The value might not be immediately available, for example, requiring a Skyframe restart, so the caller specifies what to do with the resulting value using a callback.

The StateMachine processor (Drivers and bridging to SkyFrame) guarantees that the value is available before the next state begins. An example follows.

class DoesLookup implements StateMachine, Consumer<SkyValue> {
  private Value value;

  @Override
  public StateMachine step(Tasks tasks) {
    tasks.lookUp(new Key(), (Consumer<SkyValue>) this);
    return this::processValue;
  }

  // The `lookUp` call in `step` causes this to be called before `processValue`.
  @Override  // Implementation of Consumer<SkyValue>.
  public void accept(SkyValue value) {
    this.value = (Value)value;
  }

  private StateMachine processValue(Tasks tasks) {
    System.out.println(value);  // Prints the string representation of `value`.
    return DONE;
  }
}

In the above example, the first step does a lookup for new Key(), passing this as the consumer. That is possible because DoesLookup implements Consumer<SkyValue>.

By contract, before the next state DoesLookup.processValue begins, all the lookups of DoesLookup.step are complete. Therefore value is available when it is accessed in processValue.

Subtasks

Tasks.enqueue requests the execution of logically concurrent subtasks. Subtasks are also StateMachines and can do anything regular StateMachines can do, including recursively creating more subtasks or looking up SkyValues. Much like lookUp, the state machine driver ensures that all subtasks are complete before proceeding to the next step. An example follows.

class Subtasks implements StateMachine {
  private int i = 0;

  @Override
  public StateMachine step(Tasks tasks) {
    tasks.enqueue(new Subtask1());
    tasks.enqueue(new Subtask2());
    // The next step is Subtasks.processResults. It won't be called until both
    // Subtask1 and Subtask 2 are complete.
    return this::processResults;
  }

  private StateMachine processResults(Tasks tasks) {
    System.out.println(i);  // Prints "3".
    return DONE;  // Subtasks is done.
  }

  private class Subtask1 implements StateMachine {
    @Override
    public StateMachine step(Tasks tasks) {
      i += 1;
      return DONE;  // Subtask1 is done.
    }
  }

  private class Subtask2 implements StateMachine {
    @Override
    public StateMachine step(Tasks tasks) {
      i += 2;
      return DONE;  // Subtask2 is done.
    }
  }
}

Though Subtask1 and Subtask2 are logically concurrent, everything runs in a single thread so the "concurrent" update of i does not need any synchronization.

Structured concurrency

Since every lookUp and enqueue must resolve before advancing to the next state, it means that concurrency is naturally limited to tree-structures. It's possible to create hierarchical5 concurrency as shown in the following example.

Structured Concurrency

It's hard to tell from the UML that the concurrency structure forms a tree. There's an alternate view that better shows the tree structure.

Unstructured Concurrency

Structured concurrency is much easier to reason about.

Composition and control flow patterns

This section presents examples for how multiple StateMachines can be composed and solutions to certain control flow problems.

Sequential states

This is the most common and straightforward control flow pattern. An example of this is shown in Stateful computations inside SkyKeyComputeState.

Branching

Branching states in StateMachines can be achieved by returning different values using regular Java control flow, as shown in the following example.

class Branch implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    // Returns different state machines, depending on condition.
    if (shouldUseA()) {
      return this::performA;
    }
    return this::performB;
  }
  …
}

It’s very common for certain branches to return DONE, for early completion.

Advanced sequential composition

Since the StateMachine control structure is memoryless, sharing StateMachine definitions as subtasks can sometimes be awkward. Let M1 and M2 be StateMachine instances that share a StateMachine, S, with M1 and M2 being the sequences <A, S, B> and <X, S, Y> respectively. The problem is that S doesn’t know whether to continue to B or Y after it completes and StateMachines don't quite keep a call stack. This section reviews some techniques for achieving this.

StateMachine as terminal sequence element

This doesn’t solve the initial problem posed. It only demonstrates sequential composition when the shared StateMachine is terminal in the sequence.

// S is the shared state machine.
class S implements StateMachine { … }

class M1 implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    performA();
    return new S();
  }
}

class M2 implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    performX();
    return new S();
  }
}

This works even if S is itself a complex state machine.

Subtask for sequential composition

Since enqueued subtasks are guaranteed to complete before the next state, it’s sometimes possible to slightly abuse6 the subtask mechanism.

class M1 implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    performA();
    // S starts after `step` returns and by contract must complete before `doB`
    // begins. It is effectively sequential, inducing the sequence < A, S, B >.
    tasks.enqueue(new S());
    return this::doB;
  }

  private StateMachine doB(Tasks tasks) {
    performB();
    return DONE;
  }
}

class M2 implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    performX();
    // Similarly, this induces the sequence < X, S, Y>.
    tasks.enqueue(new S());
    return this::doY;
  }

  private StateMachine doY(Tasks tasks) {
    performY();
    return DONE;
  }
}

runAfter injection

Sometimes, abusing Tasks.enqueue is impossible because there are other parallel subtasks or Tasks.lookUp calls that must be completed before S executes. In this case, injecting a runAfter parameter into S can be used to inform S of what to do next.

class S implements StateMachine {
  // Specifies what to run after S completes.
  private final StateMachine runAfter;

  @Override
  public StateMachine step(Tasks tasks) {
    … // Performs some computations.
    return this::processResults;
  }

  @Nullable
  private StateMachine processResults(Tasks tasks) {
    … // Does some additional processing.

    // Executes the state machine defined by `runAfter` after S completes.
    return runAfter;
  }
}

class M1 implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    performA();
    // Passes `this::doB` as the `runAfter` parameter of S, resulting in the
    // sequence < A, S, B >.
    return new S(/* runAfter= */ this::doB);
  }

  private StateMachine doB(Tasks tasks) {
    performB();
    return DONE;
  }
}

class M2 implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks) {
    performX();
    // Passes `this::doY` as the `runAfter` parameter of S, resulting in the
    // sequence < X, S, Y >.
    return new S(/* runAfter= */ this::doY);
  }

  private StateMachine doY(Tasks tasks) {
    performY();
    return DONE;
  }
}

This approach is cleaner than abusing subtasks. However, applying this too liberally, for example, by nesting multiple StateMachines with runAfter, is the road to Callback Hell. It’s better to break up sequential runAfters with ordinary sequential states instead.

  return new S(/* runAfter= */ new T(/* runAfter= */ this::nextStep))

can be replaced with the following.

  private StateMachine step1(Tasks tasks) {
     doStep1();
     return new S(/* runAfter= */ this::intermediateStep);
  }

  private StateMachine intermediateStep(Tasks tasks) {
    return new T(/* runAfter= */ this::nextStep);
  }

Forbidden alternative: runAfterUnlessError

In an earlier draft, we had considered a runAfterUnlessError that would abort early on errors. This was motivated by the fact that errors often end up getting checked twice, once by the StateMachine that has a runAfter reference and once by the runAfter machine itself.

After some deliberation, we decided that uniformity of the code is more important than deduplicating the error checking. It would be confusing if the runAfter mechanism did not work in a consistent manner with the tasks.enqueue mechanism, which always requires error checking.

Direct delegation

Each time there is a formal state transition, the main Driver loop advances. As per contract, advancing states means that all previously enqueued SkyValue lookups and subtasks resolve before the next state executes. Sometimes the logic of a delegate StateMachine makes a phase advance unnecessary or counterproductive. For example, if the first step of the delegate performs SkyKey lookups that could be parallelized with lookups of the delegating state then a phase advance would make them sequential. It could make more sense to perform direct delegation, as shown in the example below.

class Parent implements StateMachine {
  @Override
  public StateMachine step(Tasks tasks ) {
    tasks.lookUp(new Key1(), this);
    // Directly delegates to `Delegate`.
    //
    // The (valid) alternative:
    //   return new Delegate(this::afterDelegation);
    // would cause `Delegate.step` to execute after `step` completes which would
    // cause lookups of `Key1` and `Key2` to be sequential instead of parallel.
    return new Delegate(this::afterDelegation).step(tasks);
  }

  private StateMachine afterDelegation(Tasks tasks) {
    …
  }
}

class Delegate implements StateMachine {
  private final StateMachine runAfter;

  Delegate(StateMachine runAfter) {
    this.runAfter = runAfter;
  }

  @Override
  public StateMachine step(Tasks tasks) {
    tasks.lookUp(new Key2(), this);
    return …;
  }

  // Rest of implementation.
  …

  private StateMachine complete(Tasks tasks) {
    …
    return runAfter;
  }
}

Data flow

The focus of the previous discussion has been on managing control flow. This section describes the propagation of data values.

Implementing Tasks.lookUp callbacks

There’s an example of implementing a Tasks.lookUp callback in SkyValue lookups. This section provides rationale and suggests approaches for handling multiple SkyValues.

Tasks.lookUp callbacks

The Tasks.lookUp method takes a callback, sink, as a parameter.

  void lookUp(SkyKey key, Consumer<SkyValue> sink);

The idiomatic approach would be to use a Java lambda to implement this:

  tasks.lookUp(key, value -> myValue = (MyValueClass)value);

with myValue being a member variable of the StateMachine instance doing the lookup. However, the lambda requires an extra memory allocation compared to implementing the Consumer<SkyValue> interface in the StateMachine implementation. The lambda is still useful when there are multiple lookups that would be ambiguous.

There are also error handling overloads of Tasks.lookUp, that are analogous to SkyFunction.Environment.getValueOrThrow.

  <E extends Exception> void lookUp(
      SkyKey key, Class<E> exceptionClass, ValueOrExceptionSink<E> sink);

  interface ValueOrExceptionSink<E extends Exception> {
    void acceptValueOrException(@Nullable SkyValue value, @Nullable E exception);
  }

An example implementation is shown below.

class PerformLookupWithError extends StateMachine, ValueOrExceptionSink<MyException> {
  private MyValue value;
  private MyException error;

  @Override
  public StateMachine step(Tasks tasks) {
    tasks.lookUp(new MyKey(), MyException.class, ValueOrExceptionSink<MyException>) this);
    return this::processResult;
  }

  @Override
  public acceptValueOrException(@Nullable SkyValue value, @Nullable MyException exception) {
    if (value != null) {
      this.value = (MyValue)value;
      return;
    }
    if (exception != null) {
      this.error = exception;
      return;
    }
    throw new IllegalArgumentException("Both parameters were unexpectedly null.");
  }

  private StateMachine processResult(Tasks tasks) {
    if (exception != null) {
      // Handles the error.
      …
      return DONE;
    }
    // Processes `value`, which is non-null.
    …
  }
}

As with lookups without error handling, having the StateMachine class directly implement the callback saves a memory allocation for the lamba.

Error handling provides a bit more detail, but essentially, there's not much difference between the propagation of errors and normal values.

Consuming multiple SkyValues

Multiple SkyValue lookups are often required. An approach that works much of the time is to switch on the type of SkyValue. The following is an example that has been simplified from prototype production code.

  @Nullable
  private StateMachine fetchConfigurationAndPackage(Tasks tasks) {
    var configurationKey = configuredTarget.getConfigurationKey();
    if (configurationKey != null) {
      tasks.lookUp(configurationKey, (Consumer<SkyValue>) this);
    }

    var packageId = configuredTarget.getLabel().getPackageIdentifier();
    tasks.lookUp(PackageValue.key(packageId), (Consumer<SkyValue>) this);

    return this::constructResult;
  }

  @Override  // Implementation of `Consumer<SkyValue>`.
  public void accept(SkyValue value) {
    if (value instanceof BuildConfigurationValue) {
      this.configurationValue = (BuildConfigurationValue) value;
      return;
    }
    if (value instanceof PackageValue) {
      this.pkg = ((PackageValue) value).getPackage();
      return;
    }
    throw new IllegalArgumentException("unexpected value: " + value);
  }

The Consumer<SkyValue> callback implementation can be shared unambiguously because the value types are different. When that’s not the case, falling back to lambda-based implementations or full inner-class instances that implement the appropriate callbacks is viable.

Propagating values between StateMachines

So far, this document has only explained how to arrange work in a subtask, but subtasks also need to report a values back to the caller. Since subtasks are logically asynchronous, their results are communicated back to the caller using a callback. To make this work, the subtask defines a sink interface that is injected via its constructor.

class BarProducer implements StateMachine {
  // Callers of BarProducer implement the following interface to accept its
  // results. Exactly one of the two methods will be called by the time
  // BarProducer completes.
  interface ResultSink {
    void acceptBarValue(Bar value);
    void acceptBarError(BarException exception);
  }

  private final ResultSink sink;

  BarProducer(ResultSink sink) {
     this.sink = sink;
  }

  … // StateMachine steps that end with this::complete.

  private StateMachine complete(Tasks tasks) {
    if (hasError()) {
      sink.acceptBarError(getError());
      return DONE;
    }
    sink.acceptBarValue(getValue());
    return DONE;
  }
}

A caller StateMachine would then look like the following.

class Caller implements StateMachine, BarProducer.ResultSink {
  interface ResultSink {
    void acceptCallerValue(Bar value);
    void acceptCallerError(BarException error);
  }

  private final ResultSink sink;

  private Bar value;

  Caller(ResultSink sink) {
    this.sink = sink;
  }

  @Override
  @Nullable
  public StateMachine step(Tasks tasks) {
    tasks.enqueue(new BarProducer((BarProducer.ResultSink) this));
    return this::processResult;
  }

  @Override
  public void acceptBarValue(Bar value) {
    this.value = value;
  }

  @Override
  public void acceptBarError(BarException error) {
    sink.acceptCallerError(error);
  }

  private StateMachine processResult(Tasks tasks) {
    // Since all enqueued subtasks resolve before `processResult` starts, one of
    // the `BarResultSink` callbacks must have been called by this point.
    if (value == null) {
      return DONE;  // There was a previously reported error.
    }
    var finalResult = computeResult(value);
    sink.acceptCallerValue(finalResult);
    return DONE;
  }
}

The preceding example demonstrates a few things. Caller has to propagate its results back and defines its own Caller.ResultSink. Caller implements the BarProducer.ResultSink callbacks. Upon resumption, processResult checks if value is null to determine if an error occurred. This is a common behavior pattern after accepting output from either a subtask or SkyValue lookup.

Note that the implementation of acceptBarError eagerly forwards the result to the Caller.ResultSink, as required by Error bubbling.

Alternatives for top-level StateMachines are described in Drivers and bridging to SkyFunctions.

Error handling

There's a couple of examples of error handling already in Tasks.lookUp callbacks and Propagating values between StateMachines. Exceptions, other than InterruptedException are not thrown, but instead passed around through callbacks as values. Such callbacks often have exclusive-or semantics, with exactly one of a value or error being passed.

The next section describes a a subtle, but important interaction with Skyframe error handling.

Error bubbling (--nokeep_going)

During error bubbling, a SkyFunction may be restarted even if not all requested SkyValues are available. In such cases, the subsequent state will never be reached due to the Tasks API contract. However, the StateMachine should still propagate the exception.

Since propagation must occur regardless of whether the next state is reached, the error handling callback must perform this task. For an inner StateMachine, this is achieved by invoking the parent callback.

At the top-level StateMachine, which interfaces with the SkyFunction, this can be done by calling the setException method of ValueOrExceptionProducer. ValueOrExceptionProducer.tryProduceValue will then throw the exception, even if there are missing SkyValues.

If a Driver is being utilized directly, it is essential to check for propagated errors from the SkyFunction, even if the machine has not finished processing.

Event Handling

For SkyFunctions that need to emit events, a StoredEventHandler is injected into SkyKeyComputeState and further injected into StateMachines that require them. Historically, the StoredEventHandler was needed due to Skyframe dropping certain events unless they are replayed but this was subsequently fixed. StoredEventHandler injection is preserved because it simplifies the implementation of events emitted from error handling callbacks.

Drivers and bridging to SkyFunctions

A Driver is responsible for managing the execution of StateMachines, beginning with a specified root StateMachine. As StateMachines can recursively enqueue subtask StateMachines, a single Driver can manage numerous subtasks. These subtasks create a tree structure, a result of Structured concurrency. The Driver batches SkyValue lookups across subtasks for improved efficiency.

There are a number of classes built around the Driver, with the following API.

public final class Driver {
  public Driver(StateMachine root);
  public boolean drive(SkyFunction.Environment env) throws InterruptedException;
}

Driver takes a single root StateMachine as a parameter. Calling Driver.drive executes the StateMachine as far as it can go without a Skyframe restart. It returns true when the StateMachine completes and false otherwise, indicating that not all values were available.

Driver maintains the concurrent state of the StateMachine and it is well suited for embedding in SkyKeyComputeState.

Directly instantiating Driver

StateMachine implementations conventionally communicate their results via callbacks. It's possible to directly instantiate a Driver as shown in the following example.

The Driver is embedded in the SkyKeyComputeState implementation along with an implementation of the corresponding ResultSink to be defined a bit further down. At the top level, the State object is an appropriate receiver for the result of the computation as it is guaranteed to outlive Driver.

class State implements SkyKeyComputeState, ResultProducer.ResultSink {
  // The `Driver` instance, containing the full tree of all `StateMachine`
  // states. Responsible for calling `StateMachine.step` implementations when
  // asynchronous values are available and performing batched SkyFrame lookups.
  //
  // Non-null while `result` is being computed.
  private Driver resultProducer;

  // Variable for storing the result of the `StateMachine`
  //
  // Will be non-null after the computation completes.
  //
  private ResultType result;

  // Implements `ResultProducer.ResultSink`.
  //
  // `ResultProducer` propagates its final value through a callback that is
  // implemented here.
  @Override
  public void acceptResult(ResultType result) {
    this.result = result;
  }
}

The code below sketches the ResultProducer.

class ResultProducer implements StateMachine {
  interface ResultSink {
    void acceptResult(ResultType value);
  }

  private final Parameters parameters;
  private final ResultSink sink;

  … // Other internal state.

  ResultProducer(Parameters parameters, ResultSink sink) {
    this.parameters = parameters;
    this.sink = sink;
  }

  @Override
  public StateMachine step(Tasks tasks) {
    …  // Implementation.
    return this::complete;
  }

  private StateMachine complete(Tasks tasks) {
    sink.acceptResult(getResult());
    return DONE;
  }
}

Then the code for lazily computing the result could look like the following.

@Nullable
private Result computeResult(State state, Skyfunction.Environment env)
    throws InterruptedException {
  if (state.result != null) {
    return state.result;
  }
  if (state.resultProducer == null) {
    state.resultProducer = new Driver(new ResultProducer(
      new Parameters(), (ResultProducer.ResultSink)state));
  }
  if (state.resultProducer.drive(env)) {
    // Clears the `Driver` instance as it is no longer needed.
    state.resultProducer = null;
  }
  return state.result;
}

Embedding Driver

If the StateMachine produces a value and raises no exceptions, embedding Driver is another possible implementation, as shown in the following example.

class ResultProducer implements StateMachine {
  private final Parameters parameters;
  private final Driver driver;

  private ResultType result;

  ResultProducer(Parameters parameters) {
    this.parameters = parameters;
    this.driver = new Driver(this);
  }

  @Nullable  // Null when a Skyframe restart is needed.
  public ResultType tryProduceValue( SkyFunction.Environment env)
      throws InterruptedException {
    if (!driver.drive(env)) {
      return null;
    }
    return result;
  }

  @Override
  public StateMachine step(Tasks tasks) {
    …  // Implementation.
}

The SkyFunction may have code that looks like the following (where State is the function specific type of SkyKeyComputeState).

@Nullable  // Null when a Skyframe restart is needed.
Result computeResult(SkyFunction.Environment env, State state)
    throws InterruptedException {
  if (state.result != null) {
    return state.result;
  }
  if (state.resultProducer == null) {
    state.resultProducer = new ResultProducer(new Parameters());
  }
  var result = state.resultProducer.tryProduceValue(env);
  if (result == null) {
    return null;
  }
  state.resultProducer = null;
  return state.result = result;
}

Embedding Driver in the StateMachine implementation is a better fit for Skyframe's synchronous coding style.

StateMachines that may produce exceptions

Otherwise, there are SkyKeyComputeState-embeddable ValueOrExceptionProducer and ValueOrException2Producer classes that have synchronous APIs to match synchronous SkyFunction code.

The ValueOrExceptionProducer abstract class includes the following methods.

public abstract class ValueOrExceptionProducer<V, E extends Exception>
    implements StateMachine {
  @Nullable
  public final V tryProduceValue(Environment env)
      throws InterruptedException, E {
    …  // Implementation.
  }

  protected final void setValue(V value)  {  … // Implementation. }
  protected final void setException(E exception) {  … // Implementation. }
}

It includes an embedded Driver instance and closely resembles the ResultProducer class in Embedding driver and interfaces with the SkyFunction in a similar manner. Instead of defining a ResultSink, implementations call setValue or setException when either of those occur. When both occur, the exception takes priority. The tryProduceValue method bridges the asynchronous callback code to synchronous code and throws an exception when one is set.

As previously noted, during error bubbling, it's possible for an error to occur even if the machine is not yet done because not all inputs are available. To accommodate this, tryProduceValue throws any set exceptions, even before the machine is done.

Epilogue: Eventually removing callbacks

StateMachines are a highly efficient, but boilerplate intensive way to perform asynchronous computation. Continuations (particularly in the form of Runnables passed to ListenableFuture) are widespread in certain parts of Bazel code, but aren't prevalent in analysis SkyFunctions. Analysis is mostly CPU bound and there are no efficient asynchronous APIs for disk I/O. Eventually, it would be good to optimize away callbacks as they have a learning curve and impede readability.

One of the most promising alternatives is Java virtual threads. Instead of having to write callbacks, everything is replaced with synchronous, blocking calls. This is possible because tying up a virtual thread resource, unlike a platform thread, is supposed to be cheap. However, even with virtual threads, replacing simple synchronous operations with thread creation and synchronization primitives is too expensive. We performed a migration from StateMachines to Java virtual threads and they were orders of magnitude slower, leading to almost a 3x increase in end-to-end analysis latency. Since virtual threads are still a preview feature, it's possible that this migration can be performed at a later date when performance improves.

Another approach to consider is waiting for Loom coroutines, if they ever become available. The advantage here is that it might be possible to reduce synchronization overhead by using cooperative multitasking.

If all else fails, low-level bytecode rewriting could also be a viable alternative. With enough optimization, it might be possible to achieve performance that approaches hand-written callback code.

Appendix

Callback Hell

Callback hell is an infamous problem in asynchronous code that uses callbacks. It stems from the fact that the continuation for a subsequent step is nested within the previous step. If there are many steps, this nesting can be extremely deep. If coupled with control flow the code becomes unmanageable.

class CallbackHell implements StateMachine {
  @Override
  public StateMachine step(Tasks task) {
    doA();
    return (t, l) -> {
      doB();
      return (t1, l2) -> {
        doC();
        return DONE;
      };
    };
  }
}

One of the advantages of nested implementations is that the stack frame of the outer step can be preserved. In Java, captured lambda variables must be effectively final so using such variables can be cumbersome. Deep nesting is avoided by returning method references as continuations instead of lambdas as shown as follows.

class CallbackHellAvoided implements StateMachine {
  @Override
  public StateMachine step(Tasks task) {
    doA();
    return this::step2;
  }

  private StateMachine step2(Tasks tasks) {
    doB();
    return this::step3;
  }

  private StateMachine step3(Tasks tasks) {
    doC();
    return DONE;
  }
}

Callback hell may also occur if the runAfter injection pattern is used too densely, but this can be avoided by interspersing injections with sequential steps.

Example: Chained SkyValue lookups

It is often the case that the application logic requires dependent chains of SkyValue lookups, for example, if a second SkyKey depends on the first SkyValue. Thinking about this naively, this would result in a complex, deeply nested callback structure.

private ValueType1 value1;
private ValueType2 value2;

private StateMachine step1(...) {
  tasks.lookUp(key1, (Consumer<SkyValue>) this);  // key1 has type KeyType1.
  return this::step2;
}

@Override
public void accept(SkyValue value) {
  this.value1 = (ValueType1) value;
}

private StateMachine step2(...) {
  KeyType2 key2 = computeKey(value1);
  tasks.lookup(key2, this::acceptValueType2);
  return this::step3;
}

private void acceptValueType2(SkyValue value) {
  this.value2 = (ValueType2) value;
}

However, since continuations are specified as method references, the code looks procedural across state transitions: step2 follows step1. Note that here, a lambda is used to assign value2. This makes the ordering of the code match the ordering of the computation from top-to-bottom.

Miscellaneous Tips

Readability: Execution Ordering

To improve readability, strive to keep the StateMachine.step implementations in execution order and callback implementations immediately following where they are passed in the code. This isn't always possible where the control flow branches. Additional comments might be helpful in such cases.

In Example: Chained SkyValue lookups, an intermediate method reference is created to achieve this. This trades a small amount of performance for readability, which is likely worthwhile here.

Generational Hypothesis

Medium-lived Java objects break the generational hypothesis of the Java garbage collector, which is designed to handle objects that live for a very short time or objects that live forever. By definition, objects in SkyKeyComputeState violate this hypothesis. Such objects, containing the constructed tree of all still-running StateMachines, rooted at Driver have an intermediate lifespan as they suspend, waiting for asynchronous computations to complete.

It seems less bad in JDK19, but when using StateMachines, it's sometimes possible to observe an increase in GC time, even with dramatic decreases in actual garbage generated. Since StateMachines have an intermediate lifespan they could be promoted to old gen, causing it to fill up more quickly, thus necessitating more expensive major or full GCs to clean up.

The initial precaution is to minimize the use of StateMachine variables, but it is not always feasible, for example, if a value is needed across multiple states. Where it is possible, local stack step variables are young generation variables and efficiently GC'd.

For StateMachine variables, breaking things down into subtasks and following the recommended pattern for Propagating values between StateMachines is also helpful. Observe that when following the pattern, only child StateMachines have references to parent StateMachines and not vice versa. This means that as children complete and update the parents using result callbacks, the children naturally fall out of scope and become eligible for GC.

Finally, in some cases, a StateMachine variable is needed in earlier states but not in later states. It can be beneficial to null out references of large objects once it is known that they are no longer needed.

Naming states

When naming a method, it's usually possible to name a method for the behavior that happens within that method. It's less clear how to do this in StateMachines because there is no stack. For example, suppose method foo calls a sub-method bar. In a StateMachine, this could be translated into the state sequence foo, followed by bar. foo no longer includes the behavior bar. As a result, method names for states tend to be narrower in scope, potentially reflecting local behavior.

Concurrency tree diagram

The following is an alternative view of the diagram in Structured concurrency that better depicts the tree structure. The blocks form a small tree.

Structured Concurrency 3D


  1. In contrast to Skyframe's convention of restarting from the beginning when values are not available. 

  2. Note that step is permitted to throw InterruptedException, but the examples omit this. There are a few low methods in Bazel code that throw this exception and it propagates up to the Driver, to be described later, that runs the StateMachine. It's fine to not declare it to be thrown when unneeded. 

  3. Concurrent subtasks were motivated by the ConfiguredTargetFunction which performs independent work for each dependency. Instead of manipulating complex data structures that process all the dependencies at once, introducing inefficiencies, each dependency has its own independent StateMachine

  4. Multiple tasks.lookUp calls within a single step are batched together. Additional batching can be created by lookups occurring within concurrent subtasks. 

  5. This is conceptually similar to Java’s structured concurrency jeps/428

  6. Doing this is similar to spawning a thread and joining it to achieve sequential composition.