Skip to content

Conversation

@dianfu
Copy link
Contributor

@dianfu dianfu commented Jan 9, 2026

What is the purpose of the change

This pull request introduces the necessary rules and nodes to support async python scalar function

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@dianfu dianfu changed the title [FLINK-38882][table] Introduce rules for async python scalar function [FLINK-38882][table][python] Introduce rules for async python scalar function Jan 9, 2026
@flinkbot
Copy link
Collaborator

flinkbot commented Jan 9, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

bool profile_enabled = 4;
repeated JobParameter job_parameters = 5;
// Async execution configuration for async scalar functions
int32 async_buffer_capacity = 6; // Max number of concurrent async operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how we have come up with these numbers. async_max_attempts = 9 seems a strange amount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just represents the index of this variable in the message.

bool async_retry_enabled = 8; // Whether retry is enabled
int32 async_max_attempts = 9; // Maximum number of retry attempts
int64 async_retry_delay_ms = 10; // Delay between retries in milliseconds
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bool - I assume the value should not be 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just represents the index of this variable in the message.

int64 async_timeout_ms = 7; // Timeout in milliseconds for async operations
// Async retry strategy configuration
bool async_retry_enabled = 8; // Whether retry is enabled
int32 async_max_attempts = 9; // Maximum number of retry attempts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest including retry in the name to be consistent with the other variables. async_max_retry_attempts

repeated OverWindow windows = 3;
bool profile_enabled = 4;
repeated JobParameter job_parameters = 5;
// Async execution configuration for async scalar functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we need to include these values in the docs, with a write up of this capability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 is not a capacity value, it represents the index of this variable in the message UserDefinedFunctions (a specification requirement of protobuf)

PythonCalcSplitRule.REWRITE_PROJECT,
PythonMapRenameRule.INSTANCE,
PythonMapMergeRule.INSTANCE,
// Splits Python calc which contains Python async scalar functions and other Python functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding : why do we need Python specific rules? What in the existing rules does not work for Python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Removed.

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants