Extending benchmarking to allow pre-schedules by fschlimb · Pull Request #65 · llvm/lighthouse

fschlimb · 2026-03-06T16:40:43Z

Sharding modifies function signatures to make them operate on partitions, not the whole tensor.

The benchmark utility extracted the payload's signature from the unpartitioned IR and created the wrapper function based on that.
This of course break when it tries to call the partitioned function with global shapes.

This PR allows workloads to return more than one schedule. If so, the benchmark will apply the first, add the benchmark wrapper and finally apply all the remaining schedules.

The mlp-mpi example can now be properly benchmarked. For this mild the payload function was adjusted to accept a return argument instead of returning a tensor and it provides two schedules. All other examples simply return a list with a single schedule. I also uses a schedule for tiling/vectorizing matmuls provided by @adam-smnk.

Requires a fix in MLIR to make mlp-mpi.py pass.

Copilot

Pull request overview

Extends the Lighthouse workload/benchmarking interface to support applying multiple transform schedules, enabling “pre-schedules” to run before the benchmark wrapper is emitted (to accommodate signature changes from sharding/partitioning).

Changes:

Rename workload API from schedule_module() to schedule_modules() and apply schedules sequentially during lowering.
Update benchmarking flow to optionally apply the first schedule before emitting the benchmark wrapper, then apply remaining schedules.
Update examples (XeGPU and mlp-mpi) to return schedule lists; adjust mlp-mpi payload/signature and pipeline to support benchmarking.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
lighthouse/workload/workload.py	Updates the `Workload` interface to return multiple schedules and applies them in order during lowering.
lighthouse/workload/runner.py	Updates `benchmark()` to support pre-schedules before emitting the benchmark wrapper; adds an execution print.
examples/xegpu/mlp.py	Migrates to `schedule_modules()` returning a single schedule in a list.
examples/xegpu/matmul.py	Migrates to `schedule_modules()` returning a single schedule in a list.
examples/workload/example.py	Migrates to `schedule_modules()` and returns a list, but still has an early return path that returns a single module.
examples/mlp-mpi/mlp_weight_stationary.py	Adjusts payload signature to take an explicit destination argument and uses bufferization materialization into destination.
examples/mlp-mpi/mlp-mpi.py	Switches example driver to use `benchmark()` and splits schedule into pre/main schedules for signature-sensitive benchmarking.

Comments suppressed due to low confidence (1)

examples/workload/example.py:151

schedule_modules is now expected to return list[ir.Module], but this implementation still annotates -> ir.Module and (more importantly) returns a bare schedule_module when stop_at_stage == "bufferized" (line 151). This will violate the new interface and will fail at runtime due to the assert isinstance(schedule_modules, list) in Workload.lower_payload. Update the return annotation and make the early return return a list as well (or restructure to avoid returning from inside the insertion-point block).

    def schedule_modules(
        self, stop_at_stage: Optional[str] = None, parameters: Optional[dict] = None
    ) -> ir.Module:
        schedule_module = ir.Module.create()
        schedule_module.operation.attributes["transform.with_named_sequence"] = (
            ir.UnitAttr.get()
        )
        with ir.InsertionPoint(schedule_module.body):
            named_sequence = transform.named_sequence(
                "__transform_main",
                [transform.AnyOpType.get()],
                [],
                arg_attrs=[{"transform.readonly": ir.UnitAttr.get()}],
            )
            with ir.InsertionPoint(named_sequence.body):
                anytype = transform.AnyOpType.get()
                func = match(named_sequence.bodyTarget, ops={"func.func"})
                mod = transform.get_parent_op(
                    anytype,
                    func,
                    op_name="builtin.module",
                    deduplicate=True,
                )
                mod = apply_registered_pass(mod, "one-shot-bufferize")
                mod = apply_registered_pass(mod, "convert-linalg-to-loops")
                transform.apply_cse(mod)
                canonicalize(mod)

                if stop_at_stage == "bufferized":
                    transform.YieldOp()
                    return schedule_module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lighthouse/workload/runner.py

lighthouse/workload/workload.py

lighthouse/workload/runner.py

tkarna · 2026-03-09T11:38:30Z

The benchmark utility extracted the payload's signature from the unpartitioned IR and created the wrapper function based on that. This of course break when it tries to call the partitioned function with global shapes.

This PR allows workloads to return more than one schedule. If so, the benchmark will apply the first, add the benchmark wrapper and finally apply all the remaining schedules.

The proposed fix indeed requires that we're able to interrupt the lowering schedule in the middle, do something (i.e., emit the benchmark function with local memref shapes), and then do the rest of the lowering. The workload.get_schedules method now returns a list of schedule modules and it is assumed that we need to apply the first one to get to sharded IR code. I'm fine accepting this as a temporary solution but in the long run we'd need something more generic/less brittle.

One possible solution is that returned "schedules" is a dictionary with (tag, schedule) pairs. For uninterrupted lowering we just apply all the schedules in order (in python >=3.7 dictionaries preserve order). If we need to interrupt at sharded state, we check whether the schedules contain a tag "sharded" and if so we know where to stop.

Alternative solution

For the present benchmark func issue, another possible solution is to copy the payload func shard annotations (if any) to the benchmark function. Then we can just apply the whole pipeline in one go, and the benchmark func with get converted from global to local shapes correctly. We'd avoid splitting the schedule but would still have a sharding specific change in the runner.

fschlimb · 2026-03-09T11:51:22Z

The proposed fix indeed requires that we're able to interrupt the lowering schedule in the middle, do something (i.e., emit the benchmark function with local memref shapes), and then do the rest of the lowering. The workload.get_schedules method now returns a list of schedule modules and it is assumed that we need to apply the first one to get to sharded IR code. I'm fine accepting this as a temporary solution but in the long run we'd need something more generic/less brittle.

One possible solution is that returned "schedules" is a dictionary with (tag, schedule) pairs. For uninterrupted lowering we just apply all the schedules in order (in python >=3.7 dictionaries preserve order). If we need to interrupt at sharded state, we check whether the schedules contain a tag "sharded" and if so we know where to stop.

Yes, something nicer would be nice.
However, in my mind the proposed alternative would be even worse because it is not a generic feature, but very specific. Replacing a context-free but "brittle" protocol with one that relies on something super generic (dict) but then introduces a super-specific flag ("sharded") doesn't seem to improve the situation.

Alternative solution

For the present benchmark func issue, another possible solution is to copy the payload func shard annotations (if any) to the benchmark function. Then we can just apply the whole pipeline in one go, and the benchmark func with get converted from global to local shapes correctly. We'd avoid splitting the schedule but would still have a sharding specific change in the runner.

How do you suggest to copy the shard annotations? A shard-specific transform-schedule? What's bad about allowing multiple schedules? Whether schedules or pipelines, being able to combine more than one to me looks like a useful feature no matter what.

rengolin · 2026-03-09T17:32:25Z

lighthouse/schedule/x86/tile_and_vector_matmul.py

+from mlir.dialects.transform import x86
+
+
+def _cleanup(target):


This already exists partially in the pipeline helper. Better to improve that one and reuse here.

rengolin · 2026-03-09T17:33:51Z

lighthouse/workload/runner.py

+    # This allows for modifying function signature before extracting it for the benchmark function.
+    schedule_modules = workload.schedule_modules(parameters=schedule_parameters)
+    assert isinstance(schedule_modules, list)
+    if len(schedule_modules) > 1:


I don't like the idea of separating the schedules numerically. If there are two distinct stages in which they run, I'd make them into separate lists, and apply all schedules in the first list here, and all in the second list down there.

I don't like the idea of list of lists in this case. What is between first and second is a pure artefact of how the benchmarking is implemented. Other tools might require other separations. I don't think there is a clear way to make this generic enough without relying on something that identifies arbitrary (maybe continuous) subsets (like with numeric offsets or dicts).

As discussed elsewhere, a better approach might be to put things which modify the IR (like inserting the benchmark wrapper) into separate schedules and leave it to the workload to put various schedules in the right order.

I can try to implement that with @rolfmorel's latest extensions to MLIR.

Not a list of lists, two separate lists. Or one list that is always applied in full, and the user does that twice.

Encoding the stage in which things run on the list index is asking for trouble, and it becomes literally a list of lists.

lighthouse/workload/runner.py

lighthouse/schedule/x86/tile_and_vector_matmul.py

fschlimb · 2026-03-11T14:38:35Z

The latest commit (c99351f) keeps the list of schedules, but always applies all of them. This allows easy re-use of predefined schedules (bundles).

This can easily be extended to allow pipeline (managers) as well.

The benchmark wrapper is now generated by a schedule (thanks @rolfmorel for your help). Workloads which want to be applied to the benchmark function add this schedule to their list of schedules.

tkarna

Thanks, only minor comments. I think this use case shows that returning a list of schedules is an appropriate solution (at least for time being).

examples/feed-forward-mpi/feed-forward-mpi.py

examples/feed-forward-mpi/ff_weight_stationary.py

examples/feed-forward-mpi/README.md

tkarna · 2026-03-12T09:34:47Z

lighthouse/workload/runner.py

+    # get execution engine, rtclock requires mlir_c_runner
+    libs = workload.shared_libs()
+    c_runner_lib = "libmlir_c_runner_utils.so"
+    if c_runner_lib not in libs:
+        libs.append(c_runner_lib)


nit: Strictly speaking we only need rtclock if we are using the benchmark routine which this code path does not do. Thus we could drop the bench wrapper schedule if we just want to execute the kernel.

This would imply that now the schedule generation does not only depend on the payload and target but also on the intent.

Yes, this is not ideal. I couldn't come up with a better tradeoff yet. The next best thing I could think of of to make get_schedules accept an flag indicating if the schedule should include bench generation.
Seems difficult to find amechanism that does not get specific about bench or shard.

Yes, this is good enough for now. Maybe we can better handle this use case once we can construct the flow with some smaller abstractions.

lighthouse/workload/runner.py

tkarna · 2026-03-12T09:38:06Z

lighthouse/workload/runner.py

-                scf.yield_(())
-
-        benchmark.func_op.attributes["llvm.emit_c_interface"] = ir.UnitAttr.get()
+            args = payload_arguments + [time_memref_t, index_t, index_t]


Now that we pass the nruns and nwarmup as ints to the function, the user is responsible for passing in a time_memref that is big enough. Maybe we should assert this somewhere?

Runtime assertion? Looks a bit over the top to me.

lighthouse/workload/runner.py

rolfmorel · 2026-03-11T14:56:04Z

lighthouse/schedule/pattern_schedule.py

+def rewrite_pattern(patterns: dict, pname: str):
+    """Return a rewrite pattern class that can be registered with MLIR.
+    The patterns dict should map op names to their corresponding match and rewrite functions."""
+
+    @ext.register_operation(PatternDialect, replace=True)
+    class RewritePattern(PatternDialect.Operation, name=pname):


Suggested change

def rewrite_pattern(patterns: dict, pname: str):

"""Return a rewrite pattern class that can be registered with MLIR.

The patterns dict should map op names to their corresponding match and rewrite functions."""

@ext.register_operation(PatternDialect, replace=True)

class RewritePattern(PatternDialect.Operation, name=pname):

@ext.register_operation(PatternDialect)

class RewritePattern(PatternDialect.Operation, name=pname):

pattern_name: ir.StringAttr

priority: Optional[ir.IntegerAttr[32]]

How about generalizing this a bit and moving it to lighthouse.dialects.transform_ext like in #68?

That is, we can have a transform_ext.populate_pattern "PATTERN_NAME" { priority = PRIO } op which uses the "PATTERN_NAME" StringAttr to lookup a pattern and add it to patternset passed to .populate_patterns(op, patternset) at priority PRIO (or 1 in case the attr is not there). Inside .populate_patterns(op, patternset), the lookup of the pattern can then happen on the RewritePattern class (not instance), e.g. via RewritePattern.named_patterns : dict[str, Callable] where the Callable values have the rewrite pattern signature.

@fschlimb, what do you think? Happy to help do this.

Sure why not. Do you want to do this in a follow-up PR?

Sure. If we merge this, I can clean it up in #68 before that goes in.

lighthouse/workload/runner.py

…sharding.

fschlimb · 2026-03-12T16:49:03Z

Needs llvm/llvm-project#186158 to get green CI.

fschlimb requested review from Copilot and tkarna March 6, 2026 16:40

Copilot started reviewing on behalf of fschlimb March 6, 2026 16:41 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

lighthouse/workload/runner.py Outdated Show resolved Hide resolved

lighthouse/workload/workload.py Outdated Show resolved Hide resolved

lighthouse/workload/runner.py Outdated Show resolved Hide resolved

fschlimb mentioned this pull request Mar 6, 2026

[mlir][shard] Small fixes to partition pass llvm/llvm-project#185050

Merged

fschlimb requested a review from rolfmorel March 9, 2026 17:28

fschlimb assigned rengolin and adam-smnk and unassigned rengolin and adam-smnk Mar 9, 2026

fschlimb requested review from adam-smnk and rengolin March 9, 2026 17:28

rengolin reviewed Mar 9, 2026

View reviewed changes

tkarna reviewed Mar 10, 2026

View reviewed changes

lighthouse/workload/runner.py Outdated Show resolved Hide resolved

fschlimb force-pushed the mult_sched branch from 0216b1d to 60669ef Compare March 10, 2026 10:58

adam-smnk reviewed Mar 10, 2026

View reviewed changes

lighthouse/schedule/x86/tile_and_vector_matmul.py Show resolved Hide resolved

tkarna approved these changes Mar 12, 2026

View reviewed changes

rolfmorel reviewed Mar 12, 2026

View reviewed changes

fschlimb added 9 commits March 12, 2026 04:16

tmp safe

0ad05df

allowing more than one schedule in benchmarking/workload. Supporting …

b3a0233

…sharding.

adding []

e142795

adding and using Adam's simplified tiler for x86

28bacad

mlp -> feed-forward

909450f

bumping mlir-python-bindings version

05d42e4

mlp -> feed-forward

76492d4

using pipeline helper

5f5c716

adding benchmark function through pattern rewrite schedule

1be8956

fschlimb added 2 commits March 12, 2026 04:18

adding args

1d4f137

adressing review comments

ea13d00

fschlimb force-pushed the mult_sched branch from 6f66148 to ea13d00 Compare March 12, 2026 11:19

adding buffer deallocation passes

75e93eb

		from mlir.dialects.transform import x86


		def _cleanup(target):

Conversation

fschlimb commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tkarna commented Mar 9, 2026

Uh oh!

fschlimb commented Mar 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fschlimb Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rengolin Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fschlimb commented Mar 11, 2026

Uh oh!

tkarna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fschlimb commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fschlimb commented Mar 6, 2026 •

edited

Loading

fschlimb Mar 9, 2026 •

edited

Loading

rengolin Mar 10, 2026 •

edited

Loading