Skip to content

Conversation

@yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Nov 14, 2025

📌 Description

This PR implements the refactor mentioned in
https://github.com/flashinfer-ai/flashinfer/pull/1969/files#r2461856020

In our current design we rely on calling pip show flashinfer-python 2>/dev/null || uv pip show flashinfer-python 2>/dev/null to obtain deepgemm jit include directory, which is error-prune (e.g. if user do not have pip available in the environment it will fail), in this PR we pass the deepgemm jit include directory through python APIs.

🔍 Related Issues

#1969

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

cc @djmmoss @jiahanc @nvmbreughe

Summary by CodeRabbit

  • New Features

    • Fused MoE modules now set DeepGEMM JIT include directories at runtime; a runtime-exposed entry lets the module provide include paths during initialization.
    • Module generation includes the DeepGEMM JIT setup so those paths are applied before runtime use.
  • Chores

    • JIT compiler API updated to centrally manage and accept external include directories.
    • Minor header/build adjustments to support the new initialization flow.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 14, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a TVM FFI to set DeepGEMM JIT include directories, centralizes include-dir state in the DeepGEMM compiler via a static accessor/mutator, updates fused MoE build inputs and runtime initialization to call the new FFI, and adds a small CUDA header include.

Changes

Cohort / File(s) Summary
DeepGEMM compiler API & state
csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh
Replace instance includeDirs_ with a lazily-initialized static getJitIncludeDirs() returning a reference; add setJitIncludeDirs() and public static Compiler::setIncludeDirs(); update constructor/isValid and replace uses with the static accessor.
CUDA FFI bindings (new setup & binding swap)
csrc/fused_moe/cutlass_backend/deepgemm_jit_setup.cu, csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
Add new exported function set_deepgemm_jit_include_dirs (wrapper in flashinfer namespace) converting tvm::ffi::Array<tvm::ffi::String>std::vector<std::filesystem::path> and calling deep_gemm::jit::Compiler::setIncludeDirs(); swap which binding source is used by the module.
Python runtime init
flashinfer/fused_moe/core.py
After loading the Cutlass fused MoE backend, import JIT env and call the exported set_deepgemm_jit_include_dirs() with the constructed DeepGEMM include path (FLASHINFER_CSRC_DIR/nv_internal/tensorrt_llm).
JIT build inputs
flashinfer/jit/fused_moe.py
Swap binding kernel source to flashinfer_cutlass_fused_moe_binding.cu and add deepgemm_jit_setup.cu to CUDA sources compiled into the fused MoE module.
Header include tweak
csrc/nv_internal/tensorrt_llm/deep_gemm/jit_utils.cuh
Add #include <cuda.h> at the top of the header.

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python initializer
    participant TVM as TVM FFI layer
    participant Binding as deepgemm_jit_setup (CUDA binding)
    participant Compiler as deep_gemm::jit::Compiler

    Py->>TVM: call set_deepgemm_jit_include_dirs([path])
    TVM->>Binding: invoke exported function
    Binding->>Binding: convert tvm::ffi::Array<String> → vector<filesystem::path>
    Binding->>Compiler: Compiler::setIncludeDirs(paths)
    Compiler->>Compiler: set static JIT include dirs (getJitIncludeDirs ref)
    Compiler-->>Binding: return
    Binding-->>TVM: return
    TVM-->>Py: return
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review thread-safety and lifetime of the static vector returned by getJitIncludeDirs().
  • Verify FFI string → path conversion, empty/invalid entries, and exported symbol name matches Python usage.
  • Confirm the Python call happens before any JIT compilation paths run.

Possibly related PRs

Suggested reviewers

  • djmmoss
  • jiahanc
  • nvmbreughe
  • yongwww
  • wenscarl
  • aleozlx

Poem

🐰 I hop through code with tiny paws so fleet,

I stitch include paths where C++ and Python meet,
A bind, a call, the JIT awakes with cheer,
Static paths in place, the compiler purrs near,
Hooray — a rabbit's nibble, neat and sweet 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: refactoring the method of passing the DeepGEMM include directory configuration through Python APIs instead of shell commands.
Description check ✅ Passed The PR description is well-structured, addresses all required template sections, explains the rationale for the change, references related issues, and includes pre-commit verification and testing notes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90a0149 and bcd9bb7.

📒 Files selected for processing (2)
  • flashinfer/fused_moe/core.py (1 hunks)
  • flashinfer/jit/fused_moe.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • flashinfer/fused_moe/core.py
  • flashinfer/jit/fused_moe.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of DeepGEMM JIT compilation by eliminating a fragile dependency on shell commands for discovering include directories. Instead, it establishes a direct communication channel where Python APIs programmatically provide the necessary paths to the C++ DeepGEMM compiler, ensuring consistent and reliable behavior across diverse user environments.

Highlights

  • Refactored DeepGEMM Include Path Discovery: The method for obtaining DeepGEMM JIT include directories has been refactored from relying on error-prone shell commands (pip show or uv pip show) to using more robust Python APIs.
  • New C++ API for Include Paths: A new C++ function set_deepgemm_jit_include_dirs is now exposed via TVM_FFI, allowing Python to directly pass include directory paths to the DeepGEMM compiler.
  • DeepGEMM Compiler Update: The deep_gemm::jit::Compiler in C++ now includes a static setIncludeDirs method to receive the include paths programmatically, removing the need for internal shell command execution.
  • Python Integration: The Python codebase (flashinfer/fused_moe/core.py) has been updated to retrieve the necessary include path from the jit_env and pass it to the C++ backend using the new API.
  • File Renaming: The file flashinfer_cutlass_fused_moe_sm100_binding.cu has been renamed to flashinfer_cutlass_fused_moe_binding.cu.
  • Macro Renaming for Conflict Avoidance: The CHECK_CUDA macro in jit_utils.cuh and runtime.cuh has been renamed to CHECK_CUDA_DRIVER and guarded with #ifndef to prevent potential redefinition conflicts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors how the deepgemm JIT include directory is located, moving from a shell-based discovery (pip show) to a more robust approach where the path is passed from Python to the C++ backend. This is a great improvement, making the build process less dependent on the environment. The changes are well-contained and logical. I've found a potential race condition in the Compiler singleton initialization that could lead to JIT compilation failures depending on call order, and I've suggested a fix. I also have a minor suggestion to improve performance in one of the new C++ functions. Overall, a solid refactoring.

Comment on lines 238 to 243
[[nodiscard]] bool isValid() const { return !includeDirs_.empty(); }

// Set include directories before the singleton is initialized
static void setIncludeDirs(std::vector<std::filesystem::path> const& dirs) {
setJitIncludeDirs(dirs);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential race condition here depending on the initialization order. The Compiler singleton's constructor copies the JIT include directories into the includeDirs_ member. If getInstance() is called before setIncludeDirs(), includeDirs_ will be initialized with an empty vector and will not be updated when setIncludeDirs() is called later. This would cause JIT compilation to fail because of missing include paths.

To make this more robust, I suggest removing the includeDirs_ member and directly using getJitIncludeDirs() in isValid() and build(). This ensures that the most up-to-date include directories are always used.

This would involve:

  1. Removing the includeDirs_ member variable.
  2. Updating the constructor to remove the initialization of includeDirs_.
  3. Updating isValid() as suggested below.
  4. Updating build() to loop over getJitIncludeDirs() instead of includeDirs_.
  [[nodiscard]] bool isValid() const { return !getJitIncludeDirs().empty(); }

  // Set include directories before the singleton is initialized
  static void setIncludeDirs(std::vector<std::filesystem::path> const& dirs) {
    setJitIncludeDirs(dirs);
  }

Comment on lines 1207 to 1210
std::vector<std::filesystem::path> dirs;
for (const auto& dir : include_dirs) {
dirs.push_back(std::filesystem::path(std::string(dir)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For a slight performance improvement and to follow modern C++ best practices, you can reserve memory for the vector to avoid potential reallocations and use emplace_back to construct the path in-place.

  std::vector<std::filesystem::path> dirs;
  dirs.reserve(include_dirs.size());
  for (const auto& dir : include_dirs) {
    dirs.emplace_back(std::string(dir));
  }

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a79b78 and b36e1fb.

📒 Files selected for processing (6)
  • csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (2 hunks)
  • csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (2 hunks)
  • csrc/nv_internal/tensorrt_llm/deep_gemm/jit_utils.cuh (2 hunks)
  • csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh (2 hunks)
  • flashinfer/fused_moe/core.py (1 hunks)
  • flashinfer/jit/fused_moe.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

  • csrc/nv_internal/tensorrt_llm/deep_gemm/jit_utils.cuh
  • csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu
🧬 Code graph analysis (1)
flashinfer/fused_moe/core.py (1)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (2)
  • set_deepgemm_jit_include_dirs (1206-1212)
  • set_deepgemm_jit_include_dirs (1206-1206)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (5)
flashinfer/jit/fused_moe.py (1)

149-149: LGTM - Binding file path updated

The change aligns with the new unified binding file that includes the set_deepgemm_jit_include_dirs FFI export.

csrc/nv_internal/tensorrt_llm/deep_gemm/runtime.cuh (1)

68-68: LGTM - Consistent use of CUDA Driver API

The migration from CUDA Runtime API to CUDA Driver API is consistent throughout the file.

csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu (1)

1204-1217: LGTM - FFI binding correctly forwards include directories

The implementation properly converts TVM array types to C++ filesystem paths and forwards them to the DeepGEMM compiler.

csrc/nv_internal/tensorrt_llm/deep_gemm/jit_utils.cuh (1)

42-54: LGTM - Proper macro guard prevents conflicts

The #ifndef guard is a good practice to prevent macro redefinition errors, especially given the noted potential conflict with tvm_ffi_utils.h.

flashinfer/fused_moe/core.py (1)

327-334: Verify initialization order to prevent race condition

The DeepGEMM include directories are set after the module is loaded. The Compiler singleton in compiler.cuh copies the static include dirs vector to a member variable includeDirs_ during initialization (line 475). If the Compiler singleton is initialized anywhere during the module load process (before line 333 executes), the member variable won't be updated with the correct paths, causing compilation failures.

Run this script to verify that no code path during module loading triggers Compiler::getInstance():

Consider documenting the initialization order requirement or refactoring the Compiler to use the static include dirs directly instead of copying to a member variable.

@yzh119
Copy link
Collaborator Author

yzh119 commented Nov 17, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !142 has been created, and the CI pipeline #38602487 is currently running. I'll report back once the pipeline job completes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (1)

124-132: Consider thread safety for concurrent access.

The static vector is now mutable through setJitIncludeDirs() without synchronization. If setIncludeDirs() is called concurrently with build() operations or from multiple threads, there could be data races when reading/writing the vector.

Given the typical usage pattern (set once during initialization, then read during compilation), this is likely acceptable, but consider adding a comment documenting that setIncludeDirs() should be called during initialization before any compilation occurs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b36e1fb and 482b3b8.

📒 Files selected for processing (3)
  • csrc/fused_moe/cutlass_backend/deepgemm_jit_setup.cu (1 hunks)
  • csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (4 hunks)
  • flashinfer/jit/fused_moe.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (7)
csrc/nv_internal/tensorrt_llm/deep_gemm/compiler.cuh (4)

124-132: LGTM! Race condition from past reviews is now resolved.

The implementation correctly addresses the initialization race condition raised in previous reviews. By returning a reference to the static vector and providing a setter, the code ensures that all consumers see the same up-to-date include directories without any stale copies.


238-238: LGTM! Correct usage of the static vector.

The code now correctly calls getJitIncludeDirs() directly instead of using a potentially stale member copy. This ensures that isValid() and build() always see the current include directories.

Also applies to: 316-316


240-243: LGTM! Clean API for external configuration.

The new setIncludeDirs() public static method provides a clean interface for external code to configure include directories before compilation, which aligns well with the PR's objective of passing the directory through Python APIs.


473-473: LGTM! Constructor simplified correctly.

The constructor no longer copies the static include directories to a member variable, which was the source of the initialization race condition. The removal is correct and complete.

flashinfer/jit/fused_moe.py (1)

149-151: LGTM! Correct integration of the new DeepGEMM setup.

The change correctly integrates the new deepgemm_jit_setup.cu source file into the module generation, making the FFI function set_deepgemm_jit_include_dirs available for configuration. The binding file change from the SM100-specific version to the generic version aligns with the refactoring objective.

csrc/fused_moe/cutlass_backend/deepgemm_jit_setup.cu (2)

25-31: LGTM! Clean TVM FFI wrapper implementation.

The function correctly converts TVM array of strings to a std::vector<std::filesystem::path> and forwards to the Compiler API. The implementation is straightforward and properly integrates the Python-to-C++ configuration path.


35-36: LGTM! Correct TVM FFI export.

The macro correctly exports the function as set_deepgemm_jit_include_dirs, making it callable from Python/TVM.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #38602487: 1/17 passed

@yzh119
Copy link
Collaborator Author

yzh119 commented Nov 17, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !142 has been updated with latest changes, and the CI pipeline #38631200 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[CANCELING] Pipeline #38631200: canceled

@yzh119
Copy link
Collaborator Author

yzh119 commented Nov 17, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !142 has been updated with latest changes, and the CI pipeline #38637572 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[SUCCESS] Pipeline #38637572: 10/18 passed

@yzh119
Copy link
Collaborator Author

yzh119 commented Nov 18, 2025

Failed UTs are not relevant (will be fixed in #2097).

@yzh119 yzh119 enabled auto-merge (squash) November 18, 2025 07:54
Copy link
Collaborator

@cyx-6 cyx-6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yzh119 yzh119 requested a review from aleozlx as a code owner November 28, 2025 03:42
@yzh119
Copy link
Collaborator Author

yzh119 commented Nov 28, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !142 has been updated with latest changes, and the CI pipeline #39288620 is currently running. I'll report back once the pipeline job completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants