Skip to content

Conversation

@lovedheart
Copy link

Improve Qwen3Next inference speed

Fix Qwen3Next inference speed.

@pwilkin: I don't have a discrete GPU, could you check this PR improves the speed?

main

============================================================
                    Performance Report
============================================================
Operation Name           Total(ms)   Calls     Avg(ms)        Pct(%)
------------------------------------------------------------
attn_calc                26.204      3408      0.008          22.78
ffn_calc                 23.581      3408      0.007          20.50
attn_calc_linear         19.204      2556      0.008          16.69
fwd_exp_upd_recur_sts    15.883      2556      0.006          13.81   <- Bottleneck
build_delta_net          11.294      2556      0.004          9.82
build_dn_recur           8.553       2484      0.003          7.43
attn_calc_full           6.771       852       0.008          5.89
build_dn_chunking        2.575       72        0.036          2.24
decay_mask               0.314       2484      0.000          0.27
broadcast                0.160       2484      0.000          0.14
ssm_conv                 0.144       2556      0.000          0.13
value_mul_mat            0.107       2484      0.000          0.09
solve_tri_recur          0.097       2484      0.000          0.08
cumsum_dn                0.074       2484      0.000          0.06
fwd_expand               0.049       71        0.001          0.04
ggml_exp                 0.028       2484      0.000          0.02
------------------------------------------------------------
Total                    115.038     100.00
============================================================

Performance Bottleneck Analysis:
- Most time-consuming operation: attn_calc (26.204 ms)   <- Bottleneck, here should be ffn calc
- Longest average time: build_dn_chunking (0.036 ms)

PR

============================================================
                    Performance Report
============================================================
Operation Name           Total(ms)   Calls     Avg(ms)        Pct(%)
------------------------------------------------------------
ffn_calc                 56.509      4800      0.012          47.68
build_delta_net          15.740      3600      0.004          13.28
attn_calc                14.720      4800      0.003          12.42
build_dn_recur           13.227      3528      0.004          11.16
attn_calc_full           9.955       1200      0.008          8.40
attn_calc_linear         4.460       3600      0.001          3.76
build_dn_chunking        2.197       72        0.031          1.85
ssm_conv                 0.540       3600      0.000          0.46
decay_mask               0.312       3528      0.000          0.26
value_mul_mat            0.251       3528      0.000          0.21
broadcast                0.159       3528      0.000          0.13
fwd_exp_upd_recur_sts    0.150       3600      0.000          0.13
solve_tri_recur          0.149       3528      0.000          0.13
fwd_expand               0.082       100       0.001          0.07
ggml_exp                 0.041       3528      0.000          0.03
cumsum_dn                0.037       3528      0.000          0.03
------------------------------------------------------------
Total                    118.529     100.00
============================================================

Performance Bottleneck Analysis:
- Most time-consuming operation: ffn_calc (56.509 ms)
- Longest average time: build_dn_chunking (0.031 ms)

Added performance metrics tracking to qwen3next model, including operation timing, reporting, and CSV export functionality.
@pwilkin
Copy link
Collaborator

pwilkin commented Nov 29, 2025

This performance measuring function only computes the time for the graph building - which is completely negligible compared to the graph computation. I don't think you're getting anything meaningful this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants