Currently, I suspect syncmul! is the problem.
using Octavian
M = K = N = 10_000;
A = rand(M,K); B = rand(K,N); C = Array{T}(M, N);
@time matmul!(C, A, B);
On a computer with 18 threads, using 18 threads with Julia + running OBSStudio in the background resulted in a roughly 2x performance degredation vs using Julia with 16 threads.