WebGPU and the price of compiling WGSL – Hugo Daniel

Introducing a WebGPU info and timing tool

Welcome, in the previous two posts I went looking for the WGSL limits (both
in the spec, and also
out of it).

In this post I want to introduce a simple WebGPU app where you can create WGSL
shaders by dragging some sliders to increase their complexity.

The idea is that compilation time and complexity can be known for the compute
and rendering pipelines as well as in their different stages (vertex and
fragment).

This is important to me because I wanted to know how much it costs to increase
the shader complexity and where it might be worth to start considering using
buffers/textures for some of it.

How long does it take to compile a 3Megabyte vertex shader?

Easy to know, here is the single result of the cold timings in Chrome
(140.0.7339.215) from my Intel MacBook Pro (2019):

How about a 30Megabyte vertex shader??

It errors. Ah!

A simple tool, along the lines of the awesome
WebGPU report, but not as exhaustive.

You can find it here:
WebGPU Diagnostics (and compilation benchmark) tool.

My intention was to compose it of 3 major parts:

A simple sample (a rotating triangle).
Shader complexity sliders and timings (compilation, adapter request and
pipeline creation).
Device and adapter info (features, and announced limits)

Rotating triangle

The rotating triangle is made from the 3 initial shaders. A compute shader
outputs vertices, then a vertex shader transforms them, and finally a fragment
shader paints it.

I think of this as a kind of “hello world” for WebGPU but also as a simple
program that performs the happy path of the fastest rendering possible (or close
to it).

The idea along the app is that this triangle never changes. It always
remains the same as the shader complexity increases.

Shader complexity sliders

Shader complexity can be increased at each pipeline stage. This includes
separating the vertex shader from the fragment shader into different modules, to
allow testing how much they differ (if they do at all).

The shader string is created before compilation starts. This is done in a
WebWorker thread to keep the UI responsive (as much as possible) as memory and
string traversal time increases.

I wanted to make these sliders biased towards complexity. A small change in them
has to make the shader size increase by quite a bit. Making them purposefully
unbalanced to ensure that device limits can be checked.

What WGSL code the sliders generate?

They are split into three categories:

Number of functions – dump n functions with 1 f32 argument each that return
the single variable declared in the body.
Number of statements – for each function in 1., creates n statements that
just do a simple assignment.
Expression depth – for each statement in 2. expands it to have n sums.

So if we set them all to 8 (8 functions, each with 8 statements, each with 8
sums), it produces a 5Kilobyte file.

You can check the generator code
here.

Or click here to see the output produced with the three sliders at value 8

struct Vertex { position: vec3f, color: vec3f, }
struct Params { angle: f32, }
// Storage buffer that will hold our generated vertices @group(0) @binding(0)
var vertices: array; @group(0) @binding(1)
var params: Params;
fn function0(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
fn function1(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
fn function2(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
fn function3(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
fn function4(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
fn function5(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
fn function6(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
fn function7(a: f32) -> f32 { var tmp: f32 = 0.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp +=
0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; tmp += 0.0+1.0+2.0+3.0+4.0+5.0+6.0+7.0; return
a + tmp; }
// Compute shader that generates a triangle's vertices @compute
@workgroup_size(1) fn compute_main(@builtin(global_invocation_id) global_id:
vec3u) { var accum: f32 = 0.0; accum += function0(8.0); accum += function1(8.0);
accum += function2(8.0); accum += function3(8.0); accum += function4(8.0); accum
+= function5(8.0); accum += function6(8.0); accum += function7(8.0);
// Generate 3 vertices for a triangle // We only need one invocation for this
simple example if (global_id.x == 0) { let angle = params.angle + accum *
0.0001; let cos_angle = cos(angle); let sin_angle = sin(angle); let
base_positions = array( vec2f(0.0, 0.5), vec2f(-0.5, -0.8), vec2f(0.5,
-0.5) );
  let top = vec2f(
      base_positions[0].x * cos_angle - base_positions[0].y * sin_angle,
      base_positions[0].x * sin_angle + base_positions[0].y * cos_angle,
  );
  let left = vec2f(
      base_positions[1].x * cos_angle - base_positions[1].y * sin_angle,
      base_positions[1].x * sin_angle + base_positions[1].y * cos_angle,
  );
  let right = vec2f(
      base_positions[2].x * cos_angle - base_positions[2].y * sin_angle,
      base_positions[2].x * sin_angle + base_positions[2].y * cos_angle,
  );

  let depth = clamp(accum * 0.001, -0.5, 0.5);

  // Top vertex (red)
  vertices[0].position = vec3f(top, depth);
  vertices[0].color = vec3f(1.0, 0.0, 0.0);
  
  // Bottom-left vertex (green)
  vertices[1].position = vec3f(left, depth);
  vertices[1].color = vec3f(0.0, 1.0, 0.0);
  
  // Bottom-right vertex (blue)
  vertices[2].position = vec3f(right, depth);
  vertices[2].color = vec3f(0.0, 0.0, 1.0);

} }

Device and adapter info

The features and limits is a nice thing to have in these kind of tools. They are
not used, just communicated. A lot of them I don’t know what they do, so I
included a link to the corresponding WebGPU spec chapter.

An important part here is that I wanted to separate the adapter limits from
the device limits. These can be different, and knowing these differences might
be important when we are aiming for the upper end of the lot.

In WebGPU the adapter limits can be seen as the announced values of what might
exist, while the device limits can be seen as what we have to work with.

These can be requested through the
requiredLimits
on creation.

Another important part of motivation for doing this was to use
boreDOM. A minimal JS framework that I
created as a fun project that tries to do what major JS frameworks do but with a
different approach just because why not? (no bundling, no compilation, no
minification, and small size).

It was fun and cool to see that it works without bundling/compilation while
handling WebGPU’s complexity.

Conclusion

Go ahead, play with it a bit here and let me know
how your device and browser reacts to massive shaders.

I found that there is a significant difference between cold and warm compilation
times. This cold vs warm timing distinction could be interesting to explore
further, and maybe it could even be leveraged/exploited to hide extremely
complex shaders by splitting them in many small pipelines (just a wild
hypothesis at this point i guess).

This post wraps the WGSL limits endeavour. For the next ones I might be
introducing a rendering engine I am working that tries to perform with rendering
the same spirit as boreDOM has to JS frameworks. Thanks for reading all of this
and stay tuned!

Source link

DZdano

Administrator

Visit Website View All Posts

Leave a Reply Cancel reply

Related Stories

Our Favorite High Resolution Mirrorless Camera Is $900 Off Right Now

Rivian’s Also e-bike is like nothing you’ve ever seen before

JBL’s Rad Clip-On Bluetooth Speaker Is $20 Off

You may have missed