grafana/k6

Problems and deficiencies around uploading big files with k6

Open

#2,311 opened on Dec 22, 2021

View on GitHub
 (4 comments) (0 reactions) (1 assignee)Go (30,564 stars) (1,537 forks)batch import
evaluation neededhelp wantednew-httpperformance

Description

This will discuss deficiencies around net/http, but in reality, some of those apply to other APIs as well.

The idea is mostly to:

  1. Discuss the problems with the current API
  2. Explore what is actually wrong(or not) with it
  3. Figure out which parts are important, and we would like "fixed"
  4. And then possibly discuss solutions, although unless they are simple I would advise having them in separate issues linking to this one.

But let's first get a fairly simple script

import http from 'k6/http';
import { sleep } from 'k6';

const binFile = open('/path/to/file.bin', 'b'); // line A

export default function () {
  const data = {
    field: Math.random(), // dynamic field
    file: http.file(binFile, 'test.bin'), // Line B
  };

  const res = http.post('https://example.com/upload', data); // Line C
  sleep(3);
}

Given this very simple example currently(v0.35.0) we have

  1. A copy of the file.bin created for each VU on line A. This can be addressed with 1931 and PR2303
  2. on Line B there is currently no copy, but as binFile is a mutatable ArrayBuffer it technically can lead to a possibility to edit the underlying data after data is created. And if http.post was asynchronous(and 3 is fixed) possibly even a data race. Even if it was copying and if there wasn't dynamic data this could've been done once in the init context.
  3. on line C data gets "marshaled" into multipart body effectively copying the body.
  4. on line C again res also has a field request which has a field body which is a string dump of the body that was created above.

So for this particular script, there are 1,3 and 4 make copies of potentially big amount of bytes.

Let's go with a similar but slightly different script

import http from 'k6/http';
import { check } from 'k6';
import { FormData } from 'https://jslib.k6.io/formdata/0.0.2/index.js';

const binFile = open('/path/to/file.bin', 'b'); // line A;

export default function () {
  const fd = new FormData();
  fd.append('someTextField', math.Random());
  fd.append('file', http.file(binFile, 'file.bin', 'pdf')); // Line B
  
  const res = http.post('https://httpbin.test.k6.io/post', fd.body(), { // Line C
    headers: { 'Content-Type': 'multipart/form-data; boundary=' + fd.boundary },
  });
  check(res, {
    'is status 200': (r) => r.status === 200,
  });
}

Here we have the same exact thing except that instead of 3. happening inside k6 it happens in FormData's .body() call which is in javascript. This also happens to be the currently advised way to do multipart requests due to a number of issues.

Arguably just 1931 can help with Line A and as long as the body doesn't need something dynamic that also lets us do New SharedArrayBuffer(name,func() {return arrayBUffer}) can be used to make the whole body in the init context. edit: turns out that I am wrong as at least in the case of FormData you also need the boundary which then requires SharedArrayBuffer to return somehow a buffer and a string (the boundary). While this is doable, I think this now means that we make it extremely specific, so arguably now we need to make it more general. I guess (haven't checked) we can have the FormData take in the boundary and for it to be a constant for each VU 🤷‍♂, but this now seems like we are just fixing a hack with a hack IMO.

This still means that on each request that uploads we make a string copy of the whole body. This body also is IME never used actually. In reality, the only case where it's even relevant is in the first example as it can be used to see what actually k6 sent.

This currently proposed solution is to just reference the original value in all cases but the one where we generate it. This though in practice again falls short by a lot in the case where the values are dynamic and this needs to be generated.

All of this also isn't really thought through from the perspective of having asynchronous calls as well. With asynchronous operations there is no problem to have:

http.postPromise(url, someBody).then(.. something ..)
someBody[23] = 12; // or any other change to the body
http.postPromise(aDifferentUrl, someBody).then(.. something else ..)

In this case there are multiple race conditions points where if someBody gets modified not only can it race with any of the post being done, but response.request.body will not actually be accurate.

I really can't figure out any way for response.request.body to be accurate in cases where we don't want to copy it and it's modifiable. Requesting all body arguments to http.* to be immutable will probably be way too big of an ask and a breaking change.

This combined with the race conditions around asynchronous programming and also that copying big binary blobs will be problematic forever I kind of feel this needs more ... drastic change to fix.

The thing I mostly think of is generators. If the body is given by a generator (or a generator-like object), the body can be read once from it and sent it. There are two problems(apart from no generator support in goja currently):

  1. generators are one-time and for various reasons (redirects and goaway in http2) k6 might need to start reading the body from the beginning so it needs to be something different from a pure JS generator :(.
  2. if it's a pure JS generator the event loop needs to be called for each read from it which might be problematic.

1 can be fixed by having the custom type BodyGenerator with both next() and a reset() methods 🤷‍♂ 2 can be fixed by having custom in-go types that wrap immutable types in ways that will be more optimal

Along the same vein (of bigger changes) we can require that the body is written with explicit write() calls which are blocking or combined with the above.

This likely needs more investigation to see how it works in other APIs such as in nodejs and browsers, but I expect that there this might be less of a problem as there won't be hundreds if not thousands of js VMs doing uploads at the same time. So for them, something like copying might be a lot more reasonable and only have a "streaming" variant for compliance.

Things I have not touched on:

  • k6 currently (by some amount of design) does not copy ArrayBuffer bytes, but this is very implementation dependent and might get broken by either goja or us at a future point, so this likely is worthy of some tests.
  • We definitely can just say that race conditions are a script bug and the user should fix them and hope they won't lead to a panic 🤷‍
  • I haven't talked about CoW structures predominantly as .... those are quite complex and I would expect will have very bad performance in the case where a modification is done
  • The body currently gets copied if we compress the body, which is by .... uh design (sorry) but is just one more place where http.* will mess up any outside fixes.
  • I don't even start discussing embedding something like the jslib FormData in k6, which likely will be a requirement for even more performant multipart requests.

In my opinion, while the changes proposed in PR 2303 will have some beneficial effect to the situation with uploading big files in k6, they will not address the underlying problem. Which is that the k6 HTTP module was not designed with uploads in mind (or so it seems to me).

edit: Due to the above problem where even for static FormData a pure SharedArrayBuffer will not be sufficient and that ... I would expect there are even more deficiencies with it, I feel even more strongly that just adding SharedArrayBuffer will have negligible positive effects but will warp any future change to try to work with it instead of actually coming with a better solution.

I would really prefer if instead of making some changes, which may or may not have to be advised against in the future (as we might move to something different entirely), a redesign of the HTTP module is being thought of so that uploading anything is possible in some way that is both safe and performant. Hopefully, also it will be fairly obvious how to use the API in a way that this will be true instead of requiring jumping through many hoops 🤷‍♂

Relevant open at the time issues:

Some of those are only tangibly about uploads(and there are more around the current in k6 multipart support for "object" bodies).

Contributor guide