4.1.1 Model-Level Latency.
First, we compare the average inference latency in Web browsers and native environments. We deployed the Web service in a dedicated real-world server, which only serves for our experiments and is located in a LAN with experiment devices. We downloaded the model after the page loading phase because of the considerations that (1) the download time of Web pages may influence the results of the setup stage, and (2) the PLT, which is a critical indicator of the QoE, may be compromised if we download models and the framework during the page loading phase. We illustrate the results of ResNet50 in
Figure 2 and provide the complete results in
Tables 1–
4.
Average CPU Latency. We observe that there is a latency gap between inference in the Wasm backend and in native. We analyze the gap of prediction latency, warmup latency, and setup latency. In this part, both SIMD and multithreading are enabled by default and we set four threads.
As for
prediction latency, the in-browser prediction latency of TF.js is 3.7–18.4
\(\times\) higher than the latency of TF (
Table 2 vs.
Table 1); the gap is 2.1–36.4
\(\times\) between ORT.js and ORT (
Table 3 vs.
Table 1). The average gap of both frameworks is 16.9
\(\times\). For both frameworks, the performance of in-browser inference is inferior consistently compared to native inference. The gap mainly comes from three aspects. First, advanced SIMD instruction sets are unavailable in browsers, like AVX2 and AVX512 on Intel CPU [
22,
23]. The SIMD instruction length in Wasm is only 128-bit [
47], while the length is 256-bit in AVX2 and 512-bit in AVX512. This will introduce a latency gap of up to 4
\(\times\) between in-browser and native inference. Besides instruction length, some advanced instructions, such as
Fused-Multiply-Add (FMA) [
19], are not available in Wasm [
33]. The absence of FMA instructions in Wasm could introduce a performance gap of up to 2
\(\times\). Second, the Wasm VM introduces additional overhead, because it needs to interpret and execute the Wasm binary code, while native inference frameworks can directly run native binaries. In general, Wasm VM itself can lead up to 2.5
\(\times\) additional overhead [
81]. Third, there are inherent threads in the Web page that introduce inevitable resource competition, such as the I/O thread and compositor thread [
44].
Regarding the two in-browser inference frameworks, there is no consistent superiority of one framework over the other. For instance, we find that ORT.js exhibits lower prediction latency for ResNet50 when employing both multithreading and SIMD, while TF.js exhibits lower latency in inferring VGG16 and MobileNetV2. The difference can be primarily attributed to how the two frameworks are implemented. We delved into the implementations of both frameworks and discovered that both frameworks use XNNPACK for the underlying kernel computation, where XNNPACK is a library that provides numerous kernel implementations. However, there are differences in how a computation graph is constructed at a higher level, as well as the kernels themselves. For TF.js, graph parsing and construction, as well as kernel scheduling, are implemented on the JavaScript side, with subsequent calls to XNNPACK [
5] kernels for execution. In contrast, ORT.js implements these within Wasm. Furthermore, TF.js utilizes models directly converted from TF SavedModels [
3] for loading and graph construction, whereas ORT.js requires a transformation of the SavedModel to adapt it to the specifications of ORT and ORT.js. These implementation differences lead to varied performance across different models between the two frameworks. Besides, TF.js shows wider model support compared with ORT.js.
As for
warmup latency, the in-browser warmup latency of TF.js is 0.4–1.0
\(\times\) of the latency of TF (
Table 2 vs.
Table 1), while the gap is 1.0–20.5
\(\times\) for ORT.js and ORT (
Table 3 vs.
Table 1). TF suffers from higher warmup latency across all models compared to TF.js. This is because loading all necessary components, such as the compute graph engine, of TF for model inference requires more than 980 MB of RAM size, while the library size of TF.js is only 3 MB, indicating that the RAM size will not exceed 3 MB. During the warmup stage, the primary tasks include allocating memory for inference and initializing execution. Since memory allocation is not influenced by underlying system software and hardware instruction characteristics, the warmup latency gap between the browser and the native is not as significant as the gap in prediction latency. When comparing the warmup latency with the prediction latency, we find that warmup latency is 1.2–2.3
\(\times\) of the average prediction latency for TF.js (
Table 2), while the gap is 1.0–1.8
\(\times\) for ORT.js (
Table 3). Warmup additional overhead primarily arises from memory allocation in Wasm. After the warmup stage, the frameworks maintain the allocated memory for efficiency. The cold start issue is less severe in ORT.js compared to TF.js. This is primarily because ORT.js optimizes the computation graph, reducing the frequency of memory allocations. Additionally, ORT.js leverages an Arena memory pool for further optimization [
30,
31].
As for
setup latency, the in-browser inference setup latency of TF.js is 0.1–10.4
\(\times\) of the latency of TF (
Table 1 vs.
Table 2), while it is 0.9–2.4
\(\times\) for ORT.js and ORT (
Table 1 vs.
Table 3). It is worth noting that the setup stage of native inference only involves loading the model and the framework. In-browser inference may outperform native inference w.r.t. the setup latency. One overhead of setup arises from loading the framework into memory. The native inference framework itself has a larger size. For example, simply importing TF in Python (i.e., executing Python command “import tensorflow,” which is performed in the setup stage) requires over 160 MB of RAM size, while the required files are only about 3 MB for TF.js. On the other hand, native inference may also outperform in-browser inference because in-browser inference needs to download model files to the browser. The download latency depends on the network and model size. For example, in TF.js, downloading VGG16 requires downloading 528 MB of data, which introduces high network transfer latency. Among the selected models, the model file size varies up to 31.1
\(\times\). The significant disparity in data transfer latency results in a wide range of setup latency.
To further explore the impact of network conditions during the setup stage, we deployed a monitor on the server to observe the network latency for each device as it downloaded the model and framework. To avoid competition for network bandwidth among devices, we ensured that only one device was downloading the model at any given time. We observed relatively stable changes in network bandwidth, with the bandwidth for transferring files between different devices ranging from 51.2 to 76.4 MB/s. Regarding the latency in file transmission, the latency varied from 212 ms to 9,717 ms. We compared this data transfer time with the setup stage duration and found that on average, data transmission latency constituted 88.5% of the setup latency, indicating that data transmission dominates this stage and is the main bottleneck.
In summary, by analyzing data from
Table 1 regarding CPU performance and Wasm-thread-SIMD data from
Tables 2 and
3, several key findings emerge. Firstly, for prediction latency, a significant gap persists between in-browser and native inference, primarily due to performance differences between native and browser environments. The lack of advanced SIMD instructions, such as FMA in the Wasm environment, results in lower acceleration compared to native inference, with additional impacts from resource competition among inherent browser threads. Secondly, regarding the warmup latency, TF’s substantial library size leads to exceptionally higher latency than TF.js. However, warmup latency is generally higher in all settings than prediction latency, reaching up to 2.3
\(\times\) higher, primarily due to the need to load and initialize programs and allocate memory during the warmup stage. Lastly, setup latency is primarily influenced by network conditions. In-browser inference involves downloading models from servers, with model data transfer latency constituting a significant portion of setup latency, reaching 88.5% on average. Despite this, current frameworks support download and load pipelining, which allows parallel operations, enhancing the efficiency of the setup stage.
Breakdown Analysis for Wasm Backend. We explore the influence of multithreading and SIMD on latency. These two techniques are to accelerate computation, significantly impacting both the warmup and prediction stages. We set four threads when enabling multithreading. The results of ResNet50 are illustrated in
Figure 2 and the full results are presented in
Tables 2 and
3. The results show that SIMD reduces 49.1% and 50.7% average prediction latency for TF.js and ORT.js, respectively, while multithreading reduces 10.7% and 31.2% average prediction latency for TF.js and ORT.js, respectively. When enabling both techniques, the average prediction latency reduces by 64.7% and 63.4% for TF.js and ORT.js, respectively. For both frameworks, SIMD provides a greater speedup compared to multithreading. This is mainly because multithreading synchronization is implemented through message passing in Wasm, introducing additional overhead for thread management and synchronization [
21,
43]. Besides, inherent resource competition from other threads in the browser further slows down the inference, such as the I/O thread and compositor thread in Chromium architecture. For example, for a CPU with four cores, if we have allocated four threads for inference, the remaining threads will compete with the inference threads for resources, thereby impacting the efficiency of the inference process. Both multithreading and SIMD can provide up to 4
\(\times\) acceleration (since 128-bit instruction length is 4
\(\times\) the length of int32/float32 data; single thread vs. four threads). However, the two technologies only achieve a maximum acceleration of 2.8
\(\times\) for inference. This is because the model contains many memory-intensive kernels (37.7% on average across all models), and both frameworks provide limited optimization for these kernels.
Regarding the warmup latency, it was found that SIMD and multithreading, respectively, reduce latency by 38.5% and 12.1% on TF.js, and by 54.5% and 41.5% on ORT.js. These reductions are lower than those observed during the prediction stage, primarily because the warmup stage also involves memory allocation operations, which are less affected by these technologies. However, since computation still dominates during the warmup stage, both technologies are still able to facilitate acceleration.
In summary, an acute analysis of data from
Tables 2 and
3 reveals that for in-browser inference, SIMD can provide up to 54% lower latency on average compared to multithreading. This improvement is mainly due to the efficiency of synchronization between threads and the impact of resource competition. However, the presence of memory-intensive operations limits the maximum acceleration achievable by both technologies to only 2.8
\(\times\), as both SIMD and multithreading offer limited speed enhancements for memory-intensive operations.
Average GPU Latency. We analyze the latency in WebGL and compare it with the latency in native and Wasm in the same three latency categories. Note that when comparing with native inference, we only consider discrete GPUs because the native DL framework does not support the integrated one.
As for
prediction latency, TF.js is 2.5–268.7
\(\times\) slower than TF, and the gap between ORT.js and ORT is 11.2–17.9
\(\times\) (
Table 1 vs.
Table 4). The average gap for both frameworks is 30.6
\(\times\). The gap mainly arises from inefficient libraries and GPU abstraction. Native inference frameworks can utilize high-performance GPU libraries, such as CUDA [
16], to fully exploit the GPU’s parallelism capacity. CUDA’s low-level access to GPU allows developers to interact directly with the hardware, enabling fine-grained control and optimized program execution. However, in-browser inference frameworks rely on WebGL. WebGL is primarily designed for rendering tasks and requires using graphics-related mechanisms like explicit textures and frame buffers. Developers cannot access GPU directly through WebGL, making it inefficient for parallel computing tasks.
When comparing the latency in the WebGL backend with the Wasm backend, we also find that WebGL exhibits lower latency than Wasm. For TF.js, inference in WebGL on integrated and discrete GPUs results in 1.4\(\times\) and 2.7\(\times\) average prediction latency reduction, respectively, compared to Wasm. For ORT.js, the reduction is 2.2\(\times\) and 4.9\(\times\), respectively. The speedup contributes to the GPU capacity, despite the limitations of WebGL in fully utilizing the hardware features.
As for
warmup latency, we observed that the warmup latency for in-browser inference is significantly high. Specifically, when compared to native inference, TF.js is on average 5.8
\(\times\) slower than TF, and the average gap between ORT.js and ORT is 186.7
\(\times\) (
Table 1 vs.
Table 4). Additionally, when comparing warmup latency to prediction latency, we found that the warmup latency of TF.js on discrete GPUs is 64.6
\(\times\) longer than its prediction latency, while on integrated GPUs, the gap is 26.8
\(\times\). For ORT.js, these gaps are 48.1
\(\times\) and 25.1
\(\times\), respectively (as detailed in
Table 4). In-browser inference exhibits severe warmup latency in WebGL because the shader is compiled in this stage [
52]. Taking Chrome on the Windows platform as an example, when executing WebGL, Chrome uses ANGLE to compile WebGL shader code into an intermediate representation, such as Direct3D, which is then compiled by the GPU hardware into executable binaries [
1]. This compilation process is completed during the first execution of WebGL, and the compiled results can be re-used in subsequent executions. The compilation process is time-consuming. Therefore, in-browser inference using the WebGL backend experiences high warmup latency. In addition, memory allocation also introduces overhead in this stage. Integrated GPU outperforms discrete GPU because the inference process needs to invoke the GPU and synchronize the process state [
50,
51] during the initializing execution of WebGL, which is slow on discrete GPU. We also find that ORT.js outperforms TF.js by 1.7
\(\times\) on integrated GPU and 1.9
\(\times\) on discrete GPU, because ORT.js optimizes the model via kernel fusion during model conversion, resulting in shorter compilation time.
As for
setup latency, TF.js is 0.1–3.3
\(\times\) of TF, and the gap is 0.2–15.4
\(\times\) for ORT.js and ORT (
Table 1 vs.
Table 4). The reason is similar to that of Wasm. We also find that for in-browser inference, integrated GPU outperforms discrete GPU consistently. Specifically, the gap is 1.1–3.5
\(\times\) for TF.js and 1.5–2.8
\(\times\) for ORT.js (
Table 4). This is mainly because of high invoking overhead on discrete GPU [
50,
51]. Unlike the invocation during the warmup stage, the invocation in the setup stage primarily involves loading the model parameters and the inference framework onto the GPU.
In summary, based on the analysis of data from the GPU part of
Tables 1 and
4, we find that using WebGL for inference is a faster solution compared to Wasm. Compared with Wasm, WebGL achieves up to 2.7
\(\times\) faster prediction than Wasm. However, compared to native inference using CUDA, WebGL exhibits a significantly larger performance gap, up to 268.7
\(\times\), due to its inherent inefficiencies. Different from Wasm, using WebGL involves in-browser compilation of WebGL shaders during the warmup stage, which results in a severe warmup latency, up to 64.6
\(\times\) higher than the prediction latency. Regarding the setup stage, invoking a discrete GPU incurs higher latency.
Latency Variance.
Figures 3 and
4 show the prediction latency and warmup latency variance, respectively. The variance is the latency gap between the highest and the lowest average latency among devices. We exclude setup latency variance due to the impact of different network conditions.
The
prediction latency varies differently among devices. Take ResNet50 in the Wasm backend of TF.js as an example, the latency quartiles are 257.6 ms, 358.4 ms, and 680.9 ms. The prediction latency variance of all models in the Wasm backend reaches up to 28.4
\(\times\). Such variance mainly derives from hardware capacities. We observe that the benchmark score of the lowest-end CPU involved, Core i5 Skylake, is 4.1
\(\times\) lower than the highest-end CPU, Core i7 Rocket Lake (6,030 vs. 24,673) for one core [
15]. Moreover, the same kernel can also exhibit different performances on different hardware [
65,
112]. In the WebGL backend in TF.js, the prediction latency quartiles of ResNet50 are 77.5 ms, 124.3 ms, and 225.3 ms. The variance of all models reaches up to 19.4
\(\times\). Such variance also arises from the large device capacity gap. We observe that the score of the highest-end GPU in our dataset, Nvidia GTX 2060, is 207.4
\(\times\) better than the lowest-end Intel HD 530 according to the GPU benchmark [
38]. Since WebGL has limited capacity to harness discrete GPUs, the variance is less pronounced compared to the differences in GPU capacity.
Regarding the performance differences between different models on integrated GPUs, we compared the Apple Silicon GPUs with GPUs integrated with Intel chips. We observed that Apple Silicon GPUs generally outperform those in Intel chips. Specifically, for TF.js, prediction latency on Apple Silicon GPUs is on average 7.1\(\times\) lower than that on Intel chips; for ORT.js, the latency on Apple Silicon GPUs is on average 6.2\(\times\) lower than that on Intel chips. This enhanced performance is primarily due to the optimizations in the Apple GPU for parallel computing, which makes it more adept at handling model inference tasks. In summary, for in-browser inference, the standards of Wasm and WebGL are consistent across all devices, indicating that any device capable of running a Web browser can support in-browser inference. The differences in performance are indeed related to the capabilities of the different devices or hardware types, such as the capacity gap between integrated and discrete GPUs.
Regarding the performance differences in in-browser inference across different CPU architectures, we compared the inference capabilities between ARM CPUs and x86 CPUs. For ARM architecture, we selected Apple Silicon, with our dataset including both the M1 and M2 chips, while for x86 architecture, we used Intel CPUs. We found that the latency on ARM CPUs is on average lower than x86 CPUs. Specifically, for TF.js, the latency on ARM CPUs was 2.2\(\times\) slower than x86 CPUs, and for ORT.js, the gap is 2.9\(\times\). This disparity is significantly smaller than the previously analyzed factor of 28.6\(\times\). The primary reason for this smaller gap is the considerable variation in the performance of the Intel chips used in our dataset, which ranged from Core-i5 Skylake models to Core-i7 Rocket Lake models. Lower-end Intel chips performed worse than Apple chips, whereas higher-end Intel models surpassed the performance of Apple chips. In summary, there is a greater variance in inference performance across different Intel CPUs, and the average performance gap between Intel and Apple CPUs is relatively small. The highest-end Intel CPUs demonstrated the best inference performance.
As for warmup latency, the variance of all models reaches up to 25.3\(\times\) in the Wasm backend and 14.4\(\times\) in the WebGL backend. The variance is not as significant as the prediction latency. The reason is that the warmup latency consists of memory allocation overhead. Compared to computation, the impact of hardware capacity on memory allocation is negligible. Consequently, when memory allocation is involved, the variance of warmup latency decreases compared to the variance in prediction latency. However, warmup latency remains predominantly driven by computation, i.e., initializing the first model inference. Therefore, due to significant performance differences between hardware capacities, the variance in warmup latency can still reach up to 25.3\(\times\).
In summary, after analyzing the results of
Figures 3 and
4 the variance in prediction latency and warmup latency across devices is quite significant, primarily attributed to variations in the hardware capacity of user devices, as both types of latency are predominated by computation. Compared to prediction latency, the variance in warmup latency across devices is relatively smaller. This is mainly because warmup latency also encompasses memory allocation, where the performance variance across devices is minimal, resulting in this observation.
Model-Level Latency Difference across Different Models and Tasks. We explore the latency difference across different models within the same task. Taking prediction latency as an example, we found that MobileNetV2 exhibits the lowest latency in image classification tasks in all settings, while Bert-Small shows the lowest latency in grammar checking tasks in all settings. This is primarily because these two models are the lightest within their respective task model sets. Specifically, MobileNetV2 has a model file size of only 14 MB, and Bert-Small is only 17 MB. However, for object detection tasks, SSD-MobileNetV2 performs best in the Wasm backend of ORT.js, while Yolo5-Middle shows the best performance in the WebGL backend of TF.js. This is mainly due to the differences in the implementations of the two backends across the two frameworks. We also observed significant differences in prediction latency among different models. Taking image classification as an example, in our experiments, the file sizes for the MobileNetV2 and ResNet50 models were 14 MB and 98 MB, respectively. Additionally, the primary kernels included in these models differ: MobileNetV2 mainly incorporates depthwise separable convolutions and pointwise convolutions, whereas ResNet50 predominantly uses \(3\times 3\) convolutions. These structural differences between the two models lead to variations in setup latency, warmup latency, and prediction latency. As for models used in different tasks, take for example the image classification model MobileNetV2 and the object detection model SSD-MobileNetV2. Although SSD-MobileNetV2 uses MobileNetV2 as its feature extractor, it omits the last four bottleneck layers of MobileNetV2 and incorporates additional detection layers. Such structural differences between the two models affect their inference latency. In summary, the differences in inference latency between models are primarily due to variations in their model structure.
Results on Firefox Browser. We further investigated the impact of different browsers. For each application category, we selected a model and presented the results in
Tables 5–
7.
Starting with the CPU Results. The first is prediction latency. By comparing
Tables 5 and
6 with
Table 1, we still observed that a gap still exists between in-browser inference and native inference, averaging 8.8
\(\times\). This is primarily due to the differences between the browser and native environments, including the absence of advanced instructions and inevitable resource competition. Additionally, we compared in-browser inference between Firefox and Chrome (
Tables 5 and
6 vs.
Tables 2 and
3). We found that for the same framework, the Chrome browser outperforms the Firefox browser consistently. Specifically, for TF.js, the average prediction latency on Firefox is 1.08
\(\times\) that of Chrome; for ORT.js, it is 1.11
\(\times\). The second is warmup latency. Compared with native inference latencies, the results are similar to those on Chrome, i.e., in-browser inference warmup latency is on average 0.37–1.14
\(\times\) of native inference for TF.js; the gap for ORT.js and ORT is 1.07–22.32
\(\times\). The difference between browsers remains low, with Firefox’s results varying from Chrome’s by no more than 10.3% for both frameworks. The main difference between the two browsers with respect to warmup and prediction latency arises from the implementation of the browsers because the standard of Wasm remains the same between the two browsers. The third is setup latency. Compared to native inference, the setup latency for TF.js is 0.05–10.5
\(\times\) that of TF, and for ORT.js it is 0.84–2.6
\(\times\) that of ORT. Since in-browser inference setup latency is mainly influenced by library size and network conditions, there are significant differences between models as well as between native and browser environments. Among different browsers, the setup latency on Firefox is 1.03
\(\times\) that of Chrome, which is slightly lower than prediction and warmup latencies. This is because the setup stage is predominantly by network conditions as previously analyzed. In summary, the difference between the two browsers is minimal and can be considered negligible compared to the native vs. browser differences.
Furthermore, we analyzed the impact of different acceleration technologies, specifically multithreading and SIMD. As with previous experiments, we set up four threads. We found that SIMD can reduce the prediction latency for TF.js and ORT.js by 49.2% and 59.2%, respectively, while multithreading can reduce it by 9.9% and 49.2% for TF.js and ORT.js, respectively. When both technologies were enabled, the prediction latency was reduced by 65.2% and 73.8% for TF.js and ORT.js, respectively. Similar to Chrome, the reason multithreading is less effective than SIMD is due to unavoidable resource competition and the overhead of thread synchronization, which is determined by the Wasm itself and is independent of the browser architecture.
Next, we analyzed the average latency on the GPU. Firstly, the setup latency for in-browser inference is 0.1–12.2\(\times\) that of native inference for TF.js, and the gap is 0.2–16.4\(\times\) for ORT.js and ORT. Compared to Chrome, Firefox’s setup latency on a discrete GPU is 1.06\(\times\) on average higher, and on an integrated GPU, the average gap is 1.05\(\times\), which is negligible compared with the gap between native inference. Secondly, as for warmup latency, compared to native inference, the average in-browser inference warmup latency is 6.1\(\times\) that of native inference on average for TF.js and the gap is 13.8\(\times\) for ORT.js, primarily due to time-consuming in-browser WebGL shader compilation; compared to Chrome, Firefox’s data average 1.11\(\times\) on discrete GPU and 1.08\(\times\) on integrated GPU. Finally, compared to native inference, the prediction latency for in-browser inference averages 40.2\(\times\) that of native inference for TF.js, and the average gap is 15.5\(\times\) for ORT.js; compared to Chrome, Firefox’s data average 1.09\(\times\) higher for both frameworks. We found that similar to the results on the CPU, Firefox’s performance is somewhat inferior to Chrome’s. This is primarily due to the differences in framework implementation, as the two browsers use different engines, which accounts for the variations in performance.
In summary, the performance differences in performing in-browser inference between browsers are minimal, compared with native inference. Although Chrome’s in-browser inference performance is slightly better than Firefox’s, the gap is still much smaller than the difference between browser-based and native inference. This discrepancy is mainly due to variations in the internal implementations of the browsers. The slight differences between the two browsers are also because both are seasoned engineering experts who optimize both to support general JavaScript execution and browsing tasks.
Results on Mobile Devices. In addition, we explored the inference latency performance in-browser on mobile devices. We selected a specific model for each type of application and presented the results in
Tables 8–
11. The experiments were conducted using
TensorFlow Lite (TFLite) and
ONNX Mobile Runtime (mORT).
Starting with the analysis of the results of the CPU. The first aspect is prediction latency, where we enabled SIMD and set up four threads. By comparing
Table 8 with
Tables 9 and
10, we found that the difference between native and browser environments on mobile devices is still significant as on PCs. Specifically, the latency using TF.js for prediction is 18.9
\(\times\) that of using TFLite, and for ORT.js and mORT, this difference is 14.5
\(\times\). For both frameworks, the average gap reaches 15.8
\(\times\). The difference between native and browser mainly stems from resource contention. This is primarily because mobile devices inherently have limited resources, and in-browser inference also requires browser support. Running Chrome on mobile devices inevitably leads to resource competition. When performing native inference, we directly utilized the C++ implementation and executed the model inference via “adb shell,” thus significantly avoiding resource contention. In contrast, in-browser inference relies on the Chrome browser. Additionally, the absence of FMA instructions in Wasm also impacts performance. The second aspect is warmup latency. In the mobile environment, the latency for TF.js is 25.7
\(\times\) of TFLite, while the latency of ORT.js is 8.0
\(\times\) of mORT. For both frameworks, the average warmup latency gap between the native and browser environment is 10.2
\(\times\). Additionally, we observed that compared to TFLite, mORT also demonstrated lower warmup latency, a result that is similar to what we see on the PC side. Compared to prediction latency, the warmup latency for TF.js is 1.3
\(\times\) that of the prediction latency, while for ORT.js, this gap is 1.2
\(\times\). The difference in latency between the two stages is not as significant as that between the browser and the native environment. The variation between the two stages mainly arises from additional overheads such as memory allocation during the first model prediction. The final aspect is setup latency. We found that the latency for TF.js is 117.2
\(\times\) of TFLite; for ORT.js, its latency is 28.1
\(\times\) of mORT. For both frameworks, the average setup latency gap is 62.0
\(\times\). We discovered that this gap is much larger than that on PC devices. Compared to warmup and prediction, setup latency is mainly affected by model size, library size, and network conditions. The setup latency of in-browser inference is high because mobile devices can only connect to wireless networks. According to our measurement, the download bandwidth ranges from 5.4 MB/s to 7.8 MB/s, which is much slower than that on PC devices using high-speed wired networks. Different from the PC side, on mobile devices both libraries are lightweight. Specifically, the TFLite library is only 5.3 MB, and the mORT library is just 14 MB. This indicates that the latency for loading libraries on mobile devices is low. However, as the model changes, there are significant differences in the latency for loading models. For example, the setup latency for MobileNetV2 is only 22.0 ms, while for VGG16 it reaches 668.6 ms. This difference is primarily because the setup stage also includes loading the model. During this process, the framework loads the model to memory, involving parsing the model structure, and thus, as the model structure becomes more complex, the setup latency increases accordingly.
Furthermore, we analyzed the impact of different acceleration technologies, i.e., multithreading and SIMD. As with previous experiments, we set four threads when enabling multithreading. We found that SIMD could reduce the prediction latency for TF.js by 36.2% and for ORT.js by 37.8%, while multithreading could reduce the latency by 13.3% and 17.0% for TF.js and ORT.js, respectively. When both technologies were enabled, prediction latency decreased by 54.1% and 42.4% for TF.js and ORT.js, respectively. Similar to PC devices, SIMD also brings more acceleration than multithreading on mobile devices, and the effect of multithreading is not as pronounced as on PCs due to limited resources on mobile devices, leading to significant resource competition when four threads are enabled, thus lowering the multithreading performance.
Next, we analyzed data on the GPU. The first aspect is prediction latency. In the mobile environment, the latency for TF.js is 9.4\(\times\) of TFLite; for ORT.js, its latency is 5.6\(\times\) of mORT. The average gap for both frameworks is 7.8\(\times\). This difference is much smaller than on PC devices, mainly because mobile GPUs do not possess the powerful parallel computing capabilities of Nvidia GPUs and do not have advanced libraries like CUDA for PCs, hence the smaller difference. In-browser inference is slower than native inference, primarily because kernel implementation of native inference is implemented directly via OpenCL kernels, which can run on mobile devices directly after compilation. In contrast, in-browser inference requires implementation through WebGL. The second aspect is warmup latency. In the mobile environment, the latency for TF.js is 242.2\(\times\) of TFLite on average; the latency of ORT.js is 29.2\(\times\) of mORT on average. For both frameworks, the average gap between the mobile native and browser environment is 162.3\(\times\). Such larger gaps mainly arise from the poorer performance of mobile GPUs, and in-browser WebGL shader code compilation depends on GPU performance, thus the significant difference. The gap on mobile devices compared to the gap on PC GPUs is quite different, primarily due to the use of completely different hardware. On PCs, due to limitations of the native inference framework, we use discrete GPUs and perform inference with CUDA, which involves considerable initialization latency for the discrete GPUs, whereas on mobile devices, the GPU is on the same chip as the CPU. Additionally, TFLite’s lack of support for some models due to framework implementation also impacts the results. Nevertheless, there is still a large gap between the warmup latencies of native inference and in-browser inference on mobile devices. The last aspect is setup latency. In the mobile environment, the latency for TF.js is 21.7\(\times\) of TFLite; for ORT.js, it is 315.8\(\times\) of mORT. For both frameworks, the average setup latency gap is 81.7\(\times\). This difference is also larger compared to PCs. The setup stage of in-browser inference is heavily influenced by network conditions, as analyzed in the CPU result part. Mobile devices suffer from higher setup stage latency due to lower network transmission bandwidth, a problem not present in native inference. Native inference merely requires loading from the local file system.
In summary, on mobile devices, the performance differences between native and browser environments are still large, primarily due to the constraints of SIMD on CPUs, and inevitable resource competition. Compared to PC devices, in-browser inference latency on mobile devices is generally higher, largely due to limited resources. This scenario highlights both advantages and disadvantages. On the positive side, the uniformity between native and browser environments simplifies the development and deployment processes of in-browser inference. However, the lack of powerful computational libraries and resource constraints can lead to increased latency, presenting a significant challenge for real-time applications on mobile platforms.
4.1.2 Kernel-Level Latency.
We explore the kernel-level inference performance of in-browser inference. Due to the immature support for profiling in ORT.js, we use TF.js for analysis. We provide prediction results in
Table 12 and warmup results in
Table 13. The setup stage is excluded because kernels are not executed in this stage.
First, we explore which kernels dominate the
prediction latency. Both multithreading and SIMD are enabled and we set four threads in the Wasm backend. Each percentage value in
Table 12 shows the latency proportion out of the total latency in a setting. We find that computation-intensive kernels predominantly contribute to the overall latency in both backends, although the number of memory-intensive kernels makes up 37.7% of the total kernels. Specifically, Einsum and FusedConv2d contribute more than 70% of the total latency. We also find that the top five kernels are different in the two backends due to different computation and memory access patterns. In the Wasm backend, memory-intensive kernels introduce more latency. For example, GatherV2 accounts for the third most prediction latency in Wasm (12.5%), but it is not in the top five kernels in WebGL. The reason is that interthread synchronization relies on message passing in Wasm [
43], which introduces additional overhead and becomes particularly pronounced for memory-intensive kernels. These results indicate that computation-intensive kernels remain a focal point for optimization in both backend kernels.
We also explore the impact of SIMD and multithreading on the kernel during the prediction stage. We set four threads when enabling multithreading. We compared the latency of each kernel with various acceleration technologies enabled, compared to the latency in the vanilla Wasm setting, and represented these variations using different cell colors in
Table 12. Firstly, we found that enabling multithreading led to an increase of up to 3.7% in the prediction latency of memory-intensive kernels compared to the settings without multithreading enabled. Furthermore, the proportion of latency contributed by these kernels within the overall inference process increased by up to 8.6% (as seen in the results for GatherV2 in the “Wasm-S” and “Wasm-S-T” settings). This is because these memory-intensive kernels have limited support for multithreading. We dive into the kernel implementation in the source code and find that multithreading is only employed in some kernels, including Pad, among others. When enabling SIMD, we find that the latency proportion of computation-intensive kernels decreases by up to 6.8% and memory-intensive kernel latency changes negligibly (as indicated by the cell color). Additionally, we observed that the proportion of computation-intensive kernel latency within all kernels decreased by up to 12.8% (as seen in the results for Einsum in the “Wasm-T” and “Wasm-S-T” settings). This is because Wasm has limited support for data-moving SIMD instruction, hindering the potential acceleration for memory-intensive kernels through SIMD. When computation-intensive kernels’ latency decreases and memory-intensive kernels’ latency remains unchanged, the proportion of latency attributed to memory-intensive kernels also increases. Conversely, the reason for the decrease in the proportion of latency for computation-intensive kernels is the same. Nevertheless, each technique effectively accelerates computation-intensive kernels. For instance, compared with vanilla Wasm, SIMD and multithreading reduce the average prediction latency of FusedConv2D by 86.9% and 66.7%, respectively. When applying both techniques, the prediction latency decreases by 89.9%.
During the warmup stage, our observations remain consistent with previous findings. The latency in both backends is predominantly influenced by computation-intensive kernels. However, the Wasm backend exhibits a higher latency contribution from memory-intensive kernels compared to the WebGL backend. Upon enabling SIMD, we note a reduction in the latency proportion attributable to computation-intensive kernels by up to 2.0%. This reduction is smaller than that observed in the prediction stage, primarily because memory allocation incurs additional overhead that is unaffected by SIMD. Although the computation time for computation-intensive kernels decreased, extra overhead is added due to memory allocation, which remains constant. Moreover, the latency of memory-intensive kernels also does not significantly decrease; these result in a modest reduction in the proportion of computation-intensive kernels after SIMD is enabled. Conversely, when multithreading is enabled, the latency proportion of memory-intensive kernels experiences an increase of up to 3.2%. This trend mirrors the prediction stage, where multithreading offers negligible acceleration to memory-intensive kernels while introducing extra overhead.
In summary, by analyzing the data from
Tables 12 and
13, we find that computationally intensive kernels, such as Einsum and FusedConv2D, account for at least 76.0% of total latency. Within the Wasm backend, SIMD has almost no effect on memory-intensive kernels, while multithreading, in contrast, leads to an increase in memory-intensive kernel latency. In contrast, both technologies reduce the latency of computationally intensive kernels by more than 10%. Upon diving into the kernel implementations, we find that this is primarily due to insufficient multithreading optimizations for memory-intensive kernels by the framework. In the WebGL backend, computation-intensive kernels similarly dominate the latency. Therefore, for both backends, improving the latency of computationally intensive kernels has a greater impact on overall latency, and framework vendors should also focus on optimizing multithreading for memory-intensive kernels.
Kernel-Level Latency Difference across Different Models and Tasks. To explore the differences in latency across different kernels, we primarily examined the types of kernels included in various models and the relationship between the Floating Point Operations Per Second (FLOPs) of these kernels and their corresponding latencies. We focused on three specific operators: Einsum, FusedDwConv2D, and FusedConv2D. FusedDwConv2D is mainly used in the MobileNetV2 and SSD-MobileNetV2 models, while FusedConv2D is commonly found in other convolutional neural networks. Einsum, primarily used for matrix multiplication, is utilized across all models. For instance, FusedConv2D accounts for 87.4% prediction latency of ResNet50 in the Wasm backend with both multithreading and SIMD enabled, while Einsum accounts for less than 0%. In contrast, Einsum accounts for 92.1% prediction latency of MobileBert in the same setting while FusedConv2d accounts for 0%. We found that models utilizing FusedDwConv2D, typically used in classification and detection tasks, exhibit both lower FLOPs and latency. Specifically, FusedDwConv2D accounts for 0.6% parameters and 1.9% prediction latency. Although this may sacrifice some inference accuracy, it significantly reduces latency. Running more lightweight models in browsers holds potential, as these models are computationally efficient. In summary, all three kernels are computation-intensive kernels that dominate the total latency, as well as the number of parameters. Specifically, the three kernels account for 99% of total parameters. These results indicate that optimization on these kernels can significantly improve the in-browser inference performance, i.e., reducing latency.
Results on Firefox Browser. We also explored results on Firefox, presenting the results of prediction and warmup latencies in
Tables 14 and
15, respectively. Starting with the prediction stage, the results are similar to those in Chrome. Computationally intensive kernels, such as FusedConv2d and Einsum, still represent the largest latency proportion even after enabling multithreading and SIMD, accounting for 41.6% and 34.1%, respectively. This is primarily because the prediction stage of model inference is still dominated by computation. Upon separately enabling multithreading and SIMD, we observed a significant reduction in latency for these computationally intensive kernels compared to vanilla Wasm. Specifically, SIMD reduced the latency of computation-intensive kernels by 84.5%, and multithreading reduced it by 63.1%. However, due to the presence of memory-intensive kernels and the framework’s lack of multithreading support for these kernels, along with limited acceleration from SIMD, the absolute latency values of memory-intensive kernels remained essentially unchanged, while their latency proportion increased when both technologies were enabled. On the GPU side, we arrived at similar conclusions. Due to the absence of thread synchronization and message passing, the latency proportion of memory-intensive kernels is so as low that they do not appear in the table.
Regarding the warmup stage, shown in
Table 15, we found that computation-intensive kernels continue to dominate the latency. However, memory-intensive kernels have a higher latency proportion in the Wasm backend. We also noted that compared to the vanilla Wasm, the proportion of computation-intensive kernels is lower during the warmup stage when acceleration techniques are enabled. For instance, under conditions enabling both SIMD and multithreading, the latency proportion of FusedConv2D and Einsum decreased by 2.2% and 2.7%, respectively. This stage includes memory allocation, which is not affected by hardware performance. Similar to the prediction stage, multithreading technology, due to the overhead of thread synchronization, actually increased the latency of memory-intensive kernels (as indicated by the gray cell color in the tables). In contrast, the WebGL backend does not face such issues, maintaining a consistent dominance of computation-intensive kernels in terms of warmup latency.
In summary, the data on Firefox closely mirror the results on Chrome, aligning with our previous analysis. The changes in kernel latency proportions and absolute values primarily depend on the model architecture and the implementation of the framework, independent of the browser. This is consistent with Wasm’s browser-independent mechanism and corroborates our model-level conclusions in the Chrome browser.
Results on Mobile Devices. We also explored the results on mobile devices, displaying the prediction and warmup latencies in
Tables 16 and
17, respectively. Starting with the prediction stage, shown in
Table 16, it was unsurprising that computationally intensive kernels continued to have the highest latency proportion, consistent with previous conclusions. After enabling SIMD, the proportions of FusedConv2d and Einsum decreased by 3.3% and 1.5%, respectively, mainly due to the significant acceleration SIMD brings to these computation-intensive kernels. However, SIMD provides limited acceleration for memory-intensive kernels. After enabling multithreading, we observed an increase in the proportions and the absolute latency values of these computation-intensive kernels by 2.0% and 4.1%, respectively. When both technologies were enabled, all computation-intensive kernels saw their largest proportion decrease, with FusedConv2d and Einsum reducing by 8.0% and 5.2%, respectively. These results align with the conclusions drawn from both browsers on the PC.
Turning to the warmup stage results in
Table 17, the proportion of latency attributed to memory-intensive kernels is higher in the Wasm backend compared to the WebGL backend, reaching 12.5% after enabling SIMD and multithreading. This is mainly due to the limited optimization of these kernels by the framework when utilizing both acceleration technologies. The cost of memory allocation leads to a decrease in the proportion of computation-intensive kernels compared to the prediction stage, primarily because CPU capacity has minimal impact on memory allocation; this essentially adds a fixed value to the computational latency, thus reducing the proportion of computationally intensive kernels during the warmup stage. In the WebGL backend, there is no additional synchronization overhead like that in Wasm multithreading; hence, the proportion of memory-intensive kernel latency remains low so that the memory-intensive kernels do not appear in the table.
In summary, the results on mobile devices are consistent with those observed on PCs using Chrome and Firefox. The analysis involves only the framework and kernel implementation and optimization in two backends; thus, conclusions and analyses applicable to PCs are equally relevant to mobile devices.