¹¹institutetext: Zhejiang Provincial Key Laboratory of Service Robot, Zhejiang University
¹¹email: {shaozirui, xinghd, zapeng, yuzhirenzhe, bjj}@zju.edu.cn ²²institutetext: Alibaba Group
²²email: feiyu.gfy@alibaba-inc.com, yongqi.zq@taobao.com, yaocong2010@gmail.com

WebRPG: Automatic Web Rendering Parameters
Generation for Visual Presentation

Zirui Shao\orcidlink0000-0002-4210-070X Equal contribution.11 Feiyu Gao^⋆\orcidlink0009-0009-3206-5347 22 Hangdi Xing\orcidlink0000-0002-1770-005X 11 Zepeng Zhu\orcidlink0009-0000-1510-6455 11 Zhi Yu\orcidlink0009-0001-8608-5628 Corresponding author.11
Jiajun Bu\orcidlink0000-0002-1097-2044 11 Qi Zheng\orcidlink0009-0001-3822-2616 22 Cong Yao\orcidlink0000-0001-6564-4796 22

Abstract

In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. WebRPG would contribute to a faster web development workflow. Since there is no existing benchmark available, we develop a new dataset for WebRPG through an automated pipeline. Moreover, we present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML. Extensive experiments, including customized quantitative evaluations for this specific task, are conducted to evaluate the quality of the generated results. The dataset and code can be accessed at GitHub¹¹1https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/WebRPG.

Keywords:

Generative model Visual Design Automation Web Rendering Parameters

1 Introduction

Recently, we are witnessing a revolution in content creation, driven by rapid advancements in generative models across domains such as image [57, 55, 21, 48, 56], text [3, 62, 42], and audio [32, 6, 7]. Numerous studies aim to leverage these advancements to enhance efficiency in graphic design, including advertisement [40, 35] and magazine [19, 35, 73] design. Nevertheless, the automation of web design, an essential part of graphic design [64], lacks exploration. Web design plays a significant role in the visual communication of web pages [61], impacting not only user satisfaction [9] but also user behavior [14]. Yet, it is a complex, time-consuming task, especially challenging for those developers with limited design expertise, leading to substandard visual presentations [66]. Automating web design can simplify this process, enabling developers to create visually appealing web pages, and bridging the gap between technical development and aesthetic excellence.

Web pages are formed by HTML²²2https://html.spec.whatwg.org/ and CSS³³3https://www.w3.org/Style/CSS/specs.en.html code, where HTML defines the content and structure, and CSS controls the visual presentation. With the advent of large language models (LLMs) [62, 3, 52], automating HTML code generation has become feasible. However, efforts in automatic visual presentation design, the core aspect of web design, currently center on specific subtasks such as layout generation [28, 53, 49], font recommendation [71, 2], and colorization [27, 54, 17], rather than designing a holistic web visual presentation from scratch.

Refer to caption — Figure 1: Overview of the WebRPG task. The input consists of plain HTML code and the output comprises rendering parameters for each element. With browser rendering, plain HTML produces a disorganized visual presentation, while incorporating the generated rendering parameters significantly enhances the visual presentation.

Intuitively, leveraging generative models to learn design knowledge from existing web pages is a practical strategy for automated web visual design. However, the complexity of CSS coding practices poses challenges for its automatic generation [26]. To address this, we propose standardizing CSS using Rendering Parameters (RPs), which are defined by CSS properties that control the visual appearance of each web element [16]. Consequently, we introduce a novel task called Web Rendering Parameters Generation (WebRPG for short), which requires the automatic generation of rendering parameters for each web element based on the HTML code, as depicted in Fig. 1. With the help of a WebRPG system, HTML is the only prerequisite for obtaining an effective web visual presentation, which has the potential to achieve a faster web development workflow. With the integration of LLMs, a WebRPG system can even enable the realization of a fully automated web development workflow. Moreover, it can facilitate new applications, such as efficient exploration of various design options and dynamic personalization of web page styles.

Since there is no existing benchmark available for WebRPG, we develop automatic data processing steps to transform raw web pages into formalized WebRPG samples and construct a new dataset utilizing the Klarna dataset [22]. From a theoretical perspective, the WebRPG task presents two primary challenges: 1) Web pages comprise hundreds of elements, each with numerous RPs. 2) The visual presentation of web elements should be associated with the semantic and hierarchical information provided by HTML code. To address the challenges, variational autoencoder (VAE) [30] is employed to handle the large volume of rendering parameters for web elements, and specially designed HTML embedding is introduced to encode semantic and hierarchical information from HTML code. Using these modules, two WebRPG baselines are established, which are based on autoregressive and diffusion models, respectively. To verify the effectiveness of WebRPG baselines, metrics are designed to evaluate the overall appearance, layout, and style of the generated results. Both quantitative and qualitative experiments are conducted to assess the baselines.

Our main contributions are as follows:

•

We introduce a novel task WebRPG for automatic web design from HTML code and create a new dataset.
•

We explore the WebRPG task by establishing two baselines and propose solutions for its challenges.
•

We design metrics to quantitatively evaluate the quality of generated results, and conduct qualitative experiments to analyze the strengths and weaknesses of the baselines.

2 Related Work

Generative models achieve notable success in image [57, 55, 4, 56, 50], text [3, 62, 42], and audio [32, 6, 7]. Image synthesis can create web visual presentations by generating screenshots but struggles with producing coherent text [55]. Moreover, image synthesis is limited to static images and cannot offer interactive, manipulable web pages.

Numerous efforts utilize generative models for graphic design, including advertising [40, 35], magazines [19, 73, 23], UI [24, 25, 8, 44], and posters[74, 37]. Yet, the designs restrict the element count to no more than 25. These methods primarily employ a one-dimensional sequence to represent designs, with each element defined by five tokens: four describe the bounding box, and one indicates the category (e.g., text, headline) [35]. However, the reliance on a simplistic flat input for the WebRPG task, which involves managing hundreds of elements and various RPs, leads to a substantial memory consumption increment, and performance degradation [12]. Moreover, the one-dimensional sequence neglects crucial hierarchical information in web pages.

Research focused on web pages has continuously emerged. In terms of understanding, efforts in web question answering [5, 72], web information extraction [34, 69], and web pre-trained language models [41, 10, 60] have made notable progress in comprehending the essential semantic content and hierarchical structure of web pages. For instance, MarkupLM [41] stands out with its unique architecture and pre-training tasks, effectively encoding HTML content, which offers insights for our research. Moreover, there are works aimed at web page design, such as optimizing the overall or specific block coloring of web pages [27, 54, 17], determining layouts based on given components like navigation bars [28, 53, 49], and recommending fonts for particular elements [71, 2]. However, these studies focus only on specific subtasks of the web page design workflow, leaving the comprehensive design of web pages from scratch as an unexplored area.

3 Preliminary

3.1 Task Definition

Web design is centered on visual presentation, i.e., the manipulation of CSS code. The complexity of CSS coding practices, including a wide range of selector options, makes the automatic generation of CSS code challenging [26]. To facilitate the model for learning web design, we standardize CSS by converting it into rendering parameters (RPs), which can be transformed back into CSS, with additional details in Sec. A.1. Consequently, the WebRPG task is defined as follows: given the HTML code, generate rendering parameters for each web element. Specifically, given a web page $\mathcal{X}$ , whose HTML code is $\mathcal{H}$ , it consists of a set of elements $\mathcal{X}=\{X_{1},X_{2},\ldots,X_{S}\}$ , where $S$ is the number of elements in $\mathcal{X}$ . The visual appearance of element $X_{i}$ is controlled by a set of RPs denoted as $P_{i}=\{p_{i}^{k}\mid k\in\mathcal{W}\}$ , where $\mathcal{W}$ indicates the indices for all RPs, and the complete set of RPs for $\mathcal{X}$ is $\mathcal{P}=\{P_{1},P_{2},\ldots,P_{S}\}$ . Therefore, the primary objective of the WebRPG task is to create a function $f$ that generates RPs based on HTML code, that is, $f:(\mathcal{H})\mapsto\mathcal{\hat{P}}$ , where $\mathcal{\hat{P}}$ represents the estimate of $\mathcal{P}$ .

3.2 Web Rendering Parameter Definition

The term “Rendering Parameters (RPs)” is employed to collectively describe the parameters controlling the visual appearance of each web element on the browser, as defined by CSS properties. Layout and visual style are crucial in the design of web pages [67, 60], leading us to summarize 13 common CSS properties, divided into 3 categories as follows.

•

Layout properties include left, top, width, and height.
•

Text properties include font-style, font-weight, font-size, line-height, text-align, text-decoration, and text-transform.
•

Color properties include color and background-color.

Various formats are available for web developers to define CSS properties. To standardize, we adopt the values computed by the browser [47] as the reference. Specifically, the values related to position and size are uniformly measured in integer pixels, and the values related to color correspond to 46 widely used colors. The vocabulary for all rendering parameters is available in Sec. A.1.

4 Dataset Construction

4.1 Data Pre-processing

Raw web pages cannot provide straightforward supervision for RPs. Thus, several pre-processings are conducted. Headless chrome⁴⁴4https://developer.chrome.com/blog/headless-chrome/ is used to render web pages and selenium⁵⁵5https://www.selenium.dev/ is employed to store HTML with only visible elements and record each element’s selected CSS properties. Note that elements in this paper mean nodes in the DOM⁶⁶6https://www.w3.org/DOM/DOMTR tree. The elements are stored following the DOM tree’s pre-order traversal. Since many web pages retain thousands of elements, we treat elements with a certain number of children as sub-pages with the semantic and hierarchical integrity preserved. The sub-pages are further cleaned while keeping the visual appearance, including removing uncommon HTML tags and intricate components like carousel images, as well as placing sub-pages at the top-left corner of the browser. Additionally, we only consider static components. Our models disregard the image on web pages, preserving only $\textless$ img $\textgreater$ tags. To guarantee data quality, a specific Visual Complexity (VC) metric is introduced to assist in filtering samples. The metric integrates three dimensions: color, size, and alignment, inspired by previous works [1, 15]. The definition of the VC metric is provided in Sec. A.2.

4.2 Dataset Details

To accommodate the requirement for offline rendering, the Klarna dataset [22] is utilized to build our WebRPG dataset. The Klarna dataset, initially used for web information extraction, comprises 20K English product pages from 3K e-commerce sites, ensuring domain-specific diversity. The dataset stores all pages in MHTML⁷⁷7https://en.wikipedia.org/wiki/MHTML format, enabling offline rendering of the original pages in browsers with high fidelity.

The pre-processing in Sec. 4.1 is applied to the web pages with the browser canvas size setting to 1920*1920 pixels, generating sub-pages containing between 32 and 128 child elements. The token length for each sample (sub-page) does not surpass 512. The size of RP vocabulary is 1993. The samples with a VC below 0.1 are filtered out. After preprocessing, our dataset includes 88,418 samples, split into training and testing sets at an 8:2 ratio. Our dataset exceeds the size of established graphic design datasets such as CLAY [38] (50K samples) and RICO [44] (43K samples), ensuring it can meet our objectives. Screenshots of some samples are shown in Fig. 2. More details are provided in Sec. A.3.

5 Methodology

5.1 Overview

As indicated in Sec. 3.1, the WebRPG task is formulated as a function that generates rendering parameters (RPs) for each web element based on the HTML code. Inspired by classical generation methods [57, 50, 56], we employ a latent generation approach. In the approach, VAE is leveraged to compress all RPs of an element into latent space representation (Sec. 5.2), and a generative model (Sec. 5.4) generates the latent vector based on the given HTML embeddings (Sec. 5.3), which is then decoded back into RPs by the decoder of VAE. The key components of our method are shown in Fig. 3.

5.2 Rendering Parameters Compression

Assume a web page consists of $S$ elements, with the appearance of each element $X_{i}$ determined by $\mathcal{W}$ rendering parameters $P_{i}=\left\{p_{i}^{k}\mid k\in\mathcal{W}\right\}$ . The WebRPG model necessitates the processing of $S\times\mathcal{W}$ values for both input and output. Expanding all $p_{i}^{k}$ of $X_{i}$ into a one-dimensional sequence, as per graphic design methods [24, 35], leads to excessively long input and output lengths. To mitigate this challenge, we utilize VAE to compress the rendering parameters into a latent space. This ensures that the input length for the generative model correlates solely with $S$ .

More precisely, given the RPs of an element $P_{i}\in\mathbb{R}^{\mathcal{W}*V}$ , where $V$ is the size of RPs vocabulary (Sec. 3.2), and the corresponding latent vector is $Z_{i}\in\mathbb{R}^{d}$ . We denote the generative distribution as $p_{\theta}(P_{i}\mid Z_{i})$ and the posterior as $q_{\phi}(Z_{i}\mid P_{i})$ , respectively. The learning objective of VAE is expressed as:

L_{VAE}=\frac{1}{S}\cdot\sum_{i=1}^{S}(-\mathbb{E}_{q_{\phi}(Z_{i}\mid P_{i})}% \left[\log p_{\theta}(P_{i}\mid Z_{i})\right]+\lambda_{KL}\mathrm{KL}\left(q_{% \phi}(Z_{i}\mid P_{i})\|p(Z_{i})\right)),

(1)

where $\theta$ and $\phi$ are the encoder and decoder parameters, $\mathbb{E}$ indicates the expectation, $\mathrm{KL}$ is the Kullback-Leibler divergence, and $\lambda_{KL}$ is the hyperparameter to balance the two terms. The encoder and decoder of VAE both consist of a multilayer perceptron with five layers. To ensure that the latent space encompasses as many element appearances (i.e., combinations of RPs) as possible, the VAE is pre-trained using synthetic data.

5.3 Encoding HTML

The visual presentation of a web page should be in harmony with the content and structure dictated by its HTML code. To this end, we design an HTML embedding that captures the essential information in the HTML code, establishing the input feature for the generative model (Sec. 5.4). HTML code essentially encompasses hierarchical information among elements and the textual content of each element [10]. The character count of each element is also crucial, as the size of an element generally exhibits a positive correlation with the length of characters. Therefore, our HTML embedding integrates three facets of information: semantics, hierarchy, and character count. Precisely, for an element $X_{i}$ , its HTML embedding $H_{i}\in\mathbb{R}^{d}$ is defined as:

H_{i}=\Lambda^{\text{Sem}}(H_{i}^{\text{Sem}})+\Lambda^{\text{Hier}}(H_{i}^{% \text{Hier}})+\Lambda^{\text{CharC}}(H_{i}^{\text{CharC}}),

(2)

where $H_{i}^{\text{Sem}}$ , $H_{i}^{\text{Hier}}$ and $H_{i}^{\text{CharC}}$ denote the semantic, hierarchical and character count embedding respectively, and $\Lambda^{\circ}()$ is the linear projection layer.

Semantic embedding: The MarkupLM_large model [41], a language model explicitly pre-trained for web understanding, is employed as the semantic extractor. Specifically, given an element $X_{i}$ with HTML code tokens $X_{i}=\{x_{i}^{j}\mid j\in\mathcal{L}\}$ , where $\mathcal{L}$ denotes the token length, we calculate the semantic embedding of $X_{i}$ as $H_{i}^{\text{Sem}}=\text{Pool}(\text{MarkupLM}(x_{i}^{1},x_{i}^{2},\ldots,x_{i% }^{\mathcal{L}}))$ , where $\text{Pool}(\cdot)$ denotes an average pooling operation.

Hierarchical embedding: The XPath embedding layer [41] is employed to model the hierarchical information of elements, taking their XPath expressions as input. XPath⁸⁸8https://www.w3.org/TR/xpath-31/ is a query language for selecting elements from a web page, which is based on the DOM tree and can be used to easily locate an element. Specifically, for an element $X_{i}$ with its corresponding XPath expression $xp_{i}$ , we compute the hierarchical embedding directly as $H_{i}^{\text{Hier}}=\text{XPathEmb}(xp_{i})$ .

Character count embedding: We establish a mapping mechanism that translates the raw count of characters into dense vector space. For an element $X_{i}$ with the content of $k$ characters, the character count embedding is calculated as $H_{i}^{\text{CharC}}=\text{EmbCharC}(k)$ .

5.4 Generative Models

Two generative models are implemented: autoregressive and diffusion model.

Autoregressive Model (AR): To enhance the model stability during training, a masked latent vector $\mathcal{Z}_{mask}$ of real RPs is introduced inspired by BART [36] and MaskGIT [4]. $\mathcal{Z}_{mask}$ is constructed in two steps. Firstly, the real RPs are encoded into the latent vectors with the VAE encoder, i.e., $\mathcal{Z}=\theta(\mathcal{P})$ . Then a special $MASK$ vector and a binary mask $M=\{m_{i}\mid i\in S\}$ are utilized to partially substitute the real latent vectors with the $MASK$ as $Z_{mask,i}=m_{i}\cdot MASK+(1-m_{i})\cdot\theta(P_{i})$ .

Here $M$ is generated using a mask scheduling function $\gamma(r)\in(0,1]$ following MaskGIT [4], and the $MASK$ vector is a learnable parameter with the same shape as $Z_{i}$ . Additionally, it is important to highlight that during inference, all $Z_{i}$ are masked, i.e., $M=\{m_{i}=1|1\leq i\leq S\}$ .

As depicted in Fig. 3, the model inputs the sum of $\mathcal{Z}_{mask}$ and $\mathcal{H}$ to generate $\hat{\mathcal{Z}}$ , which is then decoded by the VAE decoder as $\mathcal{\hat{P}}=\phi(\hat{\mathcal{Z}})$ . The VAE and generative models are trained jointly, thus the training loss is as follows:

L=\log p_{\psi}(\mathcal{P}|\mathcal{H},\mathcal{Z}_{mask})+L_{VAE},

(3)

where $\psi$ is the parameters of the generative model.

Diffusion Model: Diffusion models [21, 48, 75] have recently emerged as a new class of generative models with high performance. These models are characterized by forward and reverse Markov processes of length $T$ . In our rendering parameters compression (VAE) model, rendering parameters $\mathcal{P}$ are encoded into a latent space, i.e., $\mathcal{Z}=\theta(\mathcal{P})$ . These latent vectors $\mathcal{Z}$ , which align more closely with a Gaussian distribution, improve compatibility with the noise distribution in diffusion models. Following successful models [11, 57, 59], our diffusion model can be interpreted as an equally weighted sequence of denoising autoencoders $\mathcal{E}(\mathcal{Z}_{t},t,\mathcal{H});t=1\ldots T$ , which are trained to predict the noise $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ in $\mathcal{Z}_{t}$ . The $\mathcal{Z}_{t}$ is obtained from a forward process starting from $\mathcal{Z}_{0}$ (where $\mathcal{Z}_{0}=\mathcal{Z}$ ), defined as $\mathcal{Z}_{t}=\sqrt{\alpha_{t}}\mathcal{Z}_{t-1}+\sqrt{1-\alpha_{t}}% \boldsymbol{\epsilon}$ , with $\alpha_{t}$ being a predefined set of coefficients. As illustrated in Fig. 3, $\mathcal{Z}_{t}$ and $\mathcal{H}$ are added and input into the model. Our diffusion model employs the standard variational lower bound objective as its training loss, and we jointly optimize the VAE, leading to the overall loss function:

L=\mathbb{E}_{\mathcal{Z},\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\boldsymbol{% \epsilon}-\boldsymbol{\epsilon}_{\psi}(\mathcal{Z}_{t},t,\mathcal{H})\|_{2}^{2% }\Big{]}+L_{VAE}.

(4)

During inference, the predicted $\mathcal{\hat{Z}}$ is progressively obtained through a reverse process, expressed as $\mathcal{Z}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathcal{Z}_{t}-\frac{1-% \alpha_{t}}{\sqrt{1-\alpha_{t}}}\boldsymbol{\epsilon}_{\psi}(\mathcal{Z}_{t},t% ,\mathcal{H})\right)$ . Subsequently, $\mathcal{\hat{Z}}$ is decoded to $\mathcal{\hat{P}}$ via a single pass through the VAE decoder $\phi$ . Additionally, $\mathcal{Z}_{T}$ is random Gaussian noise.

6 Experiment

6.1 Evaluation Metrics

Three metrics are utilized to assess the quality of the generated rendering parameters. Fréchet Inception Distance (FID), Element Intersection over Union (Ele. IoU), and newly introduced Style Consistency Score (SC Score) enable the evaluation of the overall appearance, layout, and style of generated web pages respectively. As indicated in Sec. 3.2, “layout” refers to layout properties, while “style” encompasses text properties and color properties.

6.1.1 Fréchet Inception Distance

FID [20], a metric initially proposed in the domain of image generation, measures the similarity of generated data to real ones in feature space. Inspired by Lee et al. [35], a binary classifier is trained to distinguish between real and noise-added RPs. This classifier is employed to generate representative features of RPs for calculating FID. We also introduce layout-specific and style-specific FID models. Further details are in Sec. A.4.

6.1.2 Elements Intersection over Union

Ele. IoU is a metric for evaluating the similarity between generated layouts and real ones, based on adaptation to the Maximum IoU [29]. As the elements of real and generated web pages correspond one-to-one, IoU is computed between the corresponding pairs. Denote the real layouts as $B=\{b_{i}\}_{i=1}^{N}$ and the generated ones as $\hat{B}=\{\hat{b}_{i}\}_{i=1}^{N}$ , with $N$ being the element count, and $b_{i}$ and $\hat{b}_{i}$ as corresponding elements. The Ele. IoU can be calculated as follows:

\text{EleIoU}(B,\hat{B})=\frac{1}{N}\sum_{i=1}^{N}IoU(b_{i},\hat{b}_{i}).

(5)

6.1.3 Style Consistency Score

The “Principle of Similarity” of Gestalt theory suggests that people tend to perceive elements with similar style as a whole [65, 31], highlighting the importance of style consistency among elements. Hence, the SC Score assesses whether elements with the same style on a real web page retain that consistency on the generated page, beyond merely visual similarity. An example explanation is provided in Sec. C. Elements are deemed to have the same style only if all their style properties are identical [60]. Specifically, for a web page $W=\{e_{i}\mid i\in N\}$ with $N$ being the number of elements, the style consistency subset of the page is defined as $S\subseteq W,\forall e_{i},e_{j}\in S,style(e_{i})=style(e_{j}).$

Thus the real web page $W$ and its generated page $\hat{W}$ are divided into style consistency subsets $W=\{S_{j}\mid j\in M\}$ and $\hat{W}=\{S_{k}\mid k\in N\}$ , respectively. Given $N$ and $M$ can differ, we apply a max operation for optimal matching. The SC Score is then calculated as:

SCScore(W,\hat{W})=\sum_{j=1}^{M}w_{j}\cdot\max_{k}J(S_{j},\hat{S}_{k}),

(6)

where $J(A,B)$ is the Jaccard similarity coefficient. Additionally, under the assumption that style consistency subsets with more elements are more semantically valuable, we utilize a weight $w_{j}=\frac{|S_{j}|}{\sum_{l=1}^{M}|S_{l}|}$ .

6.2 Implementation

Two baselines are implemented: autoregressive (WebRPG-AR) and diffusion model (WebRPG-DM). The VAE, hierarchical embedding, and character count embedding are jointly trained with the backbone, and the semantic embedding is produced by frozen pre-trained MarkupLM_large [41]. The XPath embedding layer is initialized following Li et al. [41]. All baselines are based on Transformer [63] and have approximately 50M of parameters to ensure fair comparison, whose hidden dimensions are 128. The dimensions of latent vector and HTML embedding $d$ is 128. For optimization, AdamW [45] is used with a learning rate of 1.2e-4. All models are trained for 1M steps with a batch size of 300.

Additionally, LLMs have been gaining adoption in different domains. We assess GPT-4 [70, 51], StarCoder2-7b [46], DeepSeek-Coder-6.7b [18], CodeLlama-13B [58] on the WebRPG task. GPT-4 is one of the state-of-the-art LLMs, while the others are open-source models known for code generation. Due to limited resources, we randomly select 10% of test samples. The prompt template employs in-context learning [3], incorporating a task description, three demonstrations, and a test instance. Further details are available in Sec. A.5.

6.3 Quantitative and Qualitative Evaluation

We present quantitative results in Tab. 1 and qualitative results in Fig. 4. Regarding the results of real data, in addition to the normally rendered web page (Tab. 1, “Real Web Pgae”), we also report the web page rendered using only HTML code (Tab. 1, “Plain HTML”). Since the browser would apply default CSS when custom CSS is absent, some models perform worser than the plain HTML due to unreasonable generated RPs. The FIDs for real data are calculated between the test set and other real web pages.

The experimental results show that WebRPG-AR consistently surpasses other baselines. Its sequential decoding mechanism allows for more refined control based on previously generated results [13, 68]. As shown in Fig. 4 a, b, e, WebRPG-AR demonstrates impressive visual quality in detail.

Table 1: WebRPG baselines quantitative comparison with bold figures for best results. "*" stands for the result in the randomly selected test set.

	Overall	Layout		Style
Model	FID $\downarrow$	FID_layout $\downarrow$	Ele. IoU $\uparrow$	FID_style $\downarrow$	SC Score $\uparrow$
WebRPG-AR	0.1281	0.1520	0.7082	0.2124	0.9474
WebRPG-DM	62.021	60.942	0.0357	106.95	0.3671
WebRPG-AR*	0.1324	0.2877	0.7069	0.1359	0.9485
WebRPG-DM*	61.135	60.870	0.0356	105.86	0.3649
GPT4*	4.2141	47.732	0.0347	8.8898	0.5515
StarCoder2-7b*	11.899	51.432	0.0309	18.186	0.3639
DeepSeek-Coder-6.7b*	5.8219	55.744	0.0330	7.4542	0.3949
CodeLlama-13b*	9.2826	55.427	0.0278	11.625	0.3864
Real Web Page	0.0027	0.0015	1.0000	0.0074	1.0000
Plain HTML	8.5342	52.438	0.0354	8.4951	0.3668

Table 2: Ablation study based on WebRPG-AR. Best results in bold. “

\mathcal{Z}_{mask}

” is detailed in Sec. 5.4. “H.E.” stands for HTML embedding. “S.”, “H.”, and “C.” stand for semantic, hierarchical, and character count embeddings.

			H. E.			Overall	Layout		Style
#	VAE	$\mathcal{Z}_{mask}$	S.	H.	C.	FID $\downarrow$	FID_layout $\downarrow$	Ele. IoU $\uparrow$	FID_style $\downarrow$	SC Score $\uparrow$
1		✓	✓	✓	✓	0.9702	5.4668	0.5954	15.923	0.8053
2	✓		✓	✓	✓	0.1487	0.2055	0.6462	0.2944	0.9332
3	✓	✓		✓	✓	0.1797	0.2096	0.6620	0.3055	0.9323
4	✓	✓	✓		✓	0.3003	0.3770	0.6345	1.9048	0.8982
5	✓	✓	✓	✓		0.1575	0.3152	0.6769	0.3065	0.9434
6	✓	✓	✓	✓	✓	0.1281	0.1520	0.7082	0.2124	0.9474

The performance of WebRPG-DM is suboptimal across all metrics. It only tends to produce standard web visual presentations in simpler cases, as illustrated in Fig. 4 e, such as bolding prices, adding background color to buttons, and aligning a few elements. This implies that diffusion models may be inappropriate for this task. There are two plausible explanations: First, unlike images and videos in Euclidean space, web elements are non-Euclidean due to their hierarchical arrangement, while diffusion models are confined to Euclidean space [33]. Second, the WebRPG task demands meticulous adjustments and detailed control for realism, a limitation of diffusion models [43].

GPT-4’s performance on the WebRPG task surpasses that of WebRPG-DM and falls short of WebRPG-AR. Open-source LLMs underperform compared to GPT-4. As illustrated in the Fig. 4 a,b,e, GPT-4 can effectively handle element styles, such as adding background colors to buttons and applying distinct colors for prices. However, the performance of GPT-4 in layout is limited. As demonstrated in Fig. 4 a-c, GPT-4 tends to generate simplistic vertical arrangements when faced with complex HTML structures. With regular HTML, as depicted in Fig. 4 e, GPT-4 achieves a layout that is similar to the real page. Therefore, we conclude that GPT-4 demonstrates basic capability in WebRPG tasks with regular HTML, but its performance with complex HTML is less effective. Additionally, we notice that LLMs do not generate RPs for all elements, causing many to use the browser’s default CSS, resulting in performance similar to plain HTML.

It is worth noting that WebRPG-AR exhibits the ability to render diverse web pages. For example, Fig. 4 d shows WebRPG-AR’s creation of a page with a vertical layout (originally horizontal), preserving the pattern and order consistency across four groups. This finding suggests that the model successfully learns web design knowledge and applies it effectively to render web pages from HTML code. Further cases are available in Sec. B.1.

Furthermore, we calculate the FID on screenshots of rendered web pages, following conventional image generation practices [55, 57]. The results, shown in Sec. B.2, are consistent with Tab. 1. Additionally, we conduct a human evaluation, detailed in Sec. B.4, with results that also align with Tab. 1.

6.4 Ablation Study

We conduct a series of ablation experiments based on WebRPG-AR, as shown in Tab. 2. #1 uses a one-dimensional flat input instead of VAE. #2 removes $\mathcal{Z}_{mask}$ (in Sec. 5.4). #3 and #5 respectively remove the corresponding embedding layers, while #4 substitutes hierarchical embedding with one-dimensional positional embedding. Additionally, we visualize some cases from #3, #4, and #5 in Fig. 5. All models are trained to convergence following the settings in Sec. 6.2.

The results of #1 demonstrate the effectiveness of using VAE for rendering parameters compression. Although #2 is comparable to #6, the incorporation of $\mathcal{Z}_{mask}$ enhances the model stability during training. The results of #3, #4, and #5 reveal that all three embeddings play critical roles in web design. Hierarchical embedding helps layout arrangement significantly. The simplification to 1D positional embedding leads to a disorganized layout, as illustrated in Fig. 5 c. Semantic embedding enhances the model with the capacity to perceive semantic relationship. For example, as Fig. 5 b shows, the model struggles to horizontally align elements like “select a color” and “sunshine,” suggesting challenges in identifying key-value pairs without semantic information. Character count embedding helps to predict appropriate element sizes for full content display, as in Fig. 5 d, where a narrow “price” and “sunshine” width leads to incomplete text display.

6.5 Discussion on Failure Cases

To investigate the boundaries of the model’s capabilities, we analyze several failure cases generated by WebRPG-AR. The left side of Fig. 6 reveals that both layout (Ele. IoU) and style (SC Score) metrics decrease with an increase in the number of elements or the average depth of elements within the DOM tree. This trend may be attributed to two factors: the inherent complexity of a page increases with more elements or greater depth, and the training set lacks web pages with a large number of elements or significant depths (details in Sec. A.3). Regarding error types, layout issues mainly include misalignments and overlaps, as shown in Fig. 6 a and b. For style, the model struggles to recognize web page elements with identical semantic functions, such as the “Add to Cart” buttons illustrated in Fig. 6 c, which should appear identical. Moreover, we observe two primary error scenarios: elements positioned at the end of the HTML code tend to be more error-prone, as seen with the element in the bottom right corner of Fig. 6 b, likely due to the characteristics of the autoregressive model [12]; additionally, pages with large-scale images pose challenges, as shown in Fig. 6 a, since the model does not take the original images as input. The discussion above highlights the need for further research.

6.6 Discussion on the Integration of LLM and WebRPG Model

Recently, LLMs have enabled the possibility of automatically generating HTML code [39]. Consequently, we hypothesize that integrating LLM into a WebRPG system could facilitate a fully automated web development workflow. We employ GPT-4 [70, 51] to validate this hypothesis. As Fig. 7 illustrates, WebRPG-AR effectively creates visual presentations of web pages based on generated HTML, demonstrating the potential of a fully automated web development workflow through the integration of LLM and WebRPG. Additional cases and the prompt for automatically generating HTML are provided in Sec. B.3.

7 Conclusion and Limitations

This paper presents WebRPG, a task that automates web design by generating rendering parameters for web elements from HTML. We introduce a new dataset, two baseline models, and evaluation metrics. Results show the autoregressive baseline most effectively generates web visual presentations.

Nevertheless, this study has limitations that warrant further investigation in future research. The proposed model can undergo fine-tuning to support design tasks such as partial web page design by masking specific elements. Additionally, it can be adapted to analyze raster images by replacing $\textless$ img $\textgreater$ tokens with image embeddings. The employment of established CSS frameworks like Tailwind⁹⁹9https://tailwindcss.com/ could standardize CSS, thereby potentially simplifying the WebRPG task. However, sourcing web pages based on these frameworks presents challenges. Furthermore, design options and control mechanisms of the results are worth exploring. Future research will address these aspects.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62372408) and the National Key R&D Program of China (No. 2021YFB2701100).

References

[1] Alemerien, K., Magel, K.: Guievaluator: A metric-tool for evaluating the complexity of graphical user interfaces. In: SEKE. pp. 13–18 (2014)
[2] Azadi, S., Fisher, M., Kim, V.G., Wang, Z., Shechtman, E., Darrell, T.: Multi-content gan for few-shot font style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7564–7573 (2018)
[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
[4] Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked generative image transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 11305–11315 (2022)
[5] Chen, L., Chen, X., Zhao, Z., Zhang, D., Ji, J., Luo, A., Xiong, Y., Yu, K.: Websrc: A dataset for web-based structural reading comprehension. In: Conference on Empirical Methods in Natural Language Processing (2021)
[6] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: Estimating gradients for waveform generation. In: International Conference on Learning Representations (2021)
[7] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Dehak, N., Chan, W.: Wavegrad 2: Iterative refinement for text-to-speech synthesis. arXiv preprint arXiv:2106.09660 (2021)
[8] Cheng, C.Y., Huang, F., Li, G., Li, Y.: Play: parametrically conditioned layout generation using latent diffusion. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)
[9] Cyr, D., Head, M., Larios, H.: Colour appeal in website design within and across cultures: A multi-method evaluation. International journal of human-computer studies 68(1-2), 1–21 (2010)
[10] Deng, X., Shiralkar, P., Lockard, C., Huang, B., Sun, H.: Dom-lm: Learning generalizable representations for html documents. arXiv preprint arXiv:2201.10608 (2022)
[11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
[12] Dong, Z., Tang, T., Li, L., Zhao, W.X.: A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502 (2023)
[13] Du, Y., Chen, Z., Jia, C., Yin, X., Li, C., Du, Y., Jiang, Y.G.: Context perception parallel decoder for scene text recognition. arXiv preprint arXiv:2307.12270 (2023)
[14] Flavian, C., Gurrea, R., Orus, C.: Web design: a key factor for the website success. Journal of Systems and Information Technology 11(2), 168–184 (2009)
[15] Fu, F., Chiu, S.Y., Su, C.H.: Measuring the screen complexity of web pages. In: Human Interface and the Management of Information. Interacting in Information Environments: Symposium on Human Interface 2007, Held as Part of HCI International 2007, Beijing, China, July 22-27, 2007, Proceedings, Part II. pp. 720–729. Springer (2007)
[16] Furht, B. (ed.): Cascading Style Sheets, pp. 58–58. Springer US, Boston, MA (2008)
[17] Gu, Z., Lou, J.: Data driven webpage color design. Computer-Aided Design 77, 46–59 (2016)
[18] Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al.: Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196 (2024)
[19] Herzig, R., Bar, A., Xu, H., Chechik, G., Darrell, T., Globerson, A.: Learning canonical representations for scene graph to image generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 210–227. Springer International Publishing, Cham (2020)
[20] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
[21] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
[22] Hotti, A., Risuleo, R.S., Magureanu, S., Moradi, A., Lagergren, J.: The klarna product page dataset: A realistic benchmark for web representation learning. arXiv preprint arXiv:2111.02168 (2021)
[23] Hui, M., Zhang, Z., Zhang, X., Xie, W., Wang, Y., Lu, Y.: Unifying layout generation with a decoupled diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1942–1951 (2023)
[24] Inoue, N., Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Layoutdm: Discrete diffusion model for controllable layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10167–10176 (2023)
[25] Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: Layoutvae: Stochastic scene layout generation from a label set. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9895–9904 (2019)
[26] Kaluarachchi, T., Wickramasinghe, M.: A systematic literature review on automatic website generation. Journal of Computer Languages p. 101202 (2023)
[27] Kikuchi, K., Inoue, N., Otani, M., Simo-Serra, E., Yamaguchi, K.: Generative colorization of structured mobile web pages. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3650–3659 (2023)
[28] Kikuchi, K., Otani, M., Yamaguchi, K., Simo-Serra, E.: Modeling visual containment for web page layout optimization. In: Computer Graphics Forum. vol. 40, pp. 33–44. Wiley Online Library (2021)
[29] Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Constrained graphic layout generation via latent optimization. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 88–96 (2021)
[30] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
[31] Koffka, K.: Principles of gestalt psychology (1955)
[32] Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)
[33] Koo, H.: A survey on generative diffusion models for structured data. arXiv preprint arXiv:2306.04139 (2023)
[34] Kumar, V., Dhar, M., Khattar, D., Lal, Y.K., Mishra, A., Shrivastava, M., Varma, V.: Swde: A sub-word and document embedding based engine for clickbait detection. arXiv preprint arXiv:1808.00957 (2018)
[35] Lee, H.Y., Jiang, L., Essa, I., Le, P.B., Gong, H., Yang, M.H., Yang, W.: Neural design network: Graphic layout generation with constraints. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. pp. 491–506. Springer (2020)
[36] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., rahman Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Annual Meeting of the Association for Computational Linguistics (2019)
[37] Li, C., Zhang, P., Wang, C.: Harmonious textual layout generation over natural images via deep aesthetics learning. IEEE Transactions on Multimedia 24, 3416–3428 (2022)
[38] Li, G.e.a.: Learning to denoise raw mobile ui layouts for improving datasets at scale. pp. 1–13 (2022)
[39] Li, J., Li, G., Li, Y., Jin, Z.: Enabling programming thinking in large language models toward code generation. arXiv preprint arXiv:2305.06599 (2023)
[40] Li, J., Yang, J., Zhang, J., Liu, C., Wang, C., Xu, T.: Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27(10), 4039–4048 (2020)
[41] Li, J., Xu, Y., Cui, L., Wei, F.: Markuplm: Pre-training of text and markup language for visually rich document understanding. In: Annual Meeting of the Association for Computational Linguistics (2021)
[42] Li, X., Thickstun, J., Gulrajani, I., Liang, P.S., Hashimoto, T.B.: Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems 35, 4328–4343 (2022)
[43] Li, X., Thickstun, J., Gulrajani, I., Liang, P.S., Hashimoto, T.B.: Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems 35, 4328–4343 (2022)
[44] Liu, T.F.e.a.: Learning design semantics for mobile apps. In: UIST. pp. 569–579 (2018)
[45] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017)
[46] Lozhkov, A., Li, R., Allal, L.B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., et al.: Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173 (2024)
[47] Network, M.D.: Computed value - css: Cascading style sheets (2023)
[48] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. pp. 8162–8171. PMLR (2021)
[49] O’Donovan, P., Agarwala, A., Hertzmann, A.: Designscape: Design with interactive layout suggestions. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems. pp. 1221–1224 (2015)
[50] van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural discrete representation learning. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)
[51] OpenAI: Gpt-4 technical report (2023)
[52] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022)
[53] O’Donovan, P., Agarwala, A., Hertzmann, A.: Learning layouts for single-pagegraphic designs. IEEE transactions on visualization and computer graphics 20(8), 1200–1213 (2014)
[54] Qiu, Q., Otani, M., Iwazaki, Y.: An intelligent color recommendation tool for landing page design. In: 27th International Conference on Intelligent User Interfaces. pp. 26–29 (2022)
[55] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)
[56] Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. In: Neural Information Processing Systems (2019)
[57] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[58] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
[59] Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(4), 4713–4726 (2022)
[60] Shao, Z., Gao, F., Qi, Z., Xing, H., Bu, J., Yu, Z., Zheng, Q., Liu, X.: Gem: Gestalt enhanced markup language model for web understanding via render tree. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 6132–6145 (2023)
[61] Thorlacius, L.: The role of aesthetics in web design. Nordicom Review 28 (05 2007)
[62] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
[63] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
[64] Wang, P.: The influence of artificial intelligence on visual elements of web page design under machine vision. Computational Intelligence and Neuroscience 2022 (2022)
[65] Wertheimer, M.: Gestalt theory. (1938)
[66] Williams, R.: The non-designer’s design book: Design and typographic principles for the visual novice. Pearson Education (2015)
[67] Xiang, P., Yang, X., Shi, Y.: Web page segmentation based on gestalt theory. In: 2007 IEEE International Conference on Multimedia and Expo. pp. 2253–2256 (2007)
[68] Xiao, Y., Wu, L., Guo, J., Li, J., Zhang, M., Qin, T., Liu, T.Y.: A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 11407–11427 (2022)
[69] Xie, C., Huang, W., Liang, J., Huang, C., Xiao, Y.: Webke: Knowledge extraction from semi-structured web with pre-trained markup language model. Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021)
[70] Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.C., Liu, Z., Wang, L.: The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9 (2023)
[71] Zhao, N., Cao, Y., Lau, R.W.: Modeling fonts in context: Font prediction on web designs. In: Computer Graphics Forum. vol. 37, pp. 385–395. Wiley Online Library (2018)
[72] Zhao, Z., Chen, L., Cao, R., Xu, H., Chen, X., Yu, K.: Tie: Topological information enhanced structural reading comprehension on web pages. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 1808–1821 (2022)
[73] Zheng, X., Qiao, X., Cao, Y., Lau, R.W.: Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG) 38(4), 1–15 (2019)
[74] Zhou, M., Xu, C., Ma, Y., Ge, T., Jiang, Y., Xu, W.: Composition-aware graphic layout GAN for visual-textual presentation designs. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022. pp. 4995–5001. ijcai.org (2022)
[75] Zhu, Y., Wu, Y., Olszewski, K., Ren, J., Tulyakov, S., Yan, Y.: Discrete contrastive diffusion for cross-modal and conditional generation. arXiv preprint arXiv:2206.07771 (2022)

Supplementary Material

A Additional Details

A.1 Details of Rendering Parameters

As described in Sec. 3.1, we utilize rendering parameters to standardize CSS due to its code complexity. The examples in Fig. 8 demonstrate this complexity. As shown on the left side of Fig. 8, CSS can be utilized in different forms¹⁰¹⁰10https://www.w3schools.com/css/css_howto.asp: Inline Styles for direct HTML element styling via the “style” attribute; Internal Style Sheets using “ $\textless$ style $\textgreater$ ” tags within HTML documents; and External Style Sheets linking to CSS files externally. The middle of Fig. 8 showcases various CSS selectors¹¹¹¹11https://www.w3schools.com/css/css_selectors.asp, including simple tag, class, and ID selectors, as well as complex attribute and descendant selectors. Furthermore, CSS follows certain rules regarding inheritance and overrides¹²¹²12https://developer.mozilla.org/en-US/docs/Web/CSS/Inheritance. An example on the right side of Fig. 8 shows how the .highlight class’s red color is overridden by the more specific ID selector #main-content p, turning the color green.

The complexity of CSS makes direct generation of CSS impractical. Even parsing CSS code to obtain WebRPG task labels is challenging. Since browsers compute the final applied CSS property values (i.e., rendering parameters) for each element based on HTML and CSS to render web pages, we propose extracting each element’s RPs directly from the browser, as described in Sec. 3.2. This approach bypasses the need to parse CSS code, achieving the standardization of CSS.

Table 3: The complete vocabulary of rendering parameters including all categories, their index ranges, and selected examples.

Category	Index Range	Examples
Integer Pixel	0-1920	1px, 1052px, 1920px
Color	1921-1966	RGBA(153, 204, 0, 1), RGBA (255, 255, 255, 1)
Font Style	1967-1969	italic, oblique
Font Weight	1970-1978	100, 500, 900
Line Height	1979	normal
Text Align	1980-1985	start, center, end
Text Decoration	1986-1987	none, underline
Text Transform	1988-1991	uppercase, capitalize
PAD	1992	PAD

Table 4: Index ranges for each rendering parameter in the vocabulary.

Rendering Parameter	Index Range
left	0-1920
top	0-1920
width	0-1920
height	0-1920
font-style	1967-1969
font-weight	1970-1978
font-size	0-32
line-height	0-50, 1979
text-align	1980-1985
text-decoration	1986-1987
text-transform	1988-1991
color	1921-1966
background-color	1921-1966

In practice, we follow the pre-order traversal order of the DOM tree to assign a unique ID to each element, achieved by modifying the class name, as shown on the left side of Fig. 9. We organize the rendering parameters using JSON, where the key is the element’s ID, as illustrated in the middle of Fig. 9. RPs can also be transformed into CSS, utilizing class selectors only, as demonstrated on the right side of Fig. 9.

Additionally, the complete vocabulary of all rendering parameters is detailed in Tab. 3, and index ranges of each rendering parameter are presented in Tab. 4.

A.2 Details of Visual Complexity Metric

The Visual Complexity (VC) metric integrates three dimensions: color, size, and alignment. For any given web page, the three dimensions are defined as follows:

Color: The color metric measures the richness of colors and is defined as:

VC_{color}=\frac{1}{2N}(C_{c}+C_{bg}-2),

(7)

where $N$ is the number of elements, and $C_{c}$ and $C_{bg}$ are the counts of unique color and background-color attributes respectively.

Size: The size metric measures the diversity of sizes among web page elements. In particular, it calculates the size diversity for all $N^{\prime}$ parent elements and then computes the average. The formula is as follows:

VC_{size}=\frac{1}{{N^{\prime}}}{\sum_{i=1}^{N^{\prime}}\left(\frac{DS_{i}-1}{% NC_{i}}\right)},

(8)

with $NC_{i}$ and $DS_{i}$ being the count of child elements and their distinct sizes for element $i$ , respectively.

Alignment: The complexity of a web page inversely correlates with the number of pairwise alignments [15]. To simplify, this metric applies only to leaf nodes. The calculation formula is as follows:

VC_{alg}=1-\frac{1}{N_{leaf}(N_{leaf}-1)}\sum_{j=1}^{N_{leaf}}\sum_{i\neq j}^{% N_{leaf}}ALG_{ij},

(9)

where $N_{leaf}$ denotes the number of leaf node elements, and $ALG_{ij}$ is a binary indicator of alignment (1) or misalignment (0) between elements $i$ and $j$ .

The overall VC is the sum of three metrics: $VC=VC_{\text{color}}+VC_{\text{alg}}+VC_{\text{size}}$ .

A.3 Dataset Details

The distribution of Visual Complexity (Sec. A.2) values across all samples is illustrated in Fig. 10. In our dataset, samples with a VC value of less than 0.1 are filtered out, resulting in a remaining subset where the VC distribution is relatively concentrated and approximates a normal distribution, thereby helping to mitigate the impact of extreme samples on training. Additionally, to further investigate our dataset, we visualize two crucial statistical values, element count and the average depth of elements, in Fig. 11. This visualization indicates that the dataset lacks samples containing a large number of elements or considerable element depths.

A.4 Implementation details of FID model

As described in Sec. 6.1.1, the FID model is a binary classifier, incorporating a VAE described in Sec. 5.2, four transformer layers, and a classification header. A special CLS vector is utilized as the classification feature, representing all RPs. The rest of the input is the same as the model in Sec. 5.4. Three kinds of noise are designed to pollute the real data, namely perturbing the original values with a fixed variance, randomly substituting elements with synthetic ones, and randomly swapping elements. The specific FID models for layout and style, namely FID_layout and FID_style, are trained by masking irrelevant inputs. Specifically, FID_layout processes only the layout, masking the style, and FID_style processes only the style, masking the layout. The FID models for overall, layout, and style, achieve classification accuracies of 88.8%, 95.5%, and 92.4%, respectively.

Table 5: The prompt template for GPT-4 experiment in Sec. 6.3.

Prompt	You are an exceptional web designer. Please create the corresponding CSS code based on the HTML code I have provided, so as to craft a well-designed visual presentation for the web page. You can only use the following CSS properties: "left", "top", "width", "height", "font-style", "font-weight", "font-size", "line-height", "color", "text-align", "text-decoration", "text-transform", "background-color". Please exercise caution in controlling the size of the image, as using the original image dimensions directly may result in excessive spatial occupation. Here are several demonstrations:{Demonstrates}. Below is the HTML code and do not reply with anything other than CSS code: {HTML_Code} .
Slots	Demonstrates	The HTML-CSS pairs for three selected web page segments.
	HTML_Code	HTML code of given web page.

A.5 Implementation details of WebRPG Baselines

The backbone of WebRPG-AR consists of 6-layer transformers for both encoder and decoder, and WebRPG-DM is a 12-layer U-ViT. The mask scheduling function $\gamma(r)$ is a cosine function, the time steps $T$ in diffusion follows [21] with a value of 1000, and $\lambda_{KL}$ is set to 1e-6. For optimization, AdamW [45] is used with a learning rate of 1.2e-4, $\beta_{1}$ of 0.9, and $\beta_{2}$ of 0.99.

The prompt template for the LLMs experiment in Sec. 6.3 is detailed in Tab. 5. Due to the extensive length of textual representation for each element’s RPs, as shown on the right side of Fig. 9, we opt to have LLMs directly generate the CSS code. The specific steps for conducting the LLMs experiment are:

1.

Use the prompt to generate CSS code via LLMs.
2.

Use a browser to render the web page with the given HTML and the CSS code generated by LLMs.
3.

Extract the RPs for all elements, employing the method in Sec. 4.1.
4.

Evaluate these RPs using the metrics in Sec. 6.1.

B Additional Results

B.1 Additional Cases of Baseline-Generated Results

We present additional results from WebRPG baselines in Fig. 12. These results exhibit the performance of all baselines comparable to that outlined in Sec. 6.3. Additionally, Fig. 13 displays the web page variants generated by WebRPG-AR based on the same HTML, each produced through individual inferences. The differences in layout and style among these variants indicate that WebRPG-AR can generate diverse web pages while maintaining semantic coherence.

Table 6: FID on rendered web page screenshots.

	WebRPG-AR	GPT4	WebRPG-DM	Real Web Page
FID_Screenshot	3.2102	15.515	33.040	1.1156

B.2 The FID on Screenshots of Rendered Web Pages

The FID on screenshots of rendered web pages is shown in Tab. 6.

B.3 Further Cases of Integrating LLM with WebRPG Model

Fig. 14 showcases more cases of WebRPG-AR creating visual presentations of web pages based on HTML code generated by GPT-4. The prompt template for automatically generating HTML is in Tab. 7. The prompt encompasses human-authored descriptions of web design ideas, with an example shown in Tab. 8.

Table 7: The prompt template for automatically generating HTML.

Prompt	You are a web developer. Please generate the HTML code for a web page with a caption of {Deign_Idea}.
Slot	Deign_Idea	Human-authored descriptions of web design ideas, with an example shown in Tab. 8.

B.4 Human Evaluation

We conduct a human evaluation using pairwise comparisons. We randomly select 100 test samples and generate visual presentations using WebRPG-AR, WebRPG-DM, and GPT-4. Five human annotators evaluate each pair to determine the superior presentation or if there is a tie. The results, shown in Fig. 15, align with the objective evaluations in Tab. 1.

Table 8: An example of web design ideas described by humans.

This web page showcases the “Rumble Band for 38mm Apple Watch,” offered at $19.99. It’s identified as the X-Doria Rumble Band and is noted for its compatibility with the 38mm Apple Watch Series 1, 2, 3, and Nike Edition. Highlighted on the page are customer assurances including a lifetime warranty, complimentary shipping on all orders, and a 30-day hassle-free return policy. A conspicuous “Add to Cart” button is prominently displayed. The product’s image is designed to highlight its appearance and design features.

C An Example Explanation of SC Score

Fig. 16 provides an example to explain the SC Score further. The elements representing price (marked with a green box, hereafter termed as price elements) on the real web page $W$ and on generated web page 1 $\hat{W_{1}}$ have differing styles in terms of font color and size. However, these differences do not affect the perception of price elements, as their style remains consistent within each individual web page. In contrast, the generated web page 2 $\hat{W_{2}}$ changes just one price element, which leads to confusion when perceiving the price elements. Although $\hat{W_{2}}$ seems more visually similar to $W$ because of only one differing element, from a semantic perspective, $\hat{W_{1}}$ is more coherent. Therefore, the SC Score evaluates whether elements that share a style on the real web page maintain that consistency on the generated page, beyond just visual similarity. Additionally, Fig. 17 provides a visualization of the style consistency subset for a real web page.

D Further Discussion on the Performance of LLM in WebRPG Task

As described in Sec. 6.2, we employ GPT-4 as a representative for LLMs. Due to the complexity of CSS code practices and the noise in actual web pages, directly fine-tuning LLMs is not feasible. Consequently, we do not conduct fine-tuning experiments. Moreover, to further explore the performance of GPT-4 in WebRPG tasks, we conduct two qualitative experiments. Tab. 9 details the prompt templates. The first experiment inputs HTML and the captions from the original web page screenshots. The second experiment comprises HTML, these captions, and the screenshots themselves. It’s noteworthy that the additional data comprised visual information from the original web pages, serving essentially as a form of ground truth. The second experiment and the generation of web page screenshot captions both leverage the multimodal capabilities of GPT-4V¹³¹³13https://openai.com/research/gpt-4v-system-card. Fig. 18 presents visualizations of selected cases, showing that additional data does not enhance GPT-4’s performance. Given that these two qualitative experiments involve ground truth inputs, we do not include them in the main text or conduct quantitative experiments.

Table 9: The prompts for Sec. D. “H.”, “C.”, and “S.” denote “HTML”, “caption” and “screenshot”, respectively.

Information	H.+C.	You are an exceptional web designer. Please create the corresponding CSS code based on the HTML code I have provided, so as to craft a well-designed visual presentation for the web page. Furthermore, for better comprehension of the original web page design, here is a detailed caption: {Caption}. You can only use the following CSS properties: "left", "top", "width", "height", "font-style", "font-weight", "font-size", "line-height", "color", "text-align", "text-decoration", "text-transform", "background-color". Please exercise caution in controlling the size of the image, as using the original image dimensions directly may result in excessive spatial occupation. Here are several demonstrations:{Demonstrates}. Below is the HTML code and do not reply with anything other than CSS code: {HTML_Code}.
	H.+C.+S.	You are an exceptional web designer. Please create the corresponding CSS code based on the HTML code and screenshot I have provided, so as to craft a well-designed visual presentation for the web page. Furthermore, for better comprehension of the original web page design, here is a detailed caption: {Caption}. You can only use the following CSS properties: "left", "top", "width", "height", "font-style", "font-weight", "font-size", "line-height", "color", "text-align", "text-decoration", "text-transform", "background-color". Please exercise caution in controlling the size of the image, as using the original image dimensions directly may result in excessive spatial occupation. Here are several demonstrations:{Demonstrates}. Below is the HTML code and do not reply with anything other than CSS code: {HTML_Code}.
Slots	Caption	Captions from the original web page screenshots.
	HTML_Code	HTML code of given web page.
	Demonstrates	The HTML-CSS pairs for three selected web page segments.

WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation