Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: Zhejiang Provincial Key Laboratory of Service Robot, Zhejiang University
11email: {shaozirui, xinghd, zapeng, yuzhirenzhe, bjj}@zju.edu.cn
22institutetext: Alibaba Group
22email: feiyu.gfy@alibaba-inc.com, yongqi.zq@taobao.com, yaocong2010@gmail.com

WebRPG: Automatic Web Rendering Parameters
Generation for Visual Presentation

Zirui Shao\orcidlink0000-0002-4210-070X  Equal contribution.11    Feiyu Gao\orcidlink0009-0009-3206-5347 22    Hangdi Xing\orcidlink0000-0002-1770-005X 11    Zepeng Zhu\orcidlink0009-0000-1510-6455 11    Zhi Yu\orcidlink0009-0001-8608-5628  Corresponding author.11   
Jiajun Bu\orcidlink0000-0002-1097-2044
11
   Qi Zheng\orcidlink0009-0001-3822-2616 22    Cong Yao\orcidlink0000-0001-6564-4796 22
Abstract

In the era of content creation revolution propelled by advancements in generative models, the field of web design remains unexplored despite its critical role in modern digital communication. The web design process is complex and often time-consuming, especially for those with limited expertise. In this paper, we introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code. WebRPG would contribute to a faster web development workflow. Since there is no existing benchmark available, we develop a new dataset for WebRPG through an automated pipeline. Moreover, we present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML. Extensive experiments, including customized quantitative evaluations for this specific task, are conducted to evaluate the quality of the generated results. The dataset and code can be accessed at GitHub111https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/WebRPG.

Keywords:
Generative model Visual Design Automation Web Rendering Parameters

1 Introduction

Recently, we are witnessing a revolution in content creation, driven by rapid advancements in generative models across domains such as image [57, 55, 21, 48, 56], text [3, 62, 42], and audio [32, 6, 7]. Numerous studies aim to leverage these advancements to enhance efficiency in graphic design, including advertisement [40, 35] and magazine [19, 35, 73] design. Nevertheless, the automation of web design, an essential part of graphic design [64], lacks exploration. Web design plays a significant role in the visual communication of web pages [61], impacting not only user satisfaction [9] but also user behavior [14]. Yet, it is a complex, time-consuming task, especially challenging for those developers with limited design expertise, leading to substandard visual presentations [66]. Automating web design can simplify this process, enabling developers to create visually appealing web pages, and bridging the gap between technical development and aesthetic excellence.

Web pages are formed by HTML222https://html.spec.whatwg.org/ and CSS333https://www.w3.org/Style/CSS/specs.en.html code, where HTML defines the content and structure, and CSS controls the visual presentation. With the advent of large language models (LLMs) [62, 3, 52], automating HTML code generation has become feasible. However, efforts in automatic visual presentation design, the core aspect of web design, currently center on specific subtasks such as layout generation [28, 53, 49], font recommendation [71, 2], and colorization [27, 54, 17], rather than designing a holistic web visual presentation from scratch.

Refer to caption
Figure 1: Overview of the WebRPG task. The input consists of plain HTML code and the output comprises rendering parameters for each element. With browser rendering, plain HTML produces a disorganized visual presentation, while incorporating the generated rendering parameters significantly enhances the visual presentation.

Intuitively, leveraging generative models to learn design knowledge from existing web pages is a practical strategy for automated web visual design. However, the complexity of CSS coding practices poses challenges for its automatic generation [26]. To address this, we propose standardizing CSS using Rendering Parameters (RPs), which are defined by CSS properties that control the visual appearance of each web element [16]. Consequently, we introduce a novel task called Web Rendering Parameters Generation (WebRPG for short), which requires the automatic generation of rendering parameters for each web element based on the HTML code, as depicted in Fig. 1. With the help of a WebRPG system, HTML is the only prerequisite for obtaining an effective web visual presentation, which has the potential to achieve a faster web development workflow. With the integration of LLMs, a WebRPG system can even enable the realization of a fully automated web development workflow. Moreover, it can facilitate new applications, such as efficient exploration of various design options and dynamic personalization of web page styles.

Since there is no existing benchmark available for WebRPG, we develop automatic data processing steps to transform raw web pages into formalized WebRPG samples and construct a new dataset utilizing the Klarna dataset [22]. From a theoretical perspective, the WebRPG task presents two primary challenges: 1) Web pages comprise hundreds of elements, each with numerous RPs. 2) The visual presentation of web elements should be associated with the semantic and hierarchical information provided by HTML code. To address the challenges, variational autoencoder (VAE) [30] is employed to handle the large volume of rendering parameters for web elements, and specially designed HTML embedding is introduced to encode semantic and hierarchical information from HTML code. Using these modules, two WebRPG baselines are established, which are based on autoregressive and diffusion models, respectively. To verify the effectiveness of WebRPG baselines, metrics are designed to evaluate the overall appearance, layout, and style of the generated results. Both quantitative and qualitative experiments are conducted to assess the baselines.

Our main contributions are as follows:

  • We introduce a novel task WebRPG for automatic web design from HTML code and create a new dataset.

  • We explore the WebRPG task by establishing two baselines and propose solutions for its challenges.

  • We design metrics to quantitatively evaluate the quality of generated results, and conduct qualitative experiments to analyze the strengths and weaknesses of the baselines.

2 Related Work

Generative models achieve notable success in image [57, 55, 4, 56, 50], text [3, 62, 42], and audio [32, 6, 7]. Image synthesis can create web visual presentations by generating screenshots but struggles with producing coherent text [55]. Moreover, image synthesis is limited to static images and cannot offer interactive, manipulable web pages.

Numerous efforts utilize generative models for graphic design, including advertising [40, 35], magazines [19, 73, 23], UI [24, 25, 8, 44], and posters[74, 37]. Yet, the designs restrict the element count to no more than 25. These methods primarily employ a one-dimensional sequence to represent designs, with each element defined by five tokens: four describe the bounding box, and one indicates the category (e.g., text, headline) [35]. However, the reliance on a simplistic flat input for the WebRPG task, which involves managing hundreds of elements and various RPs, leads to a substantial memory consumption increment, and performance degradation [12]. Moreover, the one-dimensional sequence neglects crucial hierarchical information in web pages.

Research focused on web pages has continuously emerged. In terms of understanding, efforts in web question answering [5, 72], web information extraction [34, 69], and web pre-trained language models [41, 10, 60] have made notable progress in comprehending the essential semantic content and hierarchical structure of web pages. For instance, MarkupLM [41] stands out with its unique architecture and pre-training tasks, effectively encoding HTML content, which offers insights for our research. Moreover, there are works aimed at web page design, such as optimizing the overall or specific block coloring of web pages [27, 54, 17], determining layouts based on given components like navigation bars [28, 53, 49], and recommending fonts for particular elements [71, 2]. However, these studies focus only on specific subtasks of the web page design workflow, leaving the comprehensive design of web pages from scratch as an unexplored area.

3 Preliminary

3.1 Task Definition

Web design is centered on visual presentation, i.e., the manipulation of CSS code. The complexity of CSS coding practices, including a wide range of selector options, makes the automatic generation of CSS code challenging [26]. To facilitate the model for learning web design, we standardize CSS by converting it into rendering parameters (RPs), which can be transformed back into CSS, with additional details in Sec. A.1. Consequently, the WebRPG task is defined as follows: given the HTML code, generate rendering parameters for each web element. Specifically, given a web page 𝒳𝒳\mathcal{X}caligraphic_X, whose HTML code is \mathcal{H}caligraphic_H, it consists of a set of elements 𝒳={X1,X2,,XS}𝒳subscript𝑋1subscript𝑋2subscript𝑋𝑆\mathcal{X}=\{X_{1},X_{2},\ldots,X_{S}\}caligraphic_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT }, where S𝑆Sitalic_S is the number of elements in 𝒳𝒳\mathcal{X}caligraphic_X. The visual appearance of element Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is controlled by a set of RPs denoted as Pi={pikk𝒲}subscript𝑃𝑖conditional-setsuperscriptsubscript𝑝𝑖𝑘𝑘𝒲P_{i}=\{p_{i}^{k}\mid k\in\mathcal{W}\}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∣ italic_k ∈ caligraphic_W }, where 𝒲𝒲\mathcal{W}caligraphic_W indicates the indices for all RPs, and the complete set of RPs for 𝒳𝒳\mathcal{X}caligraphic_X is 𝒫={P1,P2,,PS}𝒫subscript𝑃1subscript𝑃2subscript𝑃𝑆\mathcal{P}=\{P_{1},P_{2},\ldots,P_{S}\}caligraphic_P = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT }. Therefore, the primary objective of the WebRPG task is to create a function f𝑓fitalic_f that generates RPs based on HTML code, that is, f:()𝒫^:𝑓maps-to^𝒫f:(\mathcal{H})\mapsto\mathcal{\hat{P}}italic_f : ( caligraphic_H ) ↦ over^ start_ARG caligraphic_P end_ARG, where 𝒫^^𝒫\mathcal{\hat{P}}over^ start_ARG caligraphic_P end_ARG represents the estimate of 𝒫𝒫\mathcal{P}caligraphic_P.

3.2 Web Rendering Parameter Definition

The term “Rendering Parameters (RPs)” is employed to collectively describe the parameters controlling the visual appearance of each web element on the browser, as defined by CSS properties. Layout and visual style are crucial in the design of web pages [67, 60], leading us to summarize 13 common CSS properties, divided into 3 categories as follows.

  • Layout properties include left, top, width, and height.

  • Text properties include font-style, font-weight, font-size, line-height, text-align, text-decoration, and text-transform.

  • Color properties include color and background-color.

Various formats are available for web developers to define CSS properties. To standardize, we adopt the values computed by the browser [47] as the reference. Specifically, the values related to position and size are uniformly measured in integer pixels, and the values related to color correspond to 46 widely used colors. The vocabulary for all rendering parameters is available in Sec. A.1.

4 Dataset Construction

4.1 Data Pre-processing

Raw web pages cannot provide straightforward supervision for RPs. Thus, several pre-processings are conducted. Headless chrome444https://developer.chrome.com/blog/headless-chrome/ is used to render web pages and selenium555https://www.selenium.dev/ is employed to store HTML with only visible elements and record each element’s selected CSS properties. Note that elements in this paper mean nodes in the DOM666https://www.w3.org/DOM/DOMTR tree. The elements are stored following the DOM tree’s pre-order traversal. Since many web pages retain thousands of elements, we treat elements with a certain number of children as sub-pages with the semantic and hierarchical integrity preserved. The sub-pages are further cleaned while keeping the visual appearance, including removing uncommon HTML tags and intricate components like carousel images, as well as placing sub-pages at the top-left corner of the browser. Additionally, we only consider static components. Our models disregard the image on web pages, preserving only <\textless<img>\textgreater> tags. To guarantee data quality, a specific Visual Complexity (VC) metric is introduced to assist in filtering samples. The metric integrates three dimensions: color, size, and alignment, inspired by previous works [1, 15]. The definition of the VC metric is provided in Sec. A.2.

Refer to caption
Figure 2: Selected sub-page screenshots from our dataset. Notably, regions displayed are cropped due to space limitations.

4.2 Dataset Details

To accommodate the requirement for offline rendering, the Klarna dataset [22] is utilized to build our WebRPG dataset. The Klarna dataset, initially used for web information extraction, comprises 20K English product pages from 3K e-commerce sites, ensuring domain-specific diversity. The dataset stores all pages in MHTML777https://en.wikipedia.org/wiki/MHTML format, enabling offline rendering of the original pages in browsers with high fidelity.

The pre-processing in Sec. 4.1 is applied to the web pages with the browser canvas size setting to 1920*1920 pixels, generating sub-pages containing between 32 and 128 child elements. The token length for each sample (sub-page) does not surpass 512. The size of RP vocabulary is 1993. The samples with a VC below 0.1 are filtered out. After preprocessing, our dataset includes 88,418 samples, split into training and testing sets at an 8:2 ratio. Our dataset exceeds the size of established graphic design datasets such as CLAY [38] (50K samples) and RICO [44] (43K samples), ensuring it can meet our objectives. Screenshots of some samples are shown in Fig. 2. More details are provided in Sec. A.3.

Refer to caption
Figure 3: Key components of WebRPG models. In the upper left, VAE compresses the RPs of each element into latent vectors shown in blue. In the top right, "Semantic" (Sem), "Hierarchical" (Hier), and "Character Count" (CharC) embeddings combine into the HTML embedding in orange. Below, two generative models are illustrated.

5 Methodology

5.1 Overview

As indicated in Sec. 3.1, the WebRPG task is formulated as a function that generates rendering parameters (RPs) for each web element based on the HTML code. Inspired by classical generation methods [57, 50, 56], we employ a latent generation approach. In the approach, VAE is leveraged to compress all RPs of an element into latent space representation (Sec. 5.2), and a generative model (Sec. 5.4) generates the latent vector based on the given HTML embeddings (Sec. 5.3), which is then decoded back into RPs by the decoder of VAE. The key components of our method are shown in Fig. 3.

5.2 Rendering Parameters Compression

Assume a web page consists of S𝑆Sitalic_S elements, with the appearance of each element Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT determined by 𝒲𝒲\mathcal{W}caligraphic_W rendering parameters Pi={pikk𝒲}subscript𝑃𝑖conditional-setsuperscriptsubscript𝑝𝑖𝑘𝑘𝒲P_{i}=\left\{p_{i}^{k}\mid k\in\mathcal{W}\right\}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∣ italic_k ∈ caligraphic_W }. The WebRPG model necessitates the processing of S×𝒲𝑆𝒲S\times\mathcal{W}italic_S × caligraphic_W values for both input and output. Expanding all piksuperscriptsubscript𝑝𝑖𝑘p_{i}^{k}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a one-dimensional sequence, as per graphic design methods [24, 35], leads to excessively long input and output lengths. To mitigate this challenge, we utilize VAE to compress the rendering parameters into a latent space. This ensures that the input length for the generative model correlates solely with S𝑆Sitalic_S.

More precisely, given the RPs of an element Pi𝒲Vsubscript𝑃𝑖superscript𝒲𝑉P_{i}\in\mathbb{R}^{\mathcal{W}*V}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_W ∗ italic_V end_POSTSUPERSCRIPT, where V𝑉Vitalic_V is the size of RPs vocabulary (Sec. 3.2), and the corresponding latent vector is Zidsubscript𝑍𝑖superscript𝑑Z_{i}\in\mathbb{R}^{d}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We denote the generative distribution as pθ(PiZi)subscript𝑝𝜃conditionalsubscript𝑃𝑖subscript𝑍𝑖p_{\theta}(P_{i}\mid Z_{i})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the posterior as qϕ(ZiPi)subscript𝑞italic-ϕconditionalsubscript𝑍𝑖subscript𝑃𝑖q_{\phi}(Z_{i}\mid P_{i})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), respectively. The learning objective of VAE is expressed as:

LVAE=1Si=1S(𝔼qϕ(ZiPi)[logpθ(PiZi)]+λKLKL(qϕ(ZiPi)p(Zi))),subscript𝐿𝑉𝐴𝐸1𝑆superscriptsubscript𝑖1𝑆subscript𝔼subscript𝑞italic-ϕconditionalsubscript𝑍𝑖subscript𝑃𝑖delimited-[]subscript𝑝𝜃conditionalsubscript𝑃𝑖subscript𝑍𝑖subscript𝜆𝐾𝐿KLconditionalsubscript𝑞italic-ϕconditionalsubscript𝑍𝑖subscript𝑃𝑖𝑝subscript𝑍𝑖L_{VAE}=\frac{1}{S}\cdot\sum_{i=1}^{S}(-\mathbb{E}_{q_{\phi}(Z_{i}\mid P_{i})}% \left[\log p_{\theta}(P_{i}\mid Z_{i})\right]+\lambda_{KL}\mathrm{KL}\left(q_{% \phi}(Z_{i}\mid P_{i})\|p(Z_{i})\right)),italic_L start_POSTSUBSCRIPT italic_V italic_A italic_E end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( - blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] + italic_λ start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT roman_KL ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ italic_p ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) , (1)

where θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ are the encoder and decoder parameters, 𝔼𝔼\mathbb{E}blackboard_E indicates the expectation, KLKL\mathrm{KL}roman_KL is the Kullback-Leibler divergence, and λKLsubscript𝜆𝐾𝐿\lambda_{KL}italic_λ start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is the hyperparameter to balance the two terms. The encoder and decoder of VAE both consist of a multilayer perceptron with five layers. To ensure that the latent space encompasses as many element appearances (i.e., combinations of RPs) as possible, the VAE is pre-trained using synthetic data.

5.3 Encoding HTML

The visual presentation of a web page should be in harmony with the content and structure dictated by its HTML code. To this end, we design an HTML embedding that captures the essential information in the HTML code, establishing the input feature for the generative model (Sec. 5.4). HTML code essentially encompasses hierarchical information among elements and the textual content of each element [10]. The character count of each element is also crucial, as the size of an element generally exhibits a positive correlation with the length of characters. Therefore, our HTML embedding integrates three facets of information: semantics, hierarchy, and character count. Precisely, for an element Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its HTML embedding Hidsubscript𝐻𝑖superscript𝑑H_{i}\in\mathbb{R}^{d}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is defined as:

Hi=ΛSem(HiSem)+ΛHier(HiHier)+ΛCharC(HiCharC),subscript𝐻𝑖superscriptΛSemsuperscriptsubscript𝐻𝑖SemsuperscriptΛHiersuperscriptsubscript𝐻𝑖HiersuperscriptΛCharCsuperscriptsubscript𝐻𝑖CharCH_{i}=\Lambda^{\text{Sem}}(H_{i}^{\text{Sem}})+\Lambda^{\text{Hier}}(H_{i}^{% \text{Hier}})+\Lambda^{\text{CharC}}(H_{i}^{\text{CharC}}),italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Λ start_POSTSUPERSCRIPT Sem end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Sem end_POSTSUPERSCRIPT ) + roman_Λ start_POSTSUPERSCRIPT Hier end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Hier end_POSTSUPERSCRIPT ) + roman_Λ start_POSTSUPERSCRIPT CharC end_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CharC end_POSTSUPERSCRIPT ) , (2)

where HiSemsuperscriptsubscript𝐻𝑖SemH_{i}^{\text{Sem}}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Sem end_POSTSUPERSCRIPT, HiHiersuperscriptsubscript𝐻𝑖HierH_{i}^{\text{Hier}}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Hier end_POSTSUPERSCRIPT and HiCharCsuperscriptsubscript𝐻𝑖CharCH_{i}^{\text{CharC}}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CharC end_POSTSUPERSCRIPT denote the semantic, hierarchical and character count embedding respectively, and Λ()superscriptΛ\Lambda^{\circ}()roman_Λ start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ( ) is the linear projection layer.

Semantic embedding: The MarkupLMlarge model [41], a language model explicitly pre-trained for web understanding, is employed as the semantic extractor. Specifically, given an element Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with HTML code tokens Xi={xijj}subscript𝑋𝑖conditional-setsuperscriptsubscript𝑥𝑖𝑗𝑗X_{i}=\{x_{i}^{j}\mid j\in\mathcal{L}\}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∣ italic_j ∈ caligraphic_L }, where \mathcal{L}caligraphic_L denotes the token length, we calculate the semantic embedding of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as HiSem=Pool(MarkupLM(xi1,xi2,,xi))superscriptsubscript𝐻𝑖SemPoolMarkupLMsuperscriptsubscript𝑥𝑖1superscriptsubscript𝑥𝑖2superscriptsubscript𝑥𝑖H_{i}^{\text{Sem}}=\text{Pool}(\text{MarkupLM}(x_{i}^{1},x_{i}^{2},\ldots,x_{i% }^{\mathcal{L}}))italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Sem end_POSTSUPERSCRIPT = Pool ( MarkupLM ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_L end_POSTSUPERSCRIPT ) ), where Pool()Pool\text{Pool}(\cdot)Pool ( ⋅ ) denotes an average pooling operation.

Hierarchical embedding: The XPath embedding layer [41] is employed to model the hierarchical information of elements, taking their XPath expressions as input. XPath888https://www.w3.org/TR/xpath-31/ is a query language for selecting elements from a web page, which is based on the DOM tree and can be used to easily locate an element. Specifically, for an element Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with its corresponding XPath expression xpi𝑥subscript𝑝𝑖xp_{i}italic_x italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we compute the hierarchical embedding directly as HiHier=XPathEmb(xpi)superscriptsubscript𝐻𝑖HierXPathEmb𝑥subscript𝑝𝑖H_{i}^{\text{Hier}}=\text{XPathEmb}(xp_{i})italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Hier end_POSTSUPERSCRIPT = XPathEmb ( italic_x italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Character count embedding: We establish a mapping mechanism that translates the raw count of characters into dense vector space. For an element Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the content of k𝑘kitalic_k characters, the character count embedding is calculated as HiCharC=EmbCharC(k)superscriptsubscript𝐻𝑖CharCEmbCharC𝑘H_{i}^{\text{CharC}}=\text{EmbCharC}(k)italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CharC end_POSTSUPERSCRIPT = EmbCharC ( italic_k ).

5.4 Generative Models

Two generative models are implemented: autoregressive and diffusion model.

Autoregressive Model (AR): To enhance the model stability during training, a masked latent vector 𝒵masksubscript𝒵𝑚𝑎𝑠𝑘\mathcal{Z}_{mask}caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT of real RPs is introduced inspired by BART [36] and MaskGIT [4]. 𝒵masksubscript𝒵𝑚𝑎𝑠𝑘\mathcal{Z}_{mask}caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT is constructed in two steps. Firstly, the real RPs are encoded into the latent vectors with the VAE encoder, i.e., 𝒵=θ(𝒫)𝒵𝜃𝒫\mathcal{Z}=\theta(\mathcal{P})caligraphic_Z = italic_θ ( caligraphic_P ). Then a special MASK𝑀𝐴𝑆𝐾MASKitalic_M italic_A italic_S italic_K vector and a binary mask M={miiS}𝑀conditional-setsubscript𝑚𝑖𝑖𝑆M=\{m_{i}\mid i\in S\}italic_M = { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ italic_S } are utilized to partially substitute the real latent vectors with the MASK𝑀𝐴𝑆𝐾MASKitalic_M italic_A italic_S italic_K as Zmask,i=miMASK+(1mi)θ(Pi)subscript𝑍𝑚𝑎𝑠𝑘𝑖subscript𝑚𝑖𝑀𝐴𝑆𝐾1subscript𝑚𝑖𝜃subscript𝑃𝑖Z_{mask,i}=m_{i}\cdot MASK+(1-m_{i})\cdot\theta(P_{i})italic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k , italic_i end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_M italic_A italic_S italic_K + ( 1 - italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_θ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Here M𝑀Mitalic_M is generated using a mask scheduling function γ(r)(0,1]𝛾𝑟01\gamma(r)\in(0,1]italic_γ ( italic_r ) ∈ ( 0 , 1 ] following MaskGIT [4], and the MASK𝑀𝐴𝑆𝐾MASKitalic_M italic_A italic_S italic_K vector is a learnable parameter with the same shape as Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Additionally, it is important to highlight that during inference, all Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are masked, i.e., M={mi=1|1iS}𝑀conditional-setsubscript𝑚𝑖11𝑖𝑆M=\{m_{i}=1|1\leq i\leq S\}italic_M = { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | 1 ≤ italic_i ≤ italic_S }.

As depicted in Fig. 3, the model inputs the sum of 𝒵masksubscript𝒵𝑚𝑎𝑠𝑘\mathcal{Z}_{mask}caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT and \mathcal{H}caligraphic_H to generate 𝒵^^𝒵\hat{\mathcal{Z}}over^ start_ARG caligraphic_Z end_ARG, which is then decoded by the VAE decoder as 𝒫^=ϕ(𝒵^)^𝒫italic-ϕ^𝒵\mathcal{\hat{P}}=\phi(\hat{\mathcal{Z}})over^ start_ARG caligraphic_P end_ARG = italic_ϕ ( over^ start_ARG caligraphic_Z end_ARG ). The VAE and generative models are trained jointly, thus the training loss is as follows:

L=logpψ(𝒫|,𝒵mask)+LVAE,𝐿subscript𝑝𝜓conditional𝒫subscript𝒵𝑚𝑎𝑠𝑘subscript𝐿𝑉𝐴𝐸L=\log p_{\psi}(\mathcal{P}|\mathcal{H},\mathcal{Z}_{mask})+L_{VAE},italic_L = roman_log italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( caligraphic_P | caligraphic_H , caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_V italic_A italic_E end_POSTSUBSCRIPT , (3)

where ψ𝜓\psiitalic_ψ is the parameters of the generative model.

Diffusion Model: Diffusion models [21, 48, 75] have recently emerged as a new class of generative models with high performance. These models are characterized by forward and reverse Markov processes of length T𝑇Titalic_T. In our rendering parameters compression (VAE) model, rendering parameters 𝒫𝒫\mathcal{P}caligraphic_P are encoded into a latent space, i.e., 𝒵=θ(𝒫)𝒵𝜃𝒫\mathcal{Z}=\theta(\mathcal{P})caligraphic_Z = italic_θ ( caligraphic_P ). These latent vectors 𝒵𝒵\mathcal{Z}caligraphic_Z, which align more closely with a Gaussian distribution, improve compatibility with the noise distribution in diffusion models. Following successful models [11, 57, 59], our diffusion model can be interpreted as an equally weighted sequence of denoising autoencoders (𝒵t,t,);t=1Tsubscript𝒵𝑡𝑡𝑡1𝑇\mathcal{E}(\mathcal{Z}_{t},t,\mathcal{H});t=1\ldots Tcaligraphic_E ( caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_H ) ; italic_t = 1 … italic_T, which are trained to predict the noise ϵ𝒩(𝟎,𝐈)similar-tobold-italic-ϵ𝒩0𝐈\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) in 𝒵tsubscript𝒵𝑡\mathcal{Z}_{t}caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The 𝒵tsubscript𝒵𝑡\mathcal{Z}_{t}caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained from a forward process starting from 𝒵0subscript𝒵0\mathcal{Z}_{0}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (where 𝒵0=𝒵subscript𝒵0𝒵\mathcal{Z}_{0}=\mathcal{Z}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_Z), defined as 𝒵t=αt𝒵t1+1αtϵsubscript𝒵𝑡subscript𝛼𝑡subscript𝒵𝑡11subscript𝛼𝑡bold-italic-ϵ\mathcal{Z}_{t}=\sqrt{\alpha_{t}}\mathcal{Z}_{t-1}+\sqrt{1-\alpha_{t}}% \boldsymbol{\epsilon}caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, with αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT being a predefined set of coefficients. As illustrated in Fig. 3, 𝒵tsubscript𝒵𝑡\mathcal{Z}_{t}caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and \mathcal{H}caligraphic_H are added and input into the model. Our diffusion model employs the standard variational lower bound objective as its training loss, and we jointly optimize the VAE, leading to the overall loss function:

L=𝔼𝒵,ϵ𝒩(0,1),t[ϵϵψ(𝒵t,t,)22]+LVAE.𝐿subscript𝔼formulae-sequencesimilar-to𝒵italic-ϵ𝒩01𝑡delimited-[]superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜓subscript𝒵𝑡𝑡22subscript𝐿𝑉𝐴𝐸L=\mathbb{E}_{\mathcal{Z},\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\boldsymbol{% \epsilon}-\boldsymbol{\epsilon}_{\psi}(\mathcal{Z}_{t},t,\mathcal{H})\|_{2}^{2% }\Big{]}+L_{VAE}.italic_L = blackboard_E start_POSTSUBSCRIPT caligraphic_Z , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_H ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_L start_POSTSUBSCRIPT italic_V italic_A italic_E end_POSTSUBSCRIPT . (4)

During inference, the predicted 𝒵^^𝒵\mathcal{\hat{Z}}over^ start_ARG caligraphic_Z end_ARG is progressively obtained through a reverse process, expressed as 𝒵t1=1αt(𝒵t1αt1αtϵψ(𝒵t,t,))subscript𝒵𝑡11subscript𝛼𝑡subscript𝒵𝑡1subscript𝛼𝑡1subscript𝛼𝑡subscriptbold-italic-ϵ𝜓subscript𝒵𝑡𝑡\mathcal{Z}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathcal{Z}_{t}-\frac{1-% \alpha_{t}}{\sqrt{1-\alpha_{t}}}\boldsymbol{\epsilon}_{\psi}(\mathcal{Z}_{t},t% ,\mathcal{H})\right)caligraphic_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , caligraphic_H ) ). Subsequently, 𝒵^^𝒵\mathcal{\hat{Z}}over^ start_ARG caligraphic_Z end_ARG is decoded to 𝒫^^𝒫\mathcal{\hat{P}}over^ start_ARG caligraphic_P end_ARG via a single pass through the VAE decoder ϕitalic-ϕ\phiitalic_ϕ. Additionally, 𝒵Tsubscript𝒵𝑇\mathcal{Z}_{T}caligraphic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is random Gaussian noise.

6 Experiment

6.1 Evaluation Metrics

Three metrics are utilized to assess the quality of the generated rendering parameters. Fréchet Inception Distance (FID), Element Intersection over Union (Ele. IoU), and newly introduced Style Consistency Score (SC Score) enable the evaluation of the overall appearance, layout, and style of generated web pages respectively. As indicated in Sec. 3.2, “layout” refers to layout properties, while “style” encompasses text properties and color properties.

6.1.1 Fréchet Inception Distance

FID [20], a metric initially proposed in the domain of image generation, measures the similarity of generated data to real ones in feature space. Inspired by Lee et al. [35], a binary classifier is trained to distinguish between real and noise-added RPs. This classifier is employed to generate representative features of RPs for calculating FID. We also introduce layout-specific and style-specific FID models. Further details are in Sec. A.4.

6.1.2 Elements Intersection over Union

Ele. IoU is a metric for evaluating the similarity between generated layouts and real ones, based on adaptation to the Maximum IoU [29]. As the elements of real and generated web pages correspond one-to-one, IoU is computed between the corresponding pairs. Denote the real layouts as B={bi}i=1N𝐵superscriptsubscriptsubscript𝑏𝑖𝑖1𝑁B=\{b_{i}\}_{i=1}^{N}italic_B = { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the generated ones as B^={b^i}i=1N^𝐵superscriptsubscriptsubscript^𝑏𝑖𝑖1𝑁\hat{B}=\{\hat{b}_{i}\}_{i=1}^{N}over^ start_ARG italic_B end_ARG = { over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, with N𝑁Nitalic_N being the element count, and bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and b^isubscript^𝑏𝑖\hat{b}_{i}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as corresponding elements. The Ele. IoU can be calculated as follows:

EleIoU(B,B^)=1Ni=1NIoU(bi,b^i).EleIoU𝐵^𝐵1𝑁superscriptsubscript𝑖1𝑁𝐼𝑜𝑈subscript𝑏𝑖subscript^𝑏𝑖\text{EleIoU}(B,\hat{B})=\frac{1}{N}\sum_{i=1}^{N}IoU(b_{i},\hat{b}_{i}).EleIoU ( italic_B , over^ start_ARG italic_B end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I italic_o italic_U ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (5)

6.1.3 Style Consistency Score

The “Principle of Similarity” of Gestalt theory suggests that people tend to perceive elements with similar style as a whole [65, 31], highlighting the importance of style consistency among elements. Hence, the SC Score assesses whether elements with the same style on a real web page retain that consistency on the generated page, beyond merely visual similarity. An example explanation is provided in Sec. C. Elements are deemed to have the same style only if all their style properties are identical [60]. Specifically, for a web page W={eiiN}𝑊conditional-setsubscript𝑒𝑖𝑖𝑁W=\{e_{i}\mid i\in N\}italic_W = { italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ italic_N } with N𝑁Nitalic_N being the number of elements, the style consistency subset of the page is defined as SW,ei,ejS,style(ei)=style(ej).formulae-sequence𝑆𝑊for-allsubscript𝑒𝑖formulae-sequencesubscript𝑒𝑗𝑆𝑠𝑡𝑦𝑙𝑒subscript𝑒𝑖𝑠𝑡𝑦𝑙𝑒subscript𝑒𝑗S\subseteq W,\forall e_{i},e_{j}\in S,style(e_{i})=style(e_{j}).italic_S ⊆ italic_W , ∀ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_S , italic_s italic_t italic_y italic_l italic_e ( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_s italic_t italic_y italic_l italic_e ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .

Thus the real web page W𝑊Witalic_W and its generated page W^^𝑊\hat{W}over^ start_ARG italic_W end_ARG are divided into style consistency subsets W={SjjM}𝑊conditional-setsubscript𝑆𝑗𝑗𝑀W=\{S_{j}\mid j\in M\}italic_W = { italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j ∈ italic_M } and W^={SkkN}^𝑊conditional-setsubscript𝑆𝑘𝑘𝑁\hat{W}=\{S_{k}\mid k\in N\}over^ start_ARG italic_W end_ARG = { italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_k ∈ italic_N }, respectively. Given N𝑁Nitalic_N and M𝑀Mitalic_M can differ, we apply a max operation for optimal matching. The SC Score is then calculated as:

SCScore(W,W^)=j=1MwjmaxkJ(Sj,S^k),𝑆𝐶𝑆𝑐𝑜𝑟𝑒𝑊^𝑊superscriptsubscript𝑗1𝑀subscript𝑤𝑗subscript𝑘𝐽subscript𝑆𝑗subscript^𝑆𝑘SCScore(W,\hat{W})=\sum_{j=1}^{M}w_{j}\cdot\max_{k}J(S_{j},\hat{S}_{k}),italic_S italic_C italic_S italic_c italic_o italic_r italic_e ( italic_W , over^ start_ARG italic_W end_ARG ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_J ( italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (6)

where J(A,B)𝐽𝐴𝐵J(A,B)italic_J ( italic_A , italic_B ) is the Jaccard similarity coefficient. Additionally, under the assumption that style consistency subsets with more elements are more semantically valuable, we utilize a weight wj=|Sj|l=1M|Sl|subscript𝑤𝑗subscript𝑆𝑗superscriptsubscript𝑙1𝑀subscript𝑆𝑙w_{j}=\frac{|S_{j}|}{\sum_{l=1}^{M}|S_{l}|}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG | italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | end_ARG.

6.2 Implementation

Two baselines are implemented: autoregressive (WebRPG-AR) and diffusion model (WebRPG-DM). The VAE, hierarchical embedding, and character count embedding are jointly trained with the backbone, and the semantic embedding is produced by frozen pre-trained MarkupLMlarge [41]. The XPath embedding layer is initialized following Li et al. [41]. All baselines are based on Transformer [63] and have approximately 50M of parameters to ensure fair comparison, whose hidden dimensions are 128. The dimensions of latent vector and HTML embedding d𝑑ditalic_d is 128. For optimization, AdamW [45] is used with a learning rate of 1.2e-4. All models are trained for 1M steps with a batch size of 300.

Additionally, LLMs have been gaining adoption in different domains. We assess GPT-4 [70, 51], StarCoder2-7b [46], DeepSeek-Coder-6.7b [18], CodeLlama-13B [58] on the WebRPG task. GPT-4 is one of the state-of-the-art LLMs, while the others are open-source models known for code generation. Due to limited resources, we randomly select 10% of test samples. The prompt template employs in-context learning [3], incorporating a task description, three demonstrations, and a test instance. Further details are available in Sec. A.5.

6.3 Quantitative and Qualitative Evaluation

We present quantitative results in Tab. 1 and qualitative results in Fig. 4. Regarding the results of real data, in addition to the normally rendered web page (Tab. 1, “Real Web Pgae”), we also report the web page rendered using only HTML code (Tab. 1, “Plain HTML”). Since the browser would apply default CSS when custom CSS is absent, some models perform worser than the plain HTML due to unreasonable generated RPs. The FIDs for real data are calculated between the test set and other real web pages.

The experimental results show that WebRPG-AR consistently surpasses other baselines. Its sequential decoding mechanism allows for more refined control based on previously generated results [13, 68]. As shown in Fig. 4 a, b, e, WebRPG-AR demonstrates impressive visual quality in detail.

Table 1: WebRPG baselines quantitative comparison with bold figures for best results. "*" stands for the result in the randomly selected test set.
Overall Layout Style
Model FID \downarrow FIDlayout \downarrow Ele. IoU \uparrow FIDstyle \downarrow SC Score \uparrow
WebRPG-AR 0.1281 0.1520 0.7082 0.2124 0.9474
WebRPG-DM 62.021 60.942 0.0357 106.95 0.3671
WebRPG-AR* 0.1324 0.2877 0.7069 0.1359 0.9485
WebRPG-DM* 61.135 60.870 0.0356 105.86 0.3649
GPT4* 4.2141 47.732 0.0347 8.8898 0.5515
StarCoder2-7b* 11.899 51.432 0.0309 18.186 0.3639
DeepSeek-Coder-6.7b* 5.8219 55.744 0.0330 7.4542 0.3949
CodeLlama-13b* 9.2826 55.427 0.0278 11.625 0.3864
Real Web Page 0.0027 0.0015 1.0000 0.0074 1.0000
Plain HTML 8.5342 52.438 0.0354 8.4951 0.3668
Table 2: Ablation study based on WebRPG-AR. Best results in bold. “𝒵masksubscript𝒵𝑚𝑎𝑠𝑘\mathcal{Z}_{mask}caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT” is detailed in Sec. 5.4. “H.E.” stands for HTML embedding. “S.”, “H.”, and “C.” stand for semantic, hierarchical, and character count embeddings.
H. E. Overall Layout Style
# VAE 𝒵masksubscript𝒵𝑚𝑎𝑠𝑘\mathcal{Z}_{mask}caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT S. H. C. FID \downarrow FIDlayout \downarrow Ele. IoU \uparrow FIDstyle \downarrow SC Score \uparrow
1 0.9702 5.4668 0.5954 15.923 0.8053
2 0.1487 0.2055 0.6462 0.2944 0.9332
3 0.1797 0.2096 0.6620 0.3055 0.9323
4 0.3003 0.3770 0.6345 1.9048 0.8982
5 0.1575 0.3152 0.6769 0.3065 0.9434
6 0.1281 0.1520 0.7082 0.2124 0.9474

The performance of WebRPG-DM is suboptimal across all metrics. It only tends to produce standard web visual presentations in simpler cases, as illustrated in Fig. 4 e, such as bolding prices, adding background color to buttons, and aligning a few elements. This implies that diffusion models may be inappropriate for this task. There are two plausible explanations: First, unlike images and videos in Euclidean space, web elements are non-Euclidean due to their hierarchical arrangement, while diffusion models are confined to Euclidean space [33]. Second, the WebRPG task demands meticulous adjustments and detailed control for realism, a limitation of diffusion models [43].

GPT-4’s performance on the WebRPG task surpasses that of WebRPG-DM and falls short of WebRPG-AR. Open-source LLMs underperform compared to GPT-4. As illustrated in the Fig. 4 a,b,e, GPT-4 can effectively handle element styles, such as adding background colors to buttons and applying distinct colors for prices. However, the performance of GPT-4 in layout is limited. As demonstrated in Fig. 4 a-c, GPT-4 tends to generate simplistic vertical arrangements when faced with complex HTML structures. With regular HTML, as depicted in Fig. 4 e, GPT-4 achieves a layout that is similar to the real page. Therefore, we conclude that GPT-4 demonstrates basic capability in WebRPG tasks with regular HTML, but its performance with complex HTML is less effective. Additionally, we notice that LLMs do not generate RPs for all elements, causing many to use the browser’s default CSS, resulting in performance similar to plain HTML.

It is worth noting that WebRPG-AR exhibits the ability to render diverse web pages. For example, Fig. 4 d shows WebRPG-AR’s creation of a page with a vertical layout (originally horizontal), preserving the pattern and order consistency across four groups. This finding suggests that the model successfully learns web design knowledge and applies it effectively to render web pages from HTML code. Further cases are available in Sec. B.1.

Furthermore, we calculate the FID on screenshots of rendered web pages, following conventional image generation practices [55, 57]. The results, shown in Sec. B.2, are consistent with Tab. 1. Additionally, we conduct a human evaluation, detailed in Sec. B.4, with results that also align with Tab. 1.

Refer to caption
Figure 4: Qualitative comparison of WebRPG baselines.
Refer to caption
Figure 5: Case visualization from the ablation study.

6.4 Ablation Study

We conduct a series of ablation experiments based on WebRPG-AR, as shown in Tab. 2. #1 uses a one-dimensional flat input instead of VAE. #2 removes 𝒵masksubscript𝒵𝑚𝑎𝑠𝑘\mathcal{Z}_{mask}caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT (in Sec. 5.4). #3 and #5 respectively remove the corresponding embedding layers, while #4 substitutes hierarchical embedding with one-dimensional positional embedding. Additionally, we visualize some cases from #3, #4, and #5 in Fig. 5. All models are trained to convergence following the settings in Sec. 6.2.

The results of #1 demonstrate the effectiveness of using VAE for rendering parameters compression. Although #2 is comparable to #6, the incorporation of 𝒵masksubscript𝒵𝑚𝑎𝑠𝑘\mathcal{Z}_{mask}caligraphic_Z start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT enhances the model stability during training. The results of #3, #4, and #5 reveal that all three embeddings play critical roles in web design. Hierarchical embedding helps layout arrangement significantly. The simplification to 1D positional embedding leads to a disorganized layout, as illustrated in Fig. 5 c. Semantic embedding enhances the model with the capacity to perceive semantic relationship. For example, as Fig. 5 b shows, the model struggles to horizontally align elements like “select a color” and “sunshine,” suggesting challenges in identifying key-value pairs without semantic information. Character count embedding helps to predict appropriate element sizes for full content display, as in Fig. 5 d, where a narrow “price” and “sunshine” width leads to incomplete text display.

Refer to caption
Figure 6: Left: Trends in WebRPG-AR performance relative to the number of elements and average depth of elements within the DOM tree. Right: WebRPG-AR failure cases with real web pages on the left, generated results on the right, and highlights in green.

6.5 Discussion on Failure Cases

To investigate the boundaries of the model’s capabilities, we analyze several failure cases generated by WebRPG-AR. The left side of Fig. 6 reveals that both layout (Ele. IoU) and style (SC Score) metrics decrease with an increase in the number of elements or the average depth of elements within the DOM tree. This trend may be attributed to two factors: the inherent complexity of a page increases with more elements or greater depth, and the training set lacks web pages with a large number of elements or significant depths (details in Sec. A.3). Regarding error types, layout issues mainly include misalignments and overlaps, as shown in Fig. 6 a and b. For style, the model struggles to recognize web page elements with identical semantic functions, such as the “Add to Cart” buttons illustrated in Fig. 6 c, which should appear identical. Moreover, we observe two primary error scenarios: elements positioned at the end of the HTML code tend to be more error-prone, as seen with the element in the bottom right corner of Fig. 6 b, likely due to the characteristics of the autoregressive model [12]; additionally, pages with large-scale images pose challenges, as shown in Fig. 6 a, since the model does not take the original images as input. The discussion above highlights the need for further research.

Refer to caption
Figure 7: The HTML code generated by GPT-4 and the corresponding web page visual results generated by WebRPG-AR. Screenshots use green <\textless<img>\textgreater> placeholders.

6.6 Discussion on the Integration of LLM and WebRPG Model

Recently, LLMs have enabled the possibility of automatically generating HTML code [39]. Consequently, we hypothesize that integrating LLM into a WebRPG system could facilitate a fully automated web development workflow. We employ GPT-4 [70, 51] to validate this hypothesis. As Fig. 7 illustrates, WebRPG-AR effectively creates visual presentations of web pages based on generated HTML, demonstrating the potential of a fully automated web development workflow through the integration of LLM and WebRPG. Additional cases and the prompt for automatically generating HTML are provided in Sec. B.3.

7 Conclusion and Limitations

This paper presents WebRPG, a task that automates web design by generating rendering parameters for web elements from HTML. We introduce a new dataset, two baseline models, and evaluation metrics. Results show the autoregressive baseline most effectively generates web visual presentations.

Nevertheless, this study has limitations that warrant further investigation in future research. The proposed model can undergo fine-tuning to support design tasks such as partial web page design by masking specific elements. Additionally, it can be adapted to analyze raster images by replacing <\textless<img>\textgreater> tokens with image embeddings. The employment of established CSS frameworks like Tailwind999https://tailwindcss.com/ could standardize CSS, thereby potentially simplifying the WebRPG task. However, sourcing web pages based on these frameworks presents challenges. Furthermore, design options and control mechanisms of the results are worth exploring. Future research will address these aspects.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62372408) and the National Key R&D Program of China (No. 2021YFB2701100).

References

  • [1] Alemerien, K., Magel, K.: Guievaluator: A metric-tool for evaluating the complexity of graphical user interfaces. In: SEKE. pp. 13–18 (2014)
  • [2] Azadi, S., Fisher, M., Kim, V.G., Wang, Z., Shechtman, E., Darrell, T.: Multi-content gan for few-shot font style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7564–7573 (2018)
  • [3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [4] Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked generative image transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 11305–11315 (2022)
  • [5] Chen, L., Chen, X., Zhao, Z., Zhang, D., Ji, J., Luo, A., Xiong, Y., Yu, K.: Websrc: A dataset for web-based structural reading comprehension. In: Conference on Empirical Methods in Natural Language Processing (2021)
  • [6] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: Wavegrad: Estimating gradients for waveform generation. In: International Conference on Learning Representations (2021)
  • [7] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Dehak, N., Chan, W.: Wavegrad 2: Iterative refinement for text-to-speech synthesis. arXiv preprint arXiv:2106.09660 (2021)
  • [8] Cheng, C.Y., Huang, F., Li, G., Li, Y.: Play: parametrically conditioned layout generation using latent diffusion. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)
  • [9] Cyr, D., Head, M., Larios, H.: Colour appeal in website design within and across cultures: A multi-method evaluation. International journal of human-computer studies 68(1-2), 1–21 (2010)
  • [10] Deng, X., Shiralkar, P., Lockard, C., Huang, B., Sun, H.: Dom-lm: Learning generalizable representations for html documents. arXiv preprint arXiv:2201.10608 (2022)
  • [11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
  • [12] Dong, Z., Tang, T., Li, L., Zhao, W.X.: A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502 (2023)
  • [13] Du, Y., Chen, Z., Jia, C., Yin, X., Li, C., Du, Y., Jiang, Y.G.: Context perception parallel decoder for scene text recognition. arXiv preprint arXiv:2307.12270 (2023)
  • [14] Flavian, C., Gurrea, R., Orus, C.: Web design: a key factor for the website success. Journal of Systems and Information Technology 11(2), 168–184 (2009)
  • [15] Fu, F., Chiu, S.Y., Su, C.H.: Measuring the screen complexity of web pages. In: Human Interface and the Management of Information. Interacting in Information Environments: Symposium on Human Interface 2007, Held as Part of HCI International 2007, Beijing, China, July 22-27, 2007, Proceedings, Part II. pp. 720–729. Springer (2007)
  • [16] Furht, B. (ed.): Cascading Style Sheets, pp. 58–58. Springer US, Boston, MA (2008)
  • [17] Gu, Z., Lou, J.: Data driven webpage color design. Computer-Aided Design 77, 46–59 (2016)
  • [18] Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al.: Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196 (2024)
  • [19] Herzig, R., Bar, A., Xu, H., Chechik, G., Darrell, T., Globerson, A.: Learning canonical representations for scene graph to image generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 210–227. Springer International Publishing, Cham (2020)
  • [20] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
  • [21] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
  • [22] Hotti, A., Risuleo, R.S., Magureanu, S., Moradi, A., Lagergren, J.: The klarna product page dataset: A realistic benchmark for web representation learning. arXiv preprint arXiv:2111.02168 (2021)
  • [23] Hui, M., Zhang, Z., Zhang, X., Xie, W., Wang, Y., Lu, Y.: Unifying layout generation with a decoupled diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1942–1951 (2023)
  • [24] Inoue, N., Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Layoutdm: Discrete diffusion model for controllable layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10167–10176 (2023)
  • [25] Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: Layoutvae: Stochastic scene layout generation from a label set. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9895–9904 (2019)
  • [26] Kaluarachchi, T., Wickramasinghe, M.: A systematic literature review on automatic website generation. Journal of Computer Languages p. 101202 (2023)
  • [27] Kikuchi, K., Inoue, N., Otani, M., Simo-Serra, E., Yamaguchi, K.: Generative colorization of structured mobile web pages. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3650–3659 (2023)
  • [28] Kikuchi, K., Otani, M., Yamaguchi, K., Simo-Serra, E.: Modeling visual containment for web page layout optimization. In: Computer Graphics Forum. vol. 40, pp. 33–44. Wiley Online Library (2021)
  • [29] Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Constrained graphic layout generation via latent optimization. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 88–96 (2021)
  • [30] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • [31] Koffka, K.: Principles of gestalt psychology (1955)
  • [32] Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)
  • [33] Koo, H.: A survey on generative diffusion models for structured data. arXiv preprint arXiv:2306.04139 (2023)
  • [34] Kumar, V., Dhar, M., Khattar, D., Lal, Y.K., Mishra, A., Shrivastava, M., Varma, V.: Swde: A sub-word and document embedding based engine for clickbait detection. arXiv preprint arXiv:1808.00957 (2018)
  • [35] Lee, H.Y., Jiang, L., Essa, I., Le, P.B., Gong, H., Yang, M.H., Yang, W.: Neural design network: Graphic layout generation with constraints. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. pp. 491–506. Springer (2020)
  • [36] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., rahman Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Annual Meeting of the Association for Computational Linguistics (2019)
  • [37] Li, C., Zhang, P., Wang, C.: Harmonious textual layout generation over natural images via deep aesthetics learning. IEEE Transactions on Multimedia 24, 3416–3428 (2022)
  • [38] Li, G.e.a.: Learning to denoise raw mobile ui layouts for improving datasets at scale. pp. 1–13 (2022)
  • [39] Li, J., Li, G., Li, Y., Jin, Z.: Enabling programming thinking in large language models toward code generation. arXiv preprint arXiv:2305.06599 (2023)
  • [40] Li, J., Yang, J., Zhang, J., Liu, C., Wang, C., Xu, T.: Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27(10), 4039–4048 (2020)
  • [41] Li, J., Xu, Y., Cui, L., Wei, F.: Markuplm: Pre-training of text and markup language for visually rich document understanding. In: Annual Meeting of the Association for Computational Linguistics (2021)
  • [42] Li, X., Thickstun, J., Gulrajani, I., Liang, P.S., Hashimoto, T.B.: Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems 35, 4328–4343 (2022)
  • [43] Li, X., Thickstun, J., Gulrajani, I., Liang, P.S., Hashimoto, T.B.: Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems 35, 4328–4343 (2022)
  • [44] Liu, T.F.e.a.: Learning design semantics for mobile apps. In: UIST. pp. 569–579 (2018)
  • [45] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017)
  • [46] Lozhkov, A., Li, R., Allal, L.B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., et al.: Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173 (2024)
  • [47] Network, M.D.: Computed value - css: Cascading style sheets (2023)
  • [48] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. pp. 8162–8171. PMLR (2021)
  • [49] O’Donovan, P., Agarwala, A., Hertzmann, A.: Designscape: Design with interactive layout suggestions. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems. pp. 1221–1224 (2015)
  • [50] van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural discrete representation learning. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)
  • [51] OpenAI: Gpt-4 technical report (2023)
  • [52] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022)
  • [53] O’Donovan, P., Agarwala, A., Hertzmann, A.: Learning layouts for single-pagegraphic designs. IEEE transactions on visualization and computer graphics 20(8), 1200–1213 (2014)
  • [54] Qiu, Q., Otani, M., Iwazaki, Y.: An intelligent color recommendation tool for landing page design. In: 27th International Conference on Intelligent User Interfaces. pp. 26–29 (2022)
  • [55] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2),  3 (2022)
  • [56] Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. In: Neural Information Processing Systems (2019)
  • [57] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [58] Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.: Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
  • [59] Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(4), 4713–4726 (2022)
  • [60] Shao, Z., Gao, F., Qi, Z., Xing, H., Bu, J., Yu, Z., Zheng, Q., Liu, X.: Gem: Gestalt enhanced markup language model for web understanding via render tree. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 6132–6145 (2023)
  • [61] Thorlacius, L.: The role of aesthetics in web design. Nordicom Review 28 (05 2007)
  • [62] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
  • [63] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [64] Wang, P.: The influence of artificial intelligence on visual elements of web page design under machine vision. Computational Intelligence and Neuroscience 2022 (2022)
  • [65] Wertheimer, M.: Gestalt theory. (1938)
  • [66] Williams, R.: The non-designer’s design book: Design and typographic principles for the visual novice. Pearson Education (2015)
  • [67] Xiang, P., Yang, X., Shi, Y.: Web page segmentation based on gestalt theory. In: 2007 IEEE International Conference on Multimedia and Expo. pp. 2253–2256 (2007)
  • [68] Xiao, Y., Wu, L., Guo, J., Li, J., Zhang, M., Qin, T., Liu, T.Y.: A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 11407–11427 (2022)
  • [69] Xie, C., Huang, W., Liang, J., Huang, C., Xiao, Y.: Webke: Knowledge extraction from semi-structured web with pre-trained markup language model. Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021)
  • [70] Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.C., Liu, Z., Wang, L.: The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9 (2023)
  • [71] Zhao, N., Cao, Y., Lau, R.W.: Modeling fonts in context: Font prediction on web designs. In: Computer Graphics Forum. vol. 37, pp. 385–395. Wiley Online Library (2018)
  • [72] Zhao, Z., Chen, L., Cao, R., Xu, H., Chen, X., Yu, K.: Tie: Topological information enhanced structural reading comprehension on web pages. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 1808–1821 (2022)
  • [73] Zheng, X., Qiao, X., Cao, Y., Lau, R.W.: Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG) 38(4), 1–15 (2019)
  • [74] Zhou, M., Xu, C., Ma, Y., Ge, T., Jiang, Y., Xu, W.: Composition-aware graphic layout GAN for visual-textual presentation designs. In: Raedt, L.D. (ed.) Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022. pp. 4995–5001. ijcai.org (2022)
  • [75] Zhu, Y., Wu, Y., Olszewski, K., Ren, J., Tulyakov, S., Yan, Y.: Discrete contrastive diffusion for cross-modal and conditional generation. arXiv preprint arXiv:2206.07771 (2022)

Supplementary Material

A Additional Details

A.1 Details of Rendering Parameters

As described in Sec. 3.1, we utilize rendering parameters to standardize CSS due to its code complexity. The examples in Fig. 8 demonstrate this complexity. As shown on the left side of Fig. 8, CSS can be utilized in different forms101010https://www.w3schools.com/css/css_howto.asp: Inline Styles for direct HTML element styling via the “style” attribute; Internal Style Sheets using “<\textless<style>\textgreater>” tags within HTML documents; and External Style Sheets linking to CSS files externally. The middle of Fig. 8 showcases various CSS selectors111111https://www.w3schools.com/css/css_selectors.asp, including simple tag, class, and ID selectors, as well as complex attribute and descendant selectors. Furthermore, CSS follows certain rules regarding inheritance and overrides121212https://developer.mozilla.org/en-US/docs/Web/CSS/Inheritance. An example on the right side of Fig. 8 shows how the .highlight class’s red color is overridden by the more specific ID selector #main-content p, turning the color green.

Refer to caption
Figure 8: Examples of CSS code complexity, showcasing various CSS forms (left), selector complexity (middle), and style inheritance and overrides (right).

The complexity of CSS makes direct generation of CSS impractical. Even parsing CSS code to obtain WebRPG task labels is challenging. Since browsers compute the final applied CSS property values (i.e., rendering parameters) for each element based on HTML and CSS to render web pages, we propose extracting each element’s RPs directly from the browser, as described in Sec. 3.2. This approach bypasses the need to parse CSS code, achieving the standardization of CSS.

Refer to caption
Figure 9: A illustration case of rendering parameters organization, including preprocessed HTML (left), JSON-stored rendering parameters (middle), and the CSS transformed from those RPs (right).
Table 3: The complete vocabulary of rendering parameters including all categories, their index ranges, and selected examples.
Category Index Range Examples
Integer Pixel 0-1920 1px, 1052px, 1920px
Color 1921-1966 RGBA(153, 204, 0, 1), RGBA (255, 255, 255, 1)
Font Style 1967-1969 italic, oblique
Font Weight 1970-1978 100, 500, 900
Line Height 1979 normal
Text Align 1980-1985 start, center, end
Text Decoration 1986-1987 none, underline
Text Transform 1988-1991 uppercase, capitalize
PAD 1992 PAD
Table 4: Index ranges for each rendering parameter in the vocabulary.
Rendering Parameter Index Range
left 0-1920
top 0-1920
width 0-1920
height 0-1920
font-style 1967-1969
font-weight 1970-1978
font-size 0-32
line-height 0-50, 1979
text-align 1980-1985
text-decoration 1986-1987
text-transform 1988-1991
color 1921-1966
background-color 1921-1966

In practice, we follow the pre-order traversal order of the DOM tree to assign a unique ID to each element, achieved by modifying the class name, as shown on the left side of Fig. 9. We organize the rendering parameters using JSON, where the key is the element’s ID, as illustrated in the middle of Fig. 9. RPs can also be transformed into CSS, utilizing class selectors only, as demonstrated on the right side of Fig. 9.

Additionally, the complete vocabulary of all rendering parameters is detailed in Tab. 3, and index ranges of each rendering parameter are presented in Tab. 4.

A.2 Details of Visual Complexity Metric

The Visual Complexity (VC) metric integrates three dimensions: color, size, and alignment. For any given web page, the three dimensions are defined as follows:

Color: The color metric measures the richness of colors and is defined as:

VCcolor=12N(Cc+Cbg2),𝑉subscript𝐶𝑐𝑜𝑙𝑜𝑟12𝑁subscript𝐶𝑐subscript𝐶𝑏𝑔2VC_{color}=\frac{1}{2N}(C_{c}+C_{bg}-2),italic_V italic_C start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ( italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT - 2 ) , (7)

where N𝑁Nitalic_N is the number of elements, and Ccsubscript𝐶𝑐C_{c}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Cbgsubscript𝐶𝑏𝑔C_{bg}italic_C start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT are the counts of unique color and background-color attributes respectively.

Size: The size metric measures the diversity of sizes among web page elements. In particular, it calculates the size diversity for all Nsuperscript𝑁N^{\prime}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT parent elements and then computes the average. The formula is as follows:

VCsize=1Ni=1N(DSi1NCi),𝑉subscript𝐶𝑠𝑖𝑧𝑒1superscript𝑁superscriptsubscript𝑖1superscript𝑁𝐷subscript𝑆𝑖1𝑁subscript𝐶𝑖VC_{size}=\frac{1}{{N^{\prime}}}{\sum_{i=1}^{N^{\prime}}\left(\frac{DS_{i}-1}{% NC_{i}}\right)},italic_V italic_C start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG italic_D italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_N italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , (8)

with NCi𝑁subscript𝐶𝑖NC_{i}italic_N italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and DSi𝐷subscript𝑆𝑖DS_{i}italic_D italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the count of child elements and their distinct sizes for element i𝑖iitalic_i, respectively.

Alignment: The complexity of a web page inversely correlates with the number of pairwise alignments [15]. To simplify, this metric applies only to leaf nodes. The calculation formula is as follows:

VCalg=11Nleaf(Nleaf1)j=1NleafijNleafALGij,𝑉subscript𝐶𝑎𝑙𝑔11subscript𝑁𝑙𝑒𝑎𝑓subscript𝑁𝑙𝑒𝑎𝑓1superscriptsubscript𝑗1subscript𝑁𝑙𝑒𝑎𝑓superscriptsubscript𝑖𝑗subscript𝑁𝑙𝑒𝑎𝑓𝐴𝐿subscript𝐺𝑖𝑗VC_{alg}=1-\frac{1}{N_{leaf}(N_{leaf}-1)}\sum_{j=1}^{N_{leaf}}\sum_{i\neq j}^{% N_{leaf}}ALG_{ij},italic_V italic_C start_POSTSUBSCRIPT italic_a italic_l italic_g end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_l italic_e italic_a italic_f end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_l italic_e italic_a italic_f end_POSTSUBSCRIPT - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l italic_e italic_a italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l italic_e italic_a italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_A italic_L italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (9)

where Nleafsubscript𝑁𝑙𝑒𝑎𝑓N_{leaf}italic_N start_POSTSUBSCRIPT italic_l italic_e italic_a italic_f end_POSTSUBSCRIPT denotes the number of leaf node elements, and ALGij𝐴𝐿subscript𝐺𝑖𝑗ALG_{ij}italic_A italic_L italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is a binary indicator of alignment (1) or misalignment (0) between elements i𝑖iitalic_i and j𝑗jitalic_j.

The overall VC is the sum of three metrics: VC=VCcolor+VCalg+VCsize𝑉𝐶𝑉subscript𝐶color𝑉subscript𝐶alg𝑉subscript𝐶sizeVC=VC_{\text{color}}+VC_{\text{alg}}+VC_{\text{size}}italic_V italic_C = italic_V italic_C start_POSTSUBSCRIPT color end_POSTSUBSCRIPT + italic_V italic_C start_POSTSUBSCRIPT alg end_POSTSUBSCRIPT + italic_V italic_C start_POSTSUBSCRIPT size end_POSTSUBSCRIPT.

A.3 Dataset Details

Refer to caption
Figure 10: Histogram showcasing Visual Complexity (Sec. A.2) value distribution across all samples. Red indicates samples are filtered out, while blue represents those retained in the dataset.
Refer to caption
Figure 11: Histograms showcasing element count and the average depth of elements distribution across all samples in the dataset.

The distribution of Visual Complexity (Sec. A.2) values across all samples is illustrated in Fig. 10. In our dataset, samples with a VC value of less than 0.1 are filtered out, resulting in a remaining subset where the VC distribution is relatively concentrated and approximates a normal distribution, thereby helping to mitigate the impact of extreme samples on training. Additionally, to further investigate our dataset, we visualize two crucial statistical values, element count and the average depth of elements, in Fig. 11. This visualization indicates that the dataset lacks samples containing a large number of elements or considerable element depths.

A.4 Implementation details of FID model

As described in Sec. 6.1.1, the FID model is a binary classifier, incorporating a VAE described in Sec. 5.2, four transformer layers, and a classification header. A special CLS vector is utilized as the classification feature, representing all RPs. The rest of the input is the same as the model in Sec. 5.4. Three kinds of noise are designed to pollute the real data, namely perturbing the original values with a fixed variance, randomly substituting elements with synthetic ones, and randomly swapping elements. The specific FID models for layout and style, namely FIDlayout and FIDstyle, are trained by masking irrelevant inputs. Specifically, FIDlayout processes only the layout, masking the style, and FIDstyle processes only the style, masking the layout. The FID models for overall, layout, and style, achieve classification accuracies of 88.8%, 95.5%, and 92.4%, respectively.

Table 5: The prompt template for GPT-4 experiment in Sec. 6.3.
Prompt You are an exceptional web designer. Please create the corresponding CSS code based on the HTML code I have provided, so as to craft a well-designed visual presentation for the web page. You can only use the following CSS properties: "left", "top", "width", "height", "font-style", "font-weight", "font-size", "line-height", "color", "text-align", "text-decoration", "text-transform", "background-color". Please exercise caution in controlling the size of the image, as using the original image dimensions directly may result in excessive spatial occupation. Here are several demonstrations:{Demonstrates}. Below is the HTML code and do not reply with anything other than CSS code: {HTML_Code}   .
Slots Demonstrates The HTML-CSS pairs for three selected web page segments.
HTML_Code HTML code of given web page.

A.5 Implementation details of WebRPG Baselines

The backbone of WebRPG-AR consists of 6-layer transformers for both encoder and decoder, and WebRPG-DM is a 12-layer U-ViT. The mask scheduling function γ(r)𝛾𝑟\gamma(r)italic_γ ( italic_r ) is a cosine function, the time steps T𝑇Titalic_T in diffusion follows [21] with a value of 1000, and λKLsubscript𝜆𝐾𝐿\lambda_{KL}italic_λ start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is set to 1e-6. For optimization, AdamW [45] is used with a learning rate of 1.2e-4, β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of 0.9, and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of 0.99.

The prompt template for the LLMs experiment in Sec. 6.3 is detailed in Tab. 5. Due to the extensive length of textual representation for each element’s RPs, as shown on the right side of Fig. 9, we opt to have LLMs directly generate the CSS code. The specific steps for conducting the LLMs experiment are:

  1. 1.

    Use the prompt to generate CSS code via LLMs.

  2. 2.

    Use a browser to render the web page with the given HTML and the CSS code generated by LLMs.

  3. 3.

    Extract the RPs for all elements, employing the method in Sec. 4.1.

  4. 4.

    Evaluate these RPs using the metrics in Sec. 6.1.

Refer to caption
Figure 12: Additional visualization of baseline-generated results. The screenshots focus on areas with elements.

B Additional Results

B.1 Additional Cases of Baseline-Generated Results

We present additional results from WebRPG baselines in Fig. 12. These results exhibit the performance of all baselines comparable to that outlined in Sec. 6.3. Additionally, Fig. 13 displays the web page variants generated by WebRPG-AR based on the same HTML, each produced through individual inferences. The differences in layout and style among these variants indicate that WebRPG-AR can generate diverse web pages while maintaining semantic coherence.

Refer to caption
Figure 13: The web page variants generated by WebRPG-AR based on the same HTML.
Refer to caption
Figure 14: The HTML code generated by GPT-4 and the corresponding web page visual results generated by WebRPG-AR. Screenshots use green <\textless<img>\textgreater> placeholders due to GPT-4 generates fictitious source addresses.
Table 6: FID on rendered web page screenshots.
WebRPG-AR GPT4 WebRPG-DM Real Web Page
FIDScreenshot 3.2102 15.515 33.040 1.1156

B.2 The FID on Screenshots of Rendered Web Pages

The FID on screenshots of rendered web pages is shown in Tab. 6.

B.3 Further Cases of Integrating LLM with WebRPG Model

Fig. 14 showcases more cases of WebRPG-AR creating visual presentations of web pages based on HTML code generated by GPT-4. The prompt template for automatically generating HTML is in Tab. 7. The prompt encompasses human-authored descriptions of web design ideas, with an example shown in Tab. 8.

Refer to caption
Figure 15: Human pairwise comparison evaluation results.
Table 7: The prompt template for automatically generating HTML.
Prompt You are a web developer. Please generate the HTML code for a web page with a caption of {Deign_Idea}.
Slot Deign_Idea Human-authored descriptions of web design ideas, with an example shown in Tab. 8.
Refer to caption
Figure 16: An example for visualizing style consistency. Notably, W1^^subscript𝑊1\hat{W_{1}}over^ start_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG and W2^^subscript𝑊2\hat{W_{2}}over^ start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG are artificially created for demonstration purposes.
Refer to caption
Figure 17: A visualization of the style consistency subset based on a real web page. The style consistency subset is defined in Sec. 6.1.3.

B.4 Human Evaluation

We conduct a human evaluation using pairwise comparisons. We randomly select 100 test samples and generate visual presentations using WebRPG-AR, WebRPG-DM, and GPT-4. Five human annotators evaluate each pair to determine the superior presentation or if there is a tie. The results, shown in Fig. 15, align with the objective evaluations in Tab. 1.

Table 8: An example of web design ideas described by humans.
This web page showcases the “Rumble Band for 38mm Apple Watch,” offered at $19.99. It’s identified as the X-Doria Rumble Band and is noted for its compatibility with the 38mm Apple Watch Series 1, 2, 3, and Nike Edition. Highlighted on the page are customer assurances including a lifetime warranty, complimentary shipping on all orders, and a 30-day hassle-free return policy. A conspicuous “Add to Cart” button is prominently displayed. The product’s image is designed to highlight its appearance and design features.

C An Example Explanation of SC Score

Fig. 16 provides an example to explain the SC Score further. The elements representing price (marked with a green box, hereafter termed as price elements) on the real web page W𝑊Witalic_W and on generated web page 1 W1^^subscript𝑊1\hat{W_{1}}over^ start_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG have differing styles in terms of font color and size. However, these differences do not affect the perception of price elements, as their style remains consistent within each individual web page. In contrast, the generated web page 2 W2^^subscript𝑊2\hat{W_{2}}over^ start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG changes just one price element, which leads to confusion when perceiving the price elements. Although W2^^subscript𝑊2\hat{W_{2}}over^ start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG seems more visually similar to W𝑊Witalic_W because of only one differing element, from a semantic perspective, W1^^subscript𝑊1\hat{W_{1}}over^ start_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG is more coherent. Therefore, the SC Score evaluates whether elements that share a style on the real web page maintain that consistency on the generated page, beyond just visual similarity. Additionally, Fig. 17 provides a visualization of the style consistency subset for a real web page.

D Further Discussion on the Performance of LLM in WebRPG Task

As described in Sec. 6.2, we employ GPT-4 as a representative for LLMs. Due to the complexity of CSS code practices and the noise in actual web pages, directly fine-tuning LLMs is not feasible. Consequently, we do not conduct fine-tuning experiments. Moreover, to further explore the performance of GPT-4 in WebRPG tasks, we conduct two qualitative experiments. Tab. 9 details the prompt templates. The first experiment inputs HTML and the captions from the original web page screenshots. The second experiment comprises HTML, these captions, and the screenshots themselves. It’s noteworthy that the additional data comprised visual information from the original web pages, serving essentially as a form of ground truth. The second experiment and the generation of web page screenshot captions both leverage the multimodal capabilities of GPT-4V131313https://openai.com/research/gpt-4v-system-card. Fig. 18 presents visualizations of selected cases, showing that additional data does not enhance GPT-4’s performance. Given that these two qualitative experiments involve ground truth inputs, we do not include them in the main text or conduct quantitative experiments.

Refer to caption
Figure 18: Further qualitative evaluation of GPT-4’s performance in WebRPG task. Notably, the “GPT-4 based on HTML” group is the experiment in Sec. 6.3.
Table 9: The prompts for Sec. D. “H.”, “C.”, and “S.” denote “HTML”, “caption” and “screenshot”, respectively.
Information H.+C. You are an exceptional web designer. Please create the corresponding CSS code based on the HTML code I have provided, so as to craft a well-designed visual presentation for the web page. Furthermore, for better comprehension of the original web page design, here is a detailed caption: {Caption}. You can only use the following CSS properties: "left", "top", "width", "height", "font-style", "font-weight", "font-size", "line-height", "color", "text-align", "text-decoration", "text-transform", "background-color". Please exercise caution in controlling the size of the image, as using the original image dimensions directly may result in excessive spatial occupation. Here are several demonstrations:{Demonstrates}. Below is the HTML code and do not reply with anything other than CSS code: {HTML_Code}.
H.+C.+S. You are an exceptional web designer. Please create the corresponding CSS code based on the HTML code and screenshot I have provided, so as to craft a well-designed visual presentation for the web page. Furthermore, for better comprehension of the original web page design, here is a detailed caption: {Caption}. You can only use the following CSS properties: "left", "top", "width", "height", "font-style", "font-weight", "font-size", "line-height", "color", "text-align", "text-decoration", "text-transform", "background-color". Please exercise caution in controlling the size of the image, as using the original image dimensions directly may result in excessive spatial occupation. Here are several demonstrations:{Demonstrates}. Below is the HTML code and do not reply with anything other than CSS code: {HTML_Code}.
Slots Caption Captions from the original web page screenshots.
HTML_Code HTML code of given web page.
Demonstrates The HTML-CSS pairs for three selected web page segments.