A Novel Vision Transformer based Load Profile Analysis using Load Images as Inputs

Hyeonjin Kim, Yi Hu, Kai Ye, Ning Lu North Carolina State University
Raleigh, NC 27606, USA
{hkim66, yhu28, kye3, nlu2}@ncsu.edu

Abstract

This paper introduces ViT4LPA, an innovative Vision Transformer (ViT) based approach for Load Profile Analysis (LPA). We transform time-series load profiles into load images. This allows us to leverage the ViT architecture, originally designed for image processing, as a pre-trained image encoder to uncover latent patterns within load data. ViT is pre-trained using an extensive load image dataset, comprising 1M load images derived from smart meter data collected over a two-year period from 2,000 residential users. The training methodology is self-supervised, masked image modeling, wherein masked load images are restored to reveal hidden relationships among image patches. The pre-trained ViT encoder is then applied to various downstream tasks, including the identification of electric vehicle (EV) charging loads and behind-the-meter solar photovoltaic (PV) systems and load disaggregation. Simulation results illustrate ViT4LPA’s superior performance compared to existing neural network models in downstream tasks. Additionally, we conduct an in-depth analysis of the attention weights within the ViT4LPA model to gain insights into its information flow mechanisms.

Index Terms:

Image processing, Load analysis, Pre-trained model, Smart meter data, Vision transformer

I Introduction

Pre-trained neural network models have been widely used in Natural Language Processing (NLP) [1] and Computer Vision (CV) [2] tasks. Compared to training a model from scratch, using pre-trained models reduces reliance on labeled data and saves considerable time and computational resources when performing downstream tasks [3]. In the field of natural language processing (NLP), the adoption of large pre-trained language models, such as BERT and GPT, has brought about a transformative shift in the domain of language comprehension and generation. These models undergo extensive training on extensive text corpora, endowing them with the capacity to apprehend intricate linguistic structures and contextual nuances. Consequently, NLP has undergone remarkable progress in the past couple of years, with pre-trained models emerging as pivotal cornerstones for a diverse array of downstream applications, including but not limited to sentiment analysis, machine translation, text summarization, and question answering.

The transformer model exhibits exceptional scalability, surpassing traditional machine learning based models such as Recurrent Neural Networks (RNNs) and convolutional neural networks (CNNs). This led to the creation of the Vision Transformer (ViT) [4], specifically designed for processing computer vision (CV) tasks. In contrast to conventional CNNs, ViT uses a transformer architecture by dividing images into fixed-size patches, linearly projecting them, and employing self-attention mechanisms to capture long-range dependencies among patches. ViT soon excels in conducting a wide variety of CV tasks, including image classification, object detection, and segmentation. The success of ViT in the realm of CV underscores the transformative potential of applying transformer architectures across different domains.

In the field of power system analysis, load profile analysis (LPA) is becoming increasingly vital for performing tasks like load disaggregation, model parameterization, customer segmentation, load flexibility analysis, and identifying behind-the-meter resources. However, a distinctive challenge in this field is the scarcity of publicly accessible, non-sensitive power system datasets. The sensitivity and proprietary nature of power grid operations make it not only financially burdensome but often impractical for researchers and developers to amass the substantial volume of training data required for training resilient machine learning models.

Refer to caption — Figure 1: An illustration of the ViT4LPA architecture. (a) Profile-to-image conversion, (b) ViT4LPA workflow, and (c) Pre-training process.

Hence, leveraging extensive data for pre-training a model applicable to subsequent LPA tasks, akin to the use of BERT in NLP and ViT in CV, can effectively address the challenge of restricted data accessibility. To date, the development of a versatile pre-trained model suitable for various downstream tasks in the field of LPA has not received adequate attention from researchers in the power system domain. Recent research endeavors have prominently revolved around transfer learning, a technique wherein neural networks, initially trained for specific supervised learning tasks, are repurposed to perform related tasks by leveraging the knowledge they have accrued. In [5], cross-domain transfer learning was introduced, aiming to apply latent features learned from one appliance to another, particularly for non-intrusive load monitoring tasks. In [6], the authors explored LPA through representation learning using convolutional autoencoders. In our recent work [7], we introduced a BERT-based load profile inpainting approach, serving as a foundational model to streamline the restoration of missing data tasks. Nonetheless, it is currently primarily focused on addressing a specific LPA task.

Thus, in this paper, we present a pre-trained model framework tailored for LPA. Our main contribution is the introduction of ViT4LPA, a novel approach based on Vision Transformers (ViT) specifically designed for LPA tasks. This approach marks a significant shift in the LPA domain by not only reducing the dependence on labeled datasets but also enabling its seamless application across a diverse spectrum of downstream tasks.

II Methodology

This section introduces ViT4LPA, an innovative Vision Transformer (ViT) based approach for LPA.

II-A Profile-Image Conversion

One of our key contributions is to base the LPA study on load images derived from these time-series load profiles. Traditional LPA studies typically use time-series load profiles as inputs to capture temporal cyclic characteristics, including daily, weekly, or monthly variation patterns. However, in a more recent development, a few authors have proposed the conversion of load profiles into color-coded load images for autoencoder-based clustering, as seen in [6], and for BERT-based missing data recovery, as demonstrated in [7]. Motivated by the accomplishments in those endeavors, we use load images, as opposed to load profiles, to pre-train a ViT-based foundational model that can be fine-tuned with a small amount of data when conducting downstream LPA tasks.

As illustrated in Fig. 1(a), we generate load image comprising three distinct channels. Channels 1 to 3 encompass data from smart meter loads, temperature readings, and irradiance profiles, respectively. This approach allows the encoder to capture not only hidden patterns within the load data but also the correlation with temperature and irradiance variations. $x-$ axis of the image corresponds to the number of data ( $N_{T}$ ) within one day and $y-$ axis represents the number of days ( $N_{D}$ ). To convert a data point into a color patch, we initially perform a linear projection of the data point onto the [0,1] range. For instance, consider the case where the minimum load consumption ( $p^{-}$ ) is $-$ 4kW and the maximum load consumption ( $p^{+}$ ) is 24kW. In this scenario, the load consumption at hour $t$ , ${p_{t}}$ , can be normalized as follows: $(p_{t}-p^{-})/(p^{+}-p^{-})$ .

Load images excel at condensing vast amounts of multi-modal data, encompassing variables like load, temperature, and solar irradiance, facilitating the streamlined consolidation of information. This proves especially advantageous when compressing monthly or yearly data into load images, aligning profiles based on their cyclical characteristics. Through the transformation of load profiles into load images, we can seamlessly apply the ViT model [4], initially designed for computer vision tasks, to effectively handle LPA tasks.

II-B Vision Transformer Encoder Inputs

In Fig. 1(b), we present the workflow of ViT4LPA. To directly apply the ViT model architecture, we initially partition a load image comprising $N_{D}\times N_{T}$ pixels into $\frac{{N_{D}}\times{N_{T}}}{{N_{P}}^{2}}$ image patches, ensuring each patch contains $N_{P}\times N_{P}$ pixels. Subsequently, these image patches are flattened into a sequence of patches, ready to be fed as inputs to the ViT model.

Note that the division of load patches is contingent on both the data resolution ( $N_{T}$ ) and data duration ( $N_{D}$ ). In this paper, due to the page limit, we will focus on introducing the ViT4LPA architecture and workflow. The detailed discussion regarding the selection of $N_{P}$ in relation to $N_{T}$ and $N_{D}$ will be presented in our follow-up journal paper. Thus, we fix ${N_{T}}=24$ hours, ${N_{D}}=24$ days, and $N_{P}=4$ . In this setup, a load image will contain 24 rows (representing the number of days) and 24 columns (representing the number of data points in a day). After partitioning into $6\times 6$ image patches, each patch contains $4\times 4$ color patches. We then flatten the 36 image patches and send them directly to the ViT encoder (the blue block in Fig. 1(b)) for processing. The descriptions of the ViT encoder model can be found in [4].

The load embeddings produced by the pre-trained ViT encoder can be effectively used for various downstream LPA tasks. In this paper, we use the behind-the-meter load identification task and load disaggregation task as examples to showcase the efficacy of ViT4LPA. ViT4LPA will be employed to identify the following three types of distributed energy resources: electric vehicle (EV) charging load, Photovoltaic (PV) generation, and Heating, Ventilation, and Air-Conditioning (HVAC) load.

II-C Pre-training ViT4LPA using Masked Image Modeling Tasks

In Fig.1(c), we describe the architecture of the masked autoencoder network used to pre-train the ViT4LPA encoder, following the masked autoencoder scheme introduced in[2]. The training task involves restoring the masked image patches using a reconstruction decoder. The input to this decoder is the load image embedding produced by the ViT encoder using unmasked image patches as inputs. In Fig. 1(c), since the ViT-generated embeddings are for the unmasked image patches, it is necessary to incorporate the mask embeddings to ensure that the input of the decoder matches the dimensionality of the original load image.

As depicted in Fig. 1(c), 18 image patches are subjected to masking, leading to the ViT encoder receiving only the visible 18 image patches as input. Consequently, the output of the ViT encoder consists of only 18 embeddings, requiring the insertion of 18 mask embeddings to achieve the necessary 36 embeddings for the reconstruction decoder. The 36 restored image patches by the decoder will be rearranged back to a load image and compared with the original load image to calculate losses.

Please note that various masking strategies, including grid-masking, random-masking, span masking, and more, exist. However, due to page limitations, this paper exclusively introduces the grid-masking approach. This choice aligns with the primary focus of the downstream LPA tasks discussed in this paper, i.e., the tasks of behind-the-meter load identification and load disaggregation. In these tasks, grid masking has been found to demonstrate superior training efficiency when compared to random masking. Consequently, for this paper’s scope, we will exclusively showcase results obtained through grid masking. A more in-depth examination of the impacts of different masking strategies (e.g., random masking or variable grid masking) on downstream task performances will be extensively addressed in our follow-up journal paper.

III Simulation Results

In this section, we showcase the efficacy of the proposed pre-trained ViT4LPA model in three downstream load identification tasks, using smart meter data as inputs.

III-A Simulation Setup

To construct the training dataset for the ViT encoder, we generate 4,000 sets of 1-hour resolution yearly load profiles. The original datasets are collected by the Pecan Street project [8]. The load profiles are metered from 150 households in Austin, Texas, including sub-metered PV, EV, and HVAC load consumption. The dataset spans 2 years, with a 1-minute data resolution. To streamline the load identification task, we downsample the 1-minute data to 1 hour. The data augmentation strategy introduced in [9] is used to generate additional training examples for bolstering the model’s resilience. The dataset is partitioned as follows: the training set comprises load profiles from 100 users, while the testing dataset includes the remaining 50 customers. This ensures a comprehensive evaluation of the proposed models.

III-B Performance on the Masked Image Modeling Tasks

The hyperparameters used for pre-training the ViT4LPA encoder model (see the network architecture depicted in Fig. 1(c)) are listed in Table I. As shown in Fig. 2(a), the reconstruction results demonstrate that the model can reconstruct the load image from partial information (i.e., unmasked image patches) with satisfactory performance. Interestingly, the masked autoencoder demonstrates increased effectiveness in recovering missing patches during the morning to noon period (8:00-12:00), which aligns with the typical peak consumption hours for the specific customer. In Fig. 2(b), $n$ MAE distribution of the masked image reconstruction is represented. Error is concentrated in the 1-2% range, where the mean and standard deviation of the error are 1.40% and 0.515%, respectively.

TABLE I: Parameters Used in the ViT4LPA Pre-training Process

Network	Layers	Heads	Proj. Dim. ( $D$ )	Parameters
Encoder	3	4	128	1.0M
Decoder	2	2	32	2.0M
Hyperparam.	Batch size	Dropout	Optimizer	Epochs
Hyperparam.	64	0.1	Adam	50

Subsequently, the cosine similarity matrix is employed to assess the similarity of position embeddings among patches. In Fig.3(a), we depict heatmaps showcasing 36 similarity matrices. Each matrix corresponds to the cosine similarity of position embeddings between an image patch at a specific position in the load image, in relation to the 36 image patches, which include itself. To illustrate, in Fig.3(b), the top-left similarity matrix represents the cosine similarity of position embeddings between the initial image patch (consisting of 48 data points from day 1-4 and hour 1-4) and the remaining 36 image patches.

Notably, when considering photo image patches, we observe significant similarities between a patch and its neighboring patches, regardless of their alignment in columns, diagonally, or rows, as discussed in [4]. In contrast, for load image patches, high similarity is primarily found among patches aligned in columns, indicating strong correlations among profiles at the same time of day across different days. For instance, data from hours 1-4 exhibits a strong correlation with data from hours 1-4 on all other days (i.e., in the column-wise direction). However, this correlation may be less prominent when comparing data from adjacent hours, such as between hours 1-4 and hours 5-8 (i.e., along the row-wise direction).

In Figs. 4(a) to (c), the mean self-attention matrices, averaged across multiple attention heads within each layer, are depicted. Similar to Fig. 3(a), the 1-by-36 attention weights are reshaped into a 6-by-6 matrix to incorporate the positional information of patches. Analyzing the mean attention matrices provides valuable insights into the local and global information flow within each layer and the receptive field of the model. An important insight obtained from the study is that in the first layer, the attention weights are sharply focused on a small set of patches, indicating a localized focus on specific details. As the network deepens, particularly in layers 2 and 3, attention is dispersed across a larger number of patches compared to the initial layer. This shift implies the model’s transition from capturing localized features to embracing a more holistic perspective, encompassing the entire load image to extract comprehensive global information.

III-C Downstream Task 1: Load Identification

To demonstrate the effectiveness of using pre-trained model on downstream tasks, we first compare the performance of ViT4LPA on two load identification tasks: behind-the-meter PV and EV identification. The identification models have the same network structure for both the scenarios with and without ViT4LPA. Note that the ViT4LPA encoder model can be further fine-tuned using a small amount of labeled training data.

As illustrated in Fig. 5, a significant performance boost is evident, particularly when the training dataset is limited, such as within the range of 10k to 150k examples, by incorporating the pre-trained ViT4LPA encoder. Even with the utilization of the complete dataset comprising 500k examples, a performance gain of 1-2% is attainable when employing ViT4LPA. This outcome highlights that harnessing a pre-trained model not only effectively reduces the reliance on extensive training datasets but also consistently yields robust performance enhancements through comprehensive pre-training.

In Table II, we compare the proposed ViT-based identification model with conventional CNN models. The CNN architectures used in our experiments are based on the Inception model [10], with variations incorporating both 2D and 1D convolution layers, respectively. The comparison reveals a significant performance gap between the proposed ViT-based model and the benchmark CNN models. While the PV identification accuracy appears similar across the models, a notable difference emerges in EV identification accuracy. For the PV identification task, the abundance of available data makes pre-training less impactful. However, when it comes to EV identification, labeled data is scarce, underscoring the increased importance of leveraging the knowledge acquired through pre-training for effective identification. Consequently, in the EV identification task, using ViT4LPA results in a substantial performance enhancement compared to using the two CNN models.

TABLE II: Performance Comparison for Conducting PV and EV Load Identification Tasks

Model	Accuracy (PV)	Accuracy (EV)	Overall
ViT (proposed)	0.984	0.943	0.964
Inception (2D)	0.988	0.912	0.950
Inception (1D)	0.987	0.911	0.949

III-D Downstream Task 2: HVAC Load Disaggregation

Next, we showcase the performance improvement when using the pre-trained ViT4LPA encoder on load disaggregation tasks, using HVAC load disaggregation as an example. In Table III, we compare ViT4LPA with two benchmark models: the Inception model and the bidirectional Long Short-Term Memory (BiLSTM) model. Note that BiLSTM is a RNN model widely employed for sequential data analysis. To evaluate the models, we use the normalized mean absolute error ( $n$ MAE) to assess point-to-point error and energy error (EE) to measure cumulative estimation deviation. For more details on the performance metrics used for load disaggregation, please refer to our prior work [9].

TABLE III: Performance Comparison (Load Disaggregation Task)

Model	$n$ MAE (%)	EE (kWh)	std( $n$ MAE)
ViT (proposed)	6.89	2.80	2.21
Inception	8.89	4.11	2.59
BiLSTM	8.01	3.53	2.26

From the results, we made the following observations

•

ViT4LPA exhibits the lowest point-to-point error, energy error, and error standard deviation.
•

As shown in Fig. 6, the HVAC load curve generated by ViT4LPA closely aligns with the ground truth, while the estimates derived from the BiLSTM model display noticeable errors, especially during peak HVAC load periods.
•

As shown in Fig. 7, ViT4LPA exhibits remarkable consistency, as evidenced by the narrow standard deviation of $n$ MAE. It’s important to note that each error observation is derived from averaging the resultant errors across multiple days for each customer, demonstrating the model’s robustness across various customers. In the case of our proposed method, errors are densely concentrated within the 4-8% range, indicating a high level of precision. Conversely, the errors of the BiLSTM models exhibit a comparatively broader distribution with a median value of 7.6%.

IV Conclusion

In this study, we introduce ViT4LPA, a pre-trained Vision Transformer (ViT)-based load image encoder designed to generate load embeddings from load images. By converting load, temperature, and solar irradiance profiles into load images, we enable the direct adoption of ViT, a powerful image processing model, for Load Profile Analysis (LPA). Pre-trained on masked image restoration tasks, ViT4LPA captures correlations among load profile pixels to enhance performance in downstream tasks. To gauge the performance of ViT4LPA, we conducted tests on two popular downstream LPA tasks: load identification and load disaggregation. Our results demonstrate significant improvements in both tasks, with lower point-to-point errors, cumulative energy errors, and reduced standard deviations of errors. This underscores the potential of pre-trained models, particularly in machine learning and data analysis for power system applications. The capability of ViT4LPA to fine-tune with limited labeled data is especially valuable in applications constrained by data scarcity or concerns related to data privacy.

Moving forward, our research will delve into analyzing suitable masking strategies and patch sizes to further enhance the performance across multiple downstream tasks.

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[2] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
[3] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[5] M. D’Incecco, S. Squartini, and M. Zhong, “Transfer learning for non-intrusive load monitoring,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1419–1429, 2019.
[6] S. Ryu, H. Choi, H. Lee, and H. Kim, “Convolutional autoencoder based feature extraction and clustering for customer load analysis,” IEEE Transactions on Power Systems, vol. 35, no. 2, pp. 1048–1060, 2019.
[7] Y. Hu, K. Ye, H. Kim, and N. Lu, “Bert-pin: A bert-based framework for recovering missing data segments in time-series load profiles,” arXiv preprint arXiv:2310.17742, 2023.
[8] “https://www.pecanstreet.org.”
[9] K. Ye, H. Kim, Y. Hu, N. Lu, D. Wu, and P. Rehm, “A modified sequence-to-point hvac load disaggregation algorithm,” in 2023 IEEE Power & Energy Society General Meeting (PESGM). IEEE, 2023, pp. 1–5.
[10] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.