Wake Vision: A Large-scale, High-Quality Dataset and Benchmark Suite for TinyML Applications

Colby Banbury*
Harvard University
&Emil Njor*¹
Technical University
of Denmark
&Matthew Stewart
Harvard University
&Pete Warden
Useful Sensors
&Manjunath Kudlur
Useful Sensors
&Nat Jeffries
Useful Sensors
&Xenofon Fafoutis
Technical University
of Denmark
&Vijay Janapa Reddi
Harvard University

Abstract

Tiny machine learning (TinyML), which enables machine learning applications on extremely low-power devices, suffers from limited size and quality of relevant datasets. To address this issue, we introduce Wake Vision, a large-scale, diverse dataset tailored for person detection—the canonical task for TinyML visual sensing. Wake Vision comprises over 6 million images, representing a hundredfold increase compared to the previous standard, and has undergone thorough quality filtering. We provide two Wake Vision training sets: Wake Vision (Large) and Wake Vision (Quality), a smaller set with higher-quality labels. Our results demonstrate that using the Wake Vision (Quality) training set produces more accurate models than the Wake Vision (Large) training set, strongly suggesting that label quality is more important than quantity in our setting. We find use for the large training set for pre-training and knowledge distillation. To minimize label errors that can obscure true model performance, we manually label the validation and test sets, improving the test set error rate from 7.8% in the prior standard to only 2.2%. In addition to the dataset, we provide a collection of five detailed benchmark sets to facilitate the evaluation of model quality in challenging real-world scenarios that are often ignored when focusing solely on overall accuracy. These novel fine-grained benchmarks assess model performance on specific segments of the test data, such as varying lighting conditions, distances from the camera, and demographic characteristics of subjects. These benchmarks enable researchers and practitioners to better understand the strengths and limitations of their models in real-world scenarios. Our results demonstrate that using Wake Vision for training results in a 2.49% increase in accuracy compared to the established dataset. We also show the importance of dataset quality for low-capacity models and the value of dataset size for high-capacity models. The dataset, benchmark suite, code, and models are publicly available under the CC-BY 4.0 license at https://wakevision.ai/.

^*^*footnotetext: These authors contributed equally to this work.¹¹footnotetext: Work done while a visiting researcher at Harvard University.

1 Introduction

To reduce the energy consumption and cost of deploying Machine Learning (ML) deployments, researchers have pioneered ultra-low-power machine learning (TinyML) [3, 30, 1, 6]. TinyML co-locates ML models with the sensors by deploying the model onto always-on microcontrollers (MCUs) and TinyML accelerators, thereby reducing the energy cost of transmitting data or constantly running larger processors. Deploying models on these tiny devices imposes severe constraints, confining the model’s size to $\sim$ 100 kilobytes, nearly four orders of magnitude less memory than mobile phones [2].

TinyML requires large-scale datasets to facilitate high-quality research. However, the extreme constraints imposed by the hardware necessitate that TinyML models be compact and efficient, rendering them unsuitable for traditional, complex ML tasks. As a result, conventional ML datasets like ImageNet [7], designed for more resource-intensive models, are often ill-suited for TinyML research endeavors. While existing TinyML datasets [5, 29] have been valuable, they tend to be limited in scale, making it challenging to train production-grade TinyML models on these datasets.

To enable and mature the field of TinyML research, we present the Wake Vision dataset, a large and high-quality dataset for person detection, the quintessential vision use case for TinyML [1]. Person detection is a binary image classification task where a model detects whether a person is present. While simple, this task has many important use cases, such as occupancy detection [25], smart HVAC/lighting [31], or acting as an always-on ‘wake model’ in a larger ML system [11]. Wake Vision has around 6M images, ~100x more than the prior state-of-the-art person detection dataset, Visual Wake Words (VWW) [5]. Wake Vision is derived from the Open Images v7 dataset [16, 14] and is permissively licensed. Wake Vision demonstrates a 2.21% accuracy improvement over VWW.

We leverage the Wake Vision dataset for an insightful study on the trade-offs between data quality and quantity in TinyML. We connect the dataset’s unique characteristics with the limitations of TinyML models. Prior work suggests that large, overparameterized models, which have more learnable parameters than data samples, can identify and adapt to errors in their training sets [32, 9]. However, TinyML models are rarely overparameterized, leading us to hypothesize that the trade-off between data quality and quantity may be skewed towards quality for these much smaller ML models.

To explore this hypothesis and showcase the value of the Wake Vision dataset in the TinyML domain, we release two training sets as part of the Wake Vision visual corpus: Wake Vision (Large), which prioritizes dataset size, and Wake Vision (Quality), which prioritizes data quality. Using a naive training script, we find that Wake Vision (Quality) outperforms Wake Vision (Large). However, our best-performing Wake Vision model is created by using the two sets in unison, employing Wake Vision (Large) for pre-training and Wake Vision (Quality) for fine-tuning. This approach demonstrates the potential of the Wake Vision dataset to advance TinyML research by leveraging both data quantity and quality. In addition to presenting these findings, we conduct several experiments designed to investigate the trade-offs between dataset quality and quantity across a range of models. To facilitate further research into these trade-offs for TinyML, we release both training sets.

TinyML person detection systems are deployed and expected to function effectively in challenging settings, such as low lighting conditions. However, these challenging settings are often underrepresented in images sourced from the internet, which can lead to high-level metrics, like overall test set accuracy, overestimating a model’s real-world performance. To address this issue, we introduce a suite of five fine-grained benchmarks, each containing images of a specific demographic or scenario.

The benchmarks are designed to provide a more comprehensive evaluation of a person detection model’s performance across a range of real-world conditions. Our benchmarks cover several factors, including distance, which evaluates the model’s performance in detecting people at various distances from the camera; lighting, which assesses the model’s accuracy in detecting individuals in both well-lit and poorly-lit environments; depictions, which measures the model’s ability to handle different representations of people, such as drawings or sculptures; perceived gender, which investigates potential biases in the model’s performance when detecting people of different perceived genders; and age, which examines the model’s effectiveness in detecting individuals across various age groups.

By evaluating models using these fine-grained benchmarks contained within Wake Vision, we can identify potential biases, limitations, and areas for improvement that may not be apparent when relying solely on high-level metrics. This approach ensures that the person detection systems we develop are not only accurate on standard test sets but also robust and reliable across a diverse range of real-world settings and demographics. In summary, we present the following novel contributions:

•

A new person detection dataset, called Wake Vision, which is an easily accessible, permissively licensed dataset that is two orders of magnitude larger than the prior standard.
•

An improved, high-quality test set that reduces the estimated label error rate to only 2.2% from the prior standard of 7.8%
•

A suite of five fine-grained benchmark sets that evaluate a model’s fairness and robustness in challenging settings.

2 Related Work

The current de facto standard dataset for person detection is the VWW dataset [5, 6, 18, 2, 1]. To our knowledge, this is the only current open-source dataset designed explicitly for person detection. Unfortunately, the VWW dataset is small and cannot be directly downloaded. Instead, it must be regenerated from MS-COCO [19], thus limiting its accessibility and usefulness. Apart from dedicated person detection datasets, some general image classification datasets contain a person label, which makes it possible to create person detection models based on them. Examples include the Cifar-100 dataset [15], and the PASCAL Visual Object Classes dataset [8].

However, using general image classification datasets can lead to poor perceived performance of TinyML models, as they do not adequately represent the open “no-person” class [5]. Table 1 shows that our Wake Vision dataset contains up to almost two orders of magnitude more images than any other public and permissively licensed dataset. Furthermore, Wake Vision is the only person detection dataset to provide a fine-grained benchmark suite and be officially distributed to popular dataset services such as TensorFlow Datasets (TFDS) and Hugging Face Datasets. However, recent research suggests that this focus on top-line metrics can hide poor performance on critical subsets of a dataset, such as subsets relating to sensitive information and fairness [12, 13].

Table 1: A comparison of person detection and image classification datasets.

	Total	# of Person Images			Fine-Grained	Suitable for
Dataset	Images	Train	Validation	Test	Filtering	TinyML
Wake Vision (Quality)	1,322,574	624,115	9,2091	27,881	✓	✓
Wake Vision (Large)	5,760,428	2,880,214	-	-	✗	✓
Visual Wake Words [5]	123,287	36,000	3,926	19,107	✗	✓
CIFAR-100 [15]	60,000	2,500	-	500	✗	✗
PASCAL VOC 2012 [8]	11,530	1,994	2,093	-	✗	✗

3 Wake Vision Dataset

We introduce the Wake Vision dataset, the largest dataset for TinyML person detection. The dataset’s size, quality, and detailed metadata enable new avenues of TinyML research. Wake Vision is two orders of magnitude larger than prior comparable datasets and can be adapted based on the target use case. We also demonstrate how its quality is significantly improved compared to existing datsets. Fig. 1 illustrates examples of images in the dataset. We discuss our methodologies for label generation, data filtering, and error correction, as well as discussing the usability of the dataset and comparing it to the Visual Wake Words Dataset.

3.1 Label Generation

While a large person detection dataset is indispensable for TinyML research, it can be challenging to obtain. Manually labelling millions of images would be prohibitively expensive for a nascent field like TinyML. Therefore, we use existing large-scale data efforts to generate Wake Vision automatically.

The base label in Wake Vision is a binary person/non-person label. We derive these labels from Open Images, which contain both image-level and bounding box labels. Image-level labels describe which objects are present in an image but not localized to a specific point in the image. Bounding box labels provide a set of four coordinates forming a box around the object. The bounding box label classes are hierarchically structured so that one class can be a subcategory or a part of another class. For example a “Woman” is a subcategory of a “Person,” and a “Human Hand” is a part of “Person.”

We observe that the bounding box labels are generally more accurate and provide more information for data filtering but are less numerous. The Open Images training set has ~9 million images with image-level person labels, but only ~1.7 million images with a person bounding box. The Open Images validation and test sets are fully labeled with image-level and bounding box labels. Consequently, this presents a trade-off between more data (image-level labels) or higher-quality labels (bounding box labels). Because of this trade-off, we provide two Wake Vision training sets: Wake Vision (Large) and Wake Vision (Quality). One larger set, labeled via Open Images image-level labels, and one smaller set, labeled via Open Images bounding box labels (Table 1).

Large Training Set from Image-Level Labels

Image-level labels in Open Images contain little additional information in addition to the labels. The only additionally relevant information is a confidence property, which represents how certain it is that a label is correct. The confidence ranges from 0 to 1 for labels generated by machines, and is either 0 or 1 for labels negatively or positively verified by humans respectively. See section 3.2 for more information. We use this property to exclude images from Wake Vision if they contain only low confidence person labels.

Quality Training Set From Bounding Box Labels

Bounding Box labels in Open Images, in contrast to image-level labels, are all verified and localized by humans. This ensures that bounding box labels are less likely to be false positives. Bounding box labels can also be used to calculate an approximation of the area taken up by a person. We use this as a proxy for the distance of a person in the image, and exclude images where all person labels are far away, i.e., take up only a small portion of the image. An example of an image with a far away person can be found in Fig. 2(a). Finally, bounding box attributes include a depiction flag, which represents whether a label is a depiction. See an example of an image containing a person with the depiction flag set in Fig. 2(b). By default we use this flag to make sure that we do not consider a depiction a person.

We allow users to adapt the Wake Vision dataset to meet the needs of specific usecases by adding several configurable options in the open-source dataset creation script. For example a configuration option exists to change whether depictions are labelled as non-persons or excluded from the dataset. A comprehensive list of Open Images labels related to Wake Vision can be found in Sec. L.

Training Set Comparison

To compare the performance of the two Wake Vision training sets, we train a MobileNetV2-0.25 [26] model on each dataset for an equal number of steps and compare its performance on the common test set (Table 2). The filtered training set generated from the bounding box labels outperforms the larger training set generated from the image-level labels by 4.09% test accuracy. This indicates that label quality is more important than quantity in this setting. The gap between the two is reduced to just 0.17% when training exclusively on soft labels from a teacher model (MobileNetV2-1.0) trained on bounding box labels.

Combining the two training sets to train a model where Wake Vision (Large) acts as a pre-training set and Wake Vision (Quality) acts as a fine-tuning set achieves our best performance at a mean top-level test accuracy of 85.72%. As the images in Wake Vision (Quality) is a subset of the images in Wake Vision (Large), we do not train on more images than for Wake Vision (Large).

Table 2: Accuracy on test set trained on image-level and bounding box labels, with and without distillation. A MobileNetV2-1.0 model trained on the bounding box set is used as the teacher in KD.

Training set	Label Source	Training Set Size	Accuracy	Distilled Accuracy
Wake Vision (Large)	Image-Level	5,760,428	80.80 $\pm$ 0.18%	85.00 $\pm$ 0.20%
Wake Vision (Quality)	Bounding Box	1,248,230	84.89 $\pm$ 0.11%	85.17 $\pm$ 0.18%
Wake Vision (Combined)	Both	5,760,428	85.72 $\pm$ 0.04%	-

As our distillation results indicate, further data-centric filtering techniques may be needed to fully leverage the image-level training set. We experimented with tripling the model size and doubling the training time, but the difference in accuracy remained relatively consistent.

The two training sets provide a useful foundation for future work on the importance of data quality vs. quantity. Wake Vision can enable further research on data-centric TinyML [21, 22], specifically uncovering the relationship between model capacity and the sensitivity to training data quality.

3.2 Label Correction

Label errors are a challenge in computer vision [24, 4] that limit progress and obscures true performance. This is especially prevalent in large datasets that use complex labeling pipelines to scale while managing costs. Recognizing the importance of label quality, we take measures to estimate and improve the accuracy of its labels, aligning with best practices in the field of computer vision.

Wake Vision’s labels are derived from the Open Images labels; therefore, the errors that occur in the Open Images labeling pipeline are inherited by Wake Vision. All labels in the Open Images dataset are originally machine-generated candidate labels. Many of these labels are later verified by human annotators, and a subset of them are given bounding boxes and additional annotations.

While the machine-generated label phase aims to identify objects efficiently, any instances missed during this initial step are unlikely to be captured in downstream phases. Consequently, the Open Images dataset may contain numerous false negative labels, referring to images where an object is present, but lacks the corresponding label. Therefore, additional measures are necessary to identify and correct such labeling omissions to ensure the dataset’s integrity.

Given Wake Vision’s size, manually correcting label errors across the entire dataset becomes an arduous undertaking. Therefore, we prioritize labeling the Wake Vision validation and test sets so that the dataset accurately represents the model’s performance. To do so we make use of the Scale AI platform to crowd-source manual label corrections (see Appendix). With the ground truth established in the validation and test sets, Wake Vision becomes an ideal foundation for research on automated data cleaning techniques [23]. We perform an initial exploration of these methods in appendix N. We report our estimated label error rate after label corrections versus VWW in Table 3. The Wake Vision validation and test set is considerably lower than that of the VWW dataset top-level error rate.

Table 3: Estimated label error rate of VWW and Wake Vision, based on a 500 sample subset. The Wake Vision estimated error rate is broken down by set. The Wake Vision Train (Quality) and Train (Large) have different error rates due to the different sources of their labels. The Wake Vision Validation and Test sets have a lower error rate due to manual relabeling.

Dataset	Wake Vision Set	Label Error Rate
Visual Wake Words	Train, Val, & Test	7.8%
Wake Vision	Train (Large)	15.2%
Wake Vision	Train (Quality)	6.8%
Wake Vision	Val & Test	2.2%

3.3 Usability

Wake Vision is available through TensorFlow Datasets [28] and HuggingFace Datasets [17] to enable easy access and use by the community. The images and labels are rehosted to ensure the dataset will not be changed over time due to dead links, making it a more stable benchmark. The rehosted labels are generated according to our default dataset configuration, further described in Sec. I.

3.4 Comparison to Visual Wake Words

Wake Vision serves as a drop-in improvement over VWW, thanks to its size and our extensive filtering. To compare the benefits of Wake Vision against VWW, we train two identical MobileNetv2 [26] models with a width modifier of 0.25 for 200,000 steps on 224x224x3 images using AdamW [20]. We use a lr of 0.002 with a cosine decay and a weight decay of 4E-6. One model uses VWW’s training set, and the other employs Wake Vision’s Quality training set. After training these identical models using the respective datasets’ training sets with the same recipe for an equal number of steps, we cross-evaluate their performance on the corresponding test sets.

Table 4 shows a 0.26% improvement on VWW’s own test set, which illustrates that Wake Vision is a direct improvement over VWW, not simply a domain shift. We additionally achieve a 1.1% improvement over VWW on Wake Vision’s Test set, indicating that the new test set is more challenging. Table 3 also shows the performance of a the Wake Vision (Combined) model described in section 3.1. This model achieves an average 0.75% improvement over the Wake Vision (Quality) trained model on the VWW test set and an average 0.83% improvement on the Wake Vision test set. The Wake Vision (Combined) model is trained for longer than the VWW and Wake Vision (Quality) models, but shows the value of using the training sets in unison even on the VWW test set.

Table 4: Accuracy on the Wake Vision and VWW test sets by models trained on the VWW, Wake Vision (Quality) and Wake Vision (Combined) training sets.

		Train
		VWW	Wake Vision (Quality)	Wake Vision (Combined)
Test	VWW	88.33 $\pm$ 0.29	88.59 $\pm$ 0.17%	89.34 $\pm$ 0.02%
Test	Wake Vision	83.79 $\pm$ 0.23%	84.89 $\pm$ 0.11%	85.72 $\pm$ 0.04%

3.5 Dataset Quality Vs. Size

We investigate the impact of dataset quality and size of Resnet [10] style models of varying capacity. We measure dataset quality by the approximate rate of label errors. The Wake Vision (Quality) training set has an estimated error rate of around 7%. We simulate higher error rate datasets (15% and 30%) by flipping the binary label with a certain probability. Appendix D gives the derivation for the flip probability. We also sweep the dataset size by taking a slice of Wake Vision (Quality). Each model is trained for 50,000 steps on 224x224x3 images using AdamW [20], a learning rate of 0.001 with a cosine schedule and a weight decay of 0.004. Figure 3 illustrates the experiment’s results. We observe that smaller models (leftmost figure) are more sensitive to higher error rates than large models. The smallest model’s accuracy drop is 1.3%, going from the base train (quality) error rate of around 7% to 15% compared to only a 0.5% accuracy loss for the largest model. In contrast, large models benefit more from big datasets and start to over fit on the smaller slices of the training data.

4 Fine Grain Benchmark Suite

Test sets generated directly from images posted to the internet are not reflective of real world use cases as the distribution is biased towards images people deems as worth sharing (e.g. well lit and framed). In contrast, person detection systems are deployed, and expected to work, in more challenging settings, such as low lighting. Due to this gap, a model that achieves high test accuracy may perform poorly in real-world scenarios after the design stage has ended. To address this issue, we present a benchmark suite for person detection that tests a model’s robustness in challenging settings, enabling better analysis at the design stage. Additionally, the suite measures the model’s performance across demographics to ensure it does not exhibit clear bias. The suite consists of five fine-grained benchmark sets. Each of the five fine-grained benchmarks have been chosen based on a combination of its relevance to TinyML use cases and the availability of requisite metadata to generate the sets. Each benchmark set is a subset of the respective validation or test set filtered based on the criteria under test. The benchmark determines whether a model is sufficiently accurate in the planned deployment setting. For example, a model designer may make different design choices for a use case where the subject is close to the camera and well-lit compared to the inverse setting. Appendix C contains a case study showing how the fine grained evaluation sets can be used to design robust models. Table 5 reports the number of samples in each of the individual benchmark sets and the F1 scores of a Wake Vision and VWW model on each set.

Perceived Gender and Age

These fine grained datasets are generated from the Open Images More Inclusive Annotations for People (MIAP) extended fairness labels [27]. Underrepresented sub-groups typically constitute a small portion of a generic test set; therefore, top-line metrics often obscure a bias in a model until it is deployed [12]. This benchmark aims to evaluate a model separately on demographics that are underrepresented in the underlying dataset distribution, identifying bias. These labels are based on a perceived gender and age representation and are not necessarily representative.

Table 5: Wake Vision Fine Grain Benchmark Suite. We report the samples in each set and the average F1 score across three Wake Vision models and three VWW models on the benchmark set.

		Set Size		F1-Score
	Benchmark	Val	Test	Wake Vision	VWW
Gender	Female	684	2,157	0.93	0.89
	Male	1,310	3,918	0.91	0.88
	Unknown	1,612	4,940	0.77	0.78
Age	Young	275	884	0.94	0.90
	Middle	2,133	6,595	0.91	0.88
	Older	90	276	0.94	0.89
	Unknown	1,299	3,837	0.71	0.74
Distance	Near	5,457	16,333	0.91	0.89
	Medium	2,213	6,876	0.85	0.84
	Far	398	1,140	0.59	0.67
Lighting	Dark	3,255	9,420	0.85	0.81
	Normal	14,315	43,010	0.82	0.82
	Bright	1,012	3,332	0.82	0.82
Depictions	Person	356	978	0.71	0.66
	Non-Person	352	1101	0.86	0.82
	No Depiction	8,583	25,802	0.87	0.85

Distance

The distance datasets test how the distance of persons in images impact the performance of ML systems. This benchmark is critical for correctly predicting a model’s deployed performance. If a person detection system is intended to recognize subjects at great distances, the system’s performance on the far away dataset will be more informative than its performance on the top-level test set. We create three datasets based on the percentage of the image the subject bounding box covers. The three sets are near (>60%), at a medium distance (10-60%), and far away (<10%).

Lighting

The lighting datasets aim to test the performance of ML systems across different lighting conditions. In scenarios like security cameras, outdoor robotics, or augmented reality applications, models must be robust to varying lighting conditions, including low-light environments. We create three fine grained datasets of this type for dark, normal, and bright lighting conditions, respectively. We quantify lighting conditions by the average pixel values of images in greyscale, which is a simple but effective method for distinguishing lighting conditions [33]. We define low as an average pixel value less than 85, normal as between 85 and 170, and bright lighting conditions as greater than 170.

Depictions

A particularly challenging task for a person detection model is to correctly reject depictions of people. In many use cases a person detection model can not falsely trigger on a depiction. For example, a room occupancy detector that incorrectly identifies a painting on the wall as a person. This benchmark measures the models accuracy on three related sets of non-person samples: depictions of people, depictions that are not of people, and images do not contain a depiction of any kind. Depictions of people can range from photo-realistic to crude stick figures.

4.1 Benchmark Results

Table 5 compares the results of a model trained on the Wake Vision (Quality) training set to an identical model trained on VWW. The Wake Vision model exhibits superior robustness across the challenging settings exercised by our benchmarking suite, further demonstrating the effectiveness of the dataset. For instance, on the “Depictions” benchmark, which evaluates performance on images containing persons, non-person objects, or no depictions, the Wake Vision model achieves an F1 score of 0.71 for person depictions, outperforming the VWW model’s 0.66. Similarly, the Wake Vision model scores 0.86 for non-person depictions compared to the VWW model’s 0.82. The Wake Vision model also showcases improved performance on the “Age” benchmark, with F1 scores of 0.94, 0.91, and 0.94 for young, middle, and older individuals, surpassing the VWW model’s scores of 0.90, 0.88, and 0.89, respectively. This highlights the dataset’s effectiveness in enhancing model robustness for detecting people across various age groups, particularly the elderly demographic. In addition, the Wake Vision model demonstrates resilience to challenging lighting conditions, achieving F1 scores of 0.85 in dark lighting scenarios, compared to the VWW’s 0.81.

Collectively, these results underscore the significance of Wake Vision’s large-scale, diverse data and fine-grained benchmarks, enabling the development of more robust and reliable TinyML person detection models across a wide range of realistic and real-world scenarios.

5 Ethical Considerations

The Wake Vision dataset and benchmark suite aim to advance TinyML research by providing large-scale datasets and enabling fine-grained analysis of TinyML systems. Wake Vision can directly improve the quality of person detection systems that benefit society, such as energy-saving applications, while protecting user privacy through on-device inference. However, person detection models can also be used for malicious purposes, such as weapons targeting or large-scale surveillance.

The images in Wake Vision are sourced from Flickr through Open Images[16, 14] under the CC-BY 2.0 license. While we have made efforts to ensure the images are properly licensed, we cannot guarantee the license status of each image. It is possible that some images have been uploaded without the right to distribute under the CC-BY 2.0 license. Given sufficient resources, individuals in the dataset could potentially be identified based on biometric data. The dataset may also contain offensive content, as manually verifying millions of images is infeasible.

The benchmark suite aims to ensure the fairness and robustness of person detection models. However, the demographic benchmarks could potentially be misused to create systems that classify gender and age, breaching personal privacy or perpetuating discrimination. To mitigate this risk, we have not released training sets for the fine-grained benchmark datasets.

6 Conclusions

The lack of large, high-quality datasets has been a significant bottleneck in the field of TinyML research. To address this issue, we introduced Wake Vision, a large-scale dataset and benchmark suite for person detection that surpasses the current state-of-the-art dataset, VWW, in terms of size (~100x larger), ease of use, label accuracy, and real-world relevance. These improvements make Wake Vision a drop-in replacement for VWW. Recognizing the importance of fairness across demographics and robustness to challenging settings in person detection use cases, we also released a Wake Vision benchmark suite comprising six fine-grained benchmark sets to evaluate the fairness and robustness of person detection models. Furthermore, our investigation into the trade-off between dataset quality and quantity for TinyML systems suggests that dataset quality is likely more crucial for TinyML systems compared to conventional ML systems. The Wake Vision dataset and benchmark suite aim to facilitate the development of more accurate, fair, and robust person detection models in TinyML.

7 Acknowledgements

This work was partially supported by the Google TPU Research Cloud program. We would like to thank Andrew Howard of Google Deepmind for his input and support in this effort.

References

[1] C. Banbury, V. J. Reddi, P. Torelli, J. Holleman, N. Jeffries, C. Kiraly, P. Montino, D. Kanter, S. Ahmed, D. Pau, et al. Mlperf tiny benchmark. arXiv preprint arXiv:2106.07597, 2021.
[2] C. Banbury, C. Zhou, I. Fedorov, R. Matas, U. Thakker, D. Gope, V. Janapa Reddi, M. Mattina, and P. Whatmough. Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers. Proceedings of Machine Learning and Systems, 3:517–532, 2021.
[3] C. R. Banbury, V. J. Reddi, M. Lam, W. Fu, A. Fazel, J. Holleman, X. Huang, R. Hurtado, D. Kanter, A. Lokhmotov, et al. Benchmarking tinyml systems: Challenges and direction. arXiv preprint arXiv:2003.04821, 2020.
[4] L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
[5] A. Chowdhery, P. Warden, J. Shlens, A. Howard, and R. Rhodes. Visual wake words dataset. arXiv preprint arXiv:1906.05721, 2019.
[6] R. David, J. Duke, A. Jain, V. Janapa Reddi, N. Jeffries, J. Li, N. Kreeger, I. Nappier, M. Natraj, T. Wang, et al. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proceedings of Machine Learning and Systems, 3:800–811, 2021.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
[9] V. Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[11] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classification models: Combining models for holistic scene understanding. Advances in neural information processing systems, 21, 2008.
[12] S. Hooker, A. Courville, G. Clark, Y. Dauphin, and A. Frome. What do compressed deep neural networks forget? arXiv preprint arXiv:1911.05248, 2019.
[13] S. Hooker, N. Moorosi, G. Clark, S. Bengio, and E. Denton. Characterising bias in compressed models. arXiv preprint arXiv:2010.03058, 2020.
[14] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html, 2017.
[15] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Masters Thesis, 2009.
[16] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
[17] Q. Lhoest, A. V. del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J. Chaumond, M. Drame, J. Plu, L. Tunstall, et al. Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846, 2021.
[18] J. Lin, W.-M. Chen, Y. Lin, C. Gan, S. Han, et al. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems, 33:11711–11722, 2020.
[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[20] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[21] M. Mazumder, C. Banbury, X. Yao, B. Karlaš, W. Gaviria Rojas, S. Diamos, G. Diamos, L. He, A. Parrish, H. R. Kirk, et al. Dataperf: Benchmarks for data-centric ai development. Advances in Neural Information Processing Systems, 36, 2024.
[22] E. Njor, J. Madsen, and X. Fafoutis. Data aware neural architecture search. arXiv preprint arXiv:2304.01821, 2023.
[23] C. Northcutt, L. Jiang, and I. Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021.
[24] C. G. Northcutt, A. Athalye, and J. Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749, 2021.
[25] M. Piechocki, M. Kraft, T. Pajchrowski, P. Aszkowski, and D. Pieczynski. Efficient people counting in thermal images: The benchmark of resource-constrained hardware. IEEE Access, 10:124835–124847, 2022.
[26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[27] C. Schumann, S. Ricco, U. Prabhu, V. Ferrari, and C. R. Pantofaru. A step toward more inclusive people annotations for fairness. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2021.
[28] TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets, 2024.
[29] P. Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
[30] P. Warden and D. Situnayake. Tinyml: Machine learning with tensorflow lite on arduino and ultra-low-power microcontrollers. O’Reilly Media, 2019.
[31] A. Zacharia, D. Zacharia, A. Karras, C. Karras, I. Giannoukou, K. C. Giotopoulos, and S. Sioutas. An intelligent microprocessor integrating tinyml in smart hotels for rapid accident prevention. In 2022 7th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), pages 1–7. IEEE, 2022.
[32] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
[33] W. Zhang, H. Li, and Z. Wang. Research on different illumination image classification method. In 2017 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017), pages 574–581. Atlantis Press, 2017.

Appendix A Code Repository

The code used to generate the Wake Vision dataset is available at the following GitHub repository: https://github.com/harvard-edge/Wake_Vision

This repo contains the code to generate Wake Vision and the benchmark suite, as well as the code to train and evaluate models. This code is sufficient to reproduce all results in the paper.

Appendix B Flowchart of Bounding Box Filtering Process

fig. 4 illustrates the filtering process for Wake Vision when using Open Images’ bounding box labels as the label source.

Appendix C Model Design Case Study

Basic test set performance can often misrepresent a model’s performance, given the typical domain shift between images scraped from the internet and real-world use cases. This issue is exacerbated when ML practitioners must trade off accuracy for model performance and size, which is necessary for TinyML use cases. A design decision might have seemingly little impact on the test accuracy but may destroy real-world performance depending on the deployment environment. For example, a person detection system may only operate in dark lighting conditions, but the test dataset has an insignificant number of dark samples; therefore the test accuracy will not reflect the real world accuracy.

The benchmark suite enables more holistic analysis during the design phase. To show this use we perform a series of scaling experiments employing typical TinyML compression techniques and identifying under which circumstances these techniques are appropriate. While these results can inform ML practitioners, our intention is to demonstrate the usefulness of the benchmark suite.

C.1 Image Size vs. Model Width Scaling

We train two series of MobileNetV2 models: one series that sweeps the input image size [64-256] and one that sweeps the width multiplier of a model [0.1-1.5]. We then benchmark these models on the Wake Vision test set as well as the far distance benchmark. We plot these results against the number of multiply accumulate (MAC) operations in the model as a proxy for on-device latency [2].

The results in Table 5 (left) show that when looking exclusively at the high-level metric (i.e., test accuracy), scaling the input image size has a similar impact as scaling the model size. However, as shown in Figure 5 (right), when we consider only samples where the person is far away from the camera (i.e., the distance benchmark), we observe a much more significant impact when scaling the image size. In the case of distant subjects, the image size becomes the bottleneck.

These findings suggest that for ML developers targeting use cases where the subject is likely to be far from the camera, prioritizing larger input image sizes over wider models may be more beneficial. However, this critical design consideration could be obscured when solely relying on high-level metrics like overall test accuracy. The distance benchmark in Wake Vision effectively unveils the disproportionate impact of image size on model performance for distant subjects, enabling more informed decision-making during model optimization.

C.2 Quantization

Quantization is a crucial technique for deploying efficient TinyML models, offering substantial benefits in terms of reduced latency, memory footprint, and model size. However, prior work has suggested that quantization can disproportionately impact the performance of models on underrepresented subsets of data [13]. To assess the implications of quantization in the context of person detection, we investigate the impacts of int8 quantization on a model’s benchmark results across Wake Vision’s fine-grained benchmarks.

Our findings show negligible degradation in performance across all benchmarks ( $\pm 0.004$ F1) when employing int8 quantization, even on outlier sets. This result contradicts the previously observed disproportionate impact of quantization on underrepresented subsets. We speculate that person detection may be a relatively simple task, potentially explaining why we do not observe this specific property of quantization in our experiments. Given the negligible performance degradation and the substantial latency, memory, and model size benefits of quantization, we conclude that quantization is a win for person detection.

C.3 Grayscale

Converting a model’s input image channels to grayscale from RGB is a commonly employed optimization in the TinyML field [2] as it can substantially reduce a model’s memory consumption. We observed, however, that the grayscale optimization disproportionately impacts images on the brighter end of the spectrum as illustrated in Figure 6. This further demonstrates the importance of fine grained analysis, as some real world deployment environments might be far brighter than the average Wake Vision test sample.

Appendix D Inducing Label Errors

The goal is to make a single pass through the dataset and flip the labels of a binary classification dataset such that the expected label error rate is $d$ . There is an underlying rate of label errors $e$ . If we flip one of these underlying errors, we correct it, thereby inadvertently decreasing the label error rate. After flipping labels with a probability of $p$ we can claim that the likelihood of a single label being correct is the probability that we flipped the label and it was originally an error plus the probability that we didn’t flip the label and it was originally correct: $1-d=p*e+(1-p)(1-e)$ . Then solving for $p$ gives $p=(e-d)/(2e-1)$ . A current flaw of this method is that the injected label errors are not consistent between epochs, which would likely be less destructive to a model’s accuracy since the same errors are not reinforced each epoch. This could also potentially explain why large models in Fig. 3 don’t overfit on training data with higher label errors, as the inconsistent label noise has a regularizing effect.

Appendix E Fine Grained Benchmark Images

fig. 7 gives example images for each of the benchmarks in the suite.

Gender

Female

Male

Gender Unknown

Age

Young

Distance

Near

Lighting

Dark

Depictions

Person

Non-Person

No Depiction

Figure 7: Images from each fine grained benchmark dataset.

Appendix F Open Image Label Distribution

Wake Vision is derived from Open Images V7, therefore the diversity of subjects in the images should follow the distribution of labels in Open Images. The Label distribution of Open images can be found here: https://storage.googleapis.com/openimages/web/factsfigures_v7.html#statistics

Appendix G Manual Labeling

The validation and test sets were manually labeled through the crowd-sourced labeling platform Scale Rapid(https://scale.com/. Figure 8 shows a screenshot of the labeling menu. The labelers were instructed to label an image as a "person" if a person was present anywhere in the image and "no person" if no visible person was present. The labelers also indicated if the image contained a depiction of a person, which meant that sample was labeled "no person" in Wake Vision, or a "picture of a picture/reflection", which we used for metadata.

The cost per image was $0.10 USD and is set based on Scale’s pricing structure. Each image is labeled by 3 different labelers to form a consensus. The total cost of labeling was $7089.8 USD. The authors do not know the hourly rate paid to each labeler, but Scale lists $18/hr on job postings at the time of writing. For context on the difficulty of the task, the authors were able to average around 500 images per hour at a reasonable pace when estimating error rates.

Appendix H Dataset Access and Organization

Wake Vision is available online at the Harvard Dataverse, via TFDS, or Hugging Face Datasets. More information can be found on the https://wakevision.ai webpage.

The dataset is organized as a set of compressed tar files containing images, and a series of label CSVs. The label CSVs are organized such that the file name of the image is the identifying index. CSVs for all dataset splits include the person and depiction label. The Validation and Test label CSVs also have flags that denote a sample’s inclusion into a fine-grain benchmark set (e.g. Distance-Near). This structure makes it easy to access just the required data without requiring a full download. It also ensures the dataset can be easily updated as new versions are introduced.

Appendix I Label Generation Details

: Person Labels. The most straightforward Open Images label classes to label as person in Wake Vision are the "person" label and its subcategories (listed in appendix L). All of these are relabelled as persons in Wake Vision. These labels are present as both image-level labels and bounding box labels.

We furthermore inspect the image level label classes for synonyms and umbrella terms for all the person related labels. This search resulted in an additional six person related label classes. These person related label classes will only be used in the image level label configuration.
: Person Body Part Labels. Body parts are more challenging to relabel, as it is dependent on the use case whether a body part should be considered a person.

For example a camera that detects whether a person is inside a room to decide if the light should be switched on would want to consider body parts as a person, as this will keep the lights on even when the person is only partly in the camera frame. For waking up electronics, however, it may not make sense to consider body parts as persons. This could, e.g., mean that a computer would turn on when detecting a foot.

To cater to both use cases we include a flag in our open-source dataset creation code to set whether body parts should be considered persons. By default we consider body parts as persons.
: Depictions. Open Images bounding box labels contain metadata about whether an object is a depiction, e.g., a painting or a photograph of a person. This presents the challenge of how to handle depictions. While most use cases would not consider a depiction a person, it could make sense to either exclude them to make training easier, or include them as non-persons to make a model resistant to seeing depictions when deployed.

In line with how we handle body parts, we therefore include a flag for our open-source dataset creation code to set whether depictions should be excluded or considered non-persons. By default depictions are considered non-persons.
: Bounding Box Size. The VWW dataset only considered a Common Objects in Context (COCO) person to be a person if the bounding box around the person took up at least 5% of the image [5]. If the person took up less than 5% of the image, the image was excluded from the dataset. To make our dataset work as a plug in replacement for VWW, we adopt the same defaults in Wake Vision. For different requirements, users can change a configuration parameter in our open-source dataset creation code.

Appendix J Standardized Evaluation

VWW has no standardized way to evaluate performance on the dataset. This makes it challenging to compare works based on the dataset, since performance difference can come down to choices outside the contribution of the work. e.g., two works could be contributing with model improvements, but use different pre-processing pipelines that skews results.

To allow for both data-centric and model-centric improvements, we provide a standard model for data-centric contributions, and a standard pre-processing pipeline for model-centric contributions. Both types of contributions are expected to use accuracy as the primary metric for the overall test and validation set, and F1-score as the primary metric for the fine grained benchmarks introduced in section 4.

Therefore, for data-centric contributions we propose to use a Mobilenet v2 model with a width modifier¹¹1Also known as alpha. of 0.25 [26]. For model-centric contributions we propose the following pre-processing pipeline:

1.

Cast image pixel value datatype from 8-bit integers into 32-bit floating points
2.

Resize image such that the shortest side matches the model input size
3.

Perform a center crop on the image such that the longest size matches the model input size
4.

Normalize pixel values to between -1 and 1 sample wise
5.

Use image tensor as input features and person label as target feature

Appendix K Open Images Download

The full Open Images v7 dataset is not hosted by the dataset authors. Rather it is provided as a collection of flickr image Uniform Resource Locators (URLs) and their associated labels. As an unfortunate result of this, the dataset is not static over time as image owners can delete their images from the flickr platform. As a result we were only able to download a subset of the original Open Images v7 dataset as shown in table 6.

Table 6: Number of images downloaded from Open Images v7. Download occurred between the 28^th of November to the 5^th of December

	Train	Validation	Test
Downloaded	7,936,979	36,406	109,305
Errors	1,055,669	5,214	16,131

Appendix L Person Label Classes

We consider the following Open Images v7 labels to be a person for the Wake Vision dataset:

•

Person
•

Woman (Subcategory of Person)
•

Man (Subcategory of Person)
•

Girl (Subcategory of Person)
•

Boy (Subcategory of Person)
•

Human body (Part of Person)
•

Human face (Part of Person)
•

Human head (Part of Person)
•

Human (Person synonym - Only in Image Level Label Configuration)
•

Female person (Woman synonym - Only in Image Level Label Configuration)
•

Male person (Man synonym - Only in Image Level Label Configuration)
•

Child (Umbrella term for Girl & Boy - Only in Image Level Label Configuration)
•

Adolescent (Umbrella term for Girl & Boy - Only in Image Level Label Configuration)
•

Youth (Umbrella term for Girl & Boy - Only in Image Level Label Configuration)

Images containing the following Open Images v7 labels and no other person related labels are excluded from the Wake Vision dataset:

•

Human eye (Part of Person)
•

Skull (Part of Person)
•

Human mouth (Part of Person)
•

Human ear (Part of Person)
•

Human nose (Part of Person)
•

Human hair (Part of Person)
•

Human hand (Part of Person)
•

Human foot (Part of Person)
•

Human arm (Part of Person)
•

Human leg (Part of Person)
•

Beard (Part of Person)

Appendix M Wake Vision Dataset Size

Table 7: Amount of images in the Wake Vision dataset

	Person Images	Non-Person Images	Excluded
Wake Vision (Large) Training Dataset	2,880,214	2,880,214	2,176,551
Wake Vision (Quality) Training Dataset	624,115	624,115	6,688,749
Validation Dataset	9,291	9,291	17,824
Test Dataset	27,881	27,881	53,543

Appendix N Automatic Label Correction

To address the challenge of correcting Wake Vision validation and test labels we initially employed the Confident Learning technique [23] to intelligently identify potential label errors. We selected Confident Learning as prior work has demonstrated its capability to find label errors in large datasets [24]. Confident Learning flagged suspected mislabeled instances, which we inspected and corrected through a manual verification process.

Table 8: Amount of label errors identified and corrected using Confident Learning.

Dataset	Total	Suggested	Corrected	Suggestion
Split	Size	Errors	Errors	Accuracy
Validation	18,582	632	81	12.82%
Test	51,282	1672	267	15.97%

As shown in Table 8, the confident learning process identified a large amount of possible label errors in Wake Vision’s Validation and Test sets. However, only between 12 and 16% of these possible label errors were legitimate errors. We corrected these identified label errors in the Wake Vision validation and test sets.

Given this low acceptance rate of label issues identified by Confident Learning, we concluded that we could not automate label cleaning to a point where no human in the loop is needed. This made this strategy too human-intensive to be applied to the much larger training sets.

To further correct label errors in the final Wake Vision Validation and Test sets we employed the Scale AI platform to crowd-source manual label corrections. These label corrections are described in section 3.2 and deprecate the automatic label corrections described in this section.