Data provenance, a record of the origin and transformation of data, explains how output data is derived from input data. This dissertation focuses on exploring the connection between provenance and uncertainty in two main directions: (1) how a succinct representation of provenance can help infer uncertainty in the input or the output, and (2) how introducing uncertainty can facilitate publishing provenance information while hiding associated private information. A significant fraction of the data found in practice is imprecise, unreliable, and incomplete, and therefore uncertain. The level of uncertainty in the data must be measured and recorded in order to estimate the confidence in the results and find potential sources of error. In probabilistic databases, uncertainty in the input is recorded as a probability distribution, and the goal is to efficiently compute the induced probability distribution on the outputs. In general, this problem is computationally hard, and we seek to expand the class of inputs for which efficient evaluation is possible by exploiting provenance structure. In some scenarios, the output data is directly examined for errors and is labeled accordingly. We need to trace back the errors in the output to the input so that the input can be refined for future processing. Because of incomplete labeling of the output and complexity of the processes generating it, the sources of error may be uncertain. We formalize the problem of source refinement, and propose models and solutions using provenance that can handle incomplete labeling. We also evaluate our solutions empirically for an application of source refinement in information extraction . Data provenance is extensively used to help understand and debug scientific experiments that often involve proprietary and sensitive information. In this dissertation, we consider privacy of proprietary and commercial modules when they belong to a workflow and interact with other modules. We propose a model for module privacy that makes the exact functionality of the modules uncertain by selectively hiding provenance information. We also study the optimization problem of minimizing the information hidden while guaranteeing a desired level of privacy.
Recommendations
Modeling uncertain provenance and provenance of uncertainty in W3C PROV
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebThis paper describes how to model uncertain provenance and provenance of uncertain things in a flexible and unintrusive manner using PROV, W3C's new standard for provenance. Three new attributes with clearly defined values and semantics are proposed. ...
Numerical approach for quantification of epistemic uncertainty
In the field of uncertainty quantification, uncertainty in the governing equations may assume two forms: aleatory uncertainty and epistemic uncertainty. Aleatory uncertainty can be characterised by known probability distributions whilst epistemic ...
Uncertainty quantification methods for evolutionary optimization under uncertainty
GECCO '20: Proceedings of the 2020 Genetic and Evolutionary Computation Conference CompanionIn this paper, we discuss the role of uncertainty quantification (UQ) in assisting optimization under uncertainty. UQ plays a significant role in quantifying the robustness of solutions so as to help the optimizer in achieving robust optimum solutions. ...