Mardi 21 janvier 2014, 14h
Batiment Galera, salle 127
Quality and Price of Data
Ruiming Tang – National University of Singapore
In data marketplaces, people clean data, buy and sell data, and collect data. In this talk, we study quality and price of data. More specifically, we study three topics. The first topic is how to improve data quality by conditioning. The second topic is how to sell data according to a proposed price. The third topic is how people buy data, i.e., define price of a query and propose algorithms to compute the price of a query.
In order to improve data quality (accuracy) by adding constraint or information, we study the conditioning problem. We propose a framework for representing conditioned probabilistic relational data. Conditioning is the formalization of the process of adding knowledge to a database. Some worlds may be impossible given the constraints and the probabilities of possible worlds are accordingly re-defined. The new constraints can come from the observation of the existence or non-existence of a tuple, from the knowledge of a specific rule, such as the existence of an exclusive set of tuples, or from the knowledge of a general rule, such as a functional dependency. We are therefore interested in computing a concise representation of the possible worlds and their respective probabilities after the addition of new constraints, namely an equivalent probabilistic database instance without constraints after conditioning. We devise and present a general algorithm for this computation. Unfortunately, the general problem involves the simplification of general Boolean expressions and is NP-hard. We therefore identify specific practical families of constraints for which we devise and present efficient algorithms.
We study the relationship between quality and price of data. We proposed a theoretical and practical pricing framework for a data market in which data consumers can trade data quality for discounted prices. In most data markets, prices are prescribed and accuracy is determined by the data. Instead, we consider a model in which accuracy can be traded for discounted prices: “what you pay for is what you get”. The data market model consists of data consumers, data providers and data market owners. The data market owners are brokers between the data providers and data consumers. A data consumer proposes a price for the data that she requests. If the price is less than the price set by the data provider, then she gets an approximate value. The data market owners negotiate the pricing schemes with the data providers. They implement these schemes for the computation of the discounted approximate values. We propose a theoretical and practical pricing framework with its algorithms for the above mechanism. In this framework, the value published is randomly determined from a probability distribution. The distribution is computed such that its distance to the actual value is commensurate to the discount. The published value comes with a guarantee on the probability to be the exact value. The probability is also commensurate to the discount. We present and formalize the principles that a healthy data market should meet for such a transaction. We define two ancillary functions and describe the algorithms that compute the approximate value from the proposed price using these functions. We prove that the functions and the algorithm meet the required principles.
We study the price of queries for cases that data consumers request for data in forms of queries. We propose a generic data pricing model that is based on minimal provenance, i.e. minimal sets of tuples contributing to the result of a query. We show that the proposed model fulfills desirable properties such as contribution monotonicity, bounded-price and contribution arbitrage-freedom. We present a baseline algorithm to compute the exact price of a query based on our pricing model. We show that the problem is NP-hard. We therefore devise, present and compare several heuristics. We conduct a comprehensive experimental study to show their effectiveness and efficiency.