
|
|
 |


|
Publications
Title: Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks
Abstract:Performing data-mining tasks such as clustering, classification, and
prediction on large datasets is an arduous task and, many times, it
is an infeasible task given current hardware limitations. The
distributed nature of peer-to-peer databases further complicates
this issue by introducing an access overhead cost in addition to the
cost of sending singular tuples over the network. We propose a
two-level sampling approach focusing on peer-to-peer databases for
maximizing sample quality given a user-defined communication budget.
Given that individual peers may have varying cardinality we propose
an algorithm for determining the optimal sample rate (the percentage
of tuples to sample from a peer) for each peer. We do this by
analyzing the variance of individual peers, ultimately minimizing
the total variance of the entire sample. By performing local
optimization of individual peer sample rates we maximize
approximation accuracy of the samples. We also offer several
techniques for sampling in peer-to-peer databases given various
levels of known information and unknown information about the
network and its peers.
PDF | Postscript
|
|

|