arXiv Analytics

Sign in

arXiv:1711.02305 [cs.LG]AbstractReferencesReviewsResources

Finding Heavily-Weighted Features in Data Streams

Kai Sheng Tai, Vatsal Sharan, Peter Bailis, Gregory Valiant

Published 2017-11-07Version 1

We introduce a new sub-linear space data structure---the Weight-Median Sketch---that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracy-memory trade-offs over alternatives, including count-based sketches and feature hashing.

Related articles: Most relevant | Search more
arXiv:2411.04812 [cs.LG] (Published 2024-11-07)
Soft Hoeffding Tree: A Transparent and Differentiable Model on Data Streams
arXiv:2412.03611 [cs.LG] (Published 2024-12-04)
Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth
arXiv:1911.00847 [cs.LG] (Published 2019-11-03)
Weakly Supervised Deep Learning Approach in Streaming Environments