Description

Random algorithms and probabilistic data structures are algorithmically efficient and can provide shockingly good practical results. I will give a practical introduction, with live demos and bad jokes, to this fascinating algorithmic niche. I will conclude with some discussions of how our group has applied this to large sequencing data sets (although this will not be the focus of the talk).

Abstract:

I propose to start with Python implementations of most of the DS & A mentioned in this excellent blog post:

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

and also discuss skip lists and any other random algorithms that catch my fancy. I'll put everything together in an IPython notebook and add visualizations as appropriate.

I'll finish with some discussion of how we've put these approaches to work in my lab's research, which focuses on compressive approaches to large data sets (and is regularly featured in my Python-ic blog, http://ivory.idyll.org/blog/).

Misc talk links

Slides

Github repo with IPython Notebooks in it

Video link

Overviews and linkfoo

Wikipedia's category page for Probabilistic Data Structures

The Highly Scalable Blog on Probabilistic Data Structures for Web Analytics and Data Mining

Specific References

SkipLists:

skiplist IPython Notebook

Wikipedia page on SkipLists

John Shipman's excellent writeup

William Pugh's original article

The HackerNews (oops!) reference for my reddit-attributed quote about putting a gun to someone's head and asking them to write a log-time algorithm for storing stuff: https://news.ycombinator.com/item?id=2670632