Joyce Whang - Scalable Data-driven PageRank and Non-exhaustive, Overlapping Co-clustering
Wednesday, August 23, 2017 2:00 PM - 3:00 PM EDT
For the first part of my talk, I will talk about a scalable data-driven PageRank. Large-scale network and graph analysis has received considerable attention recently. Graph mining techniques often involve an iterative algorithm, which can be implemented in a variety of ways. Using PageRank as a model problem, we look at three algorithm design axes: work activation, data access pattern, and scheduling. We investigate the impact of different algorithm design choices. Using these design axes, we design and test a variety of PageRank implementations finding that data-driven, push-based algorithms are able to achieve more than 28x the performance of standard PageRank implementations (e.g., those in GraphLab). The design choices affect both single-threaded performance as well as parallel scalability. The implementation lessons not only guide efficient implementations of many graph mining algorithms, but also provide a framework for designing new scalable algorithms.
For the second part of the talk, I will talk about non-exhaustive, overlapping co-clustering. The goal of co-clustering is to simultaneously identify a clustering of the rows as well as the columns of a two dimensional data matrix. Most existing co-clustering algorithms are designed to find pairwise disjoint and exhaustive co-clusters. However, many real-world datasets might contain not only a large overlap between co-clusters but also outliers which should not belong to any co-cluster. We formulate the problem of Non-Exhaustive, Overlapping Co-Clustering where both of the row and column clusters are allowed to overlap with each other and the outliers for each dimension of the data matrix are not assigned to any cluster. To solve this problem, we propose an intuitive objective function, and develop an efficient iterative algorithm which we call the NEO-CC algorithm. We theoretically show that the NEO-CC algorithm monotonically decreases the proposed objective function. Experimental results show that the NEO-CC algorithm is able to effectively capture the underlying co-clustering structure of real-world data, and thus outperforms state-of-the-art clustering and co-clustering methods.