http://www.datameer.com It is frustrating: even with petabytes of data on a Hadoop cluster, one still encounters situations where there’s a lack of key data for a wide variety of big data analytic use cases. You might have billions of clicks on your web site, but only a few users choose to rate a product. There might be millions of text documents on your cluster, but it is too expensive to have someone categorize more than a tiny fraction of them. In principle, this is where predictive modeling could help. For instance, one could learn a model to predict user ratings so you can better serve product recommendations based on those expected ratings. Or, one could create a model to automatically categorize text documents, saving countless hours and dollars. The main problem is that there is only a limited amount of training material (i.e. user ratings, categorized documents) and it is thus hard to generate good models.
As it turns out, recent research on machine learning techniques has found a way to deal effectively with such situations with a technique called semi-supervised learning. These techniques are often able to leverage the vast amount of related, but unlabeled data to generate accurate models. In this talk, we will give an overview of the most common techniques including co-training regularization. We first explain the principles and underlying assumptions of semi-supervised learning and then show how to implement such methods with Hadoop.