February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Apache Hadoop India Summit 2011 Keynote talk "Hadoop & the Future of Cloud Computing" by Todd Papaioannou
1. Hadoop & thefuture of Cloud Computing Todd Papaioannou VP, Cloud Architecture By SearchNetMedia
2. what’s happening More publicly available human-generated content More interactions being tracked (e.g. clickstream data) More business processes are being digitized More history being kept = The Data Exhaust! Flickr : sub_lime79 BigData is here!
3. CUTTING THROUGH THE NOISE access audience blogs communication computerinternetmass media people networking technology Location Social Relationships Science UnderstandingUser Interests Flickr : Lomo-Cam
4. turning data into insights machine learning time series logic regression content clustering algorithms Ad inventory modeling user interest prediction Flickr : NASA Goddard Photo and Video factorization models
10. Adoption -> InvestmentMainstream / Enterprise adoption Fund further development, enhancements 9
11. HADOOP IS GOING MAINSTREAM 2010 2008 2009 2007 The Datagraph Blog 10
12. hadoop at yahoo! “Where Science meets Data” PRODUCTS Data Analytics Content Optimization Content Enrichment Yahoo! Mail Anti-Spam Advertising Products Ad Optimization Ad Selection Big Data Processing & ETL DIMENSIONAL DATA CONTENT DATA PIPELINES HADOOP CLUSTERS Tens of thousands of servers APPLIED SCIENCE User Interest Prediction Ad inventory prediction Machine learning - search ranking Machine learning - ad targeting Machine learning - spam filtering 11
The web is changing. It’s always evolving and changing. This evolution is about people-powered experiences and transient, unstructured data. My 16-year-old writes. He deletes. He retweets.In fact, a ton of the data on the web today is transient data. It exists for a moment and then it's gone. Its comments on Facebook, emails, content alerts, messenger updates, blogs, twitter feeds .In fact, only 5% of the information created in the world today is “structured”.
Yahoo!'s role has always been to cut through the noise and help people find what they want. We do that in many ways – primarily with deep science and insights, all relying on Hadoop. From curating people’s relationships to get more meaning out of them, to understanding their interests and their location, to adding a complex layer of science on top of all that – Hadoop’s right at the core of making all of that possible.
Turning data into insights isn't trivial. It's heavy lifting. It’s analysis and refinement of raw, unstructured information. It's also deep, best-in-class technology and science, and applying and improving this science is one of the things we do best at Yahoo! – using a variety of techniques as you see listed here.
Yahoo! has made investments in Hadoop that have enabled us to add much more relevance to our data, enrich it, extract insights, and deliver relevant, personalized content and experiences to our consumers. These same investments help deliver the right audiences to our advertisers. As a result of delivering that highly relevant experience to 600 million users around the world, Yahoo!’s one of the most trusted brands on the Internet.
Hadoop delivers huge value to Yahoo! by enabling the important stuff we do with all of our big data. Without it, we simply couldn’t deliver the engaging consumer experiences and advertiser value the way we do today. With Hadoop, we get the disruptive ability to rapidly innovate by customizing, personalizing and fusing people’s individual worlds with the Web at large, in a way no other company can today.
With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust us to give them a great experience and show them what’s most interesting and relevant to them. Behind every click, we’re using Hadoop to optimize what you see on Yahoo.com. We serve about 3 million different versions of the Today Module every 24 hours. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. Every click a person makes on our homepage – that’s around half a billion clicks per day – results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us deeply understand the content and eliminate the guesswork, so we can actually predict a story’s relevance and popularity with our audience.
Because of technologies like Hadoop and the rest of our Cloud platform, we’re learning and building faster and faster. It’s all about speed, innovation and real, substantial value to our business. At Yahoo, we’ve been using Hadoop across the company for the last five years, and I’ve shown you just a few examples. Based on our testing and experience, we believe Hadoop is now ready for mainstream enterprise use. We’ve deliberately chosen to invest in open source as the foundation of our cloud. Yahoo! is running the largest implementation of Hadoop in the world today.
An overview of the Hadoop EcosystemYahoo! employees, including Doug Cutting, initiated Apache Hadoop in 2005Since then, the ecosystem has expanded
Hadoop is at the center or our data eco system Every click, page view, search Foundation of our ad management & targeting systems Content Enrichment: (geo location, category) Customize content for users Where Science Meets DataMachine learning - algorithm developmentspam detectionad targetingpredicting user interest and ad inventory Research on ad effectivenessProvides Scale for Big DataDaily: 120TB, 3+PB. Total 70+PB data -- and growingWeb data growing at CAGR of 60% - by 2013 - 667 exabytes (Cisco)
Started Developing Hadoop 5 years ago Prototype of a 20 node clusterDedicated team developing Hadoop every since Focused on supporting Yahoo! needsContributing Hadoop to Apache and helping build the communityStarted as research projectsProgressed to applied science efforts supporting search and adv productsThen production systems (Ad Targeting, Content optimization)Now Hadoop usage has spread to all parts of our business Hadoop is our Big Data infrastructure -- It provides agility with Big Data50% of enterprises cited recent study said strongly considering Hadoop adoption Agility cited as the number one reason
People ask why we contribute to open sourceOpen Source helps us avoid technological dead endsBenefit from leveraging community contributionsAllows us to hire a workforce already trained in our technologyOpen sourcing our Cloud components starts with HadoopPigYahoo! Distribution of Hadoop (adding others)Yahoo! Traffic ServerZookeeperIn addition to benefiting from extern Hadoop contributions:Hive, Apache Web Server, Xen