Weitere ähnliche Inhalte Ähnlich wie Leveraging open source for big data stack (20) Kürzlich hochgeladen (20) Leveraging open source for big data stack1. Leveraging Open Source Big Data Stack
Prasanth M Sasidharan
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012
2. What is data?
Data is Information in raw or unorganized form such as alphabets,
numbers, or symbols
What is Big data?
Big Data refers to large datasets which are difficult to store, manage and
analyze
Everyday, we create 2.5 trillion bytes of data–so much that 90% of the
data in the world today has been created in the last two years alone.
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 2
5. Big Data & Distributed Computing
Multiple servers, each working on part of job, each doing same task .
Key Challenges:
• Work distribution and orchestration
• Error recovery
• Scalability and management
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 5
6. FOSS in Aadhar
Aadhaar is a 12-digit unique number which the Unique Identification Authority
of India (UIDAI) will issue for all residents in India
The number will be stored in a centralized database and linked to the basic
demographics and biometric information – photograph, ten fingerprints and iris
– of each individual.
It is unique and robust enough to eliminate the large number of duplicate and
fake identities in government and private databases
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 6
7. Lets Meet a Stack!
Application Layer
Infrastructure
Layer
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 7
8. Infrastructure for Big Data Analysis
What’s Virtualization?
Virtualization allows multiple operating system instances to
run concurrently on a single computer; it is a means of separating
hardware from a single operating system.
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 8
9. What’s Hypervisor?
◦ Also called virtual machine manager (VMM), is one of many hardware
virtualization techniques allowing multiple operating systems, termed guests, to
run concurrently on a host computer
◦ Originally developed in the 1970s as part of the IBM S/360
Xen® hypervisor
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 9
10. Advantages of FOSS
Flexibility and Freedom
Reliability
Auditability
Fast Deployment
Cost
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 10
11. Cost For Reproducing YouTube
Capital Expenditures Ann Expenses,ex HW Support
($M) ($M)
System Hardware Software Total Staff Support Total
Oracle Exadata $147.4 $442.0 $589.4 $1.6 $97.4 $99.0
Alternative
openSource,
commodity
hardware $104.2 $0.0 $104.2 $2.2 $12.9 $15.1
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 11
12. Get Involved!
Find out about Apache projects (http://projects.apache.org/
Join mailing lists
Pick up a Bug
Suggest ideas or Fixes
Checkout the latest code / Download releases
Change the sourcefiles to incorporate your change or addition
Provide appropriate source code documentation and follow project's
coding conventions.
Check Whether the software still compiles and runs correctly
Run any unit or regression tests the software may have
Send the patch for Review & committing
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 12
13. Notable Users of Hadoop
(Source: http://en.wikipedia.org/wiki/Hadoop)
• Adobe • Meebo
• Amazon • The New York Times
• AOL • Rackspace
• eBay • StumbleUpon
• Facebook • Twitter
• Fox Interactive Media • Yahoo
• IBM
• Last.fm
• LinkedIn
References
• Hadoop: The Definitive Guide-MapReduce for the Cloud
• HBase: The Definitive Guide
• Hive Wiki (http://wiki.apache.org/hadoop/Hive)
• Pig Wiki (http://wiki.apache.org/pig/)
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 13
14. Open Source Initiatives @ FlyTXT
Customization Specific to our business lines
Mahout Enhancements for additional Machine Learning Algorithms
Hive Customization
Oozie Enhancements
Hadoop Enhancements
We won the IEEE cloud computing challenge
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 14
15. THANK YOU
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 15
16. Extra Slides
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 16
19. Quantity of Global Data
Exabyte
130 2,720
7,910
2005
2012
2015*
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 19
20. Numbers behind the News!!
Twitter produces over 230 million tweets per day
Wal-Mart is logging one million transactions per hour
Facebook creates over 30 billion pieces of content ranging
from web links, news, blogs, photo
India's mobile subscription base at 873.61 mn users
India has a population of 1.21 billion
21. Lets meet the Big data Stack
• Oozie – Open-source workflow/coordination service to
manage data processing jobs for Apache Hadoop™ -
Developed at Yahoo!
• HBase – Column-store database based on Google’s
BigTable. Holds extremely large data sets (Petabytes)
• Hive – SQL based data warehousing app with features for
analyzing very large data sets - Developed at Facebook
• Zoo Keeper – Distributed consensus engine providing
Leader election, service discovery, distributed locking /
mutual exclusion
• Pig - platform for analyzing large data sets that consists of a
high-level language for expressing data analysis steps
• Ganglia - a scalable distributed monitoring system for high-
performance computing systems such as clusters and Grids
• Apache Mahout - Free implementations of distributed or
otherwise scalable machine learning algorithms on
the Hadoop platform
Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 21
Hinweis der Redaktion Exabyte is 1 billion gigabytes, 7910 is 3 times more bits of information in digital universe than stars in physical universe Indian telecom added 7.9 million new subscribers in September. The indian population can be related to Aadhar project Mahout is a person who drives an elephant – catching a taxi from airport algorithm