This document introduces the Birmingham Big Data Science Group (BIDS) and discusses big data and related technologies. It provides an overview of big data, large-scale distributed systems, NoSQL databases, and intelligent algorithms. Examples of prominent NoSQL database users and the Hadoop-based SMAQ stack are discussed. The document also covers next-generation systems beyond MapReduce/Hadoop and concludes that big data is a challenging and promising area.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
1st Birmingham Big Data Science Group meetup
1. Welcome to the Birmingham Big Data Science Group (BIDS) Faizan Javed 5/25/2011 Intermark Group Sponsor: Intermark Group
2. BIDS Stats Founded April 10, 2011 9 members (and counting..) Founder: Faizan Javed, Co-Founder: QasimIjaz Online presence: Meetup.com for co-ordinatingmeetups: http://www.meetup.com/bham-bids Also on (for related articles and announcements): LinkedIn: http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219 Facebook:http://www.facebook.com/home.php?sk=group_202221519811444
3. Agenda What is Big Data? Quick overview of related technologies: Large-scale distributed systems and platforms NoSQL data stores Intelligent algorithms/web-mining/information retrieval techniques Highly-scalable systems
4. What is Big Data? More people connected to the internet Social media explosion (Web 2.0): Facebook, Twitter, etc. Huge volumes of data being collected: sensors, mobile devices, machine-to-machine communications, social media and retail sites web logs for browsing patterns “Big” in Big Data is relative: today's "big" is certainly tomorrow's "medium" and next week's "small.“ “Big Data" is when the size of the data itself becomes part of the problem. Going from Gigabytes to Petabytes!http://radar.oreilly.com/2010/06/what-is-data-science.html
5.
6. Big Data, Big Numbers McKinsey report, May 2011: http://www.mckinsey.com/mgi/publications/big_data/index.asp
7. Why care about big data? Deep analysis of data can be a competitive advantage. More data easier to find consistent patterns More data usually beats better algorithms Ex 1: Predict customer preferences and target ads on an ecommerce website. Ex 2: Improve search quality. Ex 3: Bank risk modeling (aggregate customer activity from different lines of businesses) http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/ http://www.ft.com/intl/cms/s/0/64095dba-7cd5-11e0-994d-00144feabdc0.html#axzz1NHn8icSC Key point: “Many different sources” & “unstructured data”
8. Big Players on the Big Data Scene The Government http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef
9. The need for new techniques Traditional “relational” techniques breakdown at scale. Solutions: NoSQL databases: Cassandra, Hbase, Riak, etc Large-scale “commodity” scale-out distributed computing techniques: MapReduce/Hadoop, Percolator, etc Analytics platforms: IBM BigInsight, EMC GreenPlum
12. Hadoop-based SMAQ stackhttp://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
13. Hadoop-based SMAQ stack Hadoop comes with HDFS – Hadoop Distributed File Sytem. Can be used alongside various NoSQL systems (Hbase most common)
14. Hadoop-based SMAQ stack Pig (yahoo) input = LOAD 'input/sentences.txt' USING TextLoader(); words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/wordCount' USING PigStorage(); Hive (facebook) INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';
15. Next-generation systems: going beyond MapReduce/Hadoophttp://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html Mostly Google and Yahoo innovations. Percolator – “real-time” MapReduce. Powers Google Instant. Dremel – superfast “Hive” to interact with large-datasets. Inhouse-Google. Pregel– highly efficient graph computing for analyzing social graphs. In-house Google. Open-source projects available. Megastore- scalable NoSQL like system with ACID semantics but lower consistency across partitions. In-house Google. Next-gen Hadoop at Yahoo: enhanced scalability (going beyond 4000 clusters), support for multiple programming paradigms, enhanced cluster utilization.
16. Intelligent Web & machine learning Recommendation systems, data/web mining, natural language processing Recommendation systems: A type of collaborative filtering/information retrieval technique. Uses user profiles, ratings, browsing habits to recommend items not yet considered. First made famous in the commercial arena by Amazon.com
18. Foursquare (3/2011) and Google Places (5/2011)http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/ http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html
21. Search innovations @ LinkedInhttp://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/ Uses open-source Luceneproject for social graph search and real-time indexing and searching. Dynamic filters automatically generated based on your query results!
22. Conclusion Big Data is a very challenging and promising area Can be used to get a competitive advantage Usually bring about advances in computer science Vast area of topics: NoSQL systems, large-scale distributed computing systems, highly scalable web system designs Machine learning techniques: search engines, recommender systems