1st Birmingham Big Data Science Group meetup

Welcome to the Birmingham Big Data Science Group (BIDS) Faizan Javed 5/25/2011 Intermark Group Sponsor: Intermark Group

BIDS Stats Founded April 10, 2011 9 members (and counting..) Founder: Faizan Javed, Co-Founder: QasimIjaz Online presence: Meetup.com for co-ordinatingmeetups: http://www.meetup.com/bham-bids Also on (for related articles and announcements): LinkedIn: http://www.linkedin.com/groups/Birmingham-Big-Data-Science-Group-3865219 Facebook:http://www.facebook.com/home.php?sk=group_202221519811444

Agenda What is Big Data? Quick overview of related technologies: Large-scale distributed systems and platforms NoSQL data stores Intelligent algorithms/web-mining/information retrieval techniques Highly-scalable systems

What is Big Data? More people connected to the internet Social media explosion (Web 2.0): Facebook, Twitter, etc. Huge volumes of data being collected: sensors, mobile devices, machine-to-machine communications, social media and retail sites web logs for browsing patterns “Big” in Big Data is relative: today's "big" is certainly tomorrow's "medium" and next week's "small.“ “Big Data" is when the size of the data itself becomes part of the problem. Going from Gigabytes to Petabytes!http://radar.oreilly.com/2010/06/what-is-data-science.html

Big Data, Big Numbers McKinsey report, May 2011: http://www.mckinsey.com/mgi/publications/big_data/index.asp

Why care about big data? Deep analysis of data can be a competitive advantage. More data  easier to find consistent patterns More data usually beats better algorithms Ex 1: Predict customer preferences and target ads on an ecommerce website. Ex 2: Improve search quality. Ex 3: Bank risk modeling (aggregate customer activity from different lines of businesses) http://blog.mikepearce.net/2010/08/18/10-hadoop-able-problems-a-summary/ http://www.ft.com/intl/cms/s/0/64095dba-7cd5-11e0-994d-00144feabdc0.html#axzz1NHn8icSC Key point: “Many different sources” & “unstructured data”

Big Players on the Big Data Scene The Government http://us1.campaign-archive1.com/?u=4cb4c08d876d7481bbc4bc70f&id=6889126aef

The need for new techniques Traditional “relational” techniques breakdown at scale. Solutions: NoSQL databases: Cassandra, Hbase, Riak, etc Large-scale “commodity” scale-out distributed computing techniques: MapReduce/Hadoop, Percolator, etc Analytics platforms: IBM BigInsight, EMC GreenPlum

The NoSQL revolutionhttp://www.infoq.com/news/2011/04/newsql

Prominent NoSQL database users Cassandra: Facebook, Twitter, Rackspace, Reddit, Digg.com Riak: Mozilla, Ask.com, Comcast Voldemort: LinkedIn MongoDB: Foursquare, Etsy, bit.ly, Intuit Hbase: Stumbleupon, Twitter, Infolinks, Adobe, Meetup.com,

Hadoop-based SMAQ stackhttp://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Hadoop-based SMAQ stack Hadoop comes with HDFS – Hadoop Distributed File Sytem. Can be used alongside various NoSQL systems (Hbase most common)

Hadoop-based SMAQ stack Pig (yahoo) input = LOAD 'input/sentences.txt' USING TextLoader(); words = FOREACH input GENERATE FLATTEN(TOKENIZE($0)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); ordered = ORDER counts BY $0; STORE ordered INTO 'output/wordCount' USING PigStorage(); Hive (facebook) INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com';

Next-generation systems: going beyond MapReduce/Hadoophttp://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html Mostly Google and Yahoo innovations. Percolator – “real-time” MapReduce. Powers Google Instant. Dremel – superfast “Hive” to interact with large-datasets. Inhouse-Google. Pregel– highly efficient graph computing for analyzing social graphs. In-house Google. Open-source projects available. Megastore- scalable NoSQL like system with ACID semantics but lower consistency across partitions. In-house Google. Next-gen Hadoop at Yahoo: enhanced scalability (going beyond 4000 clusters), support for multiple programming paradigms, enhanced cluster utilization.

Intelligent Web & machine learning Recommendation systems, data/web mining, natural language processing Recommendation systems: A type of collaborative filtering/information retrieval technique. Uses user profiles, ratings, browsing habits to recommend items not yet considered. First made famous in the commercial arena by Amazon.com

Amazon.com & Netflix recommendation systems

Foursquare (3/2011) and Google Places (5/2011)http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/ http://places.blogspot.com/2011/05/discover-more-places-youll-like-based.html

Hot area!Netflix and Overstock.com competitions

Search Engines (Google, Bing, Wolfram, Lucene/Nutch, etc)

Search innovations @ LinkedInhttp://thenoisychannel.com/2010/01/31/linkedin-search-a-look-beneath-the-hood/http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/ Uses open-source Luceneproject for social graph search and real-time indexing and searching. Dynamic filters automatically generated based on your query results!

Conclusion Big Data is a very challenging and promising area Can be used to get a competitive advantage Usually bring about advances in computer science Vast area of topics: NoSQL systems, large-scale distributed computing systems, highly scalable web system designs Machine learning techniques: search engines, recommender systems

1st Birmingham Big Data Science Group meetup

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 1st Birmingham Big Data Science Group meetup

Ähnlich wie 1st Birmingham Big Data Science Group meetup (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

1st Birmingham Big Data Science Group meetup