SlideShare a Scribd company logo
1 of 25
BigData - Introduction
  Walter Dal Mut – walter.dalmut@gmail.com
  @walterdalmut - @corleycloud - @upcloo
Whoami
• Walter Dal Mut                                                    •     Corley S.r.l.
   • Startupper                                                           • Load and Stress tests
       • Corley S.r.l.
            • http://www.corley.it/                                          • www.corley.it
       • UpCloo Ltd.                                                      • Scalable CMS
            • http://www.upcloo.com/
   • Electronic Engineer
                                                                             • www.xitecms.it
       • Polytechnic of Turin                                             • Consultancy
• Social                                                                     • PHP
   • @walterdalmut - @corleycloud - @upcloo                                  • AWS
• Websites                                                                   • Distributed Systems
   • walterdalmut.com – corley.it – upcloo.com                               • Scalable Systems

                                 Walter Dal Mut walter.dalmut@gmail.com @walterdalmut
                                                 @corleycloud @upcloo
Wait a moment… What is it?
• Big data is a collection of data sets
   • So large and complex
   • Difficult to process:
         • using on-hand database management tools
         • traditional data processing
• Big data challenges
   •   Capture
   •   Curation
   •   Storage
   •   Search
   •   Sharing
   •   Transfer
   •   Analysis
   •   Visualization
Common scenario
• Data arriving at fast rates
• Data are typically unstructured or partially structured
   • Log files
   • JSON/BSON structures
• Data are stored without any kind of aggregation
What we are looking for?
• Scalable Storage
   • In 2010 Facebook claimed that they had 21 PB of storage
       • 30PB in 2011
       • 100PB in 2012
       • 500PB in 2013
• Massive Parallel Processing
   • In 2012 we can elaborate exabytes (10^18 bytes) of data in a reasonable amount of
     time
   • The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000
     core Linux cluster and produces data that is used in every Yahoo! Web search query.
• Resonable Cost
   • The New York Times used 100 Amazon EC2 instances and a Hadoop application to
     process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the
     space of 24 hours at a computation cost of about $240 (not including bandwidth)
What is Apache Hadoop?
• Apache Hadoop is an open-source software framework that supports data-
  intensive distributed applications.
   • It was derived from Google’s MapReduce and Google File System (GFS) papers
   • Hadoop implements a computational paradigm named: MapReduce
• Hadoop architecture
   • Hadoop Kernel
   • MapReduce
   • Hadoop Distributed File System (HDFS)
• Other project created in top of Hadoop
   •   Hive, Pig
   •   Hbase
   •   ZooKeeper
   •   Etc.
What is MapReduce?
           SLICE 0

           SLICE 1     R0

           SLICE 2

           SLICE 3
 Input
           SLICE 4     R1     Output

           SLICE 5

           SLICE 6     R2

           SLICE 7

            MAP      REDUCE
Word Count with Map Reduce
The Hadoop Ecosystem
        Data Mining           Data Access           Client Access
         Mahout                 Sqoop                  Hive, Pig


                               Network


        Data Storage        Data Processing         Coordination
        HDFS, HBase          MapReduce               ZooKeeper


                       JVM (Java Virtual Machine)


              Operating System (RedHat, Ubuntu, etc)


                               Hardware
Getting Started with Apache Hadoop
• Hadoop is a framework
• We need to write down our software
  • A Map
  • A Reduce
  • Put all togheter

• WordCount MapReduce with Hadoop
  • http://wiki.apache.org/hadoop/WordCount
(MAP) WordCount MapReduce with
Hadoop
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();


    public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
}
(RED) WordCount MapReduce with
Hadoop
public static class Reduce
    extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterable<IntWritable> values,
Context context)
      throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}
WordCount MapReduce with Hadoop
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "wordcount");


    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);


    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));


    job.waitForCompletion(true);
}
Scripting Languages on top of Hadoop
Apache HIVE – Data Warehouse
• Hive is a data warehouse system for Hadoop
   • Facilitates easy data summarization
   • Ad-hoc queries
   • Analysis of large datasets stored in Hadoop compatible file systems
• SQL-like language called HiveQL
• Custom plug-in MapReduce if needed
Apache Hive Features
• SELECTs and FILTERs
   • SELECT * FROM table_name;
   • SELECT * FROM table_name tn WHERE tn.weekday = 12;
• GROUP BY
   • SELECT * FROM table_name GROUP BY weekday;
• JOINs
   • SELECT a.* FROM a JOIN b ON (a.id = b.id)
• More (multitable inserts, streaming)

• Example:
  https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingS
  tarted-SimpleExampleUseCases
A real example with a CSV (tab-
separated)
CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE
u_data;

SELECT COUNT(*) FROM u_data;
Apache PIG
• It is a platform for analyzing large data sets that consists of a high-
  level language for expressing data analysis programs

• Pig's infrastructure layer consists of a compiler that produces
  sequences of Map-Reduce programs
   • Ease of programming
   • Optimization opportunities
      • Auto-optimization of tasks, allowing the user to focus on semantics rather than
        efficiency.
   • Extensibility
Apache Pig – Log Flow analysis
• A common scenario is collect logs and analyse it as post processing
• We will cover Apache2 (web server) log analysis
   • Read AWS Elastic Map Reduce  Apache Pig example
      • http://aws.amazon.com/articles/2729
• Apache Log:
   • 122.161.184.193 - - [21/Jul/2009:13:14:17 -0700] "GET /rss.pl
     HTTP/1.1" 200 35942 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT
     6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022;
     InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618;
     OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
Log Analysis [Common Log Format
(CLF)]
• More information about CLF:
   • http://httpd.apache.org/docs/2.2/logs.html
• Log interesting fields:
   •   Remote Address (IP)
   •   Remote Log Name
   •   User
   •   Time
   •   Request
   •   Status
   •   Bytes
   •   Refererer
   •   Browser
Getting started with Apache Pig
• First of all we need to load all logs from a HDFS compabile source (for
  example AWS S3)
• Create a table starting from all log lines
• Create the Map Reduce program in order to analyse and store results
Load logs from S3 and create a table
RAW_LOGS = LOAD ‘s3://bucket/path’ USING TextLoader as (line:chararray);


LOGS_BASE = FOREACH RAW_LOGS GENERATE
   FLATTEN(
      EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)"')
   )
   as (
       remoteAddr:     chararray, remoteLogname: chararray,
       user:           chararray, time:         chararray,
       request:        chararray, status:       int,
       bytes_string:   chararray, referrer:     chararray,
       browser:        chararray
 );
LOGS_BASE is a Table
Remote      logNm   User   Time        Request   Status   Bytes   Refererer   Browser
72.14.194.1 -       -      21/Jul/20   GET        200     2969    -           FeedFetch
                           09:18:04:   /gwidgets                              er-
                           54 -0700    /alexa.xml                             Google;
                                       HTTP/1.1                               (+http://w
                                                                              ww.googl
                                                                              e.com/fee
                                                                              dfetcher.h
                                                                              tml)
Data Analysis
REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;

FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR
referrer matches '.*google.*';

SEARCH_TERMS = FOREACH FILTERED GENERATE
FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as
terms:chararray;

SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL;

SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0)
GENERATE $0, COUNT($1) as num;

SEARCH_TERMS_COUNT_SORTED = ORDER SEARCH_TERMS_COUNT BY num DESC;
Thanks for Listening
• Any questions?

More Related Content

What's hot

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 

What's hot (20)

Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 

Similar to Big data, just an introduction to Hadoop and Scripting Languages

Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudAndrei Savu
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 

Similar to Big data, just an introduction to Hadoop and Scripting Languages (20)

SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
מיכאל
מיכאלמיכאל
מיכאל
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Polyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the CloudPolyglot Persistence & Big Data in the Cloud
Polyglot Persistence & Big Data in the Cloud
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 

More from Corley S.r.l.

Aws rekognition - riconoscimento facciale
Aws rekognition  - riconoscimento faccialeAws rekognition  - riconoscimento facciale
Aws rekognition - riconoscimento faccialeCorley S.r.l.
 
AWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container servicesAWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container servicesCorley S.r.l.
 
AWSome day 2018 - API serverless with aws
AWSome day 2018  - API serverless with awsAWSome day 2018  - API serverless with aws
AWSome day 2018 - API serverless with awsCorley S.r.l.
 
AWSome day 2018 - database in cloud
AWSome day 2018 -  database in cloudAWSome day 2018 -  database in cloud
AWSome day 2018 - database in cloudCorley S.r.l.
 
Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing Corley S.r.l.
 
Apiconf - The perfect REST solution
Apiconf - The perfect REST solutionApiconf - The perfect REST solution
Apiconf - The perfect REST solutionCorley S.r.l.
 
Apiconf - Doc Driven Development
Apiconf - Doc Driven DevelopmentApiconf - Doc Driven Development
Apiconf - Doc Driven DevelopmentCorley S.r.l.
 
Authentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructuresAuthentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructuresCorley S.r.l.
 
Flexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructuresFlexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructuresCorley S.r.l.
 
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented applicationCloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented applicationCorley S.r.l.
 
A single language for backend and frontend from AngularJS to cloud with Clau...
A single language for backend and frontend  from AngularJS to cloud with Clau...A single language for backend and frontend  from AngularJS to cloud with Clau...
A single language for backend and frontend from AngularJS to cloud with Clau...Corley S.r.l.
 
AngularJS: Service, factory & provider
AngularJS: Service, factory & providerAngularJS: Service, factory & provider
AngularJS: Service, factory & providerCorley S.r.l.
 
The advantage of developing with TypeScript
The advantage of developing with TypeScript The advantage of developing with TypeScript
The advantage of developing with TypeScript Corley S.r.l.
 
Angular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deployAngular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deployCorley S.r.l.
 
Corley cloud angular in cloud
Corley cloud   angular in cloudCorley cloud   angular in cloud
Corley cloud angular in cloudCorley S.r.l.
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Corley S.r.l.
 
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS LambdaRead Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS LambdaCorley S.r.l.
 
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...Corley S.r.l.
 
Middleware PHP - A simple micro-framework
Middleware PHP - A simple micro-frameworkMiddleware PHP - A simple micro-framework
Middleware PHP - A simple micro-frameworkCorley S.r.l.
 

More from Corley S.r.l. (20)

Aws rekognition - riconoscimento facciale
Aws rekognition  - riconoscimento faccialeAws rekognition  - riconoscimento facciale
Aws rekognition - riconoscimento facciale
 
AWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container servicesAWSome day 2018 - scalability and cost optimization with container services
AWSome day 2018 - scalability and cost optimization with container services
 
AWSome day 2018 - API serverless with aws
AWSome day 2018  - API serverless with awsAWSome day 2018  - API serverless with aws
AWSome day 2018 - API serverless with aws
 
AWSome day 2018 - database in cloud
AWSome day 2018 -  database in cloudAWSome day 2018 -  database in cloud
AWSome day 2018 - database in cloud
 
Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing Trace your micro-services oriented application with Zipkin and OpenTracing
Trace your micro-services oriented application with Zipkin and OpenTracing
 
Apiconf - The perfect REST solution
Apiconf - The perfect REST solutionApiconf - The perfect REST solution
Apiconf - The perfect REST solution
 
Apiconf - Doc Driven Development
Apiconf - Doc Driven DevelopmentApiconf - Doc Driven Development
Apiconf - Doc Driven Development
 
Authentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructuresAuthentication and authorization in res tful infrastructures
Authentication and authorization in res tful infrastructures
 
Flexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructuresFlexibility and scalability of costs in serverless infrastructures
Flexibility and scalability of costs in serverless infrastructures
 
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented applicationCloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
CloudConf2017 - Deploy, Scale & Coordinate a microservice oriented application
 
React vs Angular2
React vs Angular2React vs Angular2
React vs Angular2
 
A single language for backend and frontend from AngularJS to cloud with Clau...
A single language for backend and frontend  from AngularJS to cloud with Clau...A single language for backend and frontend  from AngularJS to cloud with Clau...
A single language for backend and frontend from AngularJS to cloud with Clau...
 
AngularJS: Service, factory & provider
AngularJS: Service, factory & providerAngularJS: Service, factory & provider
AngularJS: Service, factory & provider
 
The advantage of developing with TypeScript
The advantage of developing with TypeScript The advantage of developing with TypeScript
The advantage of developing with TypeScript
 
Angular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deployAngular coding: from project management to web and mobile deploy
Angular coding: from project management to web and mobile deploy
 
Corley cloud angular in cloud
Corley cloud   angular in cloudCorley cloud   angular in cloud
Corley cloud angular in cloud
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
 
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS LambdaRead Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
Read Twitter Stream and Tweet back pictures with Raspberry Pi & AWS Lambda
 
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
Deploy and Scale your PHP App with AWS ElasticBeanstalk and Docker- PHPTour L...
 
Middleware PHP - A simple micro-framework
Middleware PHP - A simple micro-frameworkMiddleware PHP - A simple micro-framework
Middleware PHP - A simple micro-framework
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Big data, just an introduction to Hadoop and Scripting Languages

  • 1. BigData - Introduction Walter Dal Mut – walter.dalmut@gmail.com @walterdalmut - @corleycloud - @upcloo
  • 2. Whoami • Walter Dal Mut • Corley S.r.l. • Startupper • Load and Stress tests • Corley S.r.l. • http://www.corley.it/ • www.corley.it • UpCloo Ltd. • Scalable CMS • http://www.upcloo.com/ • Electronic Engineer • www.xitecms.it • Polytechnic of Turin • Consultancy • Social • PHP • @walterdalmut - @corleycloud - @upcloo • AWS • Websites • Distributed Systems • walterdalmut.com – corley.it – upcloo.com • Scalable Systems Walter Dal Mut walter.dalmut@gmail.com @walterdalmut @corleycloud @upcloo
  • 3. Wait a moment… What is it? • Big data is a collection of data sets • So large and complex • Difficult to process: • using on-hand database management tools • traditional data processing • Big data challenges • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
  • 4. Common scenario • Data arriving at fast rates • Data are typically unstructured or partially structured • Log files • JSON/BSON structures • Data are stored without any kind of aggregation
  • 5. What we are looking for? • Scalable Storage • In 2010 Facebook claimed that they had 21 PB of storage • 30PB in 2011 • 100PB in 2012 • 500PB in 2013 • Massive Parallel Processing • In 2012 we can elaborate exabytes (10^18 bytes) of data in a reasonable amount of time • The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query. • Resonable Cost • The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)
  • 6. What is Apache Hadoop? • Apache Hadoop is an open-source software framework that supports data- intensive distributed applications. • It was derived from Google’s MapReduce and Google File System (GFS) papers • Hadoop implements a computational paradigm named: MapReduce • Hadoop architecture • Hadoop Kernel • MapReduce • Hadoop Distributed File System (HDFS) • Other project created in top of Hadoop • Hive, Pig • Hbase • ZooKeeper • Etc.
  • 7. What is MapReduce? SLICE 0 SLICE 1 R0 SLICE 2 SLICE 3 Input SLICE 4 R1 Output SLICE 5 SLICE 6 R2 SLICE 7 MAP REDUCE
  • 8. Word Count with Map Reduce
  • 9. The Hadoop Ecosystem Data Mining Data Access Client Access Mahout Sqoop Hive, Pig Network Data Storage Data Processing Coordination HDFS, HBase MapReduce ZooKeeper JVM (Java Virtual Machine) Operating System (RedHat, Ubuntu, etc) Hardware
  • 10. Getting Started with Apache Hadoop • Hadoop is a framework • We need to write down our software • A Map • A Reduce • Put all togheter • WordCount MapReduce with Hadoop • http://wiki.apache.org/hadoop/WordCount
  • 11. (MAP) WordCount MapReduce with Hadoop public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 12. (RED) WordCount MapReduce with Hadoop public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 13. WordCount MapReduce with Hadoop public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 14. Scripting Languages on top of Hadoop
  • 15. Apache HIVE – Data Warehouse • Hive is a data warehouse system for Hadoop • Facilitates easy data summarization • Ad-hoc queries • Analysis of large datasets stored in Hadoop compatible file systems • SQL-like language called HiveQL • Custom plug-in MapReduce if needed
  • 16. Apache Hive Features • SELECTs and FILTERs • SELECT * FROM table_name; • SELECT * FROM table_name tn WHERE tn.weekday = 12; • GROUP BY • SELECT * FROM table_name GROUP BY weekday; • JOINs • SELECT a.* FROM a JOIN b ON (a.id = b.id) • More (multitable inserts, streaming) • Example: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingS tarted-SimpleExampleUseCases
  • 17. A real example with a CSV (tab- separated) CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; SELECT COUNT(*) FROM u_data;
  • 18. Apache PIG • It is a platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs • Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs • Ease of programming • Optimization opportunities • Auto-optimization of tasks, allowing the user to focus on semantics rather than efficiency. • Extensibility
  • 19. Apache Pig – Log Flow analysis • A common scenario is collect logs and analyse it as post processing • We will cover Apache2 (web server) log analysis • Read AWS Elastic Map Reduce  Apache Pig example • http://aws.amazon.com/articles/2729 • Apache Log: • 122.161.184.193 - - [21/Jul/2009:13:14:17 -0700] "GET /rss.pl HTTP/1.1" 200 35942 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
  • 20. Log Analysis [Common Log Format (CLF)] • More information about CLF: • http://httpd.apache.org/docs/2.2/logs.html • Log interesting fields: • Remote Address (IP) • Remote Log Name • User • Time • Request • Status • Bytes • Refererer • Browser
  • 21. Getting started with Apache Pig • First of all we need to load all logs from a HDFS compabile source (for example AWS S3) • Create a table starting from all log lines • Create the Map Reduce program in order to analyse and store results
  • 22. Load logs from S3 and create a table RAW_LOGS = LOAD ‘s3://bucket/path’ USING TextLoader as (line:chararray); LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
  • 23. LOGS_BASE is a Table Remote logNm User Time Request Status Bytes Refererer Browser 72.14.194.1 - - 21/Jul/20 GET 200 2969 - FeedFetch 09:18:04: /gwidgets er- 54 -0700 /alexa.xml Google; HTTP/1.1 (+http://w ww.googl e.com/fee dfetcher.h tml)
  • 24. Data Analysis REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*' OR referrer matches '.*google.*'; SEARCH_TERMS = FOREACH FILTERED GENERATE FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as terms:chararray; SEARCH_TERMS_FILTERED = FILTER SEARCH_TERMS BY NOT $0 IS NULL; SEARCH_TERMS_COUNT = FOREACH (GROUP SEARCH_TERMS_FILTERED BY $0) GENERATE $0, COUNT($1) as num; SEARCH_TERMS_COUNT_SORTED = ORDER SEARCH_TERMS_COUNT BY num DESC;
  • 25. Thanks for Listening • Any questions?

Editor's Notes

  1. In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.