SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Large-Scale Log Analysis for Marketing   Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation
Company Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.
NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access,  Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone &  Telegraph  100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D
NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access,  Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone &  Telegraph  100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership
BizCITY: Cloud Services provided by NTT Communications Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service            ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log)
Big Data in BizCITY Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics  Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for  “ enormous ”  user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing
We provide a “cloud” service for marketing!!! Hadoop in cloud!!!!
Hadoop in BizMarketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day
CGM Analysis in Biz Marketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ”  supports marketing activity using customers ’  feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’  Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog
Data Flow in BuzzFinder Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce
Map/Reduce in BuzzFinder Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data
Output of BuzzFinder: Keyword Trend Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th  each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets
Output of BuzzFinder: Topic Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda
Output of BuzzFinder: Location Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of  “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area
Output of BuzzFinder: Sentiment Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of  “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of  “ Nuclear Power Plant ”  got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive)
Hadoop in Biz Marketing Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability
Web Access Analysis Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis
Visualization of  internet-users behaviors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Click stream based analysis ex.) Why users went out without conversion? Statistics Click stream analysis (OLAP)
Hadoop for PaaS Services Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique
Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!!
Strategies for Cost Reduction Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce   * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin  ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, **  “ Map Multi-Reduce ”  and  “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join)
Map Multi-Reduce/Record Reduce Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with  r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer
Map Multi-Reduce/Local Reduce Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local  write remote read, sort Output  File 0 Output  File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce  タスク Local Reduce  タスク Local Reduce Twice as fast as the normal  cluster
OLAP in Click Stream Based Analysis ,[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. click_stream Page info Location info User info Click info Scalable join is required! Amount of unique key is large
Join using Map/Reduce ,[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Combine map-side join and reduce-side join to reduce shuffle cost and disk space while keeping scalability Memory-backed join Reduce side join Map-side join Scalability NG Good Good Shuffle cost low high low Disk space good good bad
Pjoin/Join using Semi-Join View Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo  z Pre-processing pageinfo click_strm pageinfo  primary key & foreign key  (click_strm   primary key)  Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm  processing +  semi-join mapper … click_strm  processing +  semi-join pageinfo  a pageinfo _  click_strm  1 … pageinfo _  click_strm  n click_strm  n click_strm  1 Joining with  pageinfo reducer … Joining with  pageinfo … pageinfo  b pageinfo  a pageinfo  z click_strm  1 click_strm  n pageinfo _  click_strm  n pageinfo _  click_strm  1 … hash(x) hash(y) hash(y) DFS read shuffle
Experimental Evaluation (Pjoin) Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';
Other Verification on Hadoop Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved. ,[object Object],[object Object],[object Object],Hadoop Cluster(250cores) Namenode ・・・ ・・・ Rack 1( LOC1 ) Rack 2( LOC1 ) Rack 3 ( LOC2 ) WAN(30miles) 300Mb LACP 4GB Processing time Servers WAN NO significant loss over WAN
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.
Contacts ,[object Object],[object Object],[object Object],[object Object],[object Object],Copyright  ©  2011 NTT Communications Co., Ltd. All Rights Reserved.

Weitere ähnliche Inhalte

Ähnlich wie Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the dataDataWorks Summit
 
Edge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersEdge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersPatrick Lopez
 
Intra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enIntra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enNTTDATA INTRAMART
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonH Eddie Newton
 
How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business Osaka University
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJsSenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJs直樹 益子
 
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsA DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsDevOps.com
 
Microsoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementMicrosoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementUnified Communications Online
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Nicola Sandoli
 
Device to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleDevice to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleJunSeok Seo
 
What's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkWhat's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkSam Basu
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013RightScale
 
Effective IoT System on Openstack
Effective IoT System on OpenstackEffective IoT System on Openstack
Effective IoT System on OpenstackTakashi Kajinami
 
The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015Eduardo Pelegri-Llopart
 

Ähnlich wie Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications (20)

Accel series 2021_summer en
Accel series 2021_summer enAccel series 2021_summer en
Accel series 2021_summer en
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 
Edge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalersEdge Computing risks and Opportunities for Telco and hyperscalers
Edge Computing risks and Opportunities for Telco and hyperscalers
 
Intra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-enIntra mart accel platform 2021winter-en
Intra mart accel platform 2021winter-en
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_Johnston
 
How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business How Changing Mobile Technology Is Changing The Way We Do Business
How Changing Mobile Technology Is Changing The Way We Do Business
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJsSenchaCon 2015 - The advanced operation portal built sencha ExtJs
SenchaCon 2015 - The advanced operation portal built sencha ExtJs
 
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven AlertsA DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
A DevOps Tutorial to Set-up Intelligent Machine Learning Driven Alerts
 
Microsoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact managementMicrosoft and aspect, transforming customer contact management
Microsoft and aspect, transforming customer contact management
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
 
Device to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in OracleDevice to Intelligence, IOT and Big Data in Oracle
Device to Intelligence, IOT and Big Data in Oracle
 
Spotfire
SpotfireSpotfire
Spotfire
 
What's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon TalkWhat's New with Windows Phone - FoxCon Talk
What's New with Windows Phone - FoxCon Talk
 
The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013The Three Stages of Cloud Adoption - RightScale Compute 2013
The Three Stages of Cloud Adoption - RightScale Compute 2013
 
Effective IoT System on Openstack
Effective IoT System on OpenstackEffective IoT System on Openstack
Effective IoT System on Openstack
 
Soma_Chakraborty (1)
Soma_Chakraborty (1)Soma_Chakraborty (1)
Soma_Chakraborty (1)
 
The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015The impact of IOT - exchange cala - 2015
The impact of IOT - exchange cala - 2015
 
Accel series 2019_winter_en
Accel series 2019_winter_enAccel series 2019_winter_en
Accel series 2019_winter_en
 
Industrial IoT bootcamp
Industrial IoT bootcampIndustrial IoT bootcamp
Industrial IoT bootcamp
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

  • 1. Large-Scale Log Analysis for Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation
  • 2.
  • 3. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D
  • 4. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership
  • 5. BizCITY: Cloud Services provided by NTT Communications Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service           ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log)
  • 6. Big Data in BizCITY Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for “ enormous ” user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing
  • 7. We provide a “cloud” service for marketing!!! Hadoop in cloud!!!!
  • 8. Hadoop in BizMarketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day
  • 9. CGM Analysis in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ” supports marketing activity using customers ’ feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’ Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog
  • 10. Data Flow in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce
  • 11. Map/Reduce in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data
  • 12. Output of BuzzFinder: Keyword Trend Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets
  • 13. Output of BuzzFinder: Topic Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda
  • 14. Output of BuzzFinder: Location Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area
  • 15. Output of BuzzFinder: Sentiment Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of “ Nuclear Power Plant ” got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive)
  • 16. Hadoop in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability
  • 17. Web Access Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis
  • 18.
  • 19. Hadoop for PaaS Services Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique
  • 20. Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!!
  • 21. Strategies for Cost Reduction Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, ** “ Map Multi-Reduce ” and “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join)
  • 22. Map Multi-Reduce/Record Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer
  • 23. Map Multi-Reduce/Local Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local write remote read, sort Output File 0 Output File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce タスク Local Reduce タスク Local Reduce Twice as fast as the normal cluster
  • 24.
  • 25.
  • 26. Pjoin/Join using Semi-Join View Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo z Pre-processing pageinfo click_strm pageinfo primary key & foreign key (click_strm primary key) Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm processing + semi-join mapper … click_strm processing + semi-join pageinfo a pageinfo _ click_strm 1 … pageinfo _ click_strm n click_strm n click_strm 1 Joining with pageinfo reducer … Joining with pageinfo … pageinfo b pageinfo a pageinfo z click_strm 1 click_strm n pageinfo _ click_strm n pageinfo _ click_strm 1 … hash(x) hash(y) hash(y) DFS read shuffle
  • 27. Experimental Evaluation (Pjoin) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';
  • 28.
  • 29.
  • 30.