SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Data Analysis @ Daum
Paul Kim (totworld@daumcorp.com)
DevOn 2012
Data Analysis @ Daum | Devon 2012
Data Analysis @ Daum | Devon 2012
Search Quality
Search Quality
          =
Satisfaction / Cost
How?
HTTP://WWW.GOOGLE.COM/ONCEUPONATIME/TECHNOLOGY/PIGEONRANK.HTML
Understanding Users
     with Logs
               BIG
              DATA!
Data Analysis Process
           with Hadoop

?                                                     !
          HADOOP            FEATURES   TOOLS
    2 QUAD-CORES                               SAS
       8GB RAM X 60 NODES                      WEKA
       4TB HDD                                 R
                                               ETC
    4 QUAD-CORES
      16GB RAM X 30 NODES
       4TB HDD
For example,
라면 맛있게 끓이는 비법
많이 본 글
Mission
  만족스러운 검색 경험들을 랭킹에 반영
Target Data
  Half Year Search Logs (about 40TB)
Features
                                            JOB
                                   ROU P-BY
  Query - Collection Relationship G
                                                  UP-B Y JOB
  Query - Document - Session Relationship  GRO
                                         JOB
  Session - Query Relationship GROU P-BY
                                       UP-B Y JOB
  Session - Document Relationship GRO
많이 본 글
 Modeling
   Linear Regression with SAS


 Batch Process




HADOOP           FEATURES        MODEL   ENGINE

                    LESS THAN 2 HOURS
바다 이야기
SEARCH SPAM INDEX
 Mission
   Spam이 검색 사용자에게 미치는 영향 파악
 Data
   Search Log : Text with Delimiter
   Post Filtered Documents : Json Format
   Operation Deleted Documents : Xml Format
 Task
   Query - Session - Doc. 1 - Doc. 2 - Doc. 3 - Doc. 4
        Click?                                TER JOIN
                                           OU
        Type? (Ham, Spam, OP Del.)
SEARCH SPAM INDEX
 Result Sample
BLOG CLASSIFICATION
BLOG CLASSIFICATION
Mission
  Unsupervised Learning을 통한 나쁜 Blog Clustering
Data
  30 Days Blog Documents
Task
  Blog - Document’s Feature Analysis with Fixed Interval
BLOG CLASSIFICATION
Modeling
  Kohonen’s SOM(Self Organizing Map) with R
WHAT ELSE?
Topic Analysis with PLSA

Query Chain Filtering

Reprocessing with Hadoop
In Conclusion,
ADVANTAGE OF HADOOP
ADVANTAGE
  Low analyze cost!
  No more sampling!
  Low operation cost!
  Programming Language Independent
  Various support tools

DISADVANTAGE
  Conceptual Change is Needed.
  Project under active development.
  Version upgrade is not supported.
THANK YOU!

Weitere ähnliche Inhalte

Was ist angesagt?

hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...Michael Stack
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Analytical data processing
Analytical data processingAnalytical data processing
Analytical data processingPolad Saruxanov
 
Introduction of search engine
Introduction of search engineIntroduction of search engine
Introduction of search engineJinglun Li
 

Was ist angesagt? (6)

hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
 
Mongodb
MongodbMongodb
Mongodb
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Analytical data processing
Analytical data processingAnalytical data processing
Analytical data processing
 
Introduction of search engine
Introduction of search engineIntroduction of search engine
Introduction of search engine
 

Ähnlich wie Data Analysis @ Daum | Devon 2012

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Improving relevance with log information
Improving relevance with log informationImproving relevance with log information
Improving relevance with log informationRichard Boulton
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Efficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and HiveEfficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and HiveGopi Krishnan Nambiar
 
Optimizing SSD Architecture for Client Workloads
Optimizing SSD Architecture for Client WorkloadsOptimizing SSD Architecture for Client Workloads
Optimizing SSD Architecture for Client WorkloadsJonathan Long
 
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Amazon Web Services
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceAmazon Web Services
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
Data Engineering with Protobuf
Data Engineering with ProtobufData Engineering with Protobuf
Data Engineering with ProtobufThiago Baldim
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems MongoDB
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 

Ähnlich wie Data Analysis @ Daum | Devon 2012 (20)

Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Improving relevance with log information
Improving relevance with log informationImproving relevance with log information
Improving relevance with log information
 
sigmod08
sigmod08sigmod08
sigmod08
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
What is apache pig
What is apache pigWhat is apache pig
What is apache pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
What is apache_pig
What is apache_pigWhat is apache_pig
What is apache_pig
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Efficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and HiveEfficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and Hive
 
Optimizing SSD Architecture for Client Workloads
Optimizing SSD Architecture for Client WorkloadsOptimizing SSD Architecture for Client Workloads
Optimizing SSD Architecture for Client Workloads
 
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL Performance
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Data Engineering with Protobuf
Data Engineering with ProtobufData Engineering with Protobuf
Data Engineering with Protobuf
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 

Mehr von Daum DNA

Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)
Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)
Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)Daum DNA
 
Daum OAuth 2.0
Daum OAuth 2.0Daum OAuth 2.0
Daum OAuth 2.0Daum DNA
 
Daum 음성인식 API (김한샘)
Daum 음성인식 API (김한샘)Daum 음성인식 API (김한샘)
Daum 음성인식 API (김한샘)Daum DNA
 
Daum 검색/지도 API (이정주)
Daum 검색/지도 API (이정주)Daum 검색/지도 API (이정주)
Daum 검색/지도 API (이정주)Daum DNA
 
오픈 API 활용방법(Daum 사례 중심, 윤석찬)
오픈 API 활용방법(Daum 사례 중심, 윤석찬)오픈 API 활용방법(Daum 사례 중심, 윤석찬)
오픈 API 활용방법(Daum 사례 중심, 윤석찬)Daum DNA
 
Daum 티스토리 API (천정환)
Daum 티스토리 API (천정환)Daum 티스토리 API (천정환)
Daum 티스토리 API (천정환)Daum DNA
 
Daum 로그인 API (함태윤)
Daum 로그인 API (함태윤)Daum 로그인 API (함태윤)
Daum 로그인 API (함태윤)Daum DNA
 
FT직군의 현재와 미래 - 홍윤표
FT직군의 현재와 미래 - 홍윤표FT직군의 현재와 미래 - 홍윤표
FT직군의 현재와 미래 - 홍윤표Daum DNA
 
웹접근성과 장애인 차별 금지법 - 장성민
웹접근성과 장애인 차별 금지법 - 장성민웹접근성과 장애인 차별 금지법 - 장성민
웹접근성과 장애인 차별 금지법 - 장성민Daum DNA
 
반응형 웹 디자인은 만능인가? - 신현석
반응형 웹 디자인은 만능인가? - 신현석반응형 웹 디자인은 만능인가? - 신현석
반응형 웹 디자인은 만능인가? - 신현석Daum DNA
 
Daum devday 13 [bap]
Daum devday 13  [bap]Daum devday 13  [bap]
Daum devday 13 [bap]Daum DNA
 
Daum DevDay 13-힐링이 필요해
Daum DevDay 13-힐링이 필요해Daum DevDay 13-힐링이 필요해
Daum DevDay 13-힐링이 필요해Daum DNA
 
Daum DevDay 13 - 마음의 소리
Daum DevDay 13 - 마음의 소리Daum DevDay 13 - 마음의 소리
Daum DevDay 13 - 마음의 소리Daum DNA
 
Daum DevDay 13 - OpenBrace
Daum DevDay 13 - OpenBraceDaum DevDay 13 - OpenBrace
Daum DevDay 13 - OpenBraceDaum DNA
 
Daum DevDay 13 - Ogangjang
Daum DevDay 13 - OgangjangDaum DevDay 13 - Ogangjang
Daum DevDay 13 - OgangjangDaum DNA
 
Daum DevDay 13 - Mook
Daum DevDay 13 - MookDaum DevDay 13 - Mook
Daum DevDay 13 - MookDaum DNA
 
Daum DevDay 13 - Moonlight
Daum DevDay 13 - MoonlightDaum DevDay 13 - Moonlight
Daum DevDay 13 - MoonlightDaum DNA
 
Daum DevDay 13 - In-N-Out
Daum DevDay 13 - In-N-OutDaum DevDay 13 - In-N-Out
Daum DevDay 13 - In-N-OutDaum DNA
 
Daum DevDay 13 - i-DF
Daum DevDay 13 - i-DFDaum DevDay 13 - i-DF
Daum DevDay 13 - i-DFDaum DNA
 
Daum 키노트 | Devon 2012
Daum 키노트 | Devon 2012Daum 키노트 | Devon 2012
Daum 키노트 | Devon 2012Daum DNA
 

Mehr von Daum DNA (20)

Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)
Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)
Daum의 개방형 기술 전략 및 자바 기술 로드맵(2007)
 
Daum OAuth 2.0
Daum OAuth 2.0Daum OAuth 2.0
Daum OAuth 2.0
 
Daum 음성인식 API (김한샘)
Daum 음성인식 API (김한샘)Daum 음성인식 API (김한샘)
Daum 음성인식 API (김한샘)
 
Daum 검색/지도 API (이정주)
Daum 검색/지도 API (이정주)Daum 검색/지도 API (이정주)
Daum 검색/지도 API (이정주)
 
오픈 API 활용방법(Daum 사례 중심, 윤석찬)
오픈 API 활용방법(Daum 사례 중심, 윤석찬)오픈 API 활용방법(Daum 사례 중심, 윤석찬)
오픈 API 활용방법(Daum 사례 중심, 윤석찬)
 
Daum 티스토리 API (천정환)
Daum 티스토리 API (천정환)Daum 티스토리 API (천정환)
Daum 티스토리 API (천정환)
 
Daum 로그인 API (함태윤)
Daum 로그인 API (함태윤)Daum 로그인 API (함태윤)
Daum 로그인 API (함태윤)
 
FT직군의 현재와 미래 - 홍윤표
FT직군의 현재와 미래 - 홍윤표FT직군의 현재와 미래 - 홍윤표
FT직군의 현재와 미래 - 홍윤표
 
웹접근성과 장애인 차별 금지법 - 장성민
웹접근성과 장애인 차별 금지법 - 장성민웹접근성과 장애인 차별 금지법 - 장성민
웹접근성과 장애인 차별 금지법 - 장성민
 
반응형 웹 디자인은 만능인가? - 신현석
반응형 웹 디자인은 만능인가? - 신현석반응형 웹 디자인은 만능인가? - 신현석
반응형 웹 디자인은 만능인가? - 신현석
 
Daum devday 13 [bap]
Daum devday 13  [bap]Daum devday 13  [bap]
Daum devday 13 [bap]
 
Daum DevDay 13-힐링이 필요해
Daum DevDay 13-힐링이 필요해Daum DevDay 13-힐링이 필요해
Daum DevDay 13-힐링이 필요해
 
Daum DevDay 13 - 마음의 소리
Daum DevDay 13 - 마음의 소리Daum DevDay 13 - 마음의 소리
Daum DevDay 13 - 마음의 소리
 
Daum DevDay 13 - OpenBrace
Daum DevDay 13 - OpenBraceDaum DevDay 13 - OpenBrace
Daum DevDay 13 - OpenBrace
 
Daum DevDay 13 - Ogangjang
Daum DevDay 13 - OgangjangDaum DevDay 13 - Ogangjang
Daum DevDay 13 - Ogangjang
 
Daum DevDay 13 - Mook
Daum DevDay 13 - MookDaum DevDay 13 - Mook
Daum DevDay 13 - Mook
 
Daum DevDay 13 - Moonlight
Daum DevDay 13 - MoonlightDaum DevDay 13 - Moonlight
Daum DevDay 13 - Moonlight
 
Daum DevDay 13 - In-N-Out
Daum DevDay 13 - In-N-OutDaum DevDay 13 - In-N-Out
Daum DevDay 13 - In-N-Out
 
Daum DevDay 13 - i-DF
Daum DevDay 13 - i-DFDaum DevDay 13 - i-DF
Daum DevDay 13 - i-DF
 
Daum 키노트 | Devon 2012
Daum 키노트 | Devon 2012Daum 키노트 | Devon 2012
Daum 키노트 | Devon 2012
 

Data Analysis @ Daum | Devon 2012