SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Mining The Social Web
  Ch8 Blogs et al.: Natural Language
     Processing (and Beyond)      Ⅰ


               발표 : 김연기
     네이버 아키텍트를 꿈꾸는 사람들
     http://Cafe.naver.com/architect1
Natural Language
       Processing
• 마침표로 문장을 처리하자!
Natural Language
       Processing
• 마침표로 문장을 처리하자!
NLP Pipeline With NLTK
        문장의 끝 찾기


        단어 자르기


       구문별 짝짖기(?)


        단어 의미 부여


          추출
Natural Language
         Processing
• 문장의 끝 찾기(EOS Detection)
Natural Language
         Processing
• 문장의 끝 찾기(EOS Detection)
Natural Language
         Processing
• 구문별 짝짓기 (POS Tagging)
Natural Language
   Processing
Natural Language
           Processing
• 추출( Extraction)
Natural Language
   Processing
Natural Language
   Processing
Natural Language
               Processing
def cleanHtml(html):
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
fp = feedparser.parse(FEED_URL)
print "Fetched %s entries from '%s'" %
(len(fp.entries[0].title), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title, 'content'
: cleanHtml(e.content[0].value), 'link': e.links[0].href})
Natural Language
               Processing
# Basic stats
num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items()
if w[0] not in stop_words][:10]
print post['title']
print 'tNum Sentences:'.ljust(25), len(sentences)
print 'tNum Words:'.ljust(25), num_words
print 'tNum Unique Words:'.ljust(25), num_unique_words
print 'tNum Hapaxes:'.ljust(25), num_hapaxes
print 'tTop 10 Most Frequent Words (sans stop words):ntt',
'ntt'.join(['%s (%s)‘
        % (w[0], w[1]) for w in top_10_words_sans_stop_words])
print
Natural Language
   Processing
Natural Language
               Processing
# Summaization Approach 1:
# Filter out non-significant sentences by using the average
score plus a
# fraction of the std dev as a filter

avg = numpy.mean([s[1] for s in scored_sentences])
std = numpy.std([s[1] for s in scored_sentences])
mean_scored = [(sent_idx, score) for (sent_idx, score) in
scored_sentences if score > avg + 0.5 * std]

# Summarization Approach 2:
# Another approach would be to return only the top N ranked
sentences

    top_n_scored = sorted(scored_sentences, key=lambda s:
s[1])[-TOP_SENTENCES:]
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
Natural Language
   Processing
Natural Language
         Processing
– Luhn’s Summarization Algorithm
  • Score = (문장에서 중요한 단어)^2)/(문장 총단어
    수)
Natural Language
         Processing
– Luhn’s Summarization Algorithm
  • Score =

Weitere ähnliche Inhalte

Andere mochten auch

Yapcasia 2012 skyarc
Yapcasia 2012 skyarcYapcasia 2012 skyarc
Yapcasia 2012 skyarc
onagatani
 
Featuring my trip to Yunnan
Featuring my trip to YunnanFeaturing my trip to Yunnan
Featuring my trip to Yunnan
jwolfie
 
Sachin tuli
Sachin tuliSachin tuli
Sachin tuli
sknsz
 
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkołySpotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
sknsz
 
Application Software
Application SoftwareApplication Software
Application Software
Beth
 
Continuing Pakistan Floods
Continuing Pakistan FloodsContinuing Pakistan Floods
Continuing Pakistan Floods
Carlos Felipe
 
Web ve
Web veWeb ve
Web ve
Anam
 
The romans 3
The romans 3The romans 3
The romans 3
FranJLte
 
Project in mapeh(bravo)
Project in mapeh(bravo)Project in mapeh(bravo)
Project in mapeh(bravo)
Joyjoy Pena
 
Swiatowyponchiny
SwiatowyponchinySwiatowyponchiny
Swiatowyponchiny
sknsz
 

Andere mochten auch (20)

Yapcasia 2012 skyarc
Yapcasia 2012 skyarcYapcasia 2012 skyarc
Yapcasia 2012 skyarc
 
Featuring my trip to Yunnan
Featuring my trip to YunnanFeaturing my trip to Yunnan
Featuring my trip to Yunnan
 
Sachin tuli
Sachin tuliSachin tuli
Sachin tuli
 
Grocery Shopping at Fry's
Grocery Shopping at Fry'sGrocery Shopping at Fry's
Grocery Shopping at Fry's
 
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkołySpotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
Spotkanie z krzysztofem śliwińskim w ramach wiosennej szkoły
 
VodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKingVodQA_Parallelizingcukes_AmanKing
VodQA_Parallelizingcukes_AmanKing
 
Application Software
Application SoftwareApplication Software
Application Software
 
Continuing Pakistan Floods
Continuing Pakistan FloodsContinuing Pakistan Floods
Continuing Pakistan Floods
 
Suburbarian - presentation
Suburbarian - presentationSuburbarian - presentation
Suburbarian - presentation
 
Web ve
Web veWeb ve
Web ve
 
The romans 3
The romans 3The romans 3
The romans 3
 
10. perilaku tercela sm t2
10. perilaku tercela sm t210. perilaku tercela sm t2
10. perilaku tercela sm t2
 
Power point 1 media
Power point 1 mediaPower point 1 media
Power point 1 media
 
Testing the Mysterious Sphere
Testing the Mysterious SphereTesting the Mysterious Sphere
Testing the Mysterious Sphere
 
Forever Presentation
Forever PresentationForever Presentation
Forever Presentation
 
Google themes
Google themesGoogle themes
Google themes
 
Project in mapeh(bravo)
Project in mapeh(bravo)Project in mapeh(bravo)
Project in mapeh(bravo)
 
Swiatowyponchiny
SwiatowyponchinySwiatowyponchiny
Swiatowyponchiny
 
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillaraVodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
VodQA-TooMuchVerificationNotEnoughValidation_SrinivasChillara
 
Percobaan osmosis dan mitosis
Percobaan osmosis dan mitosisPercobaan osmosis dan mitosis
Percobaan osmosis dan mitosis
 

Ähnlich wie Mining the social web ch8 - 1

Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
Gopi Krishnan Nambiar
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
Fasihul Kabir
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
Software Guru
 

Ähnlich wie Mining the social web ch8 - 1 (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Nltk
NltkNltk
Nltk
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Multilingual drupal 7
Multilingual drupal 7Multilingual drupal 7
Multilingual drupal 7
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 

Mehr von scor7910

대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14
scor7910
 
Head first statistics ch15
Head first statistics ch15Head first statistics ch15
Head first statistics ch15
scor7910
 
Head first statistics ch.11
Head first statistics ch.11Head first statistics ch.11
Head first statistics ch.11
scor7910
 
어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기
scor7910
 
Mining the social web ch3
Mining the social web ch3Mining the social web ch3
Mining the social web ch3
scor7910
 
Software pattern
Software patternSoftware pattern
Software pattern
scor7910
 
Google app engine
Google app engineGoogle app engine
Google app engine
scor7910
 
Cpp 0x kimRyungee
Cpp 0x kimRyungeeCpp 0x kimRyungee
Cpp 0x kimRyungee
scor7910
 
Component configurator
Component configuratorComponent configurator
Component configurator
scor7910
 
Reflection
ReflectionReflection
Reflection
scor7910
 

Mehr von scor7910 (11)

대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14대규모 서비스를 지탱하는기술 Ch14
대규모 서비스를 지탱하는기술 Ch14
 
Head first statistics ch15
Head first statistics ch15Head first statistics ch15
Head first statistics ch15
 
Head first statistics ch.11
Head first statistics ch.11Head first statistics ch.11
Head first statistics ch.11
 
어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기어플 개발자의 서버개발 삽질기
어플 개발자의 서버개발 삽질기
 
Mining the social web ch3
Mining the social web ch3Mining the social web ch3
Mining the social web ch3
 
Software pattern
Software patternSoftware pattern
Software pattern
 
Google app engine
Google app engineGoogle app engine
Google app engine
 
Cpp 0x kimRyungee
Cpp 0x kimRyungeeCpp 0x kimRyungee
Cpp 0x kimRyungee
 
Component configurator
Component configuratorComponent configurator
Component configurator
 
Proxy pattern
Proxy patternProxy pattern
Proxy pattern
 
Reflection
ReflectionReflection
Reflection
 

Kürzlich hochgeladen

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Kürzlich hochgeladen (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 

Mining the social web ch8 - 1

  • 1. Mining The Social Web Ch8 Blogs et al.: Natural Language Processing (and Beyond) Ⅰ 발표 : 김연기 네이버 아키텍트를 꿈꾸는 사람들 http://Cafe.naver.com/architect1
  • 2. Natural Language Processing • 마침표로 문장을 처리하자!
  • 3. Natural Language Processing • 마침표로 문장을 처리하자!
  • 4. NLP Pipeline With NLTK 문장의 끝 찾기 단어 자르기 구문별 짝짖기(?) 단어 의미 부여 추출
  • 5. Natural Language Processing • 문장의 끝 찾기(EOS Detection)
  • 6. Natural Language Processing • 문장의 끝 찾기(EOS Detection)
  • 7. Natural Language Processing • 구문별 짝짓기 (POS Tagging)
  • 8. Natural Language Processing
  • 9. Natural Language Processing • 추출( Extraction)
  • 10. Natural Language Processing
  • 11. Natural Language Processing
  • 12. Natural Language Processing def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = [] for e in fp.entries: blog_posts.append({'title': e.title, 'content' : cleanHtml(e.content[0].value), 'link': e.links[0].href})
  • 13. Natural Language Processing # Basic stats num_words = sum([i[1] for i in fdist.items()]) num_unique_words = len(fdist.keys()) # Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes()) top_10_words_sans_stop_words = [w for w in fdist.items() if w[0] not in stop_words][:10] print post['title'] print 'tNum Sentences:'.ljust(25), len(sentences) print 'tNum Words:'.ljust(25), num_words print 'tNum Unique Words:'.ljust(25), num_unique_words print 'tNum Hapaxes:'.ljust(25), num_hapaxes print 'tTop 10 Most Frequent Words (sans stop words):ntt', 'ntt'.join(['%s (%s)‘ % (w[0], w[1]) for w in top_10_words_sans_stop_words]) print
  • 14. Natural Language Processing
  • 15. Natural Language Processing # Summaization Approach 1: # Filter out non-significant sentences by using the average score plus a # fraction of the std dev as a filter avg = numpy.mean([s[1] for s in scored_sentences]) std = numpy.std([s[1] for s in scored_sentences]) mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences if score > avg + 0.5 * std] # Summarization Approach 2: # Another approach would be to return only the top N ranked sentences top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-TOP_SENTENCES:] top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
  • 16. Natural Language Processing
  • 17. Natural Language Processing – Luhn’s Summarization Algorithm • Score = (문장에서 중요한 단어)^2)/(문장 총단어 수)
  • 18. Natural Language Processing – Luhn’s Summarization Algorithm • Score =