Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

A 29-Year Journey of Thai NLP

1.820 Aufrufe

Veröffentlicht am

It is already 29 years since I got involved in NLP research. It is almost the same period of the begin of NLP research in Thailand, especially for Thai language processing. Following the timeline, the slide shows the development of Thai NLP in terms of algorithm and language resource development.

Veröffentlicht in: Technologie
  • How To Cure Acne For Good, Achieve lasting acne freedom Simple proven science of clear skin ♥♥♥ http://scamcb.com/buk028959/pdf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

A 29-Year Journey of Thai NLP

  1. 1. A 29-Year Journey of Thai NLP MT-ED-OSS-IR-DM-DT Virach Sornlertlamvanich Sirindhorn International Institute of Technology (SIIT), Thammasat University virach@gmail.com 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  2. 2. 14 15 5-10-7-5-2 RESEARCH 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 NEC/CICC LINKS, NECTEC NLP, Speech, Image, e-Learning, OSS NLP, AWN, IR, OSS Mobile Application, Digitized Thailand RDI, NECTEC Machine Translation MT, NLP TCL, NICT IMA, NECTEC TPA/SIIT NLP, AI, Data Mining, Big Data, SNS, Deep Learning TITECH PGLR ① ② ③ ④ ⑤ 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand SNLP
  3. 3. When an engineer developed a grammar for the Thai language Font, Encoding, Input method, POS, Dictionary, Verb pattern, Grammar, MT ① NEC/CICC 1988-1992 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  4. 4. Thai Non-Logical Order • Non-logical order in the representation of consonant-vowel sequences. Vowels that occur to the left side of their consonant are represented in visual order before the consonant in a string, even though they are pronounced afterward. (Left-positioned vowel signs) • Difficulty in Collation (Sorting), Grapheme to phoneme Text โปรแกรม Encoding U+0E42 U+0E1B U+0E23 U+0E41 U+0E01 U+0E23 U+0E21 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  5. 5. Zero-Width Character in Thai ที่อยู่ Base line Consonant Vowel sign (lower) Vowel sign (upper) Tone mark Text ท ที ท่ ที่ Encoding U+0E17 U+0E35 U+0E48 U+0E17 U+0E35 U+0E48 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Store Display
  6. 6. X-TIS620 “อยู่” อ ย ยู ย่ CD C2 D9 E8 อ ย อู่ CD B0 C2 EA TIS X-TIS EA = B0 (base) + 38 ( อู ) + 02 ( อ่ ) 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 อ็ อ่ อ้ อ๊ อ๋ อ์ อํ อั อิ อี อึ อื อุ อู 0 1 0 0 0อฺ “|อ|ยู่|” Advantages - More than 1,000 code-points prepared for kerning and rendering - Internal encoding for terminal text wrapping - Cursor positioning - Base concept for TCC (Thai Character Cluster:- the smallest unit of character cluster according to the spelling rules) 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  7. 7. TCC (Thai Character Cluster) • The smallest unit of character cluster according to the spelling rules. • To cluster Thai text into undividable units. Character cluster is defined to be the smallest recognizable unit. The character string is clustered for the sake of avoiding the processing of invalid Thai character units. Examples of TCC Pre-position: เ, แ, ไ, ใ, โ ⊕ C+ Post-position: C+ ⊕ ะ, า Upper/Lower: ที่, มี, กุ, รู, … Sound killer: ร์, ดิ์, ตร์, ทธิ์, ถุ์ Compound: เสร็จ, เหลือ, หน่วย Leading char: หล่น, หนัง, หวะ, ไหล่ Diphthong: ครัว, อ้วน Character: เ - ป - อ้ - า - ห - ม - า - ย Cluster (TCC): เป้า - หมา - ย Word: เป้าหมาย or เป้า - หมาย Virach Sornlertlamvanich and Tanaka Hozumi. The Automatic Extraction of Open Compounds from Text Corpora. Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pp. 1143-1146, Aug 1996. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  8. 8. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Implementation (1991-) •X-TIS 620 for tterm in UNIX •X bitmap fonts •X Consortium: Thai in X11R6 •Thai in UNIX/Linux applications • Xfig • Mule/GNU Emacs: SWATH, LEXiTRON • Xemacs: X-TIS • Mozilla: LibInThai • LaTeX: Babel, Omega • National fonts: Kinnari, Garuda, Norasi Free developers
  9. 9. POS Tagset • 14 categories (N, PRON, V, AUX, DET, ADV, CLAS, CONJ, PREP, INT, PREF, END, NEG, PUNC) and 47 sub-categories • VACT, VSTA, VATT • Transitive, Intransitive • AUX • Word order • S vs NP • No diff in some cases No. POS Description Example 1 NPRP Proper noun วินโดวส์ 95, โคโรน่า, โค้ก, พระอาทิตย์ 2 NCNM Cardinal number หนึ่ง, สอง, สาม, 1, 2, 3 3 NONM Ordinal number ที่หนึ่ง, ที่สอง, ที่สาม, ที่1, ที่2, ที่3 4 NLBL Label noun 1, 2, 3, 4, ก, ข, a, b 5 NCMN Common noun หนังสือ, อาหาร, อาคาร, คน 6 NTTL Title noun ดร., พลเอก 7 PPRS Personal pronoun คุณ, เขา, ฉัน 8 PDMN Demonstrative pronoun นี่, นั่น, ที่นั่น, ที่นี่ 9 PNTR Interrogative pronoun ใคร, อะไร, อย่างไร 10 PREL Relative pronoun ที่, ซื่ง, อัน, ผู้ 11 VACT Active verb ทำงาน, ร้องเพลง, กิน 12 VSTA Stative verb เห็น, รู้, คือ 13 VATT Attributive verb อ้วน, ดี, สวย 14 XVBM Pre-verb auxiliary, before negator “ไม่” เกิด, เกือบ, กำลัง 15 XVAM Pre-verb auxiliary, after negator “ไม่” ค่อย, น่า, ได้ 16 XVMM Pre-verb, before or after negator “ไม่” ควร, เคย, ต้อง 17 XVBB Pre-verb auxiliary, in imperative mood กรุณา, จง, เชิญ, อย่า, ห้าม 18 XVAE Post-verb auxiliary ไป, มา, ขึ้น 19 DDAN Definite determiner, after noun without classifier in between นี่, นั่น, โน่น, ทั้งหมด 20 DDAC Definite determiner, allowing classifier in between นี้, นั้น, โน้น, นู้น 21 DDBQ Definite determiner, between noun and classifier or preceding quantitative expression ทั้ง, อีก, เพียง 22 DDAQ Definite determiner, following quantitative expression พอดี, ถ้วน 23 DIAC Indefinite determiner, following noun; allowing classifier in between ไหน, อื่น, ต่างๆ 24 DIBQ Indefinite determiner, between noun and classifier or preceding quantitative expression บาง, ประมาณ, เกือบ 25 DIAQ Indefinite determiner, following quantitative expression กว่า, เศษ 26 DCNM Determiner, cardinal number expression หนึ่งคน, เสือ 2 ตัว 27 DONM Determiner, ordinal number expression ที่หนึ่ง, ที่สอง, ที่สุดท้าย 28 ADVN Adverb with normal form เก่ง, เร็ว, ช้า, สม่ำเสมอ 29 ADVI Adverb with iterative form เร็วๆ, เสมอๆ, ช้าๆ 30 ADVP Adverb with prefixed form โดยเร็ว 31 ADVS Sentential adverb โดยปกติ, ธรรมดา 32 CNIT Unit classifier ตัว, คน, เล่ม 33 CLTV Collective classifier คู่, กลุ่ม, ฝูง, เชิง, ทาง, ด้าน, แบบ, รุ่น 34 CMTR Measurement classifier กิโลกรัม, แก้ว, ชั่วโมง 35 CFQC Frequency classifier ครั้ง, เที่ยว 36 CVBL Verbal classifier ม้วน, มัด 37 JCRG Coordinating conjunction และ, หรือ, แต่ 38 JCMP Comparative conjunction กว่า, เหมือนกับ, เท่ากับ 39 JSBR Subordinating conjunction เพราะว่า, เนื่องจาก, ที่, แม้ว่า, ถ้า 40 RPRE Preposition จาก, ละ, ของ, ใต้, บน 41 INT Interjection โอ้ย,โอ้, เออ, เอ๋, อ๋อ 42 FIXN Nominal prefix การทำงาน, ความสนุกสนาน 43 FIXV Adverbial prefix อย่างเร็ว 44 EAFF Ending for affirmative sentence จ๊ะ, จ้ะ, ค่ะ, ครับ, นะ, น่า, เถอะ 45 EITT Ending for interrogative sentence หรือ, เหรอ, ไหม, มั้ย 46 NEG Negator ไม่, มิได้, ไม่ได้, มิ 47 PUNC Punctuation (, ), “, ,, ; 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. Building a Thai Part-Of-Speech Tagged Corpus (ORCHID). The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-140, May 1999.
  10. 10. Multi-lingual Machine Translation Project (MMT) 1987-1992 (+2) • 6 years-project (1987-1992) • Interlingual approach MMT for CIJMT • R&D − Analysis − Generation − Dictionary − Interlingua − Integration system • Collaboration − Thailand (NECTEC, CU, KU, KMUTT, KMITL) − Japan (NEC, Fujitsu, Hitachi, OKI, Sharp, Mitsubishi, Toshiba) − China, Indonesia, Malaysia • 1969 Computerized Alphabetization of Thai • 1974 Thai Transliteration System • 1981 ARIANE Project − English-Thai MT − Ministry of University Affairs and Grenoble Univ. • 1986 Establishment of NECTEC • 1986 TIS620-2529 − Thai Standard Character Code for Computer by TISI • 1987-92 (+2) NECTEC-CICC MMT Project • 1992-present Establishment of LINKS at NECTEC − AI R&D Center at KMITT − NAiST at KU − KIND at SIIT − RDI at NECTEC − SLS at CU, …. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  11. 11. MMT Project 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  12. 12. Interlingua in MMT 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  13. 13. NLP Applications and Services LEXiTRON, Royal Thai Institute Dictionary, EZKey, ParSit, Sansarn ② LINKS/RD-I, NECTEC 1993-2003 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  14. 14. 2537 2538 2539 2540 2541 2542 2543 2544 2545 1994 1995 1996 1997 1998 1999 2000 2001 2002 ② 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  15. 15. LEXiTRON • LEXiTRON version 1.1 • Corpus-based dictionary • Dictionary for writing • Launched in 1995 • CD-ROM for Windows 3.1 Thai Edition • Thai 11,000 entries • English 9,000 entries • 6 types of dictionaries − General word entry − Thai usage dictionary (sample sentence) − Synonym-Antonym − Thai-English (equivalent) − Word class 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Virach Sornlertlamvanich, Apichit Pittayaratsophon and Kriangchai Chansaenwilai. Thai Dictionary Data Base Manipulation using Multi-indexed Double Array Trie. The 5th Annual Conference, NECTEC, Bangkok. pp. 197-206, 1993. (in Thai)
  16. 16. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Thai Electronic Dictionary
  17. 17. ORCHID POS Tagged Corpus %TTitle: การประชุมทางวิชาการ ครั้งที่ 1 %ETitle: [1st Annual Conference] %TAuthor: %EAuthor: %TInbook: การประชุมทางวิชาการ ครั้งที่ 1, โครงการวิจัยและพัฒนา อิเล็กทรอนิกส์และคอมพิวเตอร์, ปีงบประมาณ 2531, เล่ม 1 %EInbook: The 1st Annual Conference, Electronics and Computer Research and Development Project, Fiscal Year 1988, Book 1 %TPublisher: ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์ แห่งชาติ, กระทรวงวิทยาศาสตร์ เทคโนโลยีและการพลังงาน %EPublisher: National Electronics and Computer Technology Center, Ministry of Science, Technology and Energy %Page: %Year: 1989 %File: #P1 #1 การประชุมทางวิชาการ ครั้งที่ 1// การ/FIXNป ระชุม/VACT ทาง/NCMN วิชาการ/NCMN <space>/PUNC ครั้ง/CFQC ที่ 1/DONM// #2โครงการวิจัยและพัฒนาอิเล็กทรอนิกส์และคอมพิวเตอร์// โครงการวิจัยและพัฒนา/NCMN อิเล็กทรอนิกส์/NCMN และ/JCRG คอมพิวเตอร์/NCMN// … • ORCHID Corpus (1997) supported by CRL Japan • Source: NECTEC Technical Report • Size: 160 documents; 5.75 MB; 400K words • Tag: XML tagged paragraph, sentence, word, part-of- speech • Availability: for research • Difficulties • Hard to find consensus in the sentence boundary, word boundary, and POS tag 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Virach Sornlertlamvanich, Thatsanee Charoenporn and Hitoshi Isahara. ORCHID: Thai Part-Of-Speech Tagged Corpus. Technical Report Orchid TR-NECTEC-1997-001, NECTEC, Thailand, pp. 5-19, Dec 1997.
  18. 18. Interlingua English-Thai MT Concept Composition and Decomposition c#amaze c#news c#i objectimplement this c#cause c#news c#i objectimplement this c#amazing a-object This news amazes me. ข่าวนี)ทําให้ฉันประหลาดใจ 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  19. 19. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand English-Thai Web Translation http://come.to/parsit http://www.suparsit.com/ • 51,075 visits/month •138,748 translation-pages/month
  20. 20. Term Candidate Extraction for Dictionary-less Search Engine • Virach Sornlertlamvanich et al. (COLING 2000) : - Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm - C4.5-trained decision tree for determining potential word boundary from MI, Entropy potential word boundary from MI, Entropy and some linguistic information - Capable of discovering new words in document without assistance from static dictionary 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Virach Sornlertlamvanich, Tanapong Potipiti and Thatsanee Charoenporn. Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. Proceedings of the 18th International Conference on Computational Linguistics (COLING2000), Saarbrucken, Germany, pp 802-807, July-August 2000.
  21. 21. Attributes(1) : Left and Right Mutual Information High mutual information implies that xyz co-occurs more than expected by chance. If xyz is a word, its MIL and MIR must be high. …efunction… and ...function... x yz zxy where x is the leftmost character of string xyz y is the middle substring of xyz z is the rightmost character of string xyz p( ) is the probability function. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  22. 22. Attributes(2) : Left and Right Entropy Entropy shows the variety of characters before and after a word. If y is a word, its left and right entropy must be high. ...?function... and ...?unction... where x is the leftmost character of string xyz y is the middle substring of xyz z is the rightmost character of string xyz p( ) is the probability function. x y y z 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  23. 23. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  24. 24. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  25. 25. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand EZKey %~ T/E ฏ ก D โ ด F ฌ เ G Shift .of]dp68 computer vtwidh’jkpwxs,f_ ในโลกยุค computer อะไรก็ง่ายไปหมด_
  26. 26. The Names • LEXiTRON :- Lexicon + Electron • ParSit :- Parse it • ORCHID :- Orchid = Ran (蘭) • Sansarn logo :- Frog = Return of happiness カエルは“福帰る”, 幸運が還ってくる • LinuxTLE, OfficeTLE :- TLE = Ta-Le (Sea series Linux distro) Thai Language Extension • Vaja :- Speech Smart-Q, EZKey, 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  27. 27. Multi-lingualism Language Observatory, Asian WordNet ③ TCL, NICT 2003-2008 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  28. 28. Collaboration Project Project Year 03 04 05 06 07 08 09 10 Asian E-Learning Network (AEN), CICC Language Observatory Project (LOP), NUT Intercultural Collaboration Experiments (ICE), KU Asian Language Resource Network (ALRN), NUT Asian Language Resources (ALR), NEDO World Network on Linguistics Diversity (REDILI), UNESCO Open Standards Promotion, NECTEC, UNDP-APDIP Asian applied nlp for linguistics Diversity and language resource Development (ADD) KuiSci: STKC Research Community for MOST KuiPoll: Educational Community (BUU, NECTEC) KuiHerb: Collective Herbal Information (SIL, PSU, NECTEC) AsianWordNet: WordNet for Asian languages development and sharing XPLOG: Experience Log for Local Wisdom Collection NLP tools and corpora web services ③ 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  29. 29. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand TCL’s Computational Lexicon: Representativity Constraint based a conceptual class referring to the whole of which a given word X is a partWhole-of (WOF) a conceptual class specifying a part of a given word XPart-of (POF) a word having the opposite meaning of a given word XNot-equal (NEQ) a word having the same meaning as a given word XEqual (EQU) a conceptual class of a given word XIs-a (ISA) Value descriptionAttribute Logical Constraints Semantic Constraints a point or period of time when an event occursTime (TIM) a position or place where an event occursLocation (LOC) an entity used in the actionInstrument (INS) an entity affected by the actionObject (OBJ) an entity initiating the actionAgent (AGT)
  30. 30. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  31. 31. Synset Assignment Algorithm (CS=4) l Accept the Synset that includes more than one English Equivalent with confidence score 4. L0 E0 S0Î S1 Î E1 Î S2 Î Example: L0: เป้าหมาย E0: aim E1: target S0: purpose, intent, intention, aim, design S1: aim, object, objective, target S2: aim 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  32. 32. Synset Assignment Algorithm (CS=3) Example: L0: จ้อง L1: เพ่งมอง E0: stare E1: gaze S0: stare S1: gaze, stare Synonym l Accept the Synset that includes more than one English Equivalent from the synonym of the target language with confidence score 3. L0 E0 S0Î S1 Î E1 Î S2 ÎL1 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  33. 33. Synset Assignment Algorithm (CS=2) Example: L0: สูติแพทย์ E0: obstetrician S0: obstetrician, accoucheur l Accept the only Synset that includes the English Equivalent with confidence score 2. L0 E0 S0 Î 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Technical term
  34. 34. Synset Assignment Algorithm (CS=1) Example: L0: ช่อง E0: hole E1: canal S0: hole, hollow S1: hole, trap, cakehole, maw, yap, gap S2: canal, duct, epithelial duct, channel l Accept more than one Synset that includes each of the English Equivalent with confidence score 1. L0 E0 S0Î S1 Î E1 S2 Î 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Common term
  35. 35. KUI Correction Voting Lookup Translation Discussion Addition WN GWN AWN X-English X-English X-English Thai-English X-English X-English X-English Indonesian -English merged-WN ML Applications Dictionary Ontology CL-Search MT Summarization IE/IR …. Asian WordNet Development Process 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  36. 36. Asian WordNet http://www.asianwordnet.org/ • Asian WordNet • Visualization of Asian WordNet • Function • Cross language visualization • 3 modes of visualization • Progress (May 3, 2010) • Burmese (19949 senses, 11006 u. words) • Indonesian (26175 senses, 24398 u. words) • Japanese (58447 senses, 64678 u. words) • Korean (42274 senses, 26009 u. words) • Lao (38890 senses, 44032 u. words) • Mongolian (1624 senses, 1574 u. words) • Nepali (41 senses, 42 u. words) • Sinhala (268 senses, 119 u. words) • Sudanese (69 senses, 52 u. words) • Thai (71139 senses, 69998 u. words) • Collaboration • TCL • ADD members 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  37. 37. Digitalization Linked Open Data, Digitized Thailand, Thailand-1-Click ④ NECTEC 2009-2013 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  38. 38. Semantic Link Generation •Semantic Representation of the description •Keyword Extraction • Extract keywords in text documents and link them to appropriate articles •Semantic Relation Extraction • Extract commons syntactic patterns between two keywords and generalize them to a triple (ei , rij , ej) • Linked Data – Set of triple (ei , rij , ej) 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Virach Sornlertlamvanich and Canasai Kruengkrai. Effectiveness of Keyword and Semantic Relation Extraction for Knowledge Map Generation , Proceedings of The Second International Workshop on Worldwide Language Service Infrastructure (WLSI), Kyoto University, Kyoto, Japan, January 22-23, 2015.
  39. 39. Types of Semantic Relation 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand description title tag
  40. 40. Knowledge Map 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Infobox Knowledge map ISBUILTIN(พระเจดีย์กลางนํ)า, พ.ศ.2403) ISLOCATEDAT(พระเจดีย์กลางนํ)า, ตําบลปากนํ)า)
  41. 41. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Infobox Knowledge map Creator Making Product Shop Semantically Enhanced Cultural Database [Place, Person, Artifact] Knowledging
  42. 42. Digital Content Technology 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  43. 43. Projects in Digitized Thailand, 2009 • DT PaaS on the Cloud • Digitized Thailand (http://www.digitized-thailand.org/) • Digitized Lanna (http://www.digitized-lanna.com/) • Digitized Isan (http://www.digitized-isan.com/) 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  44. 44. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand Digitized Thailand: The Ultimate Goal • DT is a framework for collaboration in technology and content development • DT is a platform for digital content sharing • Toward creative economy, DT PaaS will be established
  45. 45. Data, Data, Data NLP, Big Data, Deep Learning, Social Computing, IoT, AI ⑤ SIIT 2014-… 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  46. 46. NLP Challenges • Internet, Big Data, Machine Learning, Deep Learning have brought along the possibilities. Facebook:- Adds 0.5 petabyte (1015) of data every 24 hours Twitter:- Adds 340 million tweets per day Youtube:- Adds 100 hours of new videos every minute Germin8, Social Intelligence The Evolution of Communication 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  47. 47. NLP Challenges Data Community DC (DC2) Bird Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  48. 48. Data Data Data!!! • Drastically increase number of users on social network • Keywords in the contents express the concepts of the talk • Social media texts are input in a time sequence • But, social media texts are normally short, incomplete and diverse 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand
  49. 49. 28 August 2017, ISAI-NLP 2017, Hua Hin, Thailand NLP, Big Data, Deep Learning, Social Computing, IoT, AI

×