SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Downloaden Sie, um offline zu lesen
Diseño y construcción de una plataforma de
clasificación y calificación de post para una red
 de blogs basada en textmining para Betazeta
                  Networks S.A.




                               Camilo López A.
Objetivo General
    El objetivo general del trabajo es el apoyo al procesamiento manual de
grandes volúmenes de publicaciones en la red de blogs de Betazeta mediante el
diseño e implementación de un prototipo para la categorización automática de
                      estos datos utilizando text mining.
Objetivos Específicos
1. Entender a fondo la problemática y el contexto de la empresa junto
   con los conocimientos necesarios respecto a text mining, modelos y
   metodologías necesarias.

2. Selección de los datos históricos, las consultas sobre éstos y los
   modelos que permitan realizar predicciones exitosas de
   categorización.

3. Establecer métodos y métricas para la evaluación de la solución
   propuesta.

4. Utilizando el conocimiento adquirido en los objetivos anteriores,
   diseñar el proceso de categorización automático de posts.
Objetivos Específicos
5. Diseñar e implementar un prototipo que permita al usuario ingresar
   información en forma adecuada para su análisis y a la empresa
   procesarla, filtrarla y publicarla en base a criterios del negocio.

6. Implementación de la metodología de evaluación.
El Problema
betazeta
7,5 millones
  Visitas Mensuales
User Generated Content
Volumen
Google Categories
Spam
Filtro de contenido
Filtro de contenido

Categorizar
Background
  Teórico
Data Mining
Limpieza del texto
Stemming
Stop Words
Diccionario
Latent Dirichlet Allocation
LDA
Latent Dirichlet Allocation
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type
specimen book. It has survived not only five centuries, but also the leap into
electronic typesetting, remaining essentially unchanged. It was popularised in
the 1960s with the release of Letraset sheets containing Lorem Ipsum passages,
and more recently with desktop publishing software like Aldus PageMaker
including versions of Lorem Ipsum.

It is a long established fact that a reader will be distracted by the readable
content of a page when looking at its layout. The point of using Lorem Ipsum is
that it has a more-or-less normal distribution of letters, as opposed to using
'Content here, content here', making it look like readable English. Many desktop
publishing packages and web page editors now use Lorem Ipsum as their default
model text, and a search for 'lorem ipsum' will uncover many web sites still in
their infancy. Various versions have evolved over the years, sometimes by
accident, sometimes on purpose (injected humour and the like).

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots
in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
looked up one of the more obscure Latin words, consectetur, from a Lorem
Ipsum passage, and going through the cites of the word in classical literature,
discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32
and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and
Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics,
very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem
ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below
for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et
Malorum" by Cicero are also reproduced in their exact original form,
accompanied by English versions from the 1914 translation by H. Rackham.
There are many variations of passages of Lorem Ipsum available, but the
majority have suffered alteration in some form, by injected anything
embarrassing hidden in the middle of text. All the Lorem Ipsum generators on
the Internet tend to repeat predefined chunks as necessary, making this the first
true generator on the Internet. It uses a dictionary of over 200 Latin words,
combined with a handful of model sentence structures, to generate Lorem
Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always
free from repetition, injected humour, or non-characteristic words etc.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.




A
    Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
    when an unknown printer took a galley of type and scrambled it to make a type
    specimen book. It has survived not only five centuries, but also the leap into
    electronic typesetting, remaining essentially unchanged. It was popularised in
    the 1960s with the release of Letraset sheets containing Lorem Ipsum passages,
    and more recently with desktop publishing software like Aldus PageMaker
    including versions of Lorem Ipsum.

    It is a long established fact that a reader will be distracted by the readable
    content of a page when looking at its layout. The point of using Lorem Ipsum is
    that it has a more-or-less normal distribution of letters, as opposed to using
    'Content here, content here', making it look like readable English. Many desktop
    publishing packages and web page editors now use Lorem Ipsum as their default
    model text, and a search for 'lorem ipsum' will uncover many web sites still in
    their infancy. Various versions have evolved over the years, sometimes by
    accident, sometimes on purpose (injected humour and the like).

    Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots
    in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
    Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
    looked up one of the more obscure Latin words, consectetur, from a Lorem
    Ipsum passage, and going through the cites of the word in classical literature,
    discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32
    and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and
    Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics,
    very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem
    ipsum dolor sit amet..", comes from a line in section 1.10.32.

    The standard chunk of Lorem Ipsum used since the 1500s is reproduced below
    for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et
    Malorum" by Cicero are also reproduced in their exact original form,
    accompanied by English versions from the 1914 translation by H. Rackham.
    There are many variations of passages of Lorem Ipsum available, but the
    majority have suffered alteration in some form, by injected anything
    embarrassing hidden in the middle of text. All the Lorem Ipsum generators on
    the Internet tend to repeat predefined chunks as necessary, making this the first
    true generator on the Internet. It uses a dictionary of over 200 Latin words,
    combined with a handful of model sentence structures, to generate Lorem
    Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always
    free from repetition, injected humour, or non-characteristic words etc.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.




A
    Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
    when an unknown printer took a galley of type and scrambled it to make a type
    specimen book. It has survived not only five centuries, but also the leap into
    electronic typesetting, remaining essentially unchanged. It was popularised in
    the 1960s with the release of Letraset sheets containing Lorem Ipsum passages,
    and more recently with desktop publishing software like Aldus PageMaker
    including versions of Lorem Ipsum.




                                                                                          B
    It is a long established fact that a reader will be distracted by the readable
    content of a page when looking at its layout. The point of using Lorem Ipsum is
    that it has a more-or-less normal distribution of letters, as opposed to using
    'Content here, content here', making it look like readable English. Many desktop
    publishing packages and web page editors now use Lorem Ipsum as their default
    model text, and a search for 'lorem ipsum' will uncover many web sites still in
    their infancy. Various versions have evolved over the years, sometimes by
    accident, sometimes on purpose (injected humour and the like).

    Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots
    in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
    Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
    looked up one of the more obscure Latin words, consectetur, from a Lorem
    Ipsum passage, and going through the cites of the word in classical literature,
    discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32
    and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and
    Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics,
    very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem
    ipsum dolor sit amet..", comes from a line in section 1.10.32.

    The standard chunk of Lorem Ipsum used since the 1500s is reproduced below
    for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et
    Malorum" by Cicero are also reproduced in their exact original form,
    accompanied by English versions from the 1914 translation by H. Rackham.
    There are many variations of passages of Lorem Ipsum available, but the
    majority have suffered alteration in some form, by injected anything
    embarrassing hidden in the middle of text. All the Lorem Ipsum generators on
    the Internet tend to repeat predefined chunks as necessary, making this the first
    true generator on the Internet. It uses a dictionary of over 200 Latin words,
    combined with a handful of model sentence structures, to generate Lorem
    Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always
    free from repetition, injected humour, or non-characteristic words etc.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.




A
    Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
    when an unknown printer took a galley of type and scrambled it to make a type
    specimen book. It has survived not only five centuries, but also the leap into
    electronic typesetting, remaining essentially unchanged. It was popularised in
    the 1960s with the release of Letraset sheets containing Lorem Ipsum passages,
    and more recently with desktop publishing software like Aldus PageMaker
    including versions of Lorem Ipsum.




                                                                                          B
    It is a long established fact that a reader will be distracted by the readable
    content of a page when looking at its layout. The point of using Lorem Ipsum is
    that it has a more-or-less normal distribution of letters, as opposed to using
    'Content here, content here', making it look like readable English. Many desktop
    publishing packages and web page editors now use Lorem Ipsum as their default
    model text, and a search for 'lorem ipsum' will uncover many web sites still in
    their infancy. Various versions have evolved over the years, sometimes by
    accident, sometimes on purpose (injected humour and the like).

    Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots
    in a piece of classical Latin literature from 45 BC, making it over 2000 years old.




C
    Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
    looked up one of the more obscure Latin words, consectetur, from a Lorem
    Ipsum passage, and going through the cites of the word in classical literature,
    discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32
    and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and
    Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics,
    very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem
    ipsum dolor sit amet..", comes from a line in section 1.10.32.

    The standard chunk of Lorem Ipsum used since the 1500s is reproduced below
    for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et
    Malorum" by Cicero are also reproduced in their exact original form,
    accompanied by English versions from the 1914 translation by H. Rackham.
    There are many variations of passages of Lorem Ipsum available, but the
    majority have suffered alteration in some form, by injected anything
    embarrassing hidden in the middle of text. All the Lorem Ipsum generators on
    the Internet tend to repeat predefined chunks as necessary, making this the first
    true generator on the Internet. It uses a dictionary of over 200 Latin words,
    combined with a handful of model sentence structures, to generate Lorem
    Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always
    free from repetition, injected humour, or non-characteristic words etc.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum




A
    has been the industry's standard dummy text ever since the 1500s, when an unknown
    printer took a galley of type and scrambled it to make a type specimen book. It has
    survived not only five centuries, but also the leap into electronic typesetting, remaining
    essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets
    containing Lorem Ipsum passages, and more recently with desktop publishing software
    like Aldus PageMaker including versions of Lorem Ipsum.
A B C
A
LDA
      B
      C
Descripción de la
  plataforma
Entrenamiento
Entrenamiento




Data Histórica
Entrenamiento




Data Histórica   Limpieza
Entrenamiento




Data Histórica   Limpieza   Entrenamiento
Entrenamiento




Data Histórica   Limpieza   Entrenamiento

  Scrapping
Entrenamiento




Data Histórica   Limpieza       Entrenamiento

                 Stop Words
                 Frecuencia 1
                  Frecuencia
                 Transversal
Entrenamiento




Data Histórica   Limpieza   Entrenamiento

                                  LDA
                            Almacenamiento
                                Filtrado
Categorización
Categorización




 Texto Plano
Categorización




 Texto Plano     Limpieza
Categorización




 Texto Plano     Limpieza   Clasificación
Categorización




 Texto Plano     Limpieza     Clasificación
                 Stop Words
Categorización




 Texto Plano     Limpieza        Clasificación
                 Stop Words         Modelo LDA
                              Naïve Bayes Multinomial
                                  Otros métodos
Python
Python

Django
Python

         Django

Modelo Vista Controlador
Python

          Django

Modelo Vista Controlador

      MySql
Python
           Django
  Modelo Vista Controlador

        MySql

Web Service
Resultados
Entrenamiento


4.000       Post
1000 FayerWayer
 1000 WayerLess
   1000 Belelu
   1000 Ferplei
5   Temáticas detectadas
chile colo equipo copa partido barcelona
    futbol universidad jugador partidos
  seleccion ex jugadores alexis goles club
tecnico fecha sanchez chileno real torneo
   madrid final america nacional catolica
   gran primera argentino estadio jugar
mundial san ahora luego volante delantero
   clausura campeon agosto nuevo gol
argentina primer sostuvo carlos mejor liga
93%   Ferplei
Validación


350     Post
FayerWayer y Wayerless
             Suma   Frecuencia   TFIDF   NB    NBM

Precission   88%      85%        83%     91%   92%
Recall       96%      97%        91%     94%   93%
F-Measure    92%      91%        87%     93%   93%




Belelú
             Suma   Frecuencia   TFIDF   NB    NBM


Precission   91%      93%        79%     86%   84%
Recall       74%       67%       62%     85%   86%
F-Measure    82%      78%        70%     85%   85%
Ferplei
             Suma   Frecuencia   TFIDF   NB     NBM


Precission   96%      94%        91%     100%   100%
Recall       96%      96%        96%     89%    94%
F-Measure    96%      95%        94%     94%    97%
Entrenamiento


14.400           Post
Validación


1.600         Post
Video Juegos
            Futbol
 Musica, Fiestas y Panoramas
        Telefonia Movil
         Automoviles
      Pareja y Vida Social
   Medio Ambiente Global
     Ciencia y Tecnologia
Medio Ambiente Pequeña Escala
      Mujer y Sexualidad
      Familia y Sociedad
    Investigación Espacial
         Tech Gadgets
    Servicios y Tecnologia
    Automoviles: Top Gear
Complejidad
Trabajo Futuro
Mejorar limpieza del texto
     Sistema multimodelo
       Predicción de Blog
Mejoras del modelo en el tiempo
     Incorporar Stemming
Muchas Gracias

Weitere ähnliche Inhalte

Was ist angesagt?

Luxury Listing Presentation
Luxury Listing PresentationLuxury Listing Presentation
Luxury Listing Presentationjmadrid7
 
Copy of my polymer blog
Copy of my polymer blogCopy of my polymer blog
Copy of my polymer blogrickdog
 
Miaka Theme_V2
Miaka Theme_V2Miaka Theme_V2
Miaka Theme_V2Nikki A
 
Miaka - Powerpoint Theme
Miaka - Powerpoint ThemeMiaka - Powerpoint Theme
Miaka - Powerpoint ThemeNikki A
 

Was ist angesagt? (8)

Name9
Name9Name9
Name9
 
Name8
Name8Name8
Name8
 
Metu1
Metu1Metu1
Metu1
 
Coolor
Coolor Coolor
Coolor
 
Luxury Listing Presentation
Luxury Listing PresentationLuxury Listing Presentation
Luxury Listing Presentation
 
Copy of my polymer blog
Copy of my polymer blogCopy of my polymer blog
Copy of my polymer blog
 
Miaka Theme_V2
Miaka Theme_V2Miaka Theme_V2
Miaka Theme_V2
 
Miaka - Powerpoint Theme
Miaka - Powerpoint ThemeMiaka - Powerpoint Theme
Miaka - Powerpoint Theme
 

Andere mochten auch

Elizabeth gh
Elizabeth ghElizabeth gh
Elizabeth ghblushyzz
 
Presentacio vestuari
Presentacio vestuariPresentacio vestuari
Presentacio vestuarisilviaprofe56
 
это нашей истории строки
это нашей истории строкиэто нашей истории строки
это нашей истории строкиinternists
 
Organitation antonio peñalver
Organitation antonio peñalverOrganitation antonio peñalver
Organitation antonio peñalverpilarmgarre
 
The earth
The earthThe earth
The earthsvpp11
 
выбираем будущую профессию
выбираем будущую профессиювыбираем будущую профессию
выбираем будущую профессиюmaychik1995
 
Adrian garcia example air condition
Adrian garcia example air conditionAdrian garcia example air condition
Adrian garcia example air conditionpilarmgarre
 
Com descarregar musica acbat per presentar 1
Com descarregar musica acbat per presentar 1Com descarregar musica acbat per presentar 1
Com descarregar musica acbat per presentar 1silviaprofe56
 
Untitled presentation
Untitled presentationUntitled presentation
Untitled presentationblushyzz
 
Health equity audit (in sanità) ai tempi della crisi. G.Costa
Health equity audit (in sanità) ai tempi della crisi. G.CostaHealth equity audit (in sanità) ai tempi della crisi. G.Costa
Health equity audit (in sanità) ai tempi della crisi. G.CostaGiuseppe Fattori
 
As media course work- evaluation
As media course work- evaluationAs media course work- evaluation
As media course work- evaluationBillysmedia
 
Sense títol 1angels
Sense títol 1angelsSense títol 1angels
Sense títol 1angelslauraguri
 
nba jugadors mes bons de cada equip
nba jugadors mes bons de cada equipnba jugadors mes bons de cada equip
nba jugadors mes bons de cada equipvictor969
 

Andere mochten auch (20)

Pertemuan 4
Pertemuan 4Pertemuan 4
Pertemuan 4
 
Elizabeth gh
Elizabeth ghElizabeth gh
Elizabeth gh
 
Presentacio vestuari
Presentacio vestuariPresentacio vestuari
Presentacio vestuari
 
Presentation2
Presentation2Presentation2
Presentation2
 
это нашей истории строки
это нашей истории строкиэто нашей истории строки
это нашей истории строки
 
Organitation antonio peñalver
Organitation antonio peñalverOrganitation antonio peñalver
Organitation antonio peñalver
 
The earth
The earthThe earth
The earth
 
Australia final
Australia finalAustralia final
Australia final
 
выбираем будущую профессию
выбираем будущую профессиювыбираем будущую профессию
выбираем будущую профессию
 
Malaga
MalagaMalaga
Malaga
 
Adrian garcia example air condition
Adrian garcia example air conditionAdrian garcia example air condition
Adrian garcia example air condition
 
Ujian tengah semester
Ujian tengah semesterUjian tengah semester
Ujian tengah semester
 
Полевая почта
Полевая почтаПолевая почта
Полевая почта
 
Com descarregar musica acbat per presentar 1
Com descarregar musica acbat per presentar 1Com descarregar musica acbat per presentar 1
Com descarregar musica acbat per presentar 1
 
Untitled presentation
Untitled presentationUntitled presentation
Untitled presentation
 
Health equity audit (in sanità) ai tempi della crisi. G.Costa
Health equity audit (in sanità) ai tempi della crisi. G.CostaHealth equity audit (in sanità) ai tempi della crisi. G.Costa
Health equity audit (in sanità) ai tempi della crisi. G.Costa
 
As media course work- evaluation
As media course work- evaluationAs media course work- evaluation
As media course work- evaluation
 
Sense títol 1angels
Sense títol 1angelsSense títol 1angels
Sense títol 1angels
 
nba jugadors mes bons de cada equip
nba jugadors mes bons de cada equipnba jugadors mes bons de cada equip
nba jugadors mes bons de cada equip
 
Contabilidad estado de resultados
Contabilidad estado de resultadosContabilidad estado de resultados
Contabilidad estado de resultados
 

Ähnlich wie Presentacion tema memoria v1

Ähnlich wie Presentacion tema memoria v1 (20)

What is lorem ipsum
What is lorem ipsumWhat is lorem ipsum
What is lorem ipsum
 
KF PowerPoint demo
KF PowerPoint demoKF PowerPoint demo
KF PowerPoint demo
 
HELLO HOLA.pptx
HELLO HOLA.pptxHELLO HOLA.pptx
HELLO HOLA.pptx
 
Sample Document Title
Sample Document TitleSample Document Title
Sample Document Title
 
Title
TitleTitle
Title
 
Title
TitleTitle
Title
 
Sample Document
Sample DocumentSample Document
Sample Document
 
Where can I get some There are many variations of passages of L.docx
Where can I get some There are many variations of passages of L.docxWhere can I get some There are many variations of passages of L.docx
Where can I get some There are many variations of passages of L.docx
 
Samples
SamplesSamples
Samples
 
Where does it come from Contrary to popular belief, Lorem Ipsum.docx
Where does it come from Contrary to popular belief, Lorem Ipsum.docxWhere does it come from Contrary to popular belief, Lorem Ipsum.docx
Where does it come from Contrary to popular belief, Lorem Ipsum.docx
 
Mu gep plantilla
Mu gep plantillaMu gep plantilla
Mu gep plantilla
 
Lorem ipsum2
Lorem ipsum2Lorem ipsum2
Lorem ipsum2
 
Lorem ipsum копия
Lorem ipsum   копияLorem ipsum   копия
Lorem ipsum копия
 
Lorem ipsum3
Lorem ipsum3Lorem ipsum3
Lorem ipsum3
 
Lorem ipsum2
Lorem ipsum2Lorem ipsum2
Lorem ipsum2
 
Lorem ipsum3
Lorem ipsum3Lorem ipsum3
Lorem ipsum3
 
Lorem ipsum2 (2)
Lorem ipsum2 (2)Lorem ipsum2 (2)
Lorem ipsum2 (2)
 
Lorem ipsum3 (3) копия
Lorem ipsum3 (3)   копияLorem ipsum3 (3)   копия
Lorem ipsum3 (3) копия
 
Lorem ipsum2
Lorem ipsum2Lorem ipsum2
Lorem ipsum2
 
Lorem ipsum
Lorem ipsumLorem ipsum
Lorem ipsum
 

Kürzlich hochgeladen

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Presentacion tema memoria v1

  • 1. Diseño y construcción de una plataforma de clasificación y calificación de post para una red de blogs basada en textmining para Betazeta Networks S.A. Camilo López A.
  • 2. Objetivo General El objetivo general del trabajo es el apoyo al procesamiento manual de grandes volúmenes de publicaciones en la red de blogs de Betazeta mediante el diseño e implementación de un prototipo para la categorización automática de estos datos utilizando text mining.
  • 3. Objetivos Específicos 1. Entender a fondo la problemática y el contexto de la empresa junto con los conocimientos necesarios respecto a text mining, modelos y metodologías necesarias. 2. Selección de los datos históricos, las consultas sobre éstos y los modelos que permitan realizar predicciones exitosas de categorización. 3. Establecer métodos y métricas para la evaluación de la solución propuesta. 4. Utilizando el conocimiento adquirido en los objetivos anteriores, diseñar el proceso de categorización automático de posts.
  • 4. Objetivos Específicos 5. Diseñar e implementar un prototipo que permita al usuario ingresar información en forma adecuada para su análisis y a la empresa procesarla, filtrarla y publicarla en base a criterios del negocio. 6. Implementación de la metodología de evaluación.
  • 7. 7,5 millones Visitas Mensuales
  • 8.
  • 12. Spam
  • 15.
  • 16.
  • 17.
  • 26. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham. There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.
  • 27. Lorem Ipsum is simply dummy text of the printing and typesetting industry. A Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham. There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.
  • 28. Lorem Ipsum is simply dummy text of the printing and typesetting industry. A Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. B It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham. There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.
  • 29. Lorem Ipsum is simply dummy text of the printing and typesetting industry. A Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. B It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. C Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham. There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.
  • 30. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum A has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
  • 31. A B C
  • 32.
  • 33. A LDA B C
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39. Descripción de la plataforma
  • 43. Entrenamiento Data Histórica Limpieza Entrenamiento
  • 44. Entrenamiento Data Histórica Limpieza Entrenamiento Scrapping
  • 45. Entrenamiento Data Histórica Limpieza Entrenamiento Stop Words Frecuencia 1 Frecuencia Transversal
  • 46. Entrenamiento Data Histórica Limpieza Entrenamiento LDA Almacenamiento Filtrado
  • 50. Categorización Texto Plano Limpieza Clasificación
  • 51. Categorización Texto Plano Limpieza Clasificación Stop Words
  • 52. Categorización Texto Plano Limpieza Clasificación Stop Words Modelo LDA Naïve Bayes Multinomial Otros métodos
  • 55. Python Django Modelo Vista Controlador
  • 56. Python Django Modelo Vista Controlador MySql
  • 57. Python Django Modelo Vista Controlador MySql Web Service
  • 60. 1000 FayerWayer 1000 WayerLess 1000 Belelu 1000 Ferplei
  • 61. 5 Temáticas detectadas
  • 62. chile colo equipo copa partido barcelona futbol universidad jugador partidos seleccion ex jugadores alexis goles club tecnico fecha sanchez chileno real torneo madrid final america nacional catolica gran primera argentino estadio jugar mundial san ahora luego volante delantero clausura campeon agosto nuevo gol argentina primer sostuvo carlos mejor liga
  • 63. 93% Ferplei
  • 65. FayerWayer y Wayerless Suma Frecuencia TFIDF NB NBM Precission 88% 85% 83% 91% 92% Recall 96% 97% 91% 94% 93% F-Measure 92% 91% 87% 93% 93% Belelú Suma Frecuencia TFIDF NB NBM Precission 91% 93% 79% 86% 84% Recall 74% 67% 62% 85% 86% F-Measure 82% 78% 70% 85% 85%
  • 66. Ferplei Suma Frecuencia TFIDF NB NBM Precission 96% 94% 91% 100% 100% Recall 96% 96% 96% 89% 94% F-Measure 96% 95% 94% 94% 97%
  • 69. Video Juegos Futbol Musica, Fiestas y Panoramas Telefonia Movil Automoviles Pareja y Vida Social Medio Ambiente Global Ciencia y Tecnologia Medio Ambiente Pequeña Escala Mujer y Sexualidad Familia y Sociedad Investigación Espacial Tech Gadgets Servicios y Tecnologia Automoviles: Top Gear
  • 72. Mejorar limpieza del texto Sistema multimodelo Predicción de Blog Mejoras del modelo en el tiempo Incorporar Stemming