SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
A Corpus Linguistics
Based Approach for
Estimating Arabic
Online Content
5,340,000
1,950,000
0.5 %
1%
1.4 %
3%
0.5 %
1.4 % %
     1
Zipff’s Law
Corpora
Building
Dmoz corpus
75,560 pages
530.1 MB
659,756 uniq. words
Wikipedia corpus
95,140 pages
213.3 MB
760,690 uniq. words
CCA corpus
377 pages
82,878 uniq. words
Common
‫‪Word‬‬   ‫‪Document‬‬   ‫‪Frequency‬‬   ‫‪Word‬‬       ‫‪Document‬‬   ‫‪Frequency‬‬

 ‫فً‬      ‫812,06‬   ‫882,770,1‬      ‫أو‬        ‫967,62‬    ‫457,501‬
 ‫من‬      ‫949,16‬    ‫250,068‬      ‫هذه‬        ‫982,92‬      ‫469,79‬
‫على‬      ‫648,65‬    ‫496,894‬      ‫بين‬        ‫266,23‬      ‫535,48‬
 ‫إلى‬     ‫995,84‬    ‫513,872‬      ‫اهلل‬       ‫308,62‬      ‫612,48‬
 ‫أن‬      ‫934,04‬    ‫564,772‬     ‫أخبار‬       ‫010,03‬      ‫498,18‬
‫عن‬       ‫637,05‬    ‫428,142‬      ‫كل‬         ‫772,03‬      ‫422,18‬
‫التً‬     ‫734,53‬    ‫200,661‬    ‫الزئيسية‬     ‫000,14‬      ‫161,08‬
 ‫ال‬      ‫221,04‬    ‫788,351‬      ‫بعد‬        ‫073,23‬      ‫713,87‬
 ‫مع‬      ‫797,83‬    ‫751,031‬    ‫الصفحة‬       ‫738,72‬      ‫449,66‬
 ‫ما‬      ‫736,33‬    ‫403,921‬       ‫لم‬        ‫304,52‬      ‫152,46‬
 ‫هذا‬     ‫363,13‬    ‫521,901‬      ‫كان‬        ‫613,32‬      ‫813,36‬
‫الذي‬     ‫474,23‬    ‫448,801‬     ‫العالم‬      ‫782,32‬      ‫864,06‬
A corpus linguistics based approach for estimating online content
A corpus linguistics based approach for estimating online content
A corpus linguistics based approach for estimating online content
A corpus linguistics based approach for estimating online content

Weitere ähnliche Inhalte

Mehr von Anas Tawileh

Youth in Technology for Community Development
Youth in Technology for Community DevelopmentYouth in Technology for Community Development
Youth in Technology for Community DevelopmentAnas Tawileh
 
Global Digital Divide - at the HICSS 2010
Global Digital Divide - at the HICSS 2010Global Digital Divide - at the HICSS 2010
Global Digital Divide - at the HICSS 2010Anas Tawileh
 
Explaining the Digital Divide
Explaining the Digital DivideExplaining the Digital Divide
Explaining the Digital DivideAnas Tawileh
 
Case Study in Arabic English Web
Case Study in Arabic English WebCase Study in Arabic English Web
Case Study in Arabic English WebAnas Tawileh
 
Knowledge Production and Dissemination in the Digital Era
Knowledge Production and Dissemination in the Digital EraKnowledge Production and Dissemination in the Digital Era
Knowledge Production and Dissemination in the Digital EraAnas Tawileh
 
Sustainable Protection of Critical Corporate Information
Sustainable Protection of Critical Corporate InformationSustainable Protection of Critical Corporate Information
Sustainable Protection of Critical Corporate InformationAnas Tawileh
 
ISSE 2008 Information Security Status
ISSE 2008 Information Security StatusISSE 2008 Information Security Status
ISSE 2008 Information Security StatusAnas Tawileh
 

Mehr von Anas Tawileh (8)

Youth in Technology for Community Development
Youth in Technology for Community DevelopmentYouth in Technology for Community Development
Youth in Technology for Community Development
 
Global Digital Divide - at the HICSS 2010
Global Digital Divide - at the HICSS 2010Global Digital Divide - at the HICSS 2010
Global Digital Divide - at the HICSS 2010
 
Explaining the Digital Divide
Explaining the Digital DivideExplaining the Digital Divide
Explaining the Digital Divide
 
Case Study in Arabic English Web
Case Study in Arabic English WebCase Study in Arabic English Web
Case Study in Arabic English Web
 
Knowledge Production and Dissemination in the Digital Era
Knowledge Production and Dissemination in the Digital EraKnowledge Production and Dissemination in the Digital Era
Knowledge Production and Dissemination in the Digital Era
 
Sustainable Protection of Critical Corporate Information
Sustainable Protection of Critical Corporate InformationSustainable Protection of Critical Corporate Information
Sustainable Protection of Critical Corporate Information
 
ISSE 2008 Information Security Status
ISSE 2008 Information Security StatusISSE 2008 Information Security Status
ISSE 2008 Information Security Status
 
Lasilky.org
Lasilky.orgLasilky.org
Lasilky.org
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

A corpus linguistics based approach for estimating online content

  • 1. A Corpus Linguistics Based Approach for Estimating Arabic Online Content
  • 2.
  • 3.
  • 7. 1%
  • 9. 3%
  • 10. 0.5 % 1.4 % % 1
  • 11.
  • 13.
  • 15.
  • 16. Dmoz corpus 75,560 pages 530.1 MB 659,756 uniq. words
  • 17. Wikipedia corpus 95,140 pages 213.3 MB 760,690 uniq. words
  • 19.
  • 20.
  • 22.
  • 23.
  • 24. ‫‪Word‬‬ ‫‪Document‬‬ ‫‪Frequency‬‬ ‫‪Word‬‬ ‫‪Document‬‬ ‫‪Frequency‬‬ ‫فً‬ ‫812,06‬ ‫882,770,1‬ ‫أو‬ ‫967,62‬ ‫457,501‬ ‫من‬ ‫949,16‬ ‫250,068‬ ‫هذه‬ ‫982,92‬ ‫469,79‬ ‫على‬ ‫648,65‬ ‫496,894‬ ‫بين‬ ‫266,23‬ ‫535,48‬ ‫إلى‬ ‫995,84‬ ‫513,872‬ ‫اهلل‬ ‫308,62‬ ‫612,48‬ ‫أن‬ ‫934,04‬ ‫564,772‬ ‫أخبار‬ ‫010,03‬ ‫498,18‬ ‫عن‬ ‫637,05‬ ‫428,142‬ ‫كل‬ ‫772,03‬ ‫422,18‬ ‫التً‬ ‫734,53‬ ‫200,661‬ ‫الزئيسية‬ ‫000,14‬ ‫161,08‬ ‫ال‬ ‫221,04‬ ‫788,351‬ ‫بعد‬ ‫073,23‬ ‫713,87‬ ‫مع‬ ‫797,83‬ ‫751,031‬ ‫الصفحة‬ ‫738,72‬ ‫449,66‬ ‫ما‬ ‫736,33‬ ‫403,921‬ ‫لم‬ ‫304,52‬ ‫152,46‬ ‫هذا‬ ‫363,13‬ ‫521,901‬ ‫كان‬ ‫613,32‬ ‫813,36‬ ‫الذي‬ ‫474,23‬ ‫448,801‬ ‫العالم‬ ‫782,32‬ ‫864,06‬