Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Wikipedia Data Mining

13 Aufrufe

Veröffentlicht am

A big data project that mines wikipedia data dump to uncover insights. Dashboard can be found here: https://public.tableau.com/profile/noor.mohammed.khasim.shaik#!/vizhome/WikiPulse-WhatshappeningonWikipedia_0/Dashboard1?publish=yes

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Wikipedia Data Mining

  1. 1. WikiPulse: What’s New on Wikipedia ETL operation on large-scale semi-structured data, using Wikimedia as an example Team: Myra Liu, Khasim Shaik, Jacky Yang, Kivi Zuo
  2. 2. Project Overview and Motivation ETL operation on large-scale semi-structured data, using Wikimedia data as an example Dashboard Demo Motivation: Find the most trending and upcoming topics at certain time on Wikipedia Sniff out novel topics and trends across all pages in various fields Goal: Introduce the challenges and solutions in ETL process on large-scale semi-structured data Data: Revision history on Wikipedia 103 pages, 1% of original data Timeframe: July 2007 - January 2019 Attributes: page title, revision date, revision size, contributor
  3. 3. Major Procedures and Challenges ● Wikimedia Dump Service: A snapshot of Wikipedia’s entire database A collection of data files in XML format Created twice every month ● Decompress: Dealing with ‘DUMP’ data that has been compressed by 90% ● Transferring: Transferring large amounts of data (up to 500GB) ● Transforming semi-structured data: XML data format (semi-structured and highly redundant)
  4. 4. ‘Dump’ Data with High Compression SolutionChallenge ● Limited scope: Focus on revision data Download the 40GB compressed file ● Data size: Various datasets in articles, edit history, revision logs, metadata, page-to-page links and etc. Eg. 450GB of uncompressed revision history log for all wikipedia articles ● Compressed format: Wiki Dumps data compressed in bz2 format, highly compressed, unable to unzip on PC using traditional tools ● Linux-based package Ibzip2: Use parallel decompression on AWS EC2 to save time
  5. 5. Transferring Large Amounts of Data Solution Restriction AWS S3 has an upload file size limit of 50GB Too complicated! Use S3 Multipart upload API, involving partitioning files, creating hashmaps and recombining for later use. Up to 1000 parts/upload, up to 5GB/part Simple and efficient Write bash script using ‘split’ command to create smaller chunks and other bash commands to upload these chunks separately Challenge
  6. 6. Working with XML File XML is semi-structured. It has a well defined schema but not all records may follow the schema. XPath in PIG Xpath is great for XML reading. Pig requires multiple operations to convert XML to Dataframe. Scala in Spark Use Databricks’ spark.xml package Solution Challenge Before: After:
  7. 7. Procedures Using Spark-Scala Process Map S3 Bucket HADOOP Spark-Scala Read XML and explode it to a struct- dataframe Run manipulations and filter rows Write cleaned data to CSV with Hadoop’s CopyMerge S3 Bucket Visualization Tool
  8. 8. Use Case: How our learnings can be used Transferable techniques on following scenarios: ● Decompress and transfer HUGE dataset exceeding the limit of S3 ● Mine data from other text-based databases such as .html or .xml, ingest useful information and convert to usable structured data format for further analysis ● Reformat and analyze semi-structured data for companies with large amount of legacy data
  9. 9. Thank You! Any Questions?