Wikipedia Data Mining

WikiPulse: What’s New on Wikipedia
ETL operation on large-scale semi-structured data,
using Wikimedia as an example
Team: Myra Liu, Khasim Shaik, Jacky Yang, Kivi Zuo

Project Overview and Motivation
ETL operation on large-scale semi-structured data, using Wikimedia data as an example
Dashboard Demo
Motivation: Find the most trending and upcoming topics at certain time on Wikipedia
Sniff out novel topics and trends across all pages in various fields
Goal: Introduce the challenges and solutions in ETL process on large-scale semi-structured data
Data: Revision history on Wikipedia
103 pages, 1% of original data
Timeframe: July 2007 - January 2019
Attributes: page title, revision date, revision size, contributor

Major Procedures and Challenges
● Wikimedia Dump Service: A snapshot of Wikipedia’s entire database
A collection of data files in XML format
Created twice every month
● Decompress: Dealing with ‘DUMP’ data that has been compressed by 90%
● Transferring: Transferring large amounts of data (up to 500GB)
● Transforming semi-structured data: XML data format (semi-structured and highly redundant)

‘Dump’ Data with High Compression
SolutionChallenge
● Limited scope:
Focus on revision data
Download the 40GB compressed file
● Data size:
Various datasets in articles, edit history,
revision logs, metadata, page-to-page
links and etc.
Eg. 450GB of uncompressed revision
history log for all wikipedia articles
● Compressed format:
Wiki Dumps data compressed in bz2
format, highly compressed, unable to
unzip on PC using traditional tools
● Linux-based package Ibzip2:
Use parallel decompression on AWS
EC2 to save time

Transferring Large Amounts of Data
Solution
Restriction AWS S3 has an upload file size limit of 50GB
Too complicated! Use S3 Multipart upload API, involving partitioning files, creating hashmaps
and
recombining for later use. Up to 1000 parts/upload, up to 5GB/part
Simple and efficient Write bash script using ‘split’ command to create smaller chunks and other bash
commands to upload these chunks separately
Challenge

Working with XML File
XML is semi-structured.
It has a well defined schema but not all records may follow
the schema.
XPath in PIG Xpath is great for XML reading.
Pig requires multiple operations to convert XML
to Dataframe.
Scala in Spark Use Databricks’ spark.xml package
Solution
Challenge
Before:
After:

Procedures Using Spark-Scala
Process Map
S3
Bucket
HADOOP
Spark-Scala
Read XML and
explode it to a struct-
dataframe
Run manipulations
and filter rows
Write cleaned data to
CSV with Hadoop’s
CopyMerge
S3
Bucket
Visualization
Tool

Use Case: How our learnings can be used
Transferable techniques on following scenarios:
● Decompress and transfer HUGE dataset exceeding the limit of S3
● Mine data from other text-based databases such as .html or .xml, ingest useful information and convert
to usable structured data format for further analysis
● Reformat and analyze semi-structured data for companies with large amount of legacy data

Wikipedia Data Mining

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Wikipedia Data Mining

Ähnlich wie Wikipedia Data Mining (20)

Mehr von Shaik Khasim

Mehr von Shaik Khasim (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Wikipedia Data Mining

Hinweis der Redaktion