This document summarizes an ETL project using Wikipedia data. The project aimed to analyze revision history data to identify trending topics over time. It describes challenges like decompressing large compressed data files, transferring hundreds of gigabytes of data, and transforming semi-structured XML data. The solutions involved using Linux tools to parallelize decompression, splitting files for efficient transfer, and using Spark and Scala to convert XML to a structured dataframe for analysis and visualization. Other applications of these techniques are discussed, like processing other large semi-structured datasets.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Wikipedia Data Mining
1. WikiPulse: What’s New on Wikipedia
ETL operation on large-scale semi-structured data,
using Wikimedia as an example
Team: Myra Liu, Khasim Shaik, Jacky Yang, Kivi Zuo
2. Project Overview and Motivation
ETL operation on large-scale semi-structured data, using Wikimedia data as an example
Dashboard Demo
Motivation: Find the most trending and upcoming topics at certain time on Wikipedia
Sniff out novel topics and trends across all pages in various fields
Goal: Introduce the challenges and solutions in ETL process on large-scale semi-structured data
Data: Revision history on Wikipedia
103 pages, 1% of original data
Timeframe: July 2007 - January 2019
Attributes: page title, revision date, revision size, contributor
3. Major Procedures and Challenges
● Wikimedia Dump Service: A snapshot of Wikipedia’s entire database
A collection of data files in XML format
Created twice every month
● Decompress: Dealing with ‘DUMP’ data that has been compressed by 90%
● Transferring: Transferring large amounts of data (up to 500GB)
● Transforming semi-structured data: XML data format (semi-structured and highly redundant)
4. ‘Dump’ Data with High Compression
SolutionChallenge
● Limited scope:
Focus on revision data
Download the 40GB compressed file
● Data size:
Various datasets in articles, edit history,
revision logs, metadata, page-to-page
links and etc.
Eg. 450GB of uncompressed revision
history log for all wikipedia articles
● Compressed format:
Wiki Dumps data compressed in bz2
format, highly compressed, unable to
unzip on PC using traditional tools
● Linux-based package Ibzip2:
Use parallel decompression on AWS
EC2 to save time
5. Transferring Large Amounts of Data
Solution
Restriction AWS S3 has an upload file size limit of 50GB
Too complicated! Use S3 Multipart upload API, involving partitioning files, creating hashmaps
and
recombining for later use. Up to 1000 parts/upload, up to 5GB/part
Simple and efficient Write bash script using ‘split’ command to create smaller chunks and other bash
commands to upload these chunks separately
Challenge
6. Working with XML File
XML is semi-structured.
It has a well defined schema but not all records may follow
the schema.
XPath in PIG Xpath is great for XML reading.
Pig requires multiple operations to convert XML
to Dataframe.
Scala in Spark Use Databricks’ spark.xml package
Solution
Challenge
Before:
After:
7. Procedures Using Spark-Scala
Process Map
S3
Bucket
HADOOP
Spark-Scala
Read XML and
explode it to a struct-
dataframe
Run manipulations
and filter rows
Write cleaned data to
CSV with Hadoop’s
CopyMerge
S3
Bucket
Visualization
Tool
8. Use Case: How our learnings can be used
Transferable techniques on following scenarios:
● Decompress and transfer HUGE dataset exceeding the limit of S3
● Mine data from other text-based databases such as .html or .xml, ingest useful information and convert
to usable structured data format for further analysis
● Reformat and analyze semi-structured data for companies with large amount of legacy data