Recent IT Development and Women: Big Data and The Power of Women in Goryeo

HiPIC

Recent IT Development and Women:
Big Data and The Power of Women in Goryeo

KWiSE Annual Meeting
Chapman University, CA
Oct 20th 2012

Jongwook Woo (PhD)
High-Performance Internet Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles

Jongwook Woo
CSULA

HiPIC Contents
Part I. Big Data
Fundamentals of Big Data
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases

Part II. The Power of Women in Goryeo
Dynasty
North East Asia before the Mongol Empire
Korea and Mongol
The Empress Gi

CSULA
Jongwook Woo

HiPIC Part I
Big Data
Fundamentals of Big Data
NoSQL DB: HBase, MongoDB
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases

CSULA
Jongwook Woo

HiPIC Experience in Big Data
 Grants
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011

 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012

 Certificate
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012

 Cloud Computing Blog
 http://dal-cloudcomputing.blogspot.com/
CSULA
Jongwook Woo

What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
HiPIC Cloud Computing

CSULA
Jongwook Woo

HiPIC Big Data

Too much data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social
Computing, smart phone, online game…

Cannot handle with the legacy
approach
Too big
Un-/Semi-structured data

CSULA
Jongwook Woo

HiPIC Two Issues in Big Data

How to store Big Data
NoSQL DB

How to compute Big Data
Parallel Computing with multiple cheap
computers
– Not need super computers

CSULA
Jongwook Woo

HiPIC Contents
Fundamentals of Big Data

Data-Intensive Computing: Hadoop

Big Data Supporters and Use Cases

CSULA
Jongwook Woo

HiPIC Data nowadays

• Data Issues
o data grows to 10TB, and then 100TB.
o Unstructured data coming from sources
 like Facebook, Twitter, RFID readers, sensors,
and so on.
 Need to derive information from both the
relational data and the unstructured data
• as soon as possible.

• Solution to efficiently compute Big
Data
o Hadoop Map/Reduce
CSULA
Jongwook Woo

HiPIC Solutions in Big Data Computation
 Map/Reduce by Google
(Key, Value) parallel computing
 Apache Hadoop
 Big Data
Data Computation (MapReduce, Pig)

 Integrating MapReduce and RDB
 Oracle + Hadoop
 Sybase IQ
 Vertica + Hadoop
 Hadoop DB
 Greenplum
 Aster Data
 Integrating MapReduce and NoSQL DB
 MongoDB MapReduce
 HBase
CSULA
Jongwook Woo

HiPIC Apache Hadoop
 Motivated by Google Map/Reduce and GFS
 open source project of the Apache Foundation.
 framework written in Java
– originally developed by Doug Cutting
• who named it after his son's toy elephant.

 Two core Components
 Storage: HDFS
– High Bandwidth Clustered storage
 Processing: Map/Reduce
– Fault Tolerant Distributed Processing

 Hadoop scales linearly with
 data size
 Analysis complexity

CSULA
Jongwook Woo

HiPIC Hadoop issues
 Map/Reduce is not DB
 Algorithm in Restricted Parallel Computing

 HDFS and HBase
 Cannot compete with the functions in RDBMS

 But, useful for
 Semi-structured data model and high-level dataflow query
language on top of MapReduce
– Pig, Hive, Jsql, Cascading, Cloudbase
 Useful for huge (peta- or Terra-bytes) but non-complicated data
– Web crawling
– log analysis
• Log file for web companies
– New York Times case

CSULA
Jongwook Woo

HiPIC MapReduce Pros & Cons Summary

Good when
Huge data for input, intermediate, output
A few synchronization required
Read once; batch oriented datasets (ETL)

Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream

CSULA
Jongwook Woo

HiPIC MapReduce in Detail
Functions borrowed from functional
programming languages (eg. Lisp)

Provides Restricted parallel programming
model on Hadoop
User implements Map() and Reduce()
Libraries (Hadoop) take care of
EVERYTHING else
– Parallelization
– Fault Tolerance
– Data Distribution
– Load Balancing

CSULA
Jongwook Woo

HiPIC Map
Convert input data to (key, value) pairs
map() functions run in parallel,
 creating different intermediate (key, value)
values from different input data sets

CSULA
Jongwook Woo

HiPIC Reduce
reduce() combines those intermediate values
into one or more final values for that same
key
reduce() functions also run in parallel,
each working on a different output key
Bottleneck:
reduce phase can‟t start until map phase is
completely finished.

CSULA
Jongwook Woo

HiPIC Example: Sort URLs in the largest hit order
Compute the largest hit URLs
 Stored in log files

Map()
 Input <logFilename, file text>
 Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>

Reduce()
 Input: <url, list of hit counts> from multiple map
nodes
 Output: Sums all values for the same key and emits
<url, TotalCount>
– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
CSULA
Jongwook Woo

HiPIC Map/Reduce for URL visits

Input Log Data

Map1() Map2() … Mapm()
(http://hi.com, 1) (http://halo.com, 1)
(http://hello.com, 3) (http://hello.com, 5)
… …
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>) (http://halo.com, <1, 5,>)
(http://hello.com, <3, 5, 2, 7>)
Reduce1 () Reduce2() … Reducel()

(http://hi.com, 32) (http://halo.com, 6)
(http://hello.com, 17)
CSULA
Jongwook Woo

HiPIC Legacy Example

In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
11 million in all, dating back to 1851.
four-terabyte pile of images in TIFF format.
needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files.
– not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.

CSULA
Jongwook Woo

HiPIC Legacy Example (Cont’d)

In late 2007, the New York Times
wanted to make available over the web
its entire archive of articles,
a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic
Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's
Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored
neatly in S3 and ready to be served up to visitors to the
Times site.
 The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours

CSULA
Jongwook Woo

HiPIC Supporters of Big Data

 Apache Hadoop Supporters
 Cloudera
– Like Linux and Redhat
– HiPIC is an Academic Partner
 Hortonworks
– Pig,
– Consulting and training
 Facebook
– Hive
 IBM
– Jaql

 NoSQL DB supporters
 MongoDB
– HiPIC tries to collaborate
 HBase, CouchDB, Apache Cassandra (originally by FB) etc
CSULA
Jongwook Woo

HiPIC Similarities in Pig, Hive, and Jaql

• translate high-level languages into MapReduce jobs
o the programmer can work at a higher level
 than writing MapReduce jobs in Java or other
lower-level languages
• programs are much smaller than Java code.
• option to extend these languages,
o often by writing user-defined functions in Java.
• Interoperability
o programs written in these high-level languages can
be imbedded inside other languages as well.

CSULA
Jongwook Woo

HiPIC Pig
• developed at Yahoo Research around 2006
o moved into the Apache Software Foundation in
2007.
• PigLatin,
o Pig's language
o a data flow language
o well suited to processing unstructured data
 Easy to write MapReduce codes

CSULA
Jongwook Woo

HiPIC Hive
• developed at Facebook
o turns Hadoop into a data warehouse
o complete with a dialect of SQL for querying.
• HiveQL
o a declarative language (SQL dialect)
• Difference from PigLatin,
o you do not specify the data flow,
 but instead describe the result you want
 Hive figures out how to build a data flow to
achieve it.
o a schema is required,

CSULA
Jongwook Woo

HiPIC Jaql

• developed at IBM.
• a data flow language
o its native data structure format is JSON (JavaScript
Object Notation).

CSULA
Jongwook Woo

HiPIC Use Cases

Amazon AWS

Facebook

Twitter

Craiglist

HuffPOst | AOL

CSULA
Jongwook Woo

HiPIC Amazon AWS

 amazon.com
 Consumer and seller business

 aws.amazon.com
 IT infrastructure business
– Focus on your business not IT management
 Pay as you go
– Pay for servers by the hour
– Pay for storage per Giga byte per month
– Pay for data transfer per Giga byte
 Services with many APIs
– S3: Simple Storage Service
– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers
• Can run on multiple nodes
– Hadoop and HBase
– MongoDB
CSULA
Jongwook Woo

HiPIC Amazon AWS (Cont’d)

Customers on aws.amazon.com
Samsung
– Smart TV hub sites: TV applications are on AWS
Netflix
– ~25% of US internet traffic
– ~100% on AWS
NASA JPL
– Analyze more than 200,000 images
NASDAQ
– Using AWS S3

CSULA
Jongwook Woo

HiPIC Facebook [7]

Using Apache HBase
For Titan and Puma
HBase for FB
– Provide excellent write performance and good reads
– Nice features
• Scalable
• Fault Tolerance
• MapReduce

CSULA
Jongwook Woo

HiPIC Titan: Facebook

Message services in FB
Hundreds of millions of active users
15+ billion messages a month
50K instant message a second

Challenges
High write throughput
– Every message, instant message, SMS, email
Massive Clusters
– Must be easily scalable

Solution
Clustered HBase
CSULA
Jongwook Woo

HiPIC Puma: Facebook

 ETL
 Extract, Transform, Load
– Data Integrating from many data sources to Data Warehouse
 Data analytics
– Domain owners‟ web analytics for Ad and apps
• clicks, likes, shares, comments etc

 ETL before Puma
 8 – 24 hours
– Procedures: Scribe, HDFS, Hive, MySQL

 ETL after Puma
 Puma
– Real time MapReduce framework
 2 – 30 secs
– Procedures: Scribe, HDFS, Puma, HBase

CSULA
Jongwook Woo

HiPIC Twitter [8]

Three Challenges
Collecting Data
– Scribe as FB
Large Scale Storage and analysis
– Cassandra: ColumnFamily key-value store
– Hadoop
Rapid Learning over Big Data
– Pig
• 5% of Java code
• 5% of dev time
• Within 20% of running time

CSULA
Jongwook Woo

HiPIC Craiglist in MongoDB [9]

Craiglist
~700 cities, worldwide
~1 billion hits/day
~1.5 million posts/day
Servers
– ~500 servers
– ~100 MySQL servers

Migrate to MongoDB
Scalable, Fast, Proven, Friendly

CSULA
Jongwook Woo

HiPIC
HuffPost | AOL [10]

Two Machine Learning Use Cases
Comment Moderation
– Evaluate All New HuffPost User Comments Every
Day
• Identify Abusive / Aggressive Comments
• Auto Delete / Publish ~25% Comments Every Day
Article Classification
– Tag Articles for Advertising
• E.g.: scary, salacious, …

build a flexible ML platform running on
Hadoop
Pig for Hadoop implementation.

CSULA
Jongwook Woo

HiPIC Conclusion
Era of Big Data

Need to store and compute Big Data

Storage: NoSQL DB

Computation: Hadoop MapRedude

Need to analyze Big Data in mobile
computing, SNS for Ad, User Behavior,
Patterns, Bioinformatics, Medical data …

CSULA
Jongwook Woo

HiPIC Part II
The power of Women in Goryeo
Dynasty
North East Asia before the Mongol Empire
Korea and Mongol
The Empress Gi

CSULA
Jongwook Woo

HiPIC Three kingdoms (AD 907 - 1125)

CSULA
Jongwook Woo

HiPIC Before Mongol

Three kingdoms balanced power
Goryeo, Yo (Liao, Cathay, Khitan, 契丹),
Song
–Goryeo-Yo: 3 wars
• First invasion (AD 993): 서희,
• Second invasion with 400K (AD 1010):
강조
• Third invasion with 100K (AD 1018):
강감찬
– Goryeo became famous after this victory

CSULA
Jongwook Woo

HiPIC Three kingdoms (AD 1115- 1234)

CSULA
Jongwook Woo

HiPIC Before Mongol

Three kingdoms balanced power
(AD 1115 - 1234)
Goryeo, Gum (Jin, Jurchen, Yojin, 金朝),
South Song
–윤관 invaded Jurchen Wanyan (完顏) clan
(AD 1111) and many battles
–Jin defeated Liao dynasty at AD 1121
– wanted to keep a peace with Goryeo
• From the emperor of big brother to the
king of little brother

CSULA
Jongwook Woo

HiPIC
Part II. The power of Women in Goryeo Dynasty

Korea and Mongol
Wars since AD1231 (고종 18)
Goryeo (Korea) dynasty
Military dictatorship of Choe family ended at AD1258
(고종 45)
Mongol
Was conquering China (the South Song dynasty)
since AD1257
– Möngke Kahn
• Right battalion
– Kublai
• Left battalion

CSULA
Jongwook Woo

HiPIC Korea and Mongol (Cont’d)

Mongol Empire in 1227 at Genghis Khan„s death
[http://en.wikipedia.org/wiki/Timeline_of_the_Mon
gol_Empire]
CSULA
Jongwook Woo


1236 Beginning
invading Europe by
Hulagu
Ariq Böke controlled
1231 Beginning
Mongol at Karakorum
invading Korea
1236 Beginning
invading South Asia
By Möngke Khan and
Kublai

Mongol Empire after Genghis Khan„s death (1227)
under Möngke Khan
[http://en.wikipedia.org/wiki/Timeline_of_the_Mongol
_Empire]
CSULA
Jongwook Woo


World in AD1257 – 1260
1257: Mongols was attacking Vietnam
1258: Mongols occupied Baghdad
1259: Mongols was invading Syria
– The death of Möngke Khan
1260: The succession war had begun
– By Möngke‟s brothers : Kublai Khan and Ariq Böke.
– Kublai and the youngest brother Hulugu returned to
KaraKorum: Capital of the Mongol empire
• Kara: north, Korum: Khori (Space, 골, 고을)

CSULA
Jongwook Woo


Again Goreyo and Mongol in 1259
Decided to have a peace treaty with Mongol
– Actually to surrender
April 21 1259 (고종 46): The Crown Prince left to
meet the Khan
May 17th 1259: The Crown Prince met Mongol army
at Yoyang (Liao liang) who was about to invade
Goreyo
– Stop the Mongol army
 June 30 1259: The king Go-Jong passed away
 July 30 1259: The Khan passed away
– Mongol army stopped the prince to hide the khan‟s
death
The prince met Kublai at Gaebong close to the
Yellow river
– Dec 1259: Kublai was returning back to KaraKorum CSULA
Jongwook Woo


Hulagu
Ariq Böke controlled
Mongol at Karakorum Goryeo‟s Crown
Prince

Kublai

Mongol Empire after Möngke Kahn' death (1227)
gol_Empire]
CSULA
Jongwook Woo

HiPIC Goreyo and Mongol in 1260-1264

The great meeting and the great Khan
Kublai welcomed the prince with the glad favor
– Kublai was so happy and said
• “The god is helping me. Goryeo kingdom surrendered
to me, who was never defeated even by the Chinese
emperor Dang Tae-Jong”
• He knew that Goryeo is originated from GoGuRyeo
Kublai appointed the prince to the king of Goryeo
(Won-Jong)
– as Go-Jong passed away
They came together to Beijing on Jan 1260.
April 1260: Won-Jong‟s enthronement ceremony in
Goryeo
 August 21 1264: Ariq Böke surrendered to Kublai
at Xanadu (KaraKorum)

CSULA
Jongwook Woo

HiPIC The great meeting and the marriage

 Sept 1264: King Won-Jong went to Beijing and meet
the Khan
 Another great welcoming from the Khan

 1269: Kublai decided his daughter to marry the
crown price of Goryeo
 1269, Aug 1270: Won-Jong and the crown prince asked Kublai for the
marriage
 1271, 1272: the prince went to Beijing and returned back
– Volunteer to lead the invasion of Japan
 April 1273: Defeated Sambyolcho at Jeju island

 May 1274: The crown prince of Goryeo and the
princess of the Mongol (Holdorogerimisil, 제국공주)
empire married at the palace of the capital in the
Mongol empire

 Aug 1274: The prince became the king (충렬왕)
CSULA
Jongwook Woo


Mongol Empire in 1300 -1405: this map is not
correct as Goryeo was an independent
kingdom
gol_Empire] CSULA
Jongwook Woo


The Mongol Empire and the
Kingdom of Goryeo tied with
marriages

Mongol Empire in
[http://en.wikipedia.org/wiki/Kublai_Khan]

CSULA
Jongwook Woo

HiPIC The political position

 The position of the king was the 7th ranked in the
Mongol empire
 It is the power of the princess
– A daughter of Kublai
 Should know that Kublai Khan has 12 sons.
 Goryeo received many benefits from the empire
– “Only Goryeo in the world kept the king and kingdom”
– When the king went to the palace of the empire, all mongol
officials wanted to give presents.
– The king asked the Khan to suppress Mongol generals in
Goryeo

 The position of the king was the 4th ranked in the
empire
 The next great Khan Temur:
 The princess is his aunt
 The khan asked the king be the 4th ranked at the empire
CSULA
Jongwook Woo

HiPIC The Empress Gi (기황후, 奇皇后)

born to Gi Ja-o (奇子敖)
in Haengju (幸州), Gor
yeo
Became a concubine of
Toghun Temür Khan
– Became the first
empress in 1365
Her son Ayurshiridar was
designated Crown Prince
in 1353.
– Supported by Korean
eunuch Bak Bulhwa
(朴不花)
– became a Khan called
Biligtü Khan in 1370.

CSULA
Jongwook Woo


 Good for Goryeo
She prohibited the culture to send Korean women to
the Mongol empire for marriage and slavery
 She eliminated any discussion to make Goryeo
kingdom as one of provinces in the Mongol empire

CSULA
Jongwook Woo


 An elder brother named Gi Cheol (奇轍,
Bayan Bukha).
 Came to threaten the position of the king of Goryeo
 King Gongmin exterminated the Gi family in 1356

CSULA
Jongwook Woo


 The Ming China occupied the capital of the
empire, Dadu (大都, Beijing), in 1368
The empress was disappointed that Goryeo did not
send any reinforcements
Fled north to Shangdu (上都, Xanadu)

CSULA
Jongwook Woo

HiPIC Conclusion II
Woman has a power to control husband:
King and Khan (Emperor)
 can promote their social positions to the higher
Woman can make a son to a Khan

Woman possess a political power to
positively affect the motherland

We need to know history and educate kids

CSULA
Jongwook Woo

HiPIC Question?

CSULA
Jongwook Woo

HiPIC References Part I
1) Introduction to MongoDB, Nosh Petigara, Jan 11, 2011

2) Hadoop Fundamental I, Big Data University

3) “Large Scale Data Analysis with Map/Reduce”, Marin
Dimitrov, Feb 2010

4) “BFS & MapReduce”, Edward J Yoon
http://blog.udanax.org/2009/02/breadth-first-search-
mapreduce.html, Feb 26 2009

5) “Market Basket Analysis Algorithm with no-SQL DB HBase
and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang
Xu, Seon Ho Kim, The Third International Conference on
Emerging Databases (EDB 2011), Songdo Park Hotel,
Incheon, Korea, Aug. 25-27, 2011

CSULA
Jongwook Woo

HiPIC References
6) “Market Basket Analysis Algorithm with Map/Reduce of
Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011
international Conference on Parallel and Distributed
Processing Techniques and Applications (PDPTA 2011),Las
Vegas (July 18-21, 2011)

7) Building Realtime Big Data Services at Facebook with
Hadoop and Hbase, Jonathan Gray, Facebook, Nov 11, 2011,
Hadoop World NYC

8) Analyzing Big Data at Twitter, Kevin Well, Web 2.0 Expo, NYC,
Sep 2010

9) Lessons Learned from Migrating 2+ Billion Documents at
Craigslist, Jeremy Zawodny, 2011

10) Machine Learning on Hadoop at Huffington Post | AOL, Thu
Kyaw and Sang Chul Song, Hadoop DC, Oct 4, 2011

CSULA
Jongwook Woo

HiPIC References
11) “MapReduce Debates and Schema-Free”, Woohyun Kim,
www.coordguru.com, http://blog.naver.com/wisereign, March
3 2010

12) “Large Scale Data Analysis with Map/Reduce”, Marin
Dimitrov, Feb 2010

13) “HBase Schema Design Case Studies”, Qingyan Liu, July 13
2009

CSULA
Jongwook Woo

HiPIC References Part II
1) 고려에 시집온 징기스칸의 딸들, 이한수, Nov 8 2006, 김영사

2) 쿠빌라이 칸의 일본원정과 충렬왕, 이승한, 2009, 푸른역사

CSULA
Jongwook Woo

Recent IT Development and Women: Big Data and The Power of Women in Goryeo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Recent IT Development and Women: Big Data and The Power of Women in Goryeo

Ähnlich wie Recent IT Development and Women: Big Data and The Power of Women in Goryeo (20)

Mehr von Jongwook Woo

Mehr von Jongwook Woo (20)

Recent IT Development and Women: Big Data and The Power of Women in Goryeo