1. 12/12/2018 1Demetris Trihinas
trihinas.d@unic.ac.cy
1Tutorial | TechCamp Cyprus
Department of
Computer Science
Storytelling through Data
From Mining Raw Data to Story
Visualization
Demetris Trihinas
Department of Computer Science
University of Nicosia
trihinas.d@unic.ac.cy
Cyprus
2. 12/12/2018 2Demetris Trihinas
trihinas.d@unic.ac.cy
2Tutorial | TechCamp Cyprus
Department of
Computer Science
Full-Time Faculty Member
University of Nicosia
âDeveloping scalable and self-adaptive tools for data management,
exploration and visualizationâ
@dtrihinas
http://dtrihinas.info
https://ailab.unic.ac.cy/
3. 12/12/2018 3Demetris Trihinas
trihinas.d@unic.ac.cy
3Tutorial | TechCamp Cyprus
Department of
Computer Science
State | Unemployment
------------------------------
NY | 1.72
CA | 2.43
DC | 3.54
âŚ
Raw bits nâ bytes
Structured data
Knowledge
Story
Todayâs Talk
5. 12/12/2018 5Demetris Trihinas
trihinas.d@unic.ac.cy
5Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Collection
⢠The worldâs data sources (e.g., social media, news
outlets) often permit ârestrictedâ access to their data.
⢠Web Crawling: methodically scrape website content
⢠Application Programmable Interfaces (APIs)
⢠âASK for permission and GET access to resource(s)â
⢠So⌠turn the âtapâ of a data source (coding task) and store
the data somewhere (data warehousing).
7. 12/12/2018 7Demetris Trihinas
trihinas.d@unic.ac.cy
7Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Collection via API
Data
Collection
GET access to tweets
You can have 1% for free
with this access token.
The tweet sink
Data
Warehouse
GET tweets
from @dtrihinas
or with #data_miningAlso, ask for
#cyprus and #cyprus
10. 12/12/2018 10Demetris Trihinas
trihinas.d@unic.ac.cy
10Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Overview
⢠Trawling through a couple of articles manually is easy.
⢠But⌠what about thousands of news articles from
multiple news outlets?
Humans are slow, Computers are fast!
⢠Get the data, store it and then mine it!
18. 12/12/2018 18Demetris Trihinas
trihinas.d@unic.ac.cy
18Tutorial | TechCamp Cyprus
Department of
Computer Science
Batch Data
⢠Assumes that the data is available when and if we want it
(e.g., reading and parsing data from a file or database)
⢠The application knows the dataset in advance and controls the
input rate of the data.
Count events by color
fetch data
<red, 3>
<yellow, 1>
<blue, 2>
<green, 2>
Application
Database
19. 12/12/2018 19Demetris Trihinas
trihinas.d@unic.ac.cy
19Tutorial | TechCamp Cyprus
Department of
Computer Science
⢠Unbounded Data -> the volume of the data is overwhelming
⢠Conceptually infinite sequence of data items
⢠Push Model -> data arrives at high velocity and different rates
⢠Potentially multiple sources pushing data to the application at
different rates (data distribution changes over time)
Data Streams
Application
src1
src2
src3
0
2
4
input rate
t
23. 12/12/2018 23Demetris Trihinas
trihinas.d@unic.ac.cy
23Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Warehousing
⢠Data warehousing provides data storage and
management capabilities.
⢠Memory and storage has
never been cheaper.
1MB today is 10 times
cheaper than 5 years
ago!
24. 12/12/2018 24Demetris Trihinas
trihinas.d@unic.ac.cy
24Tutorial | TechCamp Cyprus
Department of
Computer Science
Marketing Mantra
⢠Collect whatever data you can, whenever and
wherever possible.
⢠The expectation is that collected data will have value
either for the purpose collected or for a purpose not
yet envisioned.
25. 12/12/2018 25Demetris Trihinas
trihinas.d@unic.ac.cy
25Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Mining
⢠Data is useless unless you can convert it to structured
information and ultimately into knowledge.
⢠So⌠data mining provides you with the intelligence to
convert data into knowledge.
28. 12/12/2018 28Demetris Trihinas
trihinas.d@unic.ac.cy
28Tutorial | TechCamp Cyprus
Department of
Computer Science
What is NOT Data Mining
⢠Any question you can ask and get an âimmediate and
concreteâ answer from a database.
⢠How many sofas models does IKEA currently have in stock?
⢠How many sofas did IKEA sell in Sweden last month?
⢠Which IKEA customers bought a sofa worth more than 500
euros this year?
30. 12/12/2018 30Demetris Trihinas
trihinas.d@unic.ac.cy
30Tutorial | TechCamp Cyprus
Department of
Computer Science
Classification
⢠Develop models (or functions) that feature the ability
to distinguish and describe a collection of various
attribute into classes.
⢠âGive a label to your data!â
⢠Should the IKEA sofa model S be added to this monthâs
discount items (yes, no)?
33. 12/12/2018 33Demetris Trihinas
trihinas.d@unic.ac.cy
33Tutorial | TechCamp Cyprus
Department of
Computer Science
Clustering
⢠Develop models to group data together based on their
similarity or dissimilarity to data in other groups.
⢠Group IKEA customers based on how much disposable
income they have, or how often they tend to shop at a
particular IKEA branch.
⢠Similar to classification but with unknown classes.
37. 12/12/2018 37Demetris Trihinas
trihinas.d@unic.ac.cy
37Tutorial | TechCamp Cyprus
Department of
Computer Science
Pattern Discovery
⢠One of the most basic techniques in data mining is learning
to recognize patterns in the data.
⢠This is usually a recognition of some aberration in your data
happening at regular intervals, or an ebb and flow of a
certain variable over time.
⢠Sales of a certain product seem to spike just before the
holidays, or notice that warmer weather drives more
people to your website.
39. 12/12/2018 39Demetris Trihinas
trihinas.d@unic.ac.cy
39Tutorial | TechCamp Cyprus
Department of
Computer Science
Association
⢠Association is related to tracking patterns, but is more
specific to dependently linked attributes.
⢠Model developed to look for specific events or
attributes that are highly correlated with another event
or attribute.
⢠When your customers buy a specific item, they also
often buy a second, related item.
42. 12/12/2018 42Demetris Trihinas
trihinas.d@unic.ac.cy
42Tutorial | TechCamp Cyprus
Department of
Computer Science
Outlier Detection
⢠Particular data points do not comply with general
behavior (pattern) of the rest of the data.
⢠We call them outliers.
⢠Credit card fraud from
irregular buying patterns
⢠Patient health from
irregular symptoms
43. 12/12/2018 43Demetris Trihinas
trihinas.d@unic.ac.cy
43Tutorial | TechCamp Cyprus
Department of
Computer Science
Regression
⢠Used primarily as a form of modeling to identify the
likelihood of a certain variable, given the presence of
other variables.
⢠Project a certain price, based on other factors like
availability, consumer demand, and competition.
⢠How much should we sell the new IKEA sofa?
46. 12/12/2018 46Demetris Trihinas
trihinas.d@unic.ac.cy
46Tutorial | TechCamp Cyprus
Department of
Computer Science
Correlation
⢠Correlation is a statistical technique that tells us how
strongly pairs of variables are related.
⢠But⌠correlation does not tell us the why and how
behind the relationship.
⢠So⌠correlation just says that a relationship exists.
48. 12/12/2018 48Demetris Trihinas
trihinas.d@unic.ac.cy
48Tutorial | TechCamp Cyprus
Department of
Computer Science
Causation
⢠Causation denotes that any change in the value of one
variable will cause a change in the value of another
variable.
⢠This means that one variable makes other to happen.
49. 12/12/2018 49Demetris Trihinas
trihinas.d@unic.ac.cy
49Tutorial | TechCamp Cyprus
Department of
Computer Science
Exercise and Calories
⢠When a person is exercising then the amount of
calories burned increases every minute.
⢠The former (exercise) is causing the latter (calories
burned) to happen.
50. 12/12/2018 50Demetris Trihinas
trihinas.d@unic.ac.cy
50Tutorial | TechCamp Cyprus
Department of
Computer Science
Ice-Cream and Homicides in New York
⢠A study in the 90âs showed that ice-cream sales are the
cause of homicides in New York.
⢠As the sales of ice-cream rise and fall, so do the
number of homicides -> correlation.
⢠But⌠does the consumption of ice-cream actually
cause the death of people in NY?
https://www.nytimes.com/2009/06/19/nyregion/19murder.html
51. 12/12/2018 51Demetris Trihinas
trihinas.d@unic.ac.cy
51Tutorial | TechCamp Cyprus
Department of
Computer Science
Correlation Does NOT Imply Causation
⢠No⌠the two things are correlated.
⢠But this does NOT mean one causes other.
Correlation is something which
we think, when we canât see
under the covers.
So the less the information we
have the more we are forced
to observe correlations.
52. 12/12/2018 52Demetris Trihinas
trihinas.d@unic.ac.cy
52Tutorial | TechCamp Cyprus
Department of
Computer Science
Confidence Intervals
⢠How many football games do US citizens got to?
⢠To get an -exact- answer (100% correct), you must ask
everyone in the US (>350M people) -> Not practical!
⢠Use a random sample, meaning ask (much) less people
-> but we wonât be 100% correct.
53. 12/12/2018 53Demetris Trihinas
trihinas.d@unic.ac.cy
53Tutorial | TechCamp Cyprus
Department of
Computer Science
Confidence Intervals
⢠What we try to achieve: Get an interval that we are
confident that the actual answer lies within.
âI am 95% confident that the number of football games
people in the U.S. go to lies between 10 and 12â
⢠So basically, CIs describe the level of uncertainty
associated with a sample estimation.
54. 12/12/2018 54Demetris Trihinas
trihinas.d@unic.ac.cy
54Tutorial | TechCamp Cyprus
Department of
Computer Science
Random Sample Selection
⢠Random⌠means random!
⢠You cannot just select 1000 people from one city, the
sample wont represent the whole US.
⢠You cannot just send FB messages to 1000 random
people, you will get a representation of US FB users,
and of course not all of the US citizens use FB.
55. 12/12/2018 55Demetris Trihinas
trihinas.d@unic.ac.cy
55Tutorial | TechCamp Cyprus
Department of
Computer Science
Random Sample Distribution
⢠Without going into a lot of statistics, a perfectly
random sample distribution should look like this:
Assuming that you
actually selected a
random sample
57. 12/12/2018 57Demetris Trihinas
trihinas.d@unic.ac.cy
57Tutorial | TechCamp Cyprus
Department of
Computer Science
Confidence Intervals
⢠Random sample: 1000 US citizens
⢠Avg is 11 games and SD is 5 games.
⢠Letâs say we want a 95% confidence interval.
95%
11
With some statistics
we get an interval of
1 game for 95% CI.
We are 95% confident
that the average US
citizen watches between
10-12 games a year.
70. 12/12/2018 70Demetris Trihinas
trihinas.d@unic.ac.cy
70Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Science Process
Data
Warehousing
Data
Collection
Data
Mining
Data
Visualization
Insights Story
Struct
Info
Raw
Data
Data
Preprocessing
Preprocessed
Info
71. 12/12/2018 71Demetris Trihinas
trihinas.d@unic.ac.cy
71Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Preprocessing
⢠Data mining, especially on big data, is a -compute and
time- expensive process.
⢠Data Preprocessing can significantly increase
performance if performed before mining.
⢠Data Cleaning
⢠Data Reduction
⢠Data Transformation
Preprocessing can even take around
60% of your effort but totally worth it!
73. 12/12/2018 73Demetris Trihinas
trihinas.d@unic.ac.cy
73Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Cleaning
⢠You would assume that data stored in a database is
ready for analysis, but⌠âdirty dataâ.
⢠Removing duplicate, erroneous or NA data.
⢠Statistically imputing missing data.
id name age score
1000
1001
Anna
John
42
fifty
84.7
89.5
age MUST be a number
id name age score
1000
1001
1002
Anna
John
Mat
42
50
29
84.7
89.5
Mat was sick on test day but is C-
average student so lets assume he
would have scored a 72.0
74. 12/12/2018 74Demetris Trihinas
trihinas.d@unic.ac.cy
74Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Transformation
⢠Reshape, sort and combine data to suitable format(s)
for analysis.
id name age score
1000
1001
1002
Anna
John
Mat
42
50
29
84.7
89.7
72.0
id name Eats Breakfast
1000
1001
1002
Anna
John
Mat
Yes
yes
no
id name age score
1001
1000
1002
John
Anna
Mat
50
42
29
90
85
72
Breakfast
1
1
0 Sort
by
score
75. 12/12/2018 75Demetris Trihinas
trihinas.d@unic.ac.cy
75Tutorial | TechCamp Cyprus
Department of
Computer Science
Data Reduction
⢠Perform filtering on the data that is not needed for the
analysis to consume less resources and time.
⢠Analysis will be performed on US citizens so remove others.
⢠Use only a sample of the data to get an approximate, but
quick, answer
⢠Create random sample of 1K rows instead of 1M rows.
⢠Reduce the dimensionality of the problem
⢠The field age is not relevant to analysis.
81. 12/12/2018 81Demetris Trihinas
trihinas.d@unic.ac.cy
81Tutorial | TechCamp Cyprus
Department of
Computer Science
Storytelling through Data
From Mining Raw Data to Story
Visualization
Demetris Trihinas
Department of Computer Science
University of Nicosia
trihinas.d@unic.ac.cy
Cyprus