SlideShare ist ein Scribd-Unternehmen logo
1 von 53
BIG DATA&
DATAMINING
LECTURE 3, 7.9.2015
INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01)
LAURI ELORANTA
• LECTURE 1: Introduction to Computational Social Science [DONE]
• Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114
• LECTURE 2: Basics of Computation and Modeling [DONE]
• Wednesday 02.09. 16:00 – 18:00, U35, Seminar room 113
• LECTURE 3: Big Data and Information Extraction [TODAY]
• Monday 07.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 4: Network Analysis
• Monday 14.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 5: Complex Systems
• Tuesday 15.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 6: Simulation in Social Science
• Wednesday 16.09. 16:00 – 18:00, U35, Seminar room 113
• LECTURE 7: Ethical and Legal issues in CSS
• Monday 21.09. 16:00 – 18:00, U35, Seminar room 114
• LECTURE 8: Summary
• Tuesday 22.09. 17:00 – 19:00, U35, Seminar room 114
LECTURESSCHEDULE
• PART 1: BIG DATA DEFINED
• PART 2: DATA MINING PROCESS
• PART 3: WHERE TO GET DATA
• PART 4 : DATA VISUALIZATION
LECTURE 3OVERVIEW
BIGDATADEFINED
• The term big data is used quite loosely, with various definitions depending
on the context
• Typically big data is misunderstood only to refer to big volumes of data
• One of the most used definitions in the field of IT is by Gartner:
“Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision
making.” (Gartner 2014.)
• Gartner analyst Doug Laney introduced the 3Vs concept in a 2001
MetaGroup research publication, 3D data management: Controlling data
volume, variety and velocity.
BIG DATADEFINED
(Gartner 2014.)
• Called as the three “V”s of Big Data
1. Volume refers to the big quantities of data
2. Velocity refers to the usually high speed of which data is generated
3. Variety refers to different kinds and types of data
• Other Vs suggested as well: Variability, Veracity
VOLUME, VELOCITY&
VARIETY
(Gartner 2014.)
•“Big Data represents the Information assets
characterized by such a High Volume,
Velocity and Variety to require specific
Technology and Analytical Methods for its
transformation into Value".
• (De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A
consensual definition and a review of key research topics. 4th
International Conference on Integrated Information, Madrid)
DEMAURO,GRECO&GRIMALDI2014,
DEFINITION
• Strong instrumental component in relation to how you get “value” out of
big data
• Answering research questions
• Answering business problems
• Instead of just one particular technology, big data also refers to large set
of different technologies used in various ways
BIG DATAISABOUTUSING
BIG DATA
(Sicular 2013.)
• “Every day, we create 2.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information,
posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to
name a few. This data is big data.” (IBM 2014a.)
• Underlines the volume component of big data.
IBM’S DEFINITION
IBM’S FOUR VS
(IBM 2014b.)
• E.g 7 vies from Elliot 2013:
• Big Data as
1. Volume, Velocity & Variety (dictionary definition)
2. Set of technologies and tools
3. Set of different categories and types of data
4. Means of predicting the future (big data as signals)
5. New possibilities, that previously were impossible (value)
6. Metafora for a global neural network (combining all data)
7. As a capitalist/neoliberal concept (critical view)
MANYVIEWPOINTSTO BIG
DATA
(Elliot 2013)
• Letely in social sciences big data has been defined either in quite
vague terms or underlining only the volume component of big data
• ”Big Data, that is, data that are too big for standard database software to
process, or the more future-proof, ‘capacity to search, aggregate, and cross-
reference large data sets.” (Eynon 2013.)
• “Today, our more-than-ever digital lives leave significant footprints in
cyberspace. Large scale collections of these socially generated footprints,
often known as big data --“ (Yasseri ja Brigth 2013.)
• "These emitted shadows of ‘big data’ can take a variety of forms, but most
are manifestations or byproducts of human/machine interactions in
code/spaces and coded spaces. We now see hundreds of millions of
connected people, billions of sensors, and trillions of communications,
information transfers, and transactions producing unfathomably large data
shadows --" (Graham 2013.)
TYPICALLYNOTACOMMON
DEFINITIONINSOCIALSCIENCE
RESEARCH
DATAMINING
PROCESS
• Data mining process aims at answering research questions based on
large sets of data (in another words, big data)
• New insights and information is “mined” from the data with automated
computation
• For variety of research purposes with many different kinds of data
• Long traditions: Quantitative content analysis and register based
research, for example, could be seen as form of data mining
• NOTE! To be specific, in computer science the term data mining only
refers to the pre-processing and analysis part of the whole process
DATAMININGPROCESSINCSS
1. Formulating
research
questions
2. Selecting
source raw data
3. Gathering
source raw data
4. Preprocessing 5. Analysis
6.
Communication
(Cioffi-Revilla 2014.)
• Everything starts with a research question
• Three main types of research questions in relation to data
• 1. Inductive = Data-driven. The data tells something new.
• 2. Deductive = Theory-driven. The data tells something about a theory.
E.g. data can be used to test hypotheses.
• 3. Abductive = Mixed model, in-between of inductive and deductive
research
RESEARCH QUESTIONS IN
DATAMINING
(Cioffi-Revilla 2014.)
• Main guiding factor: the research question
• Not just text: many different forms of data
• Text / Numeric data
• Images
• Video
• Audio
• Sensor-data
• Register data
• Where to get the data?
• Data and its selection comes with many problems: ethics, legal,
privacy, public vs. private. (These matters will have a lecture of its
own).
SELECTINGAND
GATHERING RAW DATA
(Cioffi-Revilla 2014.)
• Data needs to be pre-processed in order it can be analyzed: typically this
can take a very big part of the data mining process
• Cioffi-Revilla 2014 mentions these (mainly from textual content analysis
perspective):
• Scanning = generating machine readable files
• Cleaning = making the data set more concise (extracting unnecessary
noise)
• Filtering = there may be a need to filter the data based on some rules
or categories even before the analysis
• Reformatting = changing the structure of the data, for example
dividing data in smaller parts
• Content proxy extraction = using removing the proxies in text that
denote to latent entities
PREPROCESSING DATA
(Cioffi-Revilla 2014.)
• This is the main automated information extraction part: data is “mined” to
reveal new information
• Many different analysis method classes, typically combining techniques
from statistics, machine learning, artificial intelligence and database
systems.
• Main types of analysis (according to Fayyad et al 1996):
Classification, Clustering, Regression Analysis, Summarization,
Dependency Modeling, Anomaly detection
• There are many many others, which can be seen combining and
mixing the main types given above
DATA ANALYSIS
(Fayyad et al. 1996)
• Classification is maps (classifies) data item in one or several predefined
classes
• Classification algorithms are learning algorithms in the sense that they
need a data set that defines how to categorize the data: thus, one needs
to teach the classification algorithm what classes to look for
• For example
• Classification of images in different categories
• Classification of news items in different categories
• Classification email into spam an normal mail
CLASSIFICATION
(Fayyad et al. 1996)
• Clustering groups a set of data objects in such a way that objects in the
same group (cluster) are more similar to each other than to those in
other groups (clusters).
• Not a one specific algorithm, but a general task with many different
solutions and algorithms
• Connectivity based clustering (based on distance)
• Centroid based clustering (e.g. K-means clustering)
• Distribution based clustering (objects belonging most likely to the same
distribution)
• Density based clustering
CLUSTERING
(Fayyad et al. 1996)
• Helsingin Sanomat (the biggest news corporation in Finland) opened
their Finnish parliament election 2015 questionnaire data to public
• The data contained questions and their answers from election
candidates for the Finnish parliament
• The data could be analyzed via clustering and factor analysis to find out
what different groups (clusters) of thought do the candidates actually
represent (in comparison to their actual party).
• Try it out: http://users.aalto.fi/~leinona1/vaalit2015/
CLUSTERING EXAMPLE
• Does what is says on the tin! Finding compact descriptions on subsets of
data.
• For example calculating means of standard deviations over different data
attributes (dimension)
• Summarization techniques are often applied to interactive exploratory
data analysis and automated report generation.
SUMMARIZATION
(Fayyad et al. 1996)
• Estimating the relationship among variables (with a regression function)
• It includes many techniques for modeling and analyzing
• Focuses on the relationship between a dependent variable and one or
more independent variables.
• Regression function is a learning function based on the data
• Applications in prediction and
REGRESSIONANALYSIS
(Fayyad et al. 1996)
REGRESSION EXAMPLE
LINEARREGRESSION
(Image is public domain, from Wikipedia 2015, Regression Analysis)
• Finds significant dependencies between the data variables
• Two levels
• Structural level defining which variables are dependent (can be
graphical form)
• Quantitative level defining the strength of the dependency in numeric
form
• E.g. Correlation analysis
• E.g. Probabilistic density networks
DEPENDENCYMODELING
(Fayyad et al. 1996)
CORRELATION DOES NOT
IMPLYCAUSATION
(XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)
• Change and deviation detection
• Has the data changed from some previously known stable state or from
some previously measured normative values (“normal range”)
• Time scales matter, short term anomaly may actually be normal in long
term.
• Synchronic change (anomalies in stable processes) and diachronic
change (deeper change in generative structures of the process)
• Quite a dynamic category
ANOMALYDETECTION
(Fayyad et al. 1996)
• Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation,
lexical analysis, spatial analysis, semantic analysis, sentiment analysis,
similarity analysis, clustering, network analysis, sequence analysis,
intensity analysis, anomaly detection, sonification analysis
• Most important thing is to understand the ins and outs of the analysis
model you are using: what is it for and how does it behave under the
hood
• The relationship of the model to your research question
AND MANYOTHERS…
• Basically means that data analysis algorithm is able to “learn” and enhance its
performance iteratively from the data
• 1. Supervised machine learning
• The algorithm is schooled based on some known labeled data (input/target pairs)
• e.g. Netflix is able to suggest you better movies based on how you use it: By
watching and rating films you are teaching the machine how to suggest better
movies to you
• 2. Semi-supervised machine learning
• The algorithm is schooled with a small set of labelet data (input/target pairs) and
a set of un labelet data
• 3. Unsupervised machine learning
• No result-set data is given for the machine to learn
• The algorithm is able to find patterns and structures from the data automatically
without any pre-learning
• 4. Reinforcement machine learning
• Algorithm has a certain goal and it interacts with a dynamic environment, which
gives it rewards based on actions
MACHINE LEARNING
WHERETOGETDATA
• Ready Data Sets = Many public data sets provided by different institutions
• Web APIs = Application programming interfaces, that gives you data in
structured format. For example facebook and twitter have APIs for getting
data
• Web Scraping = Gather the information automatically from webpages,
when it is allowed.
• Data Bases = Quering databases directly with query languages (e.g SQL)
• Custom data gathering process = the traditional research data gathering
(surveys, interviews…)
• Open Data and Open Science growing trends: governments opening
providing APIs and Data Sets to different kinds of public data (e.g. fiscal
information, expenses)
DATASOURCES
MAINTYPES
OLDIEBUTGOLDIE…
GOVERNMENTALREGISTRIES
FINNISHSOCIALSCIENCEDATA ARCHIVE
CSC.FI: ETSIN&AAVA
STATISTICSFINLAND
HELSINKIREGIONINFOSHARE
GAPMINDERDATA
• The Internet is full of open datasets of different kinds!
Some examples:
• Economics
• American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
• Gapminder: http://www.gapminder.org/data/
• UMD:: http://inforumweb.umd.edu/econdata/econdata.html
• World bank: http://data.worldbank.org/indicator
• Finance
• CBOE Futures Exchange: http://cfe.cboe.com/Data/
• Google Finance: https://www.google.com/finance (R)
• Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
• St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
• NASDAQ: https://data.nasdaq.com/
• OANDA: http://www.oanda.com/ (R)
• Quandl: http://www.quandl.com/
• Yahoo Finance: http://finance.yahoo.com/ (R)
• Social Sciences
• General Social Survey: http://www3.norc.org/GSS+Website/
• ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
• Pew Research: http://www.pewinternet.org/datasets/pages/2/
• SNAP: http://snap.stanford.edu/data/index.html
• UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
• UPJOHN INST: http://www.upjohn.org/erdc/erdc.html
• FROM: http://www.inside-r.org/howto/finding-data-internet
INTERNETIS FULLOF DATA
WEBSCRAPING,APIS&DATABASES
DATABASE
API (APPLICATION
PROGRAMMING
INTERFACE)
PUBLIC WWW-
PAGE
Access via Internet
Automated
Web Scraping
API calls
Data provider organisation
The database is typically
accessed only from inside the
oganisation and not via
Internet.
• Web services and applications (such as twitter, facebook,…) provide
Web APIs so that others are able to build their services using some
functionality or data based on the data provider’s Web API / Web service
• Using APIs is the structured and “the right” way” to get data from a web
service
• The use of APIs is controlled by the data provider: they are thus used
with data providers permission
• Some APIs cost according usage, some have other conditions for use
• Needs programming to connect
API(APPLICATION
PROGRAMMINGINTERFACE)
TWITTERRESTAPIS
FACEBOOK GRAPHAPI
• Web scraping (web harvesting or web data extraction) is a computer
software technique of extracting information from websites. (Wikipedia
2015, Web Scraping)
• Transforms unstructured data in HTML format in some structured format
for for further analysis
• Used when you do not have access to the original Data Base or when
there are no APIs
• NOTE! Always make sure that scraping is allowed and legal! This is
not always the case, as some websites and services explicitly forbid web
scraping.
• Numerous tools varying from manual to semi-manual to fully automatic
• High-level scraping services
• Browser plugin tools
• Programming libraries
WEB SCRAPING
SERVICESFORWEBSCRAPING:
IMPORT.IO
https://www.youtube.com/watch?v=ghvsVLkTKLk
SERVICESFORWEBSCRAPING:
KIMONOLABS.COM
SERVICESFORWEBSCRAPING:
WEBHOSE.IO
BROWSERPLUGINSFORWEB
SCRAPING:DATAMINER
• Python
• Scrapy: http://scrapy.org
• BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
• Scrapemark: http://arshaw.com/scrapemark/ (not maintained
anymore)
• R
• rvest: http://cran.r-project.org/web/packages/rvest/index.html
WEB SCRAPING LIBRARIES
• Watch “The Beauty of Data Visualization” by David
McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of
_data_visualization?language=en
VISUALIZING DATA
• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining
to knowledge discovery in databases. AI magazine, 17(3), 37.
• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A
consensual definition and a review of key research topics. 4th
International Conference on Integrated Information, Madrid
LECTURE 3 READING
• Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London
• Elliot, T. 2013. 7 Definitions of Big Data You Should Know About. http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know-
about.html
• Eynon, R. 2013. The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38:3,
237-240, DOI: 10.1080/17439884.2013.771783.
• Gartner, 2014. IT Glossary: Big Data. http://www.gartner.com/it-glossary/big-data/
• Graham, M. 2013. The Virtual Dimension. Global City Challenges: Debating a Concept, Improving the Practice, M. Acuto and W. Steele, 2013.
London: Palgrave. 117-139.
• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International
Conference on Integrated Information, Madrid
• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
• IBM, 2014a. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
• IBM, 2014b. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
• Sicular, S. 2013. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s. Forbes, 3/27/2013.
http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/
• Yesseri, T.; Bright, J. 2013. Can electoral popularity be predicted using socially generated big data? Oxford Internet Institute, University of Oxford.
2013.
REFERENCES
Thank You!
Questions and comments?
twitter: @laurieloranta

Weitere ähnliche Inhalte

Was ist angesagt?

Cybersecurity Skills in Industry 4.0
Cybersecurity Skills in Industry 4.0Cybersecurity Skills in Industry 4.0
Cybersecurity Skills in Industry 4.0Eryk Budi Pratama
 
Overview of computing paradigm
Overview of computing paradigmOverview of computing paradigm
Overview of computing paradigmRipal Ranpara
 
Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.Aakash Panchal
 
Information systems
Information systemsInformation systems
Information systemsmzedan
 
Cyber Security in Manufacturing
Cyber Security in ManufacturingCyber Security in Manufacturing
Cyber Security in ManufacturingCentraComm
 
Introduction to Social Network Analysis
Introduction to Social Network AnalysisIntroduction to Social Network Analysis
Introduction to Social Network AnalysisPremsankar Chakkingal
 
Next Generation Business and Operational Support Systems: Practical Realities...
Next Generation Business and Operational Support Systems: Practical Realities...Next Generation Business and Operational Support Systems: Practical Realities...
Next Generation Business and Operational Support Systems: Practical Realities...Alan Quayle
 
Types & Fundamentals of Information System
Types & Fundamentals of Information SystemTypes & Fundamentals of Information System
Types & Fundamentals of Information SystemAwais Mansoor Chohan
 
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...Lauri Eloranta
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics PresentationSkylar Ritchie
 
Technology tech trends 2022 and beyond
Technology tech trends 2022 and beyond Technology tech trends 2022 and beyond
Technology tech trends 2022 and beyond Brian Pichman
 
Cloud Computing Security
Cloud Computing SecurityCloud Computing Security
Cloud Computing SecurityNinh Nguyen
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk OverviewSplunk
 

Was ist angesagt? (20)

Cybersecurity Skills in Industry 4.0
Cybersecurity Skills in Industry 4.0Cybersecurity Skills in Industry 4.0
Cybersecurity Skills in Industry 4.0
 
Information systems
Information systemsInformation systems
Information systems
 
Cloud security ppt
Cloud security pptCloud security ppt
Cloud security ppt
 
Overview of computing paradigm
Overview of computing paradigmOverview of computing paradigm
Overview of computing paradigm
 
Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.Fundamentals of Servers, server storage and server security.
Fundamentals of Servers, server storage and server security.
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Information systems
Information systemsInformation systems
Information systems
 
Cyber Security in Manufacturing
Cyber Security in ManufacturingCyber Security in Manufacturing
Cyber Security in Manufacturing
 
Cloud applications
Cloud applicationsCloud applications
Cloud applications
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Introduction to Social Network Analysis
Introduction to Social Network AnalysisIntroduction to Social Network Analysis
Introduction to Social Network Analysis
 
Pros and Cons of Cloud Computing
Pros and Cons of Cloud ComputingPros and Cons of Cloud Computing
Pros and Cons of Cloud Computing
 
Next Generation Business and Operational Support Systems: Practical Realities...
Next Generation Business and Operational Support Systems: Practical Realities...Next Generation Business and Operational Support Systems: Practical Realities...
Next Generation Business and Operational Support Systems: Practical Realities...
 
cloud computing models
cloud computing modelscloud computing models
cloud computing models
 
Types & Fundamentals of Information System
Types & Fundamentals of Information SystemTypes & Fundamentals of Information System
Types & Fundamentals of Information System
 
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
Social Network Analysis - Lecture 4 in Introduction to Computational Social S...
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
Technology tech trends 2022 and beyond
Technology tech trends 2022 and beyond Technology tech trends 2022 and beyond
Technology tech trends 2022 and beyond
 
Cloud Computing Security
Cloud Computing SecurityCloud Computing Security
Cloud Computing Security
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk Overview
 

Ähnlich wie Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of ZurichMind the Byte
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation InfrastructureMicah Altman
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data scienceLoïc Lejoly
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxXICSStudents
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

Ähnlich wie Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science (20)

Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
NCCT.pptx
NCCT.pptxNCCT.pptx
NCCT.pptx
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
DOWLD SLIDES.pptx
DOWLD SLIDES.pptxDOWLD SLIDES.pptx
DOWLD SLIDES.pptx
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of Zurich
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
 
G045033841
G045033841G045033841
G045033841
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
DBMS
DBMSDBMS
DBMS
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 

Mehr von Lauri Eloranta

Digital Transformation in Social Science
Digital Transformation in Social ScienceDigital Transformation in Social Science
Digital Transformation in Social ScienceLauri Eloranta
 
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...Lauri Eloranta
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta
 
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...Lauri Eloranta
 
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...Lauri Eloranta
 
Producing Mobile Magazines
Producing Mobile MagazinesProducing Mobile Magazines
Producing Mobile MagazinesLauri Eloranta
 

Mehr von Lauri Eloranta (6)

Digital Transformation in Social Science
Digital Transformation in Social ScienceDigital Transformation in Social Science
Digital Transformation in Social Science
 
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
A Summary of Computational Social Science - Lecture 8 in Introduction to Comp...
 
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...
 
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
Complex Social Systems - Lecture 5 in Introduction to Computational Social Sc...
 
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
 
Producing Mobile Magazines
Producing Mobile MagazinesProducing Mobile Magazines
Producing Mobile Magazines
 

Kürzlich hochgeladen

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Kürzlich hochgeladen (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

  • 1. BIG DATA& DATAMINING LECTURE 3, 7.9.2015 INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01) LAURI ELORANTA
  • 2. • LECTURE 1: Introduction to Computational Social Science [DONE] • Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114 • LECTURE 2: Basics of Computation and Modeling [DONE] • Wednesday 02.09. 16:00 – 18:00, U35, Seminar room 113 • LECTURE 3: Big Data and Information Extraction [TODAY] • Monday 07.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 4: Network Analysis • Monday 14.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 5: Complex Systems • Tuesday 15.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 6: Simulation in Social Science • Wednesday 16.09. 16:00 – 18:00, U35, Seminar room 113 • LECTURE 7: Ethical and Legal issues in CSS • Monday 21.09. 16:00 – 18:00, U35, Seminar room 114 • LECTURE 8: Summary • Tuesday 22.09. 17:00 – 19:00, U35, Seminar room 114 LECTURESSCHEDULE
  • 3. • PART 1: BIG DATA DEFINED • PART 2: DATA MINING PROCESS • PART 3: WHERE TO GET DATA • PART 4 : DATA VISUALIZATION LECTURE 3OVERVIEW
  • 5. • The term big data is used quite loosely, with various definitions depending on the context • Typically big data is misunderstood only to refer to big volumes of data • One of the most used definitions in the field of IT is by Gartner: “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” (Gartner 2014.) • Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity. BIG DATADEFINED (Gartner 2014.)
  • 6. • Called as the three “V”s of Big Data 1. Volume refers to the big quantities of data 2. Velocity refers to the usually high speed of which data is generated 3. Variety refers to different kinds and types of data • Other Vs suggested as well: Variability, Veracity VOLUME, VELOCITY& VARIETY (Gartner 2014.)
  • 7. •“Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value". • (De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid) DEMAURO,GRECO&GRIMALDI2014, DEFINITION
  • 8. • Strong instrumental component in relation to how you get “value” out of big data • Answering research questions • Answering business problems • Instead of just one particular technology, big data also refers to large set of different technologies used in various ways BIG DATAISABOUTUSING BIG DATA (Sicular 2013.)
  • 9. • “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.” (IBM 2014a.) • Underlines the volume component of big data. IBM’S DEFINITION
  • 11. • E.g 7 vies from Elliot 2013: • Big Data as 1. Volume, Velocity & Variety (dictionary definition) 2. Set of technologies and tools 3. Set of different categories and types of data 4. Means of predicting the future (big data as signals) 5. New possibilities, that previously were impossible (value) 6. Metafora for a global neural network (combining all data) 7. As a capitalist/neoliberal concept (critical view) MANYVIEWPOINTSTO BIG DATA (Elliot 2013)
  • 12. • Letely in social sciences big data has been defined either in quite vague terms or underlining only the volume component of big data • ”Big Data, that is, data that are too big for standard database software to process, or the more future-proof, ‘capacity to search, aggregate, and cross- reference large data sets.” (Eynon 2013.) • “Today, our more-than-ever digital lives leave significant footprints in cyberspace. Large scale collections of these socially generated footprints, often known as big data --“ (Yasseri ja Brigth 2013.) • "These emitted shadows of ‘big data’ can take a variety of forms, but most are manifestations or byproducts of human/machine interactions in code/spaces and coded spaces. We now see hundreds of millions of connected people, billions of sensors, and trillions of communications, information transfers, and transactions producing unfathomably large data shadows --" (Graham 2013.) TYPICALLYNOTACOMMON DEFINITIONINSOCIALSCIENCE RESEARCH
  • 14. • Data mining process aims at answering research questions based on large sets of data (in another words, big data) • New insights and information is “mined” from the data with automated computation • For variety of research purposes with many different kinds of data • Long traditions: Quantitative content analysis and register based research, for example, could be seen as form of data mining • NOTE! To be specific, in computer science the term data mining only refers to the pre-processing and analysis part of the whole process DATAMININGPROCESSINCSS 1. Formulating research questions 2. Selecting source raw data 3. Gathering source raw data 4. Preprocessing 5. Analysis 6. Communication (Cioffi-Revilla 2014.)
  • 15. • Everything starts with a research question • Three main types of research questions in relation to data • 1. Inductive = Data-driven. The data tells something new. • 2. Deductive = Theory-driven. The data tells something about a theory. E.g. data can be used to test hypotheses. • 3. Abductive = Mixed model, in-between of inductive and deductive research RESEARCH QUESTIONS IN DATAMINING (Cioffi-Revilla 2014.)
  • 16. • Main guiding factor: the research question • Not just text: many different forms of data • Text / Numeric data • Images • Video • Audio • Sensor-data • Register data • Where to get the data? • Data and its selection comes with many problems: ethics, legal, privacy, public vs. private. (These matters will have a lecture of its own). SELECTINGAND GATHERING RAW DATA (Cioffi-Revilla 2014.)
  • 17. • Data needs to be pre-processed in order it can be analyzed: typically this can take a very big part of the data mining process • Cioffi-Revilla 2014 mentions these (mainly from textual content analysis perspective): • Scanning = generating machine readable files • Cleaning = making the data set more concise (extracting unnecessary noise) • Filtering = there may be a need to filter the data based on some rules or categories even before the analysis • Reformatting = changing the structure of the data, for example dividing data in smaller parts • Content proxy extraction = using removing the proxies in text that denote to latent entities PREPROCESSING DATA (Cioffi-Revilla 2014.)
  • 18. • This is the main automated information extraction part: data is “mined” to reveal new information • Many different analysis method classes, typically combining techniques from statistics, machine learning, artificial intelligence and database systems. • Main types of analysis (according to Fayyad et al 1996): Classification, Clustering, Regression Analysis, Summarization, Dependency Modeling, Anomaly detection • There are many many others, which can be seen combining and mixing the main types given above DATA ANALYSIS (Fayyad et al. 1996)
  • 19. • Classification is maps (classifies) data item in one or several predefined classes • Classification algorithms are learning algorithms in the sense that they need a data set that defines how to categorize the data: thus, one needs to teach the classification algorithm what classes to look for • For example • Classification of images in different categories • Classification of news items in different categories • Classification email into spam an normal mail CLASSIFICATION (Fayyad et al. 1996)
  • 20. • Clustering groups a set of data objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters). • Not a one specific algorithm, but a general task with many different solutions and algorithms • Connectivity based clustering (based on distance) • Centroid based clustering (e.g. K-means clustering) • Distribution based clustering (objects belonging most likely to the same distribution) • Density based clustering CLUSTERING (Fayyad et al. 1996)
  • 21. • Helsingin Sanomat (the biggest news corporation in Finland) opened their Finnish parliament election 2015 questionnaire data to public • The data contained questions and their answers from election candidates for the Finnish parliament • The data could be analyzed via clustering and factor analysis to find out what different groups (clusters) of thought do the candidates actually represent (in comparison to their actual party). • Try it out: http://users.aalto.fi/~leinona1/vaalit2015/ CLUSTERING EXAMPLE
  • 22.
  • 23. • Does what is says on the tin! Finding compact descriptions on subsets of data. • For example calculating means of standard deviations over different data attributes (dimension) • Summarization techniques are often applied to interactive exploratory data analysis and automated report generation. SUMMARIZATION (Fayyad et al. 1996)
  • 24. • Estimating the relationship among variables (with a regression function) • It includes many techniques for modeling and analyzing • Focuses on the relationship between a dependent variable and one or more independent variables. • Regression function is a learning function based on the data • Applications in prediction and REGRESSIONANALYSIS (Fayyad et al. 1996)
  • 25. REGRESSION EXAMPLE LINEARREGRESSION (Image is public domain, from Wikipedia 2015, Regression Analysis)
  • 26. • Finds significant dependencies between the data variables • Two levels • Structural level defining which variables are dependent (can be graphical form) • Quantitative level defining the strength of the dependency in numeric form • E.g. Correlation analysis • E.g. Probabilistic density networks DEPENDENCYMODELING (Fayyad et al. 1996)
  • 27. CORRELATION DOES NOT IMPLYCAUSATION (XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)
  • 28. • Change and deviation detection • Has the data changed from some previously known stable state or from some previously measured normative values (“normal range”) • Time scales matter, short term anomaly may actually be normal in long term. • Synchronic change (anomalies in stable processes) and diachronic change (deeper change in generative structures of the process) • Quite a dynamic category ANOMALYDETECTION (Fayyad et al. 1996)
  • 29. • Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation, lexical analysis, spatial analysis, semantic analysis, sentiment analysis, similarity analysis, clustering, network analysis, sequence analysis, intensity analysis, anomaly detection, sonification analysis • Most important thing is to understand the ins and outs of the analysis model you are using: what is it for and how does it behave under the hood • The relationship of the model to your research question AND MANYOTHERS…
  • 30. • Basically means that data analysis algorithm is able to “learn” and enhance its performance iteratively from the data • 1. Supervised machine learning • The algorithm is schooled based on some known labeled data (input/target pairs) • e.g. Netflix is able to suggest you better movies based on how you use it: By watching and rating films you are teaching the machine how to suggest better movies to you • 2. Semi-supervised machine learning • The algorithm is schooled with a small set of labelet data (input/target pairs) and a set of un labelet data • 3. Unsupervised machine learning • No result-set data is given for the machine to learn • The algorithm is able to find patterns and structures from the data automatically without any pre-learning • 4. Reinforcement machine learning • Algorithm has a certain goal and it interacts with a dynamic environment, which gives it rewards based on actions MACHINE LEARNING
  • 32. • Ready Data Sets = Many public data sets provided by different institutions • Web APIs = Application programming interfaces, that gives you data in structured format. For example facebook and twitter have APIs for getting data • Web Scraping = Gather the information automatically from webpages, when it is allowed. • Data Bases = Quering databases directly with query languages (e.g SQL) • Custom data gathering process = the traditional research data gathering (surveys, interviews…) • Open Data and Open Science growing trends: governments opening providing APIs and Data Sets to different kinds of public data (e.g. fiscal information, expenses) DATASOURCES MAINTYPES
  • 39. • The Internet is full of open datasets of different kinds! Some examples: • Economics • American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete • Gapminder: http://www.gapminder.org/data/ • UMD:: http://inforumweb.umd.edu/econdata/econdata.html • World bank: http://data.worldbank.org/indicator • Finance • CBOE Futures Exchange: http://cfe.cboe.com/Data/ • Google Finance: https://www.google.com/finance (R) • Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0 • St Louis Fed: http://research.stlouisfed.org/fred2/ (R) • NASDAQ: https://data.nasdaq.com/ • OANDA: http://www.oanda.com/ (R) • Quandl: http://www.quandl.com/ • Yahoo Finance: http://finance.yahoo.com/ (R) • Social Sciences • General Social Survey: http://www3.norc.org/GSS+Website/ • ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp • Pew Research: http://www.pewinternet.org/datasets/pages/2/ • SNAP: http://snap.stanford.edu/data/index.html • UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm • UPJOHN INST: http://www.upjohn.org/erdc/erdc.html • FROM: http://www.inside-r.org/howto/finding-data-internet INTERNETIS FULLOF DATA
  • 40. WEBSCRAPING,APIS&DATABASES DATABASE API (APPLICATION PROGRAMMING INTERFACE) PUBLIC WWW- PAGE Access via Internet Automated Web Scraping API calls Data provider organisation The database is typically accessed only from inside the oganisation and not via Internet.
  • 41. • Web services and applications (such as twitter, facebook,…) provide Web APIs so that others are able to build their services using some functionality or data based on the data provider’s Web API / Web service • Using APIs is the structured and “the right” way” to get data from a web service • The use of APIs is controlled by the data provider: they are thus used with data providers permission • Some APIs cost according usage, some have other conditions for use • Needs programming to connect API(APPLICATION PROGRAMMINGINTERFACE)
  • 44. • Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. (Wikipedia 2015, Web Scraping) • Transforms unstructured data in HTML format in some structured format for for further analysis • Used when you do not have access to the original Data Base or when there are no APIs • NOTE! Always make sure that scraping is allowed and legal! This is not always the case, as some websites and services explicitly forbid web scraping. • Numerous tools varying from manual to semi-manual to fully automatic • High-level scraping services • Browser plugin tools • Programming libraries WEB SCRAPING
  • 49. • Python • Scrapy: http://scrapy.org • BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ • Scrapemark: http://arshaw.com/scrapemark/ (not maintained anymore) • R • rvest: http://cran.r-project.org/web/packages/rvest/index.html WEB SCRAPING LIBRARIES
  • 50. • Watch “The Beauty of Data Visualization” by David McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of _data_visualization?language=en VISUALIZING DATA
  • 51. • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. • De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid LECTURE 3 READING
  • 52. • Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London • Elliot, T. 2013. 7 Definitions of Big Data You Should Know About. http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know- about.html • Eynon, R. 2013. The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38:3, 237-240, DOI: 10.1080/17439884.2013.771783. • Gartner, 2014. IT Glossary: Big Data. http://www.gartner.com/it-glossary/big-data/ • Graham, M. 2013. The Virtual Dimension. Global City Challenges: Debating a Concept, Improving the Practice, M. Acuto and W. Steele, 2013. London: Palgrave. 117-139. • De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. • IBM, 2014a. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html • IBM, 2014b. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg • Sicular, S. 2013. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s. Forbes, 3/27/2013. http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/ • Yesseri, T.; Bright, J. 2013. Can electoral popularity be predicted using socially generated big data? Oxford Internet Institute, University of Oxford. 2013. REFERENCES
  • 53. Thank You! Questions and comments? twitter: @laurieloranta