SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Data Science:
Notes and Toolkits
Dr. Haralambos Marmanis
Waltham, MA
April, 2014
___________________________________
Web: http://www.marmanis.com
Email: h@marmanis.com
Copyright(c)2014H.Marmanis.
Allrightsreserved
1
What is Science?
• Science is the systematic, data based, pursuit of knowledge
through reason
• Science is not about what we believe, it is about how we arrived
at what we believe
• Science always relied on data, e.g. Copernicus’ and Kepler’s
theories needed Brahe’s data to grow and prosper
• The word “Science”, for most people, points to specific subject
areas such as Physics, Chemistry, etc.
• However, the methodology is not a priori restricted to these
fields; nearly everything that is taught in a university is the
outcome of a scientific endeavor
Copyright(c)2014H.Marmanis.
Allrightsreserved
2
What is Data Science?
The systematic
data based
pursuit of knowledge
through reason
in non-traditional fields
i.e. applying the same methodology that is applied in physics,
chemistry, biology, etc. to fields like e-Commerce, social networking,
finance, energy, marketing, and so on.
Copyright(c)2014H.Marmanis.
Allrightsreserved
3
Why should I care?
• Scientists rejoice! There was never a better time to be a data
scientist – click here to see what the business analysts say.
• If you are a scientist today, you can become
the next Newton,
the next Maxwell,
the next Einstein in your field!
• These slides will provide you with an overview of Notes and Tools
that are necessary, although not sufficient, for achieving your own
discoveries
• The content of the slides is taken from my (forthcoming) book:
“The Data Science Revolution:
An overview of the field and its applications”
• Benefits range from “pats on the back” to salary increase or a
generous bonus and from corporate recognition to international
fame! So, your mileage can vary but it’s all good!
Copyright(c)2014H.Marmanis.
Allrightsreserved
4
Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
5
Where do I start?
1. The first thing that you need to start is a problem
2. The second is an understanding of the problem. An
understanding implies the following:
• Clear description of the problem
• Clear objectives
• Measurable success criteria
3. The third is a set of data related to the problem
4. The fourth is a set of hypotheses
5. The fifth is a set of tools that will allow us to assess the
validity of our hypotheses based on the available data
Copyright(c)2014H.Marmanis.
Allrightsreserved
6
Buzzword overview
Copyright(c)2014H.Marmanis.
Allrightsreserved
7
• Big Data
• Data Analysis
• Intelligent Web
• Machine Learning
• Artificial Intelligence
• Statistical Analysis
What you really need …
Domain
Expertise
ScienceEngineering
Copyright(c)2014H.Marmanis.
Allrightsreserved
8
Domain expertise
• Each domain defines its own “universe” that, like our physical
universe, waits to be explored by scientific means
• You do not have to be a domain expert yourself but you
should be able to grasp all the fundamentals quickly and
accurately
• Examples (just a few – this is practically endless):
• Supply chain management
• Auctions for Ads
• Financial derivatives pricing
• Mortgage risk assessment
• Drug discovery
Copyright(c)2014H.Marmanis.
Allrightsreserved
9
Science
• A firm background in mathematics is essential; not just statistics!
• Applied Mathematics
• A firm understanding of the scientific method
1. Aggregate the questions/problems to be answered/solved
2. Conceptualize the problem’s domain
3. Formulate hypotheses  build models
4. Describe the problems based on the models
5. Solve the problems
6. Validate the solutions
7. Repeat steps 3 through 6, as needed
• Scientific computing
• Numerical Methods
• Visualization
Copyright(c)2014H.Marmanis.
Allrightsreserved
10
Engineering
• Engineering is the systematic application of knowledge for the
purpose of designing, implementing, and maintaining physical
or virtual constructs in a way that optimizes multiple
objectives (e.g. cost, functional effectiveness, operational
efficiency, etc.) while respecting all applicable constraints.
• In the context of Data Science, engineering skills are required
for effectively integrating the scientific solution into the real-
world system (e.g. an online retail store, a social networking
site, a financial tool)
• In particular, software engineering proficiency is crucial, since
all the “objects of observation” are effectively digital and
accessible only through some software system
Copyright(c)2014H.Marmanis.
Allrightsreserved
11
Computational environments
Copyright(c)2014H.Marmanis.
Allrightsreserved
12
Name Language Purpose License
MATLAB C, C++, Java MATLAB General Proprietary
SciLab C,C++, Java, Fortran, Scilab General CeCILL
(Open Source)
Octave General GNU GPL
R C, Fortran, R Statistical, Graphics GNU GPL
Julia C, C++, Scheme General MIT License
ScaVis Java General Mixed
SciPy C, Fortran, Python General BSD
Scientific Libraries
• Basic Linear Algebra Subprograms (BLAS) written in Fortran
• Linear Algebra Package (LAPACK) written in Fortran 90
• Numerical Algorithms Group (NAG) libraries
• GraphLab -- GraphLab API is written in C++
• MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java
• EJML – linear algebra library written in Java
• Commons Math – Apache project that offers a lightweight,
self-contained, library for mathematics and statistics
• NumPy – support for matrices and high-level mathematical
functions for Python
• SciPy – it includes efficient numerical routines for numerical
integration and optimization
Copyright(c)2014H.Marmanis.
Allrightsreserved
13
Machine Learning libraries
• Jgap – Genetic algorithms library
• Encog – Neural networks library
• Opt4J – Evolutionary computation library
• Weka – Clustering and classification algorithms
• Yooreeka – Search, recommendations, clustering,
classification, and mathematical analysis
Copyright(c)2014H.Marmanis.
Allrightsreserved
14
Big Data technologies
• Hadoop – open-source software for reliable, scalable, distributed
computing
• OpenCL – open royalty-free standard for cross-platform, parallel
programming of modern processors found in personal computers,
servers and handheld/embedded devices
• Cloudify – Provision, configure, orchestrate, and monitor large
distributed systems on the cloud
• Spring XD -- a unified, distributed, and extensible system for data
ingestion, real time analytics, batch processing, and data export
• Proactive Parallel Suite -- an open source solution that enables the
orchestration of applications and seamlessly integrates with the
management of high-performance clouds
• Ibis -- an efficient Java-based platform for distributed computing
Copyright(c)2014H.Marmanis.
Allrightsreserved
15
Copyright(c)2014H.Marmanis.
Allrightsreserved
16
The Data Science Revolution:
An overview of the field and its applications

Weitere ähnliche Inhalte

Andere mochten auch

Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111Roberto Dariva
 
Ta Review: Application Servers
Ta Review: Application ServersTa Review: Application Servers
Ta Review: Application ServersDavid Fletcher
 
Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2Viet NguyenHoang
 
Viettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mangViettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mangPham Ngoc Quang
 
советский энциклопедический словарь
советский энциклопедический словарьсоветский энциклопедический словарь
советский энциклопедический словарьЕлена Демидова
 
Chuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phanChuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phanPham Ngoc Quang
 
Bet youdon'tknowreading
Bet youdon'tknowreadingBet youdon'tknowreading
Bet youdon'tknowreadingMonica Campana
 
3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10 3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10 Pham Ngoc Quang
 
He tuan hoan tham khao 2
He tuan hoan tham khao 2He tuan hoan tham khao 2
He tuan hoan tham khao 2Pham Ngoc Quang
 
Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08Ken Hudson
 
Ag Portal Gioi Thieu Quy Trinh
Ag Portal   Gioi Thieu Quy TrinhAg Portal   Gioi Thieu Quy Trinh
Ag Portal Gioi Thieu Quy TrinhPham Ngoc Quang
 
He thong tai khoan ke toan
He thong tai khoan ke toanHe thong tai khoan ke toan
He thong tai khoan ke toanPham Ngoc Quang
 

Andere mochten auch (20)

Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111Soluções Navita para BlackBerry 20081111
Soluções Navita para BlackBerry 20081111
 
Ta Review: Application Servers
Ta Review: Application ServersTa Review: Application Servers
Ta Review: Application Servers
 
Flagler Budget.Key
Flagler Budget.KeyFlagler Budget.Key
Flagler Budget.Key
 
Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2Open Source Presentation To Portal Partners2
Open Source Presentation To Portal Partners2
 
Trao doi chat va q p3
Trao doi chat va q  p3Trao doi chat va q  p3
Trao doi chat va q p3
 
Viettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mangViettel chung thuc-so_ke_khai_qua_mang
Viettel chung thuc-so_ke_khai_qua_mang
 
TANET
TANETTANET
TANET
 
Huyen Khong Tu
Huyen Khong TuHuyen Khong Tu
Huyen Khong Tu
 
советский энциклопедический словарь
советский энциклопедический словарьсоветский энциклопедический словарь
советский энциклопедический словарь
 
Chuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phanChuong 1 sinh ly hung phan
Chuong 1 sinh ly hung phan
 
Bet youdon'tknowreading
Bet youdon'tknowreadingBet youdon'tknowreading
Bet youdon'tknowreading
 
Ta Review OES
Ta Review OESTa Review OES
Ta Review OES
 
3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10 3. Luat thue TNCN 12.10
3. Luat thue TNCN 12.10
 
He tuan hoan tham khao 2
He tuan hoan tham khao 2He tuan hoan tham khao 2
He tuan hoan tham khao 2
 
Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08Presentation to CCAE - November 28/08
Presentation to CCAE - November 28/08
 
Picasso 4f
Picasso 4fPicasso 4f
Picasso 4f
 
Ag Portal Gioi Thieu Quy Trinh
Ag Portal   Gioi Thieu Quy TrinhAg Portal   Gioi Thieu Quy Trinh
Ag Portal Gioi Thieu Quy Trinh
 
He thong tai khoan ke toan
He thong tai khoan ke toanHe thong tai khoan ke toan
He thong tai khoan ke toan
 
Thai nguyen 02-midrex
Thai nguyen 02-midrexThai nguyen 02-midrex
Thai nguyen 02-midrex
 
Bessa swinston
Bessa swinstonBessa swinston
Bessa swinston
 

Ähnlich wie Data Science: Notes and Toolkits

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and PlacementAkhilGGM
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadVamsiNihal
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabadsaitejavella
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
data science training and placement
data science training and placementdata science training and placement
data science training and placementSaiprasadVella
 
online data science training
online data science trainingonline data science training
online data science trainingDIGITALSAI1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabadVamsiNihal
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in HyderabadKumarNaik21
 

Ähnlich wie Data Science: Notes and Toolkits (20)

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and Placement
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 

Kürzlich hochgeladen

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Kürzlich hochgeladen (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Data Science: Notes and Toolkits

  • 1. Data Science: Notes and Toolkits Dr. Haralambos Marmanis Waltham, MA April, 2014 ___________________________________ Web: http://www.marmanis.com Email: h@marmanis.com Copyright(c)2014H.Marmanis. Allrightsreserved 1
  • 2. What is Science? • Science is the systematic, data based, pursuit of knowledge through reason • Science is not about what we believe, it is about how we arrived at what we believe • Science always relied on data, e.g. Copernicus’ and Kepler’s theories needed Brahe’s data to grow and prosper • The word “Science”, for most people, points to specific subject areas such as Physics, Chemistry, etc. • However, the methodology is not a priori restricted to these fields; nearly everything that is taught in a university is the outcome of a scientific endeavor Copyright(c)2014H.Marmanis. Allrightsreserved 2
  • 3. What is Data Science? The systematic data based pursuit of knowledge through reason in non-traditional fields i.e. applying the same methodology that is applied in physics, chemistry, biology, etc. to fields like e-Commerce, social networking, finance, energy, marketing, and so on. Copyright(c)2014H.Marmanis. Allrightsreserved 3
  • 4. Why should I care? • Scientists rejoice! There was never a better time to be a data scientist – click here to see what the business analysts say. • If you are a scientist today, you can become the next Newton, the next Maxwell, the next Einstein in your field! • These slides will provide you with an overview of Notes and Tools that are necessary, although not sufficient, for achieving your own discoveries • The content of the slides is taken from my (forthcoming) book: “The Data Science Revolution: An overview of the field and its applications” • Benefits range from “pats on the back” to salary increase or a generous bonus and from corporate recognition to international fame! So, your mileage can vary but it’s all good! Copyright(c)2014H.Marmanis. Allrightsreserved 4
  • 5. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 5
  • 6. Where do I start? 1. The first thing that you need to start is a problem 2. The second is an understanding of the problem. An understanding implies the following: • Clear description of the problem • Clear objectives • Measurable success criteria 3. The third is a set of data related to the problem 4. The fourth is a set of hypotheses 5. The fifth is a set of tools that will allow us to assess the validity of our hypotheses based on the available data Copyright(c)2014H.Marmanis. Allrightsreserved 6
  • 7. Buzzword overview Copyright(c)2014H.Marmanis. Allrightsreserved 7 • Big Data • Data Analysis • Intelligent Web • Machine Learning • Artificial Intelligence • Statistical Analysis
  • 8. What you really need … Domain Expertise ScienceEngineering Copyright(c)2014H.Marmanis. Allrightsreserved 8
  • 9. Domain expertise • Each domain defines its own “universe” that, like our physical universe, waits to be explored by scientific means • You do not have to be a domain expert yourself but you should be able to grasp all the fundamentals quickly and accurately • Examples (just a few – this is practically endless): • Supply chain management • Auctions for Ads • Financial derivatives pricing • Mortgage risk assessment • Drug discovery Copyright(c)2014H.Marmanis. Allrightsreserved 9
  • 10. Science • A firm background in mathematics is essential; not just statistics! • Applied Mathematics • A firm understanding of the scientific method 1. Aggregate the questions/problems to be answered/solved 2. Conceptualize the problem’s domain 3. Formulate hypotheses  build models 4. Describe the problems based on the models 5. Solve the problems 6. Validate the solutions 7. Repeat steps 3 through 6, as needed • Scientific computing • Numerical Methods • Visualization Copyright(c)2014H.Marmanis. Allrightsreserved 10
  • 11. Engineering • Engineering is the systematic application of knowledge for the purpose of designing, implementing, and maintaining physical or virtual constructs in a way that optimizes multiple objectives (e.g. cost, functional effectiveness, operational efficiency, etc.) while respecting all applicable constraints. • In the context of Data Science, engineering skills are required for effectively integrating the scientific solution into the real- world system (e.g. an online retail store, a social networking site, a financial tool) • In particular, software engineering proficiency is crucial, since all the “objects of observation” are effectively digital and accessible only through some software system Copyright(c)2014H.Marmanis. Allrightsreserved 11
  • 12. Computational environments Copyright(c)2014H.Marmanis. Allrightsreserved 12 Name Language Purpose License MATLAB C, C++, Java MATLAB General Proprietary SciLab C,C++, Java, Fortran, Scilab General CeCILL (Open Source) Octave General GNU GPL R C, Fortran, R Statistical, Graphics GNU GPL Julia C, C++, Scheme General MIT License ScaVis Java General Mixed SciPy C, Fortran, Python General BSD
  • 13. Scientific Libraries • Basic Linear Algebra Subprograms (BLAS) written in Fortran • Linear Algebra Package (LAPACK) written in Fortran 90 • Numerical Algorithms Group (NAG) libraries • GraphLab -- GraphLab API is written in C++ • MTJ -- Matrix Toolkit that integrates BLAS and LAPACK in Java • EJML – linear algebra library written in Java • Commons Math – Apache project that offers a lightweight, self-contained, library for mathematics and statistics • NumPy – support for matrices and high-level mathematical functions for Python • SciPy – it includes efficient numerical routines for numerical integration and optimization Copyright(c)2014H.Marmanis. Allrightsreserved 13
  • 14. Machine Learning libraries • Jgap – Genetic algorithms library • Encog – Neural networks library • Opt4J – Evolutionary computation library • Weka – Clustering and classification algorithms • Yooreeka – Search, recommendations, clustering, classification, and mathematical analysis Copyright(c)2014H.Marmanis. Allrightsreserved 14
  • 15. Big Data technologies • Hadoop – open-source software for reliable, scalable, distributed computing • OpenCL – open royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices • Cloudify – Provision, configure, orchestrate, and monitor large distributed systems on the cloud • Spring XD -- a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export • Proactive Parallel Suite -- an open source solution that enables the orchestration of applications and seamlessly integrates with the management of high-performance clouds • Ibis -- an efficient Java-based platform for distributed computing Copyright(c)2014H.Marmanis. Allrightsreserved 15
  • 16. Copyright(c)2014H.Marmanis. Allrightsreserved 16 The Data Science Revolution: An overview of the field and its applications