SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Introduction to Data
Science
PREPARED
BY
MAHIR MAHTAB HAQUE
What is Data Science?
 It is a set of methodologies for taking in thousands of forms of data that are available to us today and using them to
draw meaningful conclusions.
 Purpose of Data Science:
- Describe the current state of an organization or process
- Detect anomalous events
- Diagnose the causes of events and behaviors
- Predict future events
 Data Science Workflow:
- Collect data from various sources – surveys, web traffic results, geo-tagged social media posts, financial
transactions, etc. Once data have been collected, we store that data in a safe and accessible way.
- Prepare the raw data, also known as ‘cleaning the data’, which involves finding missing or duplicate values and
converting data into a more organized format.
- Explore and visualize the cleaned data by building dashboards to track how data changes over time or performing
comparisons between two sets of data.
- Run experiments and predictions on the data, for example building a system that forecasts temperature changes or
performing a test to find which web page acquires more customers.
3 exciting areas of Data Science
 Machine Learning:
- Starts with a well-defined question (What is the probability that this transaction is fraudulent?)
- Gather some data to analyze (Old transactions labeled as fraudulent/valid)
- Bring in new additional data to make predictions (New credit card transactions)
 Internet of Things (IoT):
- Refers to gadgets that are not standard computers but still have the ability to transmit data.
- Includes smart watches, internet-connected home security systems, electronic toll collection systems, building
energy management systems, etc.
- IoT is a great source for data science projects.
 Deep Learning:
- A sub-field of machine learning, where multiple layers of algorithms called ‘Neurons’ work together to draw
complex conclusions.
- Deep learning takes much more ‘Training Data’, which are records of data used to build an algorithm, than a
traditional machine learning model and is also able to learn relationships that traditional models cannot.
- Deep learning is used to solve data-intensive problems such as image classification or language
understanding.
Data Science Roles and Tools
Roles Data Engineer Data Analyst Data Scientist Machine Learning Scientist
Responsibilities They control the flow of data by
building custom data pipelines and
storage systems. They design
infrastructure so that data is not
collected but it is easily obtained
and processed.
They describe the data through exploring the data
and creating visualizations and dashboards. To do
these, they need to first clean the data.
They find new insights from data and use
traditional machine learning for prediction
and forecasting.
Very similar to Data Scientists. They
what’s likely to be true from what we already
know – these scientists use Training Data to
classify larger, unrulier data whether it’s to
classify images that contain a car or create a
chatbot .
Focus area Data collection and storage Data preparation & Exploration and Visualization Data preparation, Exploration and
Visualization & Experimentation and
Prediction
Data preparation, Exploration and
& Experimentation and Prediction
Tools • SQL for storing and
data.
• Either Java, Scala or Python
processing data.
• Shell is used on the command
line to automate and run
• SQL for querying data – use existing databases
to retrieve and aggregate relevant data.
• Spreadsheets to perform simple analyses on
small data quantities.
• Tableau, Power BI or Looker to create
dashboards and share analyses.
• Python/R can also be used for cleaning and
analyzing data.
• SQL, Python or R proficiency.
• Data science libraries
for using reusable codes for common
data science tasks.
• Python/R to create predictive models.
• Popular machine learning libraries
(TensorFlow) to run powerful deep learning
algorithms.
Step 1: Data collection & storage
 Vast amounts of data are being generated daily from surfing the internet to paying by card in a
shop. The companies behind these services that we use, collect these data internally and use it to
make data-driven decisions. There are also many free, open data sources available. This means data
can be freely used, shared and built-on by anyone.
 Company data sources:
- Web events
- Customer data
- Survey data
- Logistics data
- Financial transactions
 Open data sources:
- Public data APIs (Application programming interface) – Twitter, Wikipedia, Yahoo! Finance, Google
Maps
- Public records (international organizations such as World Bank, UN, WTO; national statistical offices;
government agencies)
Types of data
Quantitative data: Data
that can be counted,
measured and expressed
using numbers.
Qualitative data: Data
that is descriptive and
conceptual – something
that can be observed
not measured.
Image data: An image is
made up of pixels. These
pixels contain
information about color
and intensity. Typically,
the pixels are stored in
computer memory.
Text data: Emails,
documents, reviews,
social media posts, etc –
these data can be stored
and analyzed to find
relevant insights.
Geospatial data: Data
with location
especially useful for
navigation apps like
Google Maps/Waze.
Network data: Data
consisting of people or
things in a network and
the relationships
between them.
Data storage and retrieval
 When storing data, there are 3 important things to consider:
- Determining where to store the data
- Knowing what kind of data we are storing
- How we can retrieve the data from storage
 Location:
- On-premises cluster, i.e., data stored across many different computers
- Cloud storage (MS Azure, Amazon Web Services, Google Cloud), which can also carry out data analytics,
machine learning and deep learning.
 Types of data storage:
- Unstructured data (email, text, video & audio, web pages, social media messages) are stored in a Document
Database
- Tabular data is stored in Relational Database
 Data retrieval (each type of database has its own query language):
- Document Database mainly use NoSQL (Not only SQL)
- Relational Database use SQL (Structured Query Language)
Data Pipelines
 These move data into defined stages, i.e., from data ingestion through an API to
loading data into a database.
 A key feature is that pipelines automates this movement.
- Data engineer, rather than manually running programs to collect and store data,
schedules tasks whether it’s hourly, daily or tasks that can be triggered by an event.
- Due to this automation, data pipelines need to be monitored. Alerts can be generated
automatically if 95% of storage capacity has been reached or if an API is responding
with an error.
- Data pipelines are important when working with lots of data from different sources.
 There is no set way to make a pipeline – pipelines are highly customized depending on
your data, storage options and ultimate usage of the data.
 ETL (extract, transform and load) is a popular framework for data pipelines.
Step 2 & 3: Data preparation, Exploratory
Data Analysis & Visualization
 Data preparation:
- Skipping this step may lead to errors down the way, such as incorrect results which may throw off
your algorithms.
- Tidy Data is a way of presenting a matrix of data, with observations on rows and variables as
columns.
 Exploratory Data Analysis (EDA):
- It is a process that consists in exploring the data and formulating hypotheses about it and
assessing its main characteristics with a strong emphasis on visualization. This takes place after
data preparation, but they can get mixed.
 Visualization:
- Dashboards are used to group all relevant information in one place to make it easier to gather
insights and act on them.
- Business Intelligence tools let you clean, explore, visualize data and build dashboards without
requiring any programming knowledge. Examples: Tableau, Looker, Power BI
- Note: Make your visualizations interactive and use filters
Step 4: Running experiments and predictions
 A/B Testing (aka Champion/Challenger Testing)
 It is used to make a choice between two options. These experiments help drive decisions and draw conclusions. Generally, they
begin with a question and a hypothesis, then data collection followed by a statistical test and its interpretation.
 A/B Testing steps:
- Selecting a metric to track
- Calculating the sample size
- Running the experiment
- Checking for significance (result is likely not due to chance given the statistical assumptions made)
 Case study: Which is the better title for the blog post
- Form a question: Does the title in blog post A or blog post B result in more clicks?
- Form a hypothesis: Title in blog post A and B result in the same number of clicks.
- Collect data:
 50% users will see blog title A
 50% users will see blog title B
 Track click-through rate until sample size has been reached
- Test the hypothesis with a statistical test (t-test, z-test, ANOVA, Chi-square test): Is the difference in titles’ click-through rates
significant?
- Interpret results: Choose a title or ask more questions and design another experiment.
Time-series forecasting
 What is a statistical model?
- Represents a real-world process with statistics
- Mathematical relationships between variables, including random variables
- Based on statistical assumptions and historical data
 Predictive modeling: A subcategory of modeling used for prediction.
- Process:
 New input: Enter future date in a model of unemployment
 Predictive model: Model of unemployment
 Output: Get a prediction of what unemployment rate will be next month
- Predictive models can be as simple as a linear equation with an x & y variable to a very complicated deep learning algorithm.
 Time-series data: A series of data points sequenced by time. Example: daily stock, gas prices over the years
- Often it is in the form of rates, such as monthly unemployment rates or patient’s heart rate during surgery.
- Time-series data is usually plotted as a line graph.
- Seasonality occurs when there are repeating patterns related to time such as months or weeks.
- Time-series data is used in predictive modeling to predict metrics at future dates, which is known as forecasting. We can build
predictive models using time-series data from past years or decades to generate predictions. This uses a combination of statistical
and machine learning methods.
- Confidence Intervals says that the model is ‘X%’ sure that the time value will fall in this area.
Supervised machine learning
 Machine learning: A set of methods for making predictions based on existing data.
 Supervised machine learning: A sub-set of machine learning where the existing data has a specific structure, i.e., it has labels and
features.
- Labels are what we want to predict.
- Features are data that might predict the label.
 Abilities of supervised machine learning:
- Recommendation systems
- Diagnosing biomedical images
- Recognizing hand-written digits
- Predicting customer churn
 Case study: Customer churn prediction
- Customer: Will either stay subscribed or is likely to cancel subscription (churn).
- Gather training data to build the model, i.e., historical customer data where some will have maintained subscriptions while others
will have churned. We eventually want to be able to predict the label for each customer (churned/subscribed), hence we will need
features about each customer that might affect our label (age, gender, date of last purchase, household income). Machine learning
can analyze many features simultaneously.
- We use these labels and features to train our model to make predictions on new data.
- It’s always good practice to not allocate all your historical data for your training model. Withheld data is called a test set and it can
be used to evaluate the efficacy of the model.
Unsupervised learning
 Clustering: A set of machine learning algorithms that divide data into categories
called clusters.
- Clusters help us see patterns in messy datasets.
- Machine learning scientists use clustering to divide customers into segments,
images into categories or behaviors into typical and anomalous.
- Clustering is a broader category within machine learning called ‘Unsupervised
learning.’ Unsupervised learning, unlike Supervised learning which uses data with
features and labels, use data with only features. These features are basically
measurements.
- Some clustering algorithms need us to define how many clusters we want to
create. The number of clusters we ask for greatly affects how the algorithm will
segment our data, based on hypothesis.
THANK YOU!

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesSlideTeam
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 
Data Visualization
Data VisualizationData Visualization
Data Visualizationgzargary
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Edureka!
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | EdurekaEdureka!
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 

Was ist angesagt? (20)

Data science
Data scienceData science
Data science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation Slides
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data Science
Data ScienceData Science
Data Science
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data science
Data science Data science
Data science
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 

Ähnlich wie Introduction to data science

big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptxNATASHABANO
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business IntelligenceSukirti Garg
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Morden EcoSystem.pptx
Morden EcoSystem.pptxMorden EcoSystem.pptx
Morden EcoSystem.pptxpriti jadhao
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7Rohit Mittal
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptxamitparashar42
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptxamitparashar42
 

Ähnlich wie Introduction to data science (20)

big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
data mining
data miningdata mining
data mining
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Morden EcoSystem.pptx
Morden EcoSystem.pptxMorden EcoSystem.pptx
Morden EcoSystem.pptx
 
Data Science
Data ScienceData Science
Data Science
 
Big data
Big dataBig data
Big data
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptx
 
Data Analytics Introduction.pptx
Data Analytics Introduction.pptxData Analytics Introduction.pptx
Data Analytics Introduction.pptx
 

Mehr von Mahir Haque

Principles of marketing 15th Edition
Principles of marketing 15th Edition Principles of marketing 15th Edition
Principles of marketing 15th Edition Mahir Haque
 
JOINING with SQL
JOINING with SQLJOINING with SQL
JOINING with SQLMahir Haque
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQLMahir Haque
 
Marketing analytics
Marketing analyticsMarketing analytics
Marketing analyticsMahir Haque
 
Marketing in a Digital World
Marketing in a Digital WorldMarketing in a Digital World
Marketing in a Digital WorldMahir Haque
 
Fundamentals of Digital Marketing
Fundamentals of Digital MarketingFundamentals of Digital Marketing
Fundamentals of Digital MarketingMahir Haque
 
Global Exchange Rate Arrangements
Global Exchange Rate ArrangementsGlobal Exchange Rate Arrangements
Global Exchange Rate ArrangementsMahir Haque
 
Operations management of IKEA
Operations management of IKEAOperations management of IKEA
Operations management of IKEAMahir Haque
 
John F. Kennedy (Leadership style)
John F. Kennedy (Leadership style)John F. Kennedy (Leadership style)
John F. Kennedy (Leadership style)Mahir Haque
 

Mehr von Mahir Haque (9)

Principles of marketing 15th Edition
Principles of marketing 15th Edition Principles of marketing 15th Edition
Principles of marketing 15th Edition
 
JOINING with SQL
JOINING with SQLJOINING with SQL
JOINING with SQL
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
Marketing analytics
Marketing analyticsMarketing analytics
Marketing analytics
 
Marketing in a Digital World
Marketing in a Digital WorldMarketing in a Digital World
Marketing in a Digital World
 
Fundamentals of Digital Marketing
Fundamentals of Digital MarketingFundamentals of Digital Marketing
Fundamentals of Digital Marketing
 
Global Exchange Rate Arrangements
Global Exchange Rate ArrangementsGlobal Exchange Rate Arrangements
Global Exchange Rate Arrangements
 
Operations management of IKEA
Operations management of IKEAOperations management of IKEA
Operations management of IKEA
 
John F. Kennedy (Leadership style)
John F. Kennedy (Leadership style)John F. Kennedy (Leadership style)
John F. Kennedy (Leadership style)
 

Kürzlich hochgeladen

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Introduction to data science

  • 2. What is Data Science?  It is a set of methodologies for taking in thousands of forms of data that are available to us today and using them to draw meaningful conclusions.  Purpose of Data Science: - Describe the current state of an organization or process - Detect anomalous events - Diagnose the causes of events and behaviors - Predict future events  Data Science Workflow: - Collect data from various sources – surveys, web traffic results, geo-tagged social media posts, financial transactions, etc. Once data have been collected, we store that data in a safe and accessible way. - Prepare the raw data, also known as ‘cleaning the data’, which involves finding missing or duplicate values and converting data into a more organized format. - Explore and visualize the cleaned data by building dashboards to track how data changes over time or performing comparisons between two sets of data. - Run experiments and predictions on the data, for example building a system that forecasts temperature changes or performing a test to find which web page acquires more customers.
  • 3. 3 exciting areas of Data Science  Machine Learning: - Starts with a well-defined question (What is the probability that this transaction is fraudulent?) - Gather some data to analyze (Old transactions labeled as fraudulent/valid) - Bring in new additional data to make predictions (New credit card transactions)  Internet of Things (IoT): - Refers to gadgets that are not standard computers but still have the ability to transmit data. - Includes smart watches, internet-connected home security systems, electronic toll collection systems, building energy management systems, etc. - IoT is a great source for data science projects.  Deep Learning: - A sub-field of machine learning, where multiple layers of algorithms called ‘Neurons’ work together to draw complex conclusions. - Deep learning takes much more ‘Training Data’, which are records of data used to build an algorithm, than a traditional machine learning model and is also able to learn relationships that traditional models cannot. - Deep learning is used to solve data-intensive problems such as image classification or language understanding.
  • 4. Data Science Roles and Tools Roles Data Engineer Data Analyst Data Scientist Machine Learning Scientist Responsibilities They control the flow of data by building custom data pipelines and storage systems. They design infrastructure so that data is not collected but it is easily obtained and processed. They describe the data through exploring the data and creating visualizations and dashboards. To do these, they need to first clean the data. They find new insights from data and use traditional machine learning for prediction and forecasting. Very similar to Data Scientists. They what’s likely to be true from what we already know – these scientists use Training Data to classify larger, unrulier data whether it’s to classify images that contain a car or create a chatbot . Focus area Data collection and storage Data preparation & Exploration and Visualization Data preparation, Exploration and Visualization & Experimentation and Prediction Data preparation, Exploration and & Experimentation and Prediction Tools • SQL for storing and data. • Either Java, Scala or Python processing data. • Shell is used on the command line to automate and run • SQL for querying data – use existing databases to retrieve and aggregate relevant data. • Spreadsheets to perform simple analyses on small data quantities. • Tableau, Power BI or Looker to create dashboards and share analyses. • Python/R can also be used for cleaning and analyzing data. • SQL, Python or R proficiency. • Data science libraries for using reusable codes for common data science tasks. • Python/R to create predictive models. • Popular machine learning libraries (TensorFlow) to run powerful deep learning algorithms.
  • 5. Step 1: Data collection & storage  Vast amounts of data are being generated daily from surfing the internet to paying by card in a shop. The companies behind these services that we use, collect these data internally and use it to make data-driven decisions. There are also many free, open data sources available. This means data can be freely used, shared and built-on by anyone.  Company data sources: - Web events - Customer data - Survey data - Logistics data - Financial transactions  Open data sources: - Public data APIs (Application programming interface) – Twitter, Wikipedia, Yahoo! Finance, Google Maps - Public records (international organizations such as World Bank, UN, WTO; national statistical offices; government agencies)
  • 6. Types of data Quantitative data: Data that can be counted, measured and expressed using numbers. Qualitative data: Data that is descriptive and conceptual – something that can be observed not measured. Image data: An image is made up of pixels. These pixels contain information about color and intensity. Typically, the pixels are stored in computer memory. Text data: Emails, documents, reviews, social media posts, etc – these data can be stored and analyzed to find relevant insights. Geospatial data: Data with location especially useful for navigation apps like Google Maps/Waze. Network data: Data consisting of people or things in a network and the relationships between them.
  • 7. Data storage and retrieval  When storing data, there are 3 important things to consider: - Determining where to store the data - Knowing what kind of data we are storing - How we can retrieve the data from storage  Location: - On-premises cluster, i.e., data stored across many different computers - Cloud storage (MS Azure, Amazon Web Services, Google Cloud), which can also carry out data analytics, machine learning and deep learning.  Types of data storage: - Unstructured data (email, text, video & audio, web pages, social media messages) are stored in a Document Database - Tabular data is stored in Relational Database  Data retrieval (each type of database has its own query language): - Document Database mainly use NoSQL (Not only SQL) - Relational Database use SQL (Structured Query Language)
  • 8. Data Pipelines  These move data into defined stages, i.e., from data ingestion through an API to loading data into a database.  A key feature is that pipelines automates this movement. - Data engineer, rather than manually running programs to collect and store data, schedules tasks whether it’s hourly, daily or tasks that can be triggered by an event. - Due to this automation, data pipelines need to be monitored. Alerts can be generated automatically if 95% of storage capacity has been reached or if an API is responding with an error. - Data pipelines are important when working with lots of data from different sources.  There is no set way to make a pipeline – pipelines are highly customized depending on your data, storage options and ultimate usage of the data.  ETL (extract, transform and load) is a popular framework for data pipelines.
  • 9. Step 2 & 3: Data preparation, Exploratory Data Analysis & Visualization  Data preparation: - Skipping this step may lead to errors down the way, such as incorrect results which may throw off your algorithms. - Tidy Data is a way of presenting a matrix of data, with observations on rows and variables as columns.  Exploratory Data Analysis (EDA): - It is a process that consists in exploring the data and formulating hypotheses about it and assessing its main characteristics with a strong emphasis on visualization. This takes place after data preparation, but they can get mixed.  Visualization: - Dashboards are used to group all relevant information in one place to make it easier to gather insights and act on them. - Business Intelligence tools let you clean, explore, visualize data and build dashboards without requiring any programming knowledge. Examples: Tableau, Looker, Power BI - Note: Make your visualizations interactive and use filters
  • 10. Step 4: Running experiments and predictions  A/B Testing (aka Champion/Challenger Testing)  It is used to make a choice between two options. These experiments help drive decisions and draw conclusions. Generally, they begin with a question and a hypothesis, then data collection followed by a statistical test and its interpretation.  A/B Testing steps: - Selecting a metric to track - Calculating the sample size - Running the experiment - Checking for significance (result is likely not due to chance given the statistical assumptions made)  Case study: Which is the better title for the blog post - Form a question: Does the title in blog post A or blog post B result in more clicks? - Form a hypothesis: Title in blog post A and B result in the same number of clicks. - Collect data:  50% users will see blog title A  50% users will see blog title B  Track click-through rate until sample size has been reached - Test the hypothesis with a statistical test (t-test, z-test, ANOVA, Chi-square test): Is the difference in titles’ click-through rates significant? - Interpret results: Choose a title or ask more questions and design another experiment.
  • 11. Time-series forecasting  What is a statistical model? - Represents a real-world process with statistics - Mathematical relationships between variables, including random variables - Based on statistical assumptions and historical data  Predictive modeling: A subcategory of modeling used for prediction. - Process:  New input: Enter future date in a model of unemployment  Predictive model: Model of unemployment  Output: Get a prediction of what unemployment rate will be next month - Predictive models can be as simple as a linear equation with an x & y variable to a very complicated deep learning algorithm.  Time-series data: A series of data points sequenced by time. Example: daily stock, gas prices over the years - Often it is in the form of rates, such as monthly unemployment rates or patient’s heart rate during surgery. - Time-series data is usually plotted as a line graph. - Seasonality occurs when there are repeating patterns related to time such as months or weeks. - Time-series data is used in predictive modeling to predict metrics at future dates, which is known as forecasting. We can build predictive models using time-series data from past years or decades to generate predictions. This uses a combination of statistical and machine learning methods. - Confidence Intervals says that the model is ‘X%’ sure that the time value will fall in this area.
  • 12. Supervised machine learning  Machine learning: A set of methods for making predictions based on existing data.  Supervised machine learning: A sub-set of machine learning where the existing data has a specific structure, i.e., it has labels and features. - Labels are what we want to predict. - Features are data that might predict the label.  Abilities of supervised machine learning: - Recommendation systems - Diagnosing biomedical images - Recognizing hand-written digits - Predicting customer churn  Case study: Customer churn prediction - Customer: Will either stay subscribed or is likely to cancel subscription (churn). - Gather training data to build the model, i.e., historical customer data where some will have maintained subscriptions while others will have churned. We eventually want to be able to predict the label for each customer (churned/subscribed), hence we will need features about each customer that might affect our label (age, gender, date of last purchase, household income). Machine learning can analyze many features simultaneously. - We use these labels and features to train our model to make predictions on new data. - It’s always good practice to not allocate all your historical data for your training model. Withheld data is called a test set and it can be used to evaluate the efficacy of the model.
  • 13. Unsupervised learning  Clustering: A set of machine learning algorithms that divide data into categories called clusters. - Clusters help us see patterns in messy datasets. - Machine learning scientists use clustering to divide customers into segments, images into categories or behaviors into typical and anomalous. - Clustering is a broader category within machine learning called ‘Unsupervised learning.’ Unsupervised learning, unlike Supervised learning which uses data with features and labels, use data with only features. These features are basically measurements. - Some clustering algorithms need us to define how many clusters we want to create. The number of clusters we ask for greatly affects how the algorithm will segment our data, based on hypothesis.