Just finished a basic course on data science (highly recommend it if you wish to explore what data science is all about). Here are my takeaways from the course.
2. What is Data Science?
It is a set of methodologies for taking in thousands of forms of data that are available to us today and using them to
draw meaningful conclusions.
Purpose of Data Science:
- Describe the current state of an organization or process
- Detect anomalous events
- Diagnose the causes of events and behaviors
- Predict future events
Data Science Workflow:
- Collect data from various sources – surveys, web traffic results, geo-tagged social media posts, financial
transactions, etc. Once data have been collected, we store that data in a safe and accessible way.
- Prepare the raw data, also known as ‘cleaning the data’, which involves finding missing or duplicate values and
converting data into a more organized format.
- Explore and visualize the cleaned data by building dashboards to track how data changes over time or performing
comparisons between two sets of data.
- Run experiments and predictions on the data, for example building a system that forecasts temperature changes or
performing a test to find which web page acquires more customers.
3. 3 exciting areas of Data Science
Machine Learning:
- Starts with a well-defined question (What is the probability that this transaction is fraudulent?)
- Gather some data to analyze (Old transactions labeled as fraudulent/valid)
- Bring in new additional data to make predictions (New credit card transactions)
Internet of Things (IoT):
- Refers to gadgets that are not standard computers but still have the ability to transmit data.
- Includes smart watches, internet-connected home security systems, electronic toll collection systems, building
energy management systems, etc.
- IoT is a great source for data science projects.
Deep Learning:
- A sub-field of machine learning, where multiple layers of algorithms called ‘Neurons’ work together to draw
complex conclusions.
- Deep learning takes much more ‘Training Data’, which are records of data used to build an algorithm, than a
traditional machine learning model and is also able to learn relationships that traditional models cannot.
- Deep learning is used to solve data-intensive problems such as image classification or language
understanding.
4. Data Science Roles and Tools
Roles Data Engineer Data Analyst Data Scientist Machine Learning Scientist
Responsibilities They control the flow of data by
building custom data pipelines and
storage systems. They design
infrastructure so that data is not
collected but it is easily obtained
and processed.
They describe the data through exploring the data
and creating visualizations and dashboards. To do
these, they need to first clean the data.
They find new insights from data and use
traditional machine learning for prediction
and forecasting.
Very similar to Data Scientists. They
what’s likely to be true from what we already
know – these scientists use Training Data to
classify larger, unrulier data whether it’s to
classify images that contain a car or create a
chatbot .
Focus area Data collection and storage Data preparation & Exploration and Visualization Data preparation, Exploration and
Visualization & Experimentation and
Prediction
Data preparation, Exploration and
& Experimentation and Prediction
Tools • SQL for storing and
data.
• Either Java, Scala or Python
processing data.
• Shell is used on the command
line to automate and run
• SQL for querying data – use existing databases
to retrieve and aggregate relevant data.
• Spreadsheets to perform simple analyses on
small data quantities.
• Tableau, Power BI or Looker to create
dashboards and share analyses.
• Python/R can also be used for cleaning and
analyzing data.
• SQL, Python or R proficiency.
• Data science libraries
for using reusable codes for common
data science tasks.
• Python/R to create predictive models.
• Popular machine learning libraries
(TensorFlow) to run powerful deep learning
algorithms.
5. Step 1: Data collection & storage
Vast amounts of data are being generated daily from surfing the internet to paying by card in a
shop. The companies behind these services that we use, collect these data internally and use it to
make data-driven decisions. There are also many free, open data sources available. This means data
can be freely used, shared and built-on by anyone.
Company data sources:
- Web events
- Customer data
- Survey data
- Logistics data
- Financial transactions
Open data sources:
- Public data APIs (Application programming interface) – Twitter, Wikipedia, Yahoo! Finance, Google
Maps
- Public records (international organizations such as World Bank, UN, WTO; national statistical offices;
government agencies)
6. Types of data
Quantitative data: Data
that can be counted,
measured and expressed
using numbers.
Qualitative data: Data
that is descriptive and
conceptual – something
that can be observed
not measured.
Image data: An image is
made up of pixels. These
pixels contain
information about color
and intensity. Typically,
the pixels are stored in
computer memory.
Text data: Emails,
documents, reviews,
social media posts, etc –
these data can be stored
and analyzed to find
relevant insights.
Geospatial data: Data
with location
especially useful for
navigation apps like
Google Maps/Waze.
Network data: Data
consisting of people or
things in a network and
the relationships
between them.
7. Data storage and retrieval
When storing data, there are 3 important things to consider:
- Determining where to store the data
- Knowing what kind of data we are storing
- How we can retrieve the data from storage
Location:
- On-premises cluster, i.e., data stored across many different computers
- Cloud storage (MS Azure, Amazon Web Services, Google Cloud), which can also carry out data analytics,
machine learning and deep learning.
Types of data storage:
- Unstructured data (email, text, video & audio, web pages, social media messages) are stored in a Document
Database
- Tabular data is stored in Relational Database
Data retrieval (each type of database has its own query language):
- Document Database mainly use NoSQL (Not only SQL)
- Relational Database use SQL (Structured Query Language)
8. Data Pipelines
These move data into defined stages, i.e., from data ingestion through an API to
loading data into a database.
A key feature is that pipelines automates this movement.
- Data engineer, rather than manually running programs to collect and store data,
schedules tasks whether it’s hourly, daily or tasks that can be triggered by an event.
- Due to this automation, data pipelines need to be monitored. Alerts can be generated
automatically if 95% of storage capacity has been reached or if an API is responding
with an error.
- Data pipelines are important when working with lots of data from different sources.
There is no set way to make a pipeline – pipelines are highly customized depending on
your data, storage options and ultimate usage of the data.
ETL (extract, transform and load) is a popular framework for data pipelines.
9. Step 2 & 3: Data preparation, Exploratory
Data Analysis & Visualization
Data preparation:
- Skipping this step may lead to errors down the way, such as incorrect results which may throw off
your algorithms.
- Tidy Data is a way of presenting a matrix of data, with observations on rows and variables as
columns.
Exploratory Data Analysis (EDA):
- It is a process that consists in exploring the data and formulating hypotheses about it and
assessing its main characteristics with a strong emphasis on visualization. This takes place after
data preparation, but they can get mixed.
Visualization:
- Dashboards are used to group all relevant information in one place to make it easier to gather
insights and act on them.
- Business Intelligence tools let you clean, explore, visualize data and build dashboards without
requiring any programming knowledge. Examples: Tableau, Looker, Power BI
- Note: Make your visualizations interactive and use filters
10. Step 4: Running experiments and predictions
A/B Testing (aka Champion/Challenger Testing)
It is used to make a choice between two options. These experiments help drive decisions and draw conclusions. Generally, they
begin with a question and a hypothesis, then data collection followed by a statistical test and its interpretation.
A/B Testing steps:
- Selecting a metric to track
- Calculating the sample size
- Running the experiment
- Checking for significance (result is likely not due to chance given the statistical assumptions made)
Case study: Which is the better title for the blog post
- Form a question: Does the title in blog post A or blog post B result in more clicks?
- Form a hypothesis: Title in blog post A and B result in the same number of clicks.
- Collect data:
50% users will see blog title A
50% users will see blog title B
Track click-through rate until sample size has been reached
- Test the hypothesis with a statistical test (t-test, z-test, ANOVA, Chi-square test): Is the difference in titles’ click-through rates
significant?
- Interpret results: Choose a title or ask more questions and design another experiment.
11. Time-series forecasting
What is a statistical model?
- Represents a real-world process with statistics
- Mathematical relationships between variables, including random variables
- Based on statistical assumptions and historical data
Predictive modeling: A subcategory of modeling used for prediction.
- Process:
New input: Enter future date in a model of unemployment
Predictive model: Model of unemployment
Output: Get a prediction of what unemployment rate will be next month
- Predictive models can be as simple as a linear equation with an x & y variable to a very complicated deep learning algorithm.
Time-series data: A series of data points sequenced by time. Example: daily stock, gas prices over the years
- Often it is in the form of rates, such as monthly unemployment rates or patient’s heart rate during surgery.
- Time-series data is usually plotted as a line graph.
- Seasonality occurs when there are repeating patterns related to time such as months or weeks.
- Time-series data is used in predictive modeling to predict metrics at future dates, which is known as forecasting. We can build
predictive models using time-series data from past years or decades to generate predictions. This uses a combination of statistical
and machine learning methods.
- Confidence Intervals says that the model is ‘X%’ sure that the time value will fall in this area.
12. Supervised machine learning
Machine learning: A set of methods for making predictions based on existing data.
Supervised machine learning: A sub-set of machine learning where the existing data has a specific structure, i.e., it has labels and
features.
- Labels are what we want to predict.
- Features are data that might predict the label.
Abilities of supervised machine learning:
- Recommendation systems
- Diagnosing biomedical images
- Recognizing hand-written digits
- Predicting customer churn
Case study: Customer churn prediction
- Customer: Will either stay subscribed or is likely to cancel subscription (churn).
- Gather training data to build the model, i.e., historical customer data where some will have maintained subscriptions while others
will have churned. We eventually want to be able to predict the label for each customer (churned/subscribed), hence we will need
features about each customer that might affect our label (age, gender, date of last purchase, household income). Machine learning
can analyze many features simultaneously.
- We use these labels and features to train our model to make predictions on new data.
- It’s always good practice to not allocate all your historical data for your training model. Withheld data is called a test set and it can
be used to evaluate the efficacy of the model.
13. Unsupervised learning
Clustering: A set of machine learning algorithms that divide data into categories
called clusters.
- Clusters help us see patterns in messy datasets.
- Machine learning scientists use clustering to divide customers into segments,
images into categories or behaviors into typical and anomalous.
- Clustering is a broader category within machine learning called ‘Unsupervised
learning.’ Unsupervised learning, unlike Supervised learning which uses data with
features and labels, use data with only features. These features are basically
measurements.
- Some clustering algorithms need us to define how many clusters we want to
create. The number of clusters we ask for greatly affects how the algorithm will
segment our data, based on hypothesis.