This document summarizes a talk on using big data driven solutions to combat COVID-19. It discusses how big data preparation involves ingesting, cleansing, and enriching data from various sources. It also describes common big data technologies used for storage, mining, analytics and visualization including Hadoop, Presto, Kafka and Tableau. Finally, it provides examples of research projects applying big data and AI to track COVID-19 cases, model disease spread, and optimize health resource utilization.
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Big Data Driven Solutions to Combat Covid-19 Webinar
1. Talk on Big Data Driven Solutions to
Combat Covid’ 19
National Level Webinar at
Ethiraj College for Women (Autonomous),
Chennai.
Dr.S.Balakrishnan,
Professor and Head,
Department of Computer Science and Business Systems,
Sri Krishna College of Engineering and Technology,
Coimbatore, Tamilnadu.
1
2. OUTLINE
Introduction
Big Data Preparation
Types of Tools Used in Big-data
Top Big Data Technologies
Research Projects
3. INTRODUCTION
Big Data may well be the Next Big Thing in the IT world.
Big data burst upon the scene in the first decade of the
21st century.
The first organizations to embrace it were online and startup
firms. Firms like Google, eBay, LinkedIn, and Facebook
were built around big data from the beginning.
Like many new information technologies,
big data can bring about dramatic cost reductions,
Substantial improvements in the time required to perform a
computing task, or
new product and service offerings.
3
4. WHAT IS BIG DATA?
‘Big Data’ is similar to ‘small data’, but bigger in size.
but having data bigger it requires different approaches:
Techniques, tools and architecture
an aim to solve new problems or old problems in a better way
Big Data generates value from the storage and processing of
very large quantities of digital information that cannot be
analyzed with traditional computing techniques.
4
5. WHAT IS BIG DATA
Walmart handles more than 1 million customer
transactions every hour.
• Facebook handles 40 billion photos from its user base.
• Decoding the human genome originally took 10 years to process;
now it can be achieved in one week.
5
9. WHY BIG DATA
9
Growth of Big Data is needed
– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
10. WHY BIG DATA
10
• FB generates 10TB
daily
• Twitter generates 7TB of
data Daily
• IBM claims 90% of
today’s stored data was
generated in just the
last two years.
11. HOW IS BIG DATA DIFFERENT?
11
1) Automatically generated by a machine (e.g. Sensor
embedded in an engine)
2) Typically an entirely new source of data (e.g. Use of the
internet)
3) Not designed to be friendly (e.g. Text
streams)
14
14. THE STRUCTURE OF BIG DATA
14
Structured
• Most traditional data
sources
Semi-structured
• Many sources of big data
Unstructured
• Video data, audio data
11
16. BIG DATA PREPARATION
Perception vs. Reality
The perception is that you spend most of your time on
analytics.
But in reality, you will devote much more time and effort
on importing, profiling, cleansing, repairing,
standardizing, and enriching your data.
16
17. DATA CLEANSING
Data cleaning is the process of detecting and
correcting (or removing) corrupt or inaccurate records
from a record set, table, or database.
Data cleansing may be performed interactively with
data wrangling tools, or as batch processing through
scripting.
After cleansing, a data set will be consistent with other
similar data sets in the system
17
18. DATA CLEANSING VS VALIDATION
Cleansing
Data cleansing may
involve removing
typographical errors or
validating and
correcting values
against a known list of
entities
Validation
The validation may be
strict (such as rejecting any
address that does not have
a valid postal code) or
fuzzy (such as correcting
records that partially match
existing, known records).
18
19. Data cleansing may also involve activities like,
harmonization of data, and standardization of data.
For example, harmonization of short codes (st, rd,
etc.) to actual words (street, road, etcetera).
Standardization of data is a means of changing a
reference data set to a new standard, ex, use of
standard codes.
19
20. WHY DATA PREPARATION IS NEEDED
It significantly reduces the amount of time needed to
ingest and prepare new data sets for multiple
downstream processes.
Also it shapes and improves your business data, and
render your ecosystem simple, scalable, and
automated.
20
21. DATA BEFORE PROCESSING
You work with a mishmash of data sources.
Your content will be inconsistent, incomplete and in a
variety of formats.
It takes you weeks to process your data and write
custom scripts to clean up the mess.
It needs efficient strategy to harvest and analyze data
from social media and sales transactions.
You will have only a vague idea of the categories of
information that your data might provide.
21
22. DATA AFTER PROCESSING
Provides you with a large set of data repair,
transformation, and enrichment options that require zero
coding or scripting.
Enables you to see data transformations and the result
of script automation in real time with a set of smart and
interactive tools and features
22
24. INGEST
What are your data sources?
Are they office documents, social media, or click stream
logs? If so, you need to ingest your data before you can
effectively analyze and enrich it.
To make sense of all the data you have, you must
define a structure and correlate the disparate data sets.
This important step involves both understanding and
standardizing your data.
24
25. WHAT YOU CAN DO TO INGEST AND MEND YOUR DATA
Statistical Profiling: Create standard statistical
analysis of numerical data, and frequency and term
analysis of text data.
Process: Handle multiple formats of data sources,
whether their content is structured, semi-structured, or
unstructured.
Cleanse: Remove nonessential characters and
standardize date formats.
Repair: Find and fix inconsistencies.
Detect Schema: Identify schema and metadata that is
explicitly defined in headers, fields, or tags.
Identify Duplicates: Find and flag duplicates in your
data so you can reduce the size of your data pool. 25
26. After you’ve cleansed your data, you can leverage any
patterns and knowledge-based classifications to understand
the domains found in your data sets.
Use the wide variety of known categories and vast array of
reference data to analyze and recognize content without
relying on any metadata.
After classification of data sets, enrich your data sets with
related entities from the reference knowledge service, and
extract embedded entities found in your data. This
semantically enriches and correlates your data.
ENRICH
26
28. GOVERN
As you ingest, enrich, and publish your data, Cloud Service providers
provides a user interface driven, intuitive Dashboard page to monitor
all transform activity on your data sets.
28
29. AUTOMATE
Automate the Process
To automate the process.
First, you can use the scheduler to set your transformations to
run on a daily, weekly, or monthly basis against a pre-
determined data source.
Second, By using APIs you can automate the entire data
preparation process, from file movement to preparation to
publishing.
29
30. TYPES OF TOOLS USED IN BIG-DATA
30
Where processing is hosted?
Distributed Servers / Cloud (e.g. Amazon EC2)
Where data is stored?
Distributed Storage (e.g. Amazon S3)
What is the programming model?
Distributed Processing (e.g. MapReduce)
How data is stored & indexed?
High-performance schema-free databases (e.g. MongoDB)
What operations are performed on data?
Analytic / Semantic Processing
31. TYPES OF BIG DATA TECHNOLOGIES
Big Data Technology is mainly classified into
two types:
Operational Big Data Technologies
all about the normal day to day data that we generate.
Analytical Big Data Technologies
like the advanced version of Big Data Technologies.
little complex than the Operational Big Data.
31
32. A FEW EXAMPLES OF OPERATIONAL BIG DATA
TECHNOLOGIES ARE AS FOLLOWS:
32
33. FEW EXAMPLES OF ANALYTICAL BIG DATA
TECHNOLOGIES ARE AS FOLLOWS:
33
35. TOP BIG DATA TECHNOLOGIES
Top big data technologies are divided into 4 fields which
are classified as follows:
Data Storage
Data Mining
Data Analytics
Data Visualization
35
37. DATA STORAGE - HADOOP
Hadoop Framework was designed to store and process data
in a Distributed Data Processing Environment with
commodity hardware with a simple programming model.
Store and Analyse the data present in different machines with
High Speeds and Low Costs.
Developed by: Apache Software Foundation in the year
2011 10th of Dec.
Written in: JAVA
Current stable version: Hadoop 3.11 37
39. DATA MINING - PRESTO
open source Distributed SQL Query Engine for
running Interactive Analytic Queries against data
sources of all sizes ranging from Gigabytes to
Petabytes.
allows querying data in Hive, Cassandra, Relational
Databases and Proprietary Data Stores.
Developed by: Apache Foundation in the year 2013.
Written in: JAVA
Current stable version: Presto 0.22
39
41. DATA ANALYTICS - KAFKA
Apache Kafka is a Distributed Streaming platform.
A streaming platform has Three Key Capabilities that are as
follows:
Publisher
Subscriber
Consumer
This is similar to a Message Queue or an Enterprise
Messaging System.
Developed by: Apache Software Foundation in the year 2011
Written in: Scala, JAVA
Current stable version: Apache Kafka 2.2.0
41
43. DATA VISUALIZATION - TABLEAU
Tableau is a Powerful and Fastest growing Data
Visualization tool used in the Business Intelligence Industry.
Data analysis is very fast with Tableau and the
Visualizations created are in the form of Dashboards and
Worksheets.
Developed by: TableAU 2013 May 17th
Written in: JAVA, C++, Python, C
Current stable version: TableAU 8.2
43
45. EMERGING BIG DATA TECHNOLOGIES -
TENSORFLOW
has a Comprehensive, Flexible Ecosystem of tools,
Libraries and Community resources that lets Researchers
push the state-of-the-art in Machine Learning and
Developers
can easily build and deploy Machine Learning powered
applications.
Developed by: Google Brain Team in the year 2019
Written in: Python, C++, CUDA
Current stable version: TensorFlow 2.0 beta
45
47. APPLICATION OF BIG DATA ANALYTICS
47
Homeland
Security
Smarter
Healthcare Multi-channel
sales
Telecom
Manufacturing
TrafficControl Trading
Analytics
Search
Quality
48. RISKS OF BIG DATA
48
• Will be so overwhelmed
• Need the right people and solve the right problems
• Costs escalate too fast
• Is n’t necessary to capture 100%
• Many sources of big data
is privacy
• self-regulation (data compression)
• Legal regulation
20
49. HOW BIG DATA IMPACTS ON IT
49
Big data is a troublesome force presenting
opportunities with challenges to IT organizations.
By 2015 4.4 million IT jobs in Big Data ; 1.9 million is in
US itself
In 2017, Data scientist’s was No. 1 Job in the
Harvard’s ranking.
50. BENEFITS OF BIG DATA
50
• Real-time big data isn’t just a process for storing
petabytes or exabytes of data in a data warehouse,
It’s about the ability to make better decisions and take
meaningful actions at the right time.
• Fast forward to the present and technologies like Hadoop
give you the scale and flexibility to store data before you
know how you are going to process it.
• Technologies such as MapReduce,Hive and Impala enable
you to run queries without changing the data structures
underneath.
51. RESEARCH PROJECTS RELATED TO COVID’19
51
•Project 1:
•World-wide COVID-19 Outbreak Data Analysis and
Prediction
•Methods:
Real-time data query is done and visualized, then the
queried data is used for Susceptible-Exposed-Infectious-
Recovered (SEIR) predictive modelling.
SEIR modelling to forecast COVID-19 outbreak within and
outside of China based on daily observations.
Also analyze the queried news, and classify the news into
negative and positive sentiments, to understand the influence
of the news to people’s behavior both politically and
economically.
52. RESEARCH PROJECTS
52
•Project 2:
• Short-Term Applications of Artificial Intelligence and
Big Data: Tracking and Diagnosing COVID-19 Cases.
•Project 3:
•Short-Term Applications of Artificial Intelligence and Big
Data: A Quick and Effective Pandemic Alert.
•Project 4:
•Modeling of disease activity, potential growth and areas
of spread.
•Project 5:
•Modeling of the utility of operating theaters and clinics
with manpower projections