SlideShare a Scribd company logo
1 of 41
Download to read offline
1 /
A Study Review of Common Big Data
Architecture for
Small-medium Enterprise
Ridwan Fadjar Septian, ridwanbejo@gmail.com, Master of information System, Faculty of Postgraduate,
Universitas Komputer Indonesia.
Fajri Abdillah, clasense04@gmail.com, Senior Software Engineer, Horangi Cyber Security.
Tajhul Faijin Aliyudin, tajhulfaijin@gmail.com, Senior Software Engineer, Tado.live.
MSCEIS 2019
2 /
1. Introduction
● Big data is a set of facility that processes those large-scale
dataset in a complex way beside the traditional infrastructure
that only applied with small amount of dataset [1].
● Amount of users, mobile device and internet of thing
become a factor for the enterprise which could gain more
dataset from their user [1].
● That large dataset could be analyzed and processed to
support their businesses with distributed processing and
large-scale storage [1].
3 /
1. Introduction (2)
Characteristic of Big Data [1]:
● Volume, the quantity of the dataset. From Gigabyte to Petabyte or more.
● Variety, structured dataset, semi structured dataset, unstructured dataset
● Veracity, quality of dataset before and after data preprocessing
● Velocity, frequency of collecting the dataset from amount of event
sources
● Value , useful information and pattern that might be converted into
knowledge
● Variability, dataset might be in sparse format that need a preprocessing
first
4 /
1. Introduction (3)
Big Data methodology [2]:
1. Acquisistion of the dataset
2. Organizing the dataset
3. Analyze the dataset
4. Take a decision from the analysis result
5 /
1. Introduction (4)
Big Data methodology [4]:
1. Data acquisition
2. Data storing
3. Data management
4. Data analysis
5. Data visualization
6 /
1. Introduction (5)
Advance data analysis in Big Data [2,3,5,7]:
1. Regression learning
2. Classification
3. Association rule mining
4. Clustering
5. Forecasting
6. Deep learning
7. Natural language processing
7 /
1. Introduction (6)
Big Data adoption (example):
1. Education: Analytics for academic, learning, information
technology and institutional information [6]
2. Environmental Technology: Energy efficiency, sustainable
farming and agriculture, smart city, national strategy without
harming the environment [5]
3. Healthcare Industry: improved pharmaceutical, personalized
patient care healthcare, medical device design, fraud
detection on medical claim, preventive action for certain
diseases based on genomic analysis [4]
8 /
1. Introduction (7)
Big Data risk (example) [1,2,3,6,7]:
1. Security and privacy protection
2. Data ownership and transparency
3. Data quality
4. Cost of implementation
5. Infrastructure maintenance
6. Developer failure
7. Security system and compliance with the regulation
9 /
2. Methods
1. Research planning
– Studying a common architecture of big data
– Find architecture that could be implemented by small-medium enterprise
2. Scoping the research
– Studying the big data architecture and its component only
– This research isn’t covering cost calculation for implementing the big data
architecture
3. Data acquisition by reviewing amount of papers
– ~42 papers are obtained to become references for this study review
4. Conclusion and recommendations
10 /
3. Result and Discussion
1. Enterprise Architecture for Big Data Project
2. Event Sources
3. Message Queue
4. Data Lake
5. Extract Transform Load (ETL)
6. Data Warehouse
7. Data Mining Methodology
8. Data Visualization
9. Security and Compliance
10. Small Medium Enterprise in Big Data Era
11 /
3.1 Enterprise Architecture for
Big Data Project
TOGAF phase [8]:
1. Preliminary Phase
2. Phase A: Architecture Vision
3. Phase B: Business Architecture
4. Phase C: Information System Architectures
5. Phase D: Technology Architecture
6. Phase E: Opportunities and Solutions
7. Phase F: Migration Planning
8. Phase G: Implementation Governance and Phase
9. Phase H: Architecture Change Management
12 /
3.1 Enterprise Architecture for
Big Data Project (1)
TOGAF for Big Data [9]:
1. Has a potential for stabilize the implementation of big data architecture
2. Clear business goal
3. Aligned with business requirements
4. Clear planning
5. Better in recognize project scope
6. Better communication between stakeholders
7. Better change management
8. Improve focus on business opportunities
TOGAF also has Target Capability Model that could drive the oil and gas enterprise to
build big data to manage and migrate from their former data infrastructure to big
data infrastructure [10].
13 /
3.2 Event sources
Event sources for Big Data:
1. Web application [10]: clickstream behaviour, liked post, sharing post,
recommendation to friends
2. Internet of Thing (IoT) [11, 12, 13]: parking lot occupation detection and
heat regulation at university (390 Gb), 10.000 sensors around industrial
area that sent dataset for every 15 minutes, collecting Return Air
Temperature (RAT) and Set Point Temperature (SPT) from sensors at
manufacture sector.
3. External data sources [14]: Transaction Processing Performance Council
dataset (TPC-DI)
4. Mobile application [15, 16]: geolocation, geo-tagged tweet, tweet
analysis, spending-time at tourism objects.
14 /
3.3 Message Queue
Some facts about message queue [17]:
 software that retain some message for certain period from producers
and it will be processed by the consumers.
 organized by certain of topic and each topic has partition key.
 consumer could be more than one instance and perform a reliable
process to process the messages.
 conjunction for web application and data storage to prevent lost data
during processing requests from the clients
15 /
3.3 Message Queue (1)
16 /
3.4 Data Lake
What is Data Lake [24]:
• Kind form of massive data storage that collects raw dataset before the
dataset gets further processing
• On top of Hadoop File System that initially introduced by James Dixon.
• Could be utilized by using the open-source solution such as Hadoop File
System from Apache Hadoop.
• Enterprise can build their data lake with mentioned open source
products in their environment and more cost-effective than process it on
the database.
17 /
3.4 Data Lake (1)
What is Data Lake [24]:
• Kind form of massive data storage that collects raw dataset before the
dataset gets further processing
• On top of Hadoop File System that initially introduced by James Dixon.
• Could be utilized by using the open-source solution such as Hadoop File
System from Apache Hadoop.
• Enterprise can build their data lake with mentioned open source
products in their environment and more cost-effective than process it on
the database.
• The dataset could be stored on data lake with various extensions such as
CSV, Apache Avro, Apache Orc, Textfile, JSON, XML, etc
18 /
3.4 Data Lake (2)
Data lake have several key features such as [25]:
• Large scale batch processing, schema on read, able to store large scale
data volume with low cost,
• Could be accessed using SQL-like systems even they are not in SQL
format dataset,
• Complex processing that’s even apply machine learning operation,
• Store raw dataset instead compact format such as in SQL format,
• Low cost for distributed processing
19 /
3.4 Data Lake (3)
Data lake has core components such as [26]:
• on the backend side such as
• catalog storage,
• batch job performance and scheduling
• fault tolerance
• garbage collection of metadata.
• On the frontend side such as
• dataset profile pages
• dataset search
• team dashboards
20 /
3.5 Extract Transform Load (ETL)
What is ETL [14]:
• ETL (Extract, Transform, Load) is the part of big data that has a role to
perform a conversion from raw dataset into a cleansed dataset.
• ETL is a set of tools that could help the enterprise to perform real-time
analytics and decisions.
• ETL could be divided into three approaches that consist of micro-batch,
near real-time and streaming.
• ETL must be a combination of scheduler and ETL script.
21 /
3.5 Extract Transform Load (ETL)
(1)
What is ETL [14]:
• Scheduler could use software such as Cron Jobs while for Apache Kafka
could be used for streaming approach.
• ETL script could be executed in distributed processes through a cluster of
workers such as by using Apache Spark or executed in a single node
such as using plain SQL or programming language approach.
• ETL could transform dataset from SQL format into other text format or
vice versa.
• Supported data format by ETL technology could be XML, CSV, JSON,
Apache Avro, Apache Orc, etc.
22 /
3.5 Extract Transform Load (ETL)
(2)
Data preprocessing is also part of ETL phase that consists of [29]:
• imperfect data handling
• dimensionality reduction
• instance reduction
• Discretization
• imbalanced data
• incomplete data
• and instance reduction
Data preprocessing could take apart on improving the result of machine
learning model
23 /
3.6 Data Warehouse
Data warehouse is a kind of [30]:
• denormalized database that have generic information to cover
management level question
• centralized data source that could supply the data to develop the
strategic plan of the enterprise to make a better decision based on
historical data that stored historical data.
• Facility to perform online analytical processing (OLAP) over the data
source to answer the business needs.
• Facility that receives the cleansed dataset that processed by the ETL part
in the big data pipeline.
• a source for data mining tasks to perform forecasting, classification,
pattern recognition, clustering, etc
24 /
3.6 Data Warehouse (1)
In some cases, such as in the educational sector, the data warehouse has
key roles to perform [31]:
• feasibility assessment and data analysis for an educational system from
different viewpoints.
• make a quick decision to evaluate the educational system.
• collect a huge amount of datasets from several kinds of existing
databases and unify those data source into a single data source.
• a decision support system (DSS) technique over the data warehouse.
25 /
3.7 Data Mining Methodology
Data Mining is one of the methodologies over big data. There are some
known methodologies to perform better data mining processes such as
[32]:
● KDD process
● CRISP-DM
● RAMSYS
● DMIE
● DMEE
● ASUM-DM
● AABA
● etc
26 /
3.7 Data Mining Methodology (1)
KDD has several steps that consist of [32]:
● selection
● preprocessing
● transformation
● data mining
● knowledge gain
27 /
3.7 Data Mining Methodology (2)
a popular CRISP-DM has several steps that perform a complete operation
that consist of [32]:
● Business understanding
● data understanding
● data preprocessing
● Modeling
● Evaluation
● deployment
28 /
3.7 Data Mining Methodology (3)
CRISP-DM example use case [32, 35, 36]:
● In the retail sector, CRISP-DM is applied to perform a data mining
process that uses association rule mining to predict the sales pattern
● Research on climatology has been conducted also by applying CRISP-
DM and KDD as a combination to improve the data mining result
● Another research on social media analysis has shown a result that CRISP-
DM made the data mining processes gave the better result to perform
the favorite TV series classification by applying Decision Tree algorithm
29 /
3.8 Data Visualization
Visualization could deliver more engagement to management level or other
users so they can understand the insight that gains from the big data [37].
● Visualization could be formed as a web dashboard or document that
contains an explanation and story.
● Data could be visualized with various graphs such as bar chart, line
chart, pie chart, scatter chart, map chart, etc.
● Data visualization itself has several different types that consist of
linear data visualization, planar data visualization, volumetric data
visualization, temporal data visualization, multidimensional data
visualization, and network data visualization
30 /
3.8 Data Visualization (1)
There are also some various open source products that can help the
enterprise to visualize their insight from big data such as:
● Pentaho
● Apache Zeppelin
● Apache Superset
● Metabase
● etc.
Those products have capabilities to connect with Hadoop File System also
with various database products that commonly used by the market.
31 /
3.9 Security and Compliance
Some security approaches that might be applied for big data infrastructure
such as [1]:
● encryption
● security as a service
● real-time monitoring
● privacy by design
● data protection and authorization
● log management
● authentication
● data anonymization
● secure communication line
32 /
3.9 Security and Compliance (1)
Enterprise security for big data [42]:
● NIST for Big Data
● ISO/IEC 20546:2019 for Big Data
● ISO/IEC 27001:2017
33 /
3.10 Small Medium Enterprise in
Big Data Era
● Based on TOGAF standard, enterprise could be considered as: 1) a whole
corporation or the division of that corporation, 2) government agency
or signle government department, 3) distributed organizations that
linked together by common ownership and separated geographically, 4)
groups of countries or governments that working together to create
common or shareable deliverables and partnerships or 5) alliances of
business that working together [8].
● While small-medium enterprise is an enteprise that defined based on
annual work unit, annual turnover and annual balance sheet
according to European Commission [43, 44]. Small-medium enterprise
is a differentiation of the enterprise itself but based on the size of the
three indicators.
34 /
3.10 Small Medium Enterprise in
Big Data Era (1)
Based on prior researches according to European Commission with
European Union Standards [43, 44]:
● Annual work unit per year >= 10 and < 50, annual turnover and
annual balance sheet < 10 million euro -> Small enterprise
● Annual work unit per year >= 50 and < 250, annual turnover and
annual balance sheet < 50 million euro -> Medium enterprise
In that case, we could state that small-medium enterprise has an annual
work unit around 10 until 250 annual work unit based on European
Commission with European Union Standards
35 /
3.10 Small Medium Enterprise in
Big Data Era (2)
An advantage of leveraging big data for small-medium enterprise [45]:
● learn their pattern from past transaction
● combine with external data to understand the market behaviour in order
to gain competitive advantage and growth
● increase their product improvement and innovation against the
competitor
36 /
4. Conclusion
Common big data architecture for small-medium enterprises could be
categorized into three components that consist of:
● design and architecture component: a small-medium enterprise could
leverage enterprise architecture framework such as TOGAF based on the
study review
● infrastructure component: it could consist of event sources, message
queue, data lake, extract-transform-load (ETL), data warehouse and data
visualization
● operational component: the study found that data mining
methodology is running on top of the infrastructure.
37 /
4. Conclusion (1)
● By using the methodology such as CRISP-DM or the other
methodologies could produce a better data mining result.
● Security and compliance could be include in the operational component
since the security and compliance is a life cycle for the sustainability of
big data infrastructure and architecture against threat, vulnerability and
risk.
● SME could utilize open source products with minimum cost that could
be established as a part of the big data infrastructure such as:
– Apache Nifi, Apache Hadoop, Apache Kafka, Apache Spark, Apache
Storm, Scribe, RabbitMQ, Apache Zeppelin, Metabase, PostgreSQL,
etc
38 /
References
39 /
References
40 /
References
41 /
References

More Related Content

What's hot

The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationZaloni
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 

What's hot (20)

Shaping a Digital Vision
Shaping a Digital VisionShaping a Digital Vision
Shaping a Digital Vision
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop
HadoopHadoop
Hadoop
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 

Similar to A Study Review of Common Big Data Architecture for Small-Medium Enterprise

Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a surveyssuser0191d4
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Soujanya V
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
GouriShankar_Informatica
GouriShankar_InformaticaGouriShankar_Informatica
GouriShankar_InformaticaGouri Shankar M
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)Denodo
 
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdfpole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdfAkuhuruf
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data LakeIRJET Journal
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...dbpublications
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 
IRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET Journal
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 

Similar to A Study Review of Common Big Data Architecture for Small-Medium Enterprise (20)

J0212065068
J0212065068J0212065068
J0212065068
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a survey
 
E018142329
E018142329E018142329
E018142329
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
GouriShankar_Informatica
GouriShankar_InformaticaGouriShankar_Informatica
GouriShankar_Informatica
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)
 
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdfpole2016-A-Recent-Study-of-Emerging-Tools.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
IRJET- A Scenario on Big Data
IRJET- A Scenario on Big DataIRJET- A Scenario on Big Data
IRJET- A Scenario on Big Data
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 

More from Ridwan Fadjar

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdfPyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdfRidwan Fadjar
 
Cloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdfCloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdfRidwan Fadjar
 
GraphQL- Presentation
GraphQL- PresentationGraphQL- Presentation
GraphQL- PresentationRidwan Fadjar
 
Bugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdfBugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdfRidwan Fadjar
 
Introduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdfIntroduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdfRidwan Fadjar
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar
 
CS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOpsCS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOpsRidwan Fadjar
 
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)Ridwan Fadjar
 
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...Ridwan Fadjar
 
Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2Ridwan Fadjar
 
Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018Ridwan Fadjar
 
Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018Ridwan Fadjar
 
Resftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & CeleryResftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & CeleryRidwan Fadjar
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonRidwan Fadjar
 
Kisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & PythonKisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & PythonRidwan Fadjar
 
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014Ridwan Fadjar
 
Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1Ridwan Fadjar
 
Membuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygameMembuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygameRidwan Fadjar
 

More from Ridwan Fadjar (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdfPyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
 
Cloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdfCloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdf
 
GraphQL- Presentation
GraphQL- PresentationGraphQL- Presentation
GraphQL- Presentation
 
Bugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdfBugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdf
 
Introduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdfIntroduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdf
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
 
CS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOpsCS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOps
 
Why Serverless?
Why Serverless?Why Serverless?
Why Serverless?
 
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
 
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
 
Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2
 
Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018
 
Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018
 
Resftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & CeleryResftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & Celery
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
Kisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & PythonKisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & Python
 
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
 
Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1
 
Membuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygameMembuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygame
 

Recently uploaded

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

A Study Review of Common Big Data Architecture for Small-Medium Enterprise

  • 1. 1 / A Study Review of Common Big Data Architecture for Small-medium Enterprise Ridwan Fadjar Septian, ridwanbejo@gmail.com, Master of information System, Faculty of Postgraduate, Universitas Komputer Indonesia. Fajri Abdillah, clasense04@gmail.com, Senior Software Engineer, Horangi Cyber Security. Tajhul Faijin Aliyudin, tajhulfaijin@gmail.com, Senior Software Engineer, Tado.live. MSCEIS 2019
  • 2. 2 / 1. Introduction ● Big data is a set of facility that processes those large-scale dataset in a complex way beside the traditional infrastructure that only applied with small amount of dataset [1]. ● Amount of users, mobile device and internet of thing become a factor for the enterprise which could gain more dataset from their user [1]. ● That large dataset could be analyzed and processed to support their businesses with distributed processing and large-scale storage [1].
  • 3. 3 / 1. Introduction (2) Characteristic of Big Data [1]: ● Volume, the quantity of the dataset. From Gigabyte to Petabyte or more. ● Variety, structured dataset, semi structured dataset, unstructured dataset ● Veracity, quality of dataset before and after data preprocessing ● Velocity, frequency of collecting the dataset from amount of event sources ● Value , useful information and pattern that might be converted into knowledge ● Variability, dataset might be in sparse format that need a preprocessing first
  • 4. 4 / 1. Introduction (3) Big Data methodology [2]: 1. Acquisistion of the dataset 2. Organizing the dataset 3. Analyze the dataset 4. Take a decision from the analysis result
  • 5. 5 / 1. Introduction (4) Big Data methodology [4]: 1. Data acquisition 2. Data storing 3. Data management 4. Data analysis 5. Data visualization
  • 6. 6 / 1. Introduction (5) Advance data analysis in Big Data [2,3,5,7]: 1. Regression learning 2. Classification 3. Association rule mining 4. Clustering 5. Forecasting 6. Deep learning 7. Natural language processing
  • 7. 7 / 1. Introduction (6) Big Data adoption (example): 1. Education: Analytics for academic, learning, information technology and institutional information [6] 2. Environmental Technology: Energy efficiency, sustainable farming and agriculture, smart city, national strategy without harming the environment [5] 3. Healthcare Industry: improved pharmaceutical, personalized patient care healthcare, medical device design, fraud detection on medical claim, preventive action for certain diseases based on genomic analysis [4]
  • 8. 8 / 1. Introduction (7) Big Data risk (example) [1,2,3,6,7]: 1. Security and privacy protection 2. Data ownership and transparency 3. Data quality 4. Cost of implementation 5. Infrastructure maintenance 6. Developer failure 7. Security system and compliance with the regulation
  • 9. 9 / 2. Methods 1. Research planning – Studying a common architecture of big data – Find architecture that could be implemented by small-medium enterprise 2. Scoping the research – Studying the big data architecture and its component only – This research isn’t covering cost calculation for implementing the big data architecture 3. Data acquisition by reviewing amount of papers – ~42 papers are obtained to become references for this study review 4. Conclusion and recommendations
  • 10. 10 / 3. Result and Discussion 1. Enterprise Architecture for Big Data Project 2. Event Sources 3. Message Queue 4. Data Lake 5. Extract Transform Load (ETL) 6. Data Warehouse 7. Data Mining Methodology 8. Data Visualization 9. Security and Compliance 10. Small Medium Enterprise in Big Data Era
  • 11. 11 / 3.1 Enterprise Architecture for Big Data Project TOGAF phase [8]: 1. Preliminary Phase 2. Phase A: Architecture Vision 3. Phase B: Business Architecture 4. Phase C: Information System Architectures 5. Phase D: Technology Architecture 6. Phase E: Opportunities and Solutions 7. Phase F: Migration Planning 8. Phase G: Implementation Governance and Phase 9. Phase H: Architecture Change Management
  • 12. 12 / 3.1 Enterprise Architecture for Big Data Project (1) TOGAF for Big Data [9]: 1. Has a potential for stabilize the implementation of big data architecture 2. Clear business goal 3. Aligned with business requirements 4. Clear planning 5. Better in recognize project scope 6. Better communication between stakeholders 7. Better change management 8. Improve focus on business opportunities TOGAF also has Target Capability Model that could drive the oil and gas enterprise to build big data to manage and migrate from their former data infrastructure to big data infrastructure [10].
  • 13. 13 / 3.2 Event sources Event sources for Big Data: 1. Web application [10]: clickstream behaviour, liked post, sharing post, recommendation to friends 2. Internet of Thing (IoT) [11, 12, 13]: parking lot occupation detection and heat regulation at university (390 Gb), 10.000 sensors around industrial area that sent dataset for every 15 minutes, collecting Return Air Temperature (RAT) and Set Point Temperature (SPT) from sensors at manufacture sector. 3. External data sources [14]: Transaction Processing Performance Council dataset (TPC-DI) 4. Mobile application [15, 16]: geolocation, geo-tagged tweet, tweet analysis, spending-time at tourism objects.
  • 14. 14 / 3.3 Message Queue Some facts about message queue [17]:  software that retain some message for certain period from producers and it will be processed by the consumers.  organized by certain of topic and each topic has partition key.  consumer could be more than one instance and perform a reliable process to process the messages.  conjunction for web application and data storage to prevent lost data during processing requests from the clients
  • 15. 15 / 3.3 Message Queue (1)
  • 16. 16 / 3.4 Data Lake What is Data Lake [24]: • Kind form of massive data storage that collects raw dataset before the dataset gets further processing • On top of Hadoop File System that initially introduced by James Dixon. • Could be utilized by using the open-source solution such as Hadoop File System from Apache Hadoop. • Enterprise can build their data lake with mentioned open source products in their environment and more cost-effective than process it on the database.
  • 17. 17 / 3.4 Data Lake (1) What is Data Lake [24]: • Kind form of massive data storage that collects raw dataset before the dataset gets further processing • On top of Hadoop File System that initially introduced by James Dixon. • Could be utilized by using the open-source solution such as Hadoop File System from Apache Hadoop. • Enterprise can build their data lake with mentioned open source products in their environment and more cost-effective than process it on the database. • The dataset could be stored on data lake with various extensions such as CSV, Apache Avro, Apache Orc, Textfile, JSON, XML, etc
  • 18. 18 / 3.4 Data Lake (2) Data lake have several key features such as [25]: • Large scale batch processing, schema on read, able to store large scale data volume with low cost, • Could be accessed using SQL-like systems even they are not in SQL format dataset, • Complex processing that’s even apply machine learning operation, • Store raw dataset instead compact format such as in SQL format, • Low cost for distributed processing
  • 19. 19 / 3.4 Data Lake (3) Data lake has core components such as [26]: • on the backend side such as • catalog storage, • batch job performance and scheduling • fault tolerance • garbage collection of metadata. • On the frontend side such as • dataset profile pages • dataset search • team dashboards
  • 20. 20 / 3.5 Extract Transform Load (ETL) What is ETL [14]: • ETL (Extract, Transform, Load) is the part of big data that has a role to perform a conversion from raw dataset into a cleansed dataset. • ETL is a set of tools that could help the enterprise to perform real-time analytics and decisions. • ETL could be divided into three approaches that consist of micro-batch, near real-time and streaming. • ETL must be a combination of scheduler and ETL script.
  • 21. 21 / 3.5 Extract Transform Load (ETL) (1) What is ETL [14]: • Scheduler could use software such as Cron Jobs while for Apache Kafka could be used for streaming approach. • ETL script could be executed in distributed processes through a cluster of workers such as by using Apache Spark or executed in a single node such as using plain SQL or programming language approach. • ETL could transform dataset from SQL format into other text format or vice versa. • Supported data format by ETL technology could be XML, CSV, JSON, Apache Avro, Apache Orc, etc.
  • 22. 22 / 3.5 Extract Transform Load (ETL) (2) Data preprocessing is also part of ETL phase that consists of [29]: • imperfect data handling • dimensionality reduction • instance reduction • Discretization • imbalanced data • incomplete data • and instance reduction Data preprocessing could take apart on improving the result of machine learning model
  • 23. 23 / 3.6 Data Warehouse Data warehouse is a kind of [30]: • denormalized database that have generic information to cover management level question • centralized data source that could supply the data to develop the strategic plan of the enterprise to make a better decision based on historical data that stored historical data. • Facility to perform online analytical processing (OLAP) over the data source to answer the business needs. • Facility that receives the cleansed dataset that processed by the ETL part in the big data pipeline. • a source for data mining tasks to perform forecasting, classification, pattern recognition, clustering, etc
  • 24. 24 / 3.6 Data Warehouse (1) In some cases, such as in the educational sector, the data warehouse has key roles to perform [31]: • feasibility assessment and data analysis for an educational system from different viewpoints. • make a quick decision to evaluate the educational system. • collect a huge amount of datasets from several kinds of existing databases and unify those data source into a single data source. • a decision support system (DSS) technique over the data warehouse.
  • 25. 25 / 3.7 Data Mining Methodology Data Mining is one of the methodologies over big data. There are some known methodologies to perform better data mining processes such as [32]: ● KDD process ● CRISP-DM ● RAMSYS ● DMIE ● DMEE ● ASUM-DM ● AABA ● etc
  • 26. 26 / 3.7 Data Mining Methodology (1) KDD has several steps that consist of [32]: ● selection ● preprocessing ● transformation ● data mining ● knowledge gain
  • 27. 27 / 3.7 Data Mining Methodology (2) a popular CRISP-DM has several steps that perform a complete operation that consist of [32]: ● Business understanding ● data understanding ● data preprocessing ● Modeling ● Evaluation ● deployment
  • 28. 28 / 3.7 Data Mining Methodology (3) CRISP-DM example use case [32, 35, 36]: ● In the retail sector, CRISP-DM is applied to perform a data mining process that uses association rule mining to predict the sales pattern ● Research on climatology has been conducted also by applying CRISP- DM and KDD as a combination to improve the data mining result ● Another research on social media analysis has shown a result that CRISP- DM made the data mining processes gave the better result to perform the favorite TV series classification by applying Decision Tree algorithm
  • 29. 29 / 3.8 Data Visualization Visualization could deliver more engagement to management level or other users so they can understand the insight that gains from the big data [37]. ● Visualization could be formed as a web dashboard or document that contains an explanation and story. ● Data could be visualized with various graphs such as bar chart, line chart, pie chart, scatter chart, map chart, etc. ● Data visualization itself has several different types that consist of linear data visualization, planar data visualization, volumetric data visualization, temporal data visualization, multidimensional data visualization, and network data visualization
  • 30. 30 / 3.8 Data Visualization (1) There are also some various open source products that can help the enterprise to visualize their insight from big data such as: ● Pentaho ● Apache Zeppelin ● Apache Superset ● Metabase ● etc. Those products have capabilities to connect with Hadoop File System also with various database products that commonly used by the market.
  • 31. 31 / 3.9 Security and Compliance Some security approaches that might be applied for big data infrastructure such as [1]: ● encryption ● security as a service ● real-time monitoring ● privacy by design ● data protection and authorization ● log management ● authentication ● data anonymization ● secure communication line
  • 32. 32 / 3.9 Security and Compliance (1) Enterprise security for big data [42]: ● NIST for Big Data ● ISO/IEC 20546:2019 for Big Data ● ISO/IEC 27001:2017
  • 33. 33 / 3.10 Small Medium Enterprise in Big Data Era ● Based on TOGAF standard, enterprise could be considered as: 1) a whole corporation or the division of that corporation, 2) government agency or signle government department, 3) distributed organizations that linked together by common ownership and separated geographically, 4) groups of countries or governments that working together to create common or shareable deliverables and partnerships or 5) alliances of business that working together [8]. ● While small-medium enterprise is an enteprise that defined based on annual work unit, annual turnover and annual balance sheet according to European Commission [43, 44]. Small-medium enterprise is a differentiation of the enterprise itself but based on the size of the three indicators.
  • 34. 34 / 3.10 Small Medium Enterprise in Big Data Era (1) Based on prior researches according to European Commission with European Union Standards [43, 44]: ● Annual work unit per year >= 10 and < 50, annual turnover and annual balance sheet < 10 million euro -> Small enterprise ● Annual work unit per year >= 50 and < 250, annual turnover and annual balance sheet < 50 million euro -> Medium enterprise In that case, we could state that small-medium enterprise has an annual work unit around 10 until 250 annual work unit based on European Commission with European Union Standards
  • 35. 35 / 3.10 Small Medium Enterprise in Big Data Era (2) An advantage of leveraging big data for small-medium enterprise [45]: ● learn their pattern from past transaction ● combine with external data to understand the market behaviour in order to gain competitive advantage and growth ● increase their product improvement and innovation against the competitor
  • 36. 36 / 4. Conclusion Common big data architecture for small-medium enterprises could be categorized into three components that consist of: ● design and architecture component: a small-medium enterprise could leverage enterprise architecture framework such as TOGAF based on the study review ● infrastructure component: it could consist of event sources, message queue, data lake, extract-transform-load (ETL), data warehouse and data visualization ● operational component: the study found that data mining methodology is running on top of the infrastructure.
  • 37. 37 / 4. Conclusion (1) ● By using the methodology such as CRISP-DM or the other methodologies could produce a better data mining result. ● Security and compliance could be include in the operational component since the security and compliance is a life cycle for the sustainability of big data infrastructure and architecture against threat, vulnerability and risk. ● SME could utilize open source products with minimum cost that could be established as a part of the big data infrastructure such as: – Apache Nifi, Apache Hadoop, Apache Kafka, Apache Spark, Apache Storm, Scribe, RabbitMQ, Apache Zeppelin, Metabase, PostgreSQL, etc