This document summarizes a study review of common big data architectures for small to medium enterprises. It finds that such architectures typically include three main components: 1) an enterprise design framework like TOGAF for planning and architecture, 2) core infrastructure including data sources, messaging queues, data lakes, ETL processes, data warehouses, and visualization tools, and 3) operational aspects like data mining and security/compliance practices running on top of the infrastructure. The study concludes that open source tools can help SMEs establish affordable big data solutions to gain competitive advantages from data-driven insights.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
1. 1 /
A Study Review of Common Big Data
Architecture for
Small-medium Enterprise
Ridwan Fadjar Septian, ridwanbejo@gmail.com, Master of information System, Faculty of Postgraduate,
Universitas Komputer Indonesia.
Fajri Abdillah, clasense04@gmail.com, Senior Software Engineer, Horangi Cyber Security.
Tajhul Faijin Aliyudin, tajhulfaijin@gmail.com, Senior Software Engineer, Tado.live.
MSCEIS 2019
2. 2 /
1. Introduction
● Big data is a set of facility that processes those large-scale
dataset in a complex way beside the traditional infrastructure
that only applied with small amount of dataset [1].
● Amount of users, mobile device and internet of thing
become a factor for the enterprise which could gain more
dataset from their user [1].
● That large dataset could be analyzed and processed to
support their businesses with distributed processing and
large-scale storage [1].
3. 3 /
1. Introduction (2)
Characteristic of Big Data [1]:
● Volume, the quantity of the dataset. From Gigabyte to Petabyte or more.
● Variety, structured dataset, semi structured dataset, unstructured dataset
● Veracity, quality of dataset before and after data preprocessing
● Velocity, frequency of collecting the dataset from amount of event
sources
● Value , useful information and pattern that might be converted into
knowledge
● Variability, dataset might be in sparse format that need a preprocessing
first
4. 4 /
1. Introduction (3)
Big Data methodology [2]:
1. Acquisistion of the dataset
2. Organizing the dataset
3. Analyze the dataset
4. Take a decision from the analysis result
5. 5 /
1. Introduction (4)
Big Data methodology [4]:
1. Data acquisition
2. Data storing
3. Data management
4. Data analysis
5. Data visualization
6. 6 /
1. Introduction (5)
Advance data analysis in Big Data [2,3,5,7]:
1. Regression learning
2. Classification
3. Association rule mining
4. Clustering
5. Forecasting
6. Deep learning
7. Natural language processing
7. 7 /
1. Introduction (6)
Big Data adoption (example):
1. Education: Analytics for academic, learning, information
technology and institutional information [6]
2. Environmental Technology: Energy efficiency, sustainable
farming and agriculture, smart city, national strategy without
harming the environment [5]
3. Healthcare Industry: improved pharmaceutical, personalized
patient care healthcare, medical device design, fraud
detection on medical claim, preventive action for certain
diseases based on genomic analysis [4]
8. 8 /
1. Introduction (7)
Big Data risk (example) [1,2,3,6,7]:
1. Security and privacy protection
2. Data ownership and transparency
3. Data quality
4. Cost of implementation
5. Infrastructure maintenance
6. Developer failure
7. Security system and compliance with the regulation
9. 9 /
2. Methods
1. Research planning
– Studying a common architecture of big data
– Find architecture that could be implemented by small-medium enterprise
2. Scoping the research
– Studying the big data architecture and its component only
– This research isn’t covering cost calculation for implementing the big data
architecture
3. Data acquisition by reviewing amount of papers
– ~42 papers are obtained to become references for this study review
4. Conclusion and recommendations
10. 10 /
3. Result and Discussion
1. Enterprise Architecture for Big Data Project
2. Event Sources
3. Message Queue
4. Data Lake
5. Extract Transform Load (ETL)
6. Data Warehouse
7. Data Mining Methodology
8. Data Visualization
9. Security and Compliance
10. Small Medium Enterprise in Big Data Era
11. 11 /
3.1 Enterprise Architecture for
Big Data Project
TOGAF phase [8]:
1. Preliminary Phase
2. Phase A: Architecture Vision
3. Phase B: Business Architecture
4. Phase C: Information System Architectures
5. Phase D: Technology Architecture
6. Phase E: Opportunities and Solutions
7. Phase F: Migration Planning
8. Phase G: Implementation Governance and Phase
9. Phase H: Architecture Change Management
12. 12 /
3.1 Enterprise Architecture for
Big Data Project (1)
TOGAF for Big Data [9]:
1. Has a potential for stabilize the implementation of big data architecture
2. Clear business goal
3. Aligned with business requirements
4. Clear planning
5. Better in recognize project scope
6. Better communication between stakeholders
7. Better change management
8. Improve focus on business opportunities
TOGAF also has Target Capability Model that could drive the oil and gas enterprise to
build big data to manage and migrate from their former data infrastructure to big
data infrastructure [10].
13. 13 /
3.2 Event sources
Event sources for Big Data:
1. Web application [10]: clickstream behaviour, liked post, sharing post,
recommendation to friends
2. Internet of Thing (IoT) [11, 12, 13]: parking lot occupation detection and
heat regulation at university (390 Gb), 10.000 sensors around industrial
area that sent dataset for every 15 minutes, collecting Return Air
Temperature (RAT) and Set Point Temperature (SPT) from sensors at
manufacture sector.
3. External data sources [14]: Transaction Processing Performance Council
dataset (TPC-DI)
4. Mobile application [15, 16]: geolocation, geo-tagged tweet, tweet
analysis, spending-time at tourism objects.
14. 14 /
3.3 Message Queue
Some facts about message queue [17]:
software that retain some message for certain period from producers
and it will be processed by the consumers.
organized by certain of topic and each topic has partition key.
consumer could be more than one instance and perform a reliable
process to process the messages.
conjunction for web application and data storage to prevent lost data
during processing requests from the clients
16. 16 /
3.4 Data Lake
What is Data Lake [24]:
• Kind form of massive data storage that collects raw dataset before the
dataset gets further processing
• On top of Hadoop File System that initially introduced by James Dixon.
• Could be utilized by using the open-source solution such as Hadoop File
System from Apache Hadoop.
• Enterprise can build their data lake with mentioned open source
products in their environment and more cost-effective than process it on
the database.
17. 17 /
3.4 Data Lake (1)
What is Data Lake [24]:
• Kind form of massive data storage that collects raw dataset before the
dataset gets further processing
• On top of Hadoop File System that initially introduced by James Dixon.
• Could be utilized by using the open-source solution such as Hadoop File
System from Apache Hadoop.
• Enterprise can build their data lake with mentioned open source
products in their environment and more cost-effective than process it on
the database.
• The dataset could be stored on data lake with various extensions such as
CSV, Apache Avro, Apache Orc, Textfile, JSON, XML, etc
18. 18 /
3.4 Data Lake (2)
Data lake have several key features such as [25]:
• Large scale batch processing, schema on read, able to store large scale
data volume with low cost,
• Could be accessed using SQL-like systems even they are not in SQL
format dataset,
• Complex processing that’s even apply machine learning operation,
• Store raw dataset instead compact format such as in SQL format,
• Low cost for distributed processing
19. 19 /
3.4 Data Lake (3)
Data lake has core components such as [26]:
• on the backend side such as
• catalog storage,
• batch job performance and scheduling
• fault tolerance
• garbage collection of metadata.
• On the frontend side such as
• dataset profile pages
• dataset search
• team dashboards
20. 20 /
3.5 Extract Transform Load (ETL)
What is ETL [14]:
• ETL (Extract, Transform, Load) is the part of big data that has a role to
perform a conversion from raw dataset into a cleansed dataset.
• ETL is a set of tools that could help the enterprise to perform real-time
analytics and decisions.
• ETL could be divided into three approaches that consist of micro-batch,
near real-time and streaming.
• ETL must be a combination of scheduler and ETL script.
21. 21 /
3.5 Extract Transform Load (ETL)
(1)
What is ETL [14]:
• Scheduler could use software such as Cron Jobs while for Apache Kafka
could be used for streaming approach.
• ETL script could be executed in distributed processes through a cluster of
workers such as by using Apache Spark or executed in a single node
such as using plain SQL or programming language approach.
• ETL could transform dataset from SQL format into other text format or
vice versa.
• Supported data format by ETL technology could be XML, CSV, JSON,
Apache Avro, Apache Orc, etc.
22. 22 /
3.5 Extract Transform Load (ETL)
(2)
Data preprocessing is also part of ETL phase that consists of [29]:
• imperfect data handling
• dimensionality reduction
• instance reduction
• Discretization
• imbalanced data
• incomplete data
• and instance reduction
Data preprocessing could take apart on improving the result of machine
learning model
23. 23 /
3.6 Data Warehouse
Data warehouse is a kind of [30]:
• denormalized database that have generic information to cover
management level question
• centralized data source that could supply the data to develop the
strategic plan of the enterprise to make a better decision based on
historical data that stored historical data.
• Facility to perform online analytical processing (OLAP) over the data
source to answer the business needs.
• Facility that receives the cleansed dataset that processed by the ETL part
in the big data pipeline.
• a source for data mining tasks to perform forecasting, classification,
pattern recognition, clustering, etc
24. 24 /
3.6 Data Warehouse (1)
In some cases, such as in the educational sector, the data warehouse has
key roles to perform [31]:
• feasibility assessment and data analysis for an educational system from
different viewpoints.
• make a quick decision to evaluate the educational system.
• collect a huge amount of datasets from several kinds of existing
databases and unify those data source into a single data source.
• a decision support system (DSS) technique over the data warehouse.
25. 25 /
3.7 Data Mining Methodology
Data Mining is one of the methodologies over big data. There are some
known methodologies to perform better data mining processes such as
[32]:
● KDD process
● CRISP-DM
● RAMSYS
● DMIE
● DMEE
● ASUM-DM
● AABA
● etc
26. 26 /
3.7 Data Mining Methodology (1)
KDD has several steps that consist of [32]:
● selection
● preprocessing
● transformation
● data mining
● knowledge gain
27. 27 /
3.7 Data Mining Methodology (2)
a popular CRISP-DM has several steps that perform a complete operation
that consist of [32]:
● Business understanding
● data understanding
● data preprocessing
● Modeling
● Evaluation
● deployment
28. 28 /
3.7 Data Mining Methodology (3)
CRISP-DM example use case [32, 35, 36]:
● In the retail sector, CRISP-DM is applied to perform a data mining
process that uses association rule mining to predict the sales pattern
● Research on climatology has been conducted also by applying CRISP-
DM and KDD as a combination to improve the data mining result
● Another research on social media analysis has shown a result that CRISP-
DM made the data mining processes gave the better result to perform
the favorite TV series classification by applying Decision Tree algorithm
29. 29 /
3.8 Data Visualization
Visualization could deliver more engagement to management level or other
users so they can understand the insight that gains from the big data [37].
● Visualization could be formed as a web dashboard or document that
contains an explanation and story.
● Data could be visualized with various graphs such as bar chart, line
chart, pie chart, scatter chart, map chart, etc.
● Data visualization itself has several different types that consist of
linear data visualization, planar data visualization, volumetric data
visualization, temporal data visualization, multidimensional data
visualization, and network data visualization
30. 30 /
3.8 Data Visualization (1)
There are also some various open source products that can help the
enterprise to visualize their insight from big data such as:
● Pentaho
● Apache Zeppelin
● Apache Superset
● Metabase
● etc.
Those products have capabilities to connect with Hadoop File System also
with various database products that commonly used by the market.
31. 31 /
3.9 Security and Compliance
Some security approaches that might be applied for big data infrastructure
such as [1]:
● encryption
● security as a service
● real-time monitoring
● privacy by design
● data protection and authorization
● log management
● authentication
● data anonymization
● secure communication line
32. 32 /
3.9 Security and Compliance (1)
Enterprise security for big data [42]:
● NIST for Big Data
● ISO/IEC 20546:2019 for Big Data
● ISO/IEC 27001:2017
33. 33 /
3.10 Small Medium Enterprise in
Big Data Era
● Based on TOGAF standard, enterprise could be considered as: 1) a whole
corporation or the division of that corporation, 2) government agency
or signle government department, 3) distributed organizations that
linked together by common ownership and separated geographically, 4)
groups of countries or governments that working together to create
common or shareable deliverables and partnerships or 5) alliances of
business that working together [8].
● While small-medium enterprise is an enteprise that defined based on
annual work unit, annual turnover and annual balance sheet
according to European Commission [43, 44]. Small-medium enterprise
is a differentiation of the enterprise itself but based on the size of the
three indicators.
34. 34 /
3.10 Small Medium Enterprise in
Big Data Era (1)
Based on prior researches according to European Commission with
European Union Standards [43, 44]:
● Annual work unit per year >= 10 and < 50, annual turnover and
annual balance sheet < 10 million euro -> Small enterprise
● Annual work unit per year >= 50 and < 250, annual turnover and
annual balance sheet < 50 million euro -> Medium enterprise
In that case, we could state that small-medium enterprise has an annual
work unit around 10 until 250 annual work unit based on European
Commission with European Union Standards
35. 35 /
3.10 Small Medium Enterprise in
Big Data Era (2)
An advantage of leveraging big data for small-medium enterprise [45]:
● learn their pattern from past transaction
● combine with external data to understand the market behaviour in order
to gain competitive advantage and growth
● increase their product improvement and innovation against the
competitor
36. 36 /
4. Conclusion
Common big data architecture for small-medium enterprises could be
categorized into three components that consist of:
● design and architecture component: a small-medium enterprise could
leverage enterprise architecture framework such as TOGAF based on the
study review
● infrastructure component: it could consist of event sources, message
queue, data lake, extract-transform-load (ETL), data warehouse and data
visualization
● operational component: the study found that data mining
methodology is running on top of the infrastructure.
37. 37 /
4. Conclusion (1)
● By using the methodology such as CRISP-DM or the other
methodologies could produce a better data mining result.
● Security and compliance could be include in the operational component
since the security and compliance is a life cycle for the sustainability of
big data infrastructure and architecture against threat, vulnerability and
risk.
● SME could utilize open source products with minimum cost that could
be established as a part of the big data infrastructure such as:
– Apache Nifi, Apache Hadoop, Apache Kafka, Apache Spark, Apache
Storm, Scribe, RabbitMQ, Apache Zeppelin, Metabase, PostgreSQL,
etc