An overview of datonixOne a new, evolutionary and effective Data Preparation Platform.
We do introduce a new disruptive technology into the Data Management Space, it is the Data Scannew.
Using the Data Scanner Data Science is more accurate and feasible.
datonixOne is a perfect Satellite of any Enterprise Data Hub.
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
Product overview 6.0 v.1.0
1. datonix™ and QueryObject™ are registered trademarks
April 2016
datonix
pragmatic self service data preparation
Product outlook
2. datonix™ and QueryObject™ are registered trademarks
Datonix introduces an interesting technology
into the BigData space
It is a proprietary fractal like algorithm that converts an input raw data
into a smart, lean and fast file
the QueryObject data set
In addition, technology materializes in the QueryObject data set the
results of any custom defined data processing
The datonix fractal technology is the core of the datonixOne, a
product designed to offer disruptive data preparation features.
Executive summary
3. datonix™ and QueryObject™ are registered trademarkswhere life follows data
Why we made datonix
What datonix does
Where datonix is used
Annex
Summary of Features
Details of Data Preparation Steps
Index
5. datonix™ and QueryObject™ are registered trademarksTechnology Scenario: the information firewall
Why datonix
The information firewall will fall where
IT division will evolve into separate
“I” and “T” orgs
Data
Ware
house
Legacy
DMP
SAP
Local
DB
Hadoop
BigData
CRM
OSS
Web-app
Data
Science
Business
Intelligence
Data
Journalism
Data
Unification
Dark
Data
Data
Integration
The data management (Tech) side The data exploitation (Info) side
6. datonix™ and QueryObject™ are registered trademarksData Preparation?
It is
the next big thing in the data
management space
It is the answer
for Data Scientists that need to self
prepare “Ready Data” for their
actvities
It is the answer for end-users
asking for performance and agility
Why datonix
high
Oriented to
Perform
Oriented to
Agility
Oriented to
Comfort
Utility
technology
low
lowhigh
Competence
7. datonix™ and QueryObject™ are registered trademarksFrom Big to Complex
Assumption:
N algorithmic linear logic isn’t
adequate for a complete data
analysis
Perspective:
Nm real time logic is required to
understand more and to act faster
Data Complex Issue:
How to blend many & large data
component
Why datonix
Complex
Diversified
Big
Simple
Size
few
smalllarge
manyData Components
8. datonix™ and QueryObject™ are registered trademarksBottom line
To accelerate the “Information Firewall Fall”, most companies are investing in
MDM or Enterprise data hub or data Lake implementations
In parallel, Data Preparation Market is going to be hot simply because
- there is a shortage of Data Scientists
- they are required unique Data Preparation methodologies and tools for the
Business Analysts, the Data Scientists and the Data Journalist
datonix has been designed to support from the “in pectore”, aka citizen, to the
most expert data scientists”
Thanks to its fractal engine datonix can offer real self service data preparation
capabilities for the end-user,
Why datonix
10. datonix™ and QueryObject™ are registered trademarksMission
Datonix primary product is datonixOne a
Self Service Data Preparation solution
Using datonixOne it is possible to prepare
SMART Ready Data, the QueryObject
data set
Once prepared, with a few of clicks, the
QueryObject can be used as
· a Web 2.0 data services;
· an analytic web-app;
· a first class narrative analytics & report;
· an ODBC set of tables.
What datonix does
11. datonix™ and QueryObject™ are registered trademarksDatonixOne is a Data Scanner
the Data Scanner Engine quickly convert raw
data in a new file type named QueryObject
QueryObject contains scanned raw data, end-
user defined data processing outputs, and raw
data projections
it is read only, binary portable, compressed,
secure, and fast
in the QueryObject, raw data set are linked
with their projections in real time
the fractal nature of the QueryObject internals
ensure scalable data blending and data
unification
What datonix does
12. datonix™ and QueryObject™ are registered trademarksQueryObject data set build example
rowId Nname Ncity
1 1 1
2 2 2
3 3 3
4 2 2
What datonix does
Key Value NValue
Name Aldo 1
Name Sara 2
Name Anna 3
City Miami 1
… … …name city DateBirth
Aldo Miami 11/1/90
Sara NYC 12/2/89
Anna Rome 1.1.68
Sara NYC 31-1-61
DateBirth UDateB Age
11/1/90 1/11/90 26
12/2/89 2/12/89 26
1.1.68 1/1/68 48
31-1-61 1/31/61 56
Ncity city state
1 Miami Fl
2 NYC NY
3 Rome Italy
Map Dictionary
Luggage hierarchy
Data complex Storage group
Data source
Fractal conversion
Transform
DateBirth
Add Geo
classification
14. datonix™ and QueryObject™ are registered trademarksConnect, Build & Combine QueryObject data set
Datonix is an efficient Data Lake Satellite, it can connect to any source for any
load at the maximum speed, and executes custom data processing
What datonix does
QCL Server
QueryObject
Communication
Layer
Scanner
Engine
DFS
Datonix File
System
SRM
Stream
Resource
Manager
Fractal
Engine
QO Joiner
ENGINE
lookup
ENGINE
Metadata Repository
Source Data
Descriptions
Transformation, Filter
Cleansing snippets
External
hierarchies
15. datonix™ and QueryObject™ are registered trademarksHow to use QueryObject data set
Once QueryObject has been registered for networking, in a few of clicks it can be
transformed in online web objects or in a materialized data set ready to be moved
and used elsewhere
What datonix does
DFS
Datonix File
System
Chart
ENGINE
Metadata Repository
External
Hierarchies
Query
Commands
ODBC
Connector
Download
Center
QOhpIO
Rest
Provider
Oracle
Transparent
Gateway
Services ini
16. datonix™ and QueryObject™ are registered trademarksThe packaging and the business model
What datonix does
Today, datonixOne is delivered on premise in
four editions:
- Personal
- Server
- Enterprise
- OEM
The Server edition is also commercialized
using our SaaS (Software as a Service) Plan.
Using the SaaS Plan customers can have
datonixOne with a small deposit and an
attractive monthly rate.
Soon it will be announced the general
availability of the Cloud edition.
Standard product maintenance is included in
the price, a three years datonixCare
extension plan for certain services can be
subscribed.
17. datonix™ and QueryObject™ are registered trademarksBottom Line
datonixOne performs a sophisticated processing of the given data to create an Object based
dataset named QueryObject data set. It contains cold detailed data and hot analytic
information.
In general, the main strengths of datonixOne are as follows:
- It is well scalable in the number of input data rows.
- It supports all regular aggregate and Distinct Count Measure (DCM)
- It handles dimensional hierarchies externally to the QueryObject
- It supports a SMP-based parallel data processing
- It supports high performances incremental or cdc update
- It uses several compression techniques to reduce data footprint.
- It allows to create partial and or optimized view of data
- It is a disconnected active data store
- Its ideal usage is to provide keyback from hot data selections to the related cold data
- Based on Grid architecture, which supports federated queries, supports querying in multi-user environment.
- It is a cost-efficient solution.
In many cases it will work efficiently because many potential applications requirements find
answers in datonix possibilities
19. datonix™ and QueryObject™ are registered trademarks
Where is user’s pain
If analytic requisite is not clear upfront or not
fully structured, preparation could be long and
costs high
Raw Data are not connected with Analytics, and in
order to review/adjust User’s data it is necessary
to cycle back to Data Integration
Dark Data don’t go back to DMP level
Narrative reporting is not handled
Typical data interface to R, Python, Advanced
Analytics and BI tools are slow and tricky
Why to use
Datonix remedy
datonixOne connects and scans any kind of data
sources.
Datonix automatically maintains raw data connected to
analytics. Using datonix, end-user can self develop data
processing resulting “Schema on Read Structures” that
can be easily combined with dark data, external data
and Cloud data.
The datonixOne Publisher is a cloud component designed
to make data available to distribution over a network.
Following the registration of data in Publisher the
following services are automatically made available:
• An Excel worksheet
• A high performance data or info graphic service
• A dynamic report in ppt, doc, xlsx or pdf format
• An active Big Data dashboard
• A DBMS like data table
20. datonix™ and QueryObject™ are registered trademarksApplication of datonix
Thanks to its unique features and or
performance capabilities, datonix has been
used in a variety of applications over
several industry.
Cloud migrations, Spending Review,
Revenue Assurance and OSS data
collection, processing and movement are
the most common.
Product has been commercialized directly
or through partners to Telcos, Media,
Public Administration, Manufacturing, R&D
Institutes.
Where used
21. datonix™ and QueryObject™ are registered trademarksAgile Methodology
SPUR
START
• Goals definition
• Data Collection
• Selection of the
Data Science
• Preparation
Estimo
PREPARE
• QueryObject
Set up
• Teaser Set up
USE
• Implement the
Data Science
• Determine the
KPIs
• Implement the
communication
RUN
• Implement Pre
and Post mining
Agents
• Implement the
automation
• Ongoing run
SPUR is the result of more then 6,000 preparation implementations executed directly by datonix or through
our certified professionals.
Most of the processes are supported by specific GUIs, called Voyagers, and a wide collection of Microsoft
office Addin.
22. datonix™ and QueryObject™ are registered trademarksMeasurable Estimo
Datonix Estimo database is continuously
reviewed and available to datonix certified
professional network in order to ensure:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
SPUR
23. datonix™ and QueryObject™ are registered trademarksCustomer case: CPM
Type of Business Very large multinational enterprise with annual Supply Chain contracts in excess of 9B.
Customer Pain /Issue: The making of the monthly Corporate Supply Chain Tableau de Bord [TdB] and Spending Review was lengthy [more than
9 man/days of processing], prone to errors and produced a huge report more of 180 pages long in order to provide for required drill down views.
Data inconsistencies arising from different data download procedures from main data repositories. Spreadsheets and final reports sent to
stakeholders were cumbersome.
Initial Conditions The supply chain auditing team monthly fetched data from multiple SAP that were then processed locally in an Access DB,
manually rectified, sent to each category manager for further control and then consolidated into Excel spreadsheets used for the compilation of the
TdB and Spending Review Report. The data cleaning instructions were stored each month into a lengthy Word file together with original and adjusted
data in Excel spreadsheets
Solution Implemented One datonix m200 for data collection and preparation and one datonix m200 platform for publishing.
Obtained Results One single data collection and consolidation platform for all corporate divisions. Effort now required to assemble the TdB is less
than 4 man/days. Lean reports in PDF format with richer data content and drill through capabilities up to detail data. The obtained TdB has enough
synthesis to be delivered to the CEO, but at the same time plentiful in-depth capabilities to serve departmental analysis. Reports provided to users
as a web service and on multiple devices. Completely reshaped content and homogeneous reports format. Automatic generation and archiving of
adjustments data report for reference.
Competitive Advantages datonix external hierarchies hugely simplified data classification and easily accommodated for a consistent data
comparison between different time frames [month over previous month or previous year]. Richer data content due to the overcoming of the Excel
row limits. The possibility to quickly reflect organizational changes into the report data, still maintaining meaningful comparison with previous data.
Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. Displaced 50% of human resources assigned to the
reporting to other duties. One data collection system no longer necessary. Indirect savings by better spending control.
Additional Benefits Narrative reports are accessible also from mobile devices through a web app. Automatic creation of reports with different detail
depths for diverse needs. Efficient and lean data distribution to internal users who can use Excel to do local analysis still maintaining overall data
consistency. Phase out a system used by Finance to consolidate data coming from subsidiaries abroad. Same system used for forecast and what-if
analysis.
Where used
24. datonix™ and QueryObject™ are registered trademarksCustomer case: Data Warehouse Optimization
Type of Business Top Tier Telco
Customer Pain /Issue: Excessive storage needs and bad performances in an important Trend Analysis OSS system [network alarm data, network
performance data, network inventory data, etc.] but not such a mission critical project to have access to vast human and financial resources.
Initial Conditions The system collects data from several heterogeneous sources and systems to build data models that feed reports. The
datawarehouse required two daily restarts due to the critical volumes of data and this impacted effort required and system performances while not
satisfying business needs.
Solution Implemented Two datonix scanner Enterprise edition, one for the collection of the alarm data coming from the different technologies .
Obtained Results The original database has been freed from the historical data storage thus greatly improving its performance and removing the
need for daily restarts. Data acquisition and data presentation have been uncoupled thus avoiding the momentarily unavailability of the reporting
services if the data acquisition system was down. Very fast and efficient reports. Faster root cause analysis. Closer control over network trunks with
beneficial impacts on HR as well.
Competitive Advantages More effective data availability that converts into better predictive analysis. Improve performances and add new
functionalities without replacing existing infrastructure thus protecting CAPEX. Seamless integration between aggregate information and detail data
so that costly and time consuming data “reverse engineering” is no longer necessary. The same system is used for high-level top management
reporting and for operative analysis.
Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. No longer down time. No need to replace data acquisition
infrastructure. Automatic generation of reports against manual creation. Indirect savings by better control of the network.
Additional Benefits Dynamic reports for the management focusing on KPIs not previously attainable. Weekly Smart Network analysis report
generated automatically [with the previous system it was not economically feasible to collect and process those data]. Efficient data distribution to
internal users who can use Excel to do local analysis still maintaining overall data consistency. The automatic creation of a Downtime Report that
spans from highly aggregated information to finely granular data that previously required dozens of man/days effort to prepare.
Where used
25. datonix™ and QueryObject™ are registered trademarksCustomer case: SMART METERING
Type of Business Large Utility
Customer Pain /Issue: Data coming from smart meters were only collected manually once a week making it impossible to efficiently monitor the
devices and actively correlate malfunctioning with root causes. Fast and effective troubleshooting was a key to positively affect the bottom line of the
company in a heavily regulated market. KPIs were needed to monitor the quality of the metering process and to gauge the implementation of
improvement actions in conjunction with a finer granularity of the analysed data.
Initial Conditions Meters data were collected in an Oracle database where once a week were extracted and manually aggregated to produce simple
and hasty reports with a 1,5 man/day effort. The data volume and weekly cadence of the extraction made worthless a deep analysis on the meter
network performance.
Solution Implemented One datonix m100 virtual machine for the acquisition of the readings from their business customers (50k meters).
Obtained Results Completely automated daily data collection with an historical depth of 13 months of data.
Competitive Advantages The same system is used for high-level top management reporting and for operative analysis. The possibility to
continue to use Excel for operational reports with increased data availability due to the removed row limit provided by datonix. Easy access to an
holistic view of the whole network with the added possibility to analyse locally one year of full network data at granular level. datonix keyback feature
natively connects aggregate data to the actual records for easier analysis of equipment malfunctioning
Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. Proactive maintenance sprang from better and timely data
yielded to less outage and reduced ticket costs
Additional Benefits Highly scalable solution. During the project lifecycle the customer increased the frequency of the readings and doubled the
number of meters with no impact on performance. Customer is widening the scope of the project to include mass market as well. The capability of
adding further data models without impact on performance allowed the introduction of more KPIs than originally planned, thus improving the overall
quality of the monitoring process
Where used
26. datonix™ and QueryObject™ are registered trademarksSolution case: the Hadoop Data Scanner
Hadoop is perfect to support simple queries on very large data set.
When it is needed more then a simple full table scans, it is recommended to complement Hadoop with an additional technology.
Since the usage of a data base technology is expensive and shows inherent limitations, and that Hadoop software stack is conceived to leave data
where they are, ie in the HDFS, today, they are emerging modern solutions to run Hadoop data processes (ie Spark, Flink, etc.).
But what does it happen when it is required to move data outside the HDFS to run data processes remotely or, vice versa, remote users need to
move large data set into the HDFS?
Requirement
Issues and high costs of data delivery make untenable the traditional solution of transferring and integrating remote data resources by hauling the
data, in bulk, back to an IT Enterprise Data Hub and processing it there. Worse still, the more distributed the organization, the more serious the
impact of large scale data movement will be. Transferring Hadoop's raw data to remote destinations, blending those data with Remote Dark Data,
extracting there business intelligence, and sending back huge volume of adjusted data should solve the problem. This strategy would require to
maintain expensive data movements, and an IT architecture that matches the distributed nature of the Remote Data.
Solution
With DumbOne, datonix™ offers new perspectives for Hadoop data movement and remote data processing.
DumbOne plugs into HDFS, reads data and natively shares memory segments with the datonix™ data-scan ultrafast processes. Once converted in
the QueryObect™ format, dataset are ultra-compressed and ready to be exchanged with remote datonixOne servers. This way existing IT Centric
architecture can be complemented with an ideal “pure grid” satellite, the datonixOne, so that it’s as non-disruptive as possible, saves operation’s
costs, and increases performance. Using datonix™ Voyager it’s easy and fast to set up connections to data stored in HDFS, so that developers don’t
loose their focus facing against java memory errors, garbage collection, file fragmentation, cluster contention and annoying cluster’s performance
tuning.
Where used
29. datonix™ and QueryObject™ are registered trademarksQOhpDG Summary of features
To access Data Sources, several components are available in 5.10.10, in red are listed 6.0.1 components:
File System:Driver to access local csv, Flat, XML, Json, RDF, binary
http/ sftp: Remote File Systems can be accessed using http/sftp protocol
ODBC: Local ODBC sources are accessed using UnixOdbc, remote ODBC sources requires Source’s
ODBC client installation
Oracle: Oracle OCI driver is pre-installed
DumbOne: NFS mount of a Cloudera HDFS, and Hive driver are pre-installed, python Pandas is pre-
installed as well
Cloudy: DropBox and Gdrive drives can be mounted into the datonix File System
SRM: Using Stream Resource Manager http stream source can be accessed
All QOhpDG drivers can benefit of the automatic or static source partitioning for parallel load
Summary of Features
30. datonix™ and QueryObject™ are registered trademarksFractal Engine Summary of features
Row-id: A primary key is automatically added to any rows of the scan
User field: processing on original data. It is available any operation on Date, String, and Numeric fields.
Special processing logic can be user defined using C language. Wide library of C transformations
is available.
Filters: Include / exclude filters during any build process step.
Csv lookup: Left, Right, Inner and Outer Join on dataset load time
Hierarchies: Fields can be organized in hierarchies, External files containing hierarchies can be dinamically
added to QueryObject fields
MultiThread: Load, Normalization and Data Complex build processes can be parallelized
Load strategy: Loads can be processed in append mode, and change data capture mode
Compression: QueryObject Numbers and String can be compressed
Count Measures: Regular, distinct, exclusive distinct and intersection counts can be preprocessed
ECC: Command line Engine Control Command
WebECC: Rest based ECC wrapper
Security: QueryObject data set can be password protected, content encript can be user defined
Summary of Features
31. datonix™ and QueryObject™ are registered trademarksBlend Engine Summary of features
Union: two or more QueryObject data set can be virtually unified in union. Union can be inner,
only the common columns will be in union, or outer All the columns will partecipate to
the Union, resulting Null in the columns not belonging to the single QueryObject in
Union. Order of the columns doesn’t matter to the Union. Hierarchies can be applied or
inherited.
Grid Union: Same of the above but in the Network. Each QueryObject remain where it is but the
Union will be network registered.
Merge: two or more QueryObject data set can be consolidated, aka merged, in batch.
Join: two or more QueryObject data set can be joined at high speed.
Csv Join: One QueryObject can be Joined with a csv file
Oracle TG: One or more QueryObject can be linked to an Oracle. Oracle will consider the
QueryObject as an external ODBC connection.
Summary of Features
32. datonix™ and QueryObject™ are registered trademarksDatonix File System (DFS) Summary of features
DFS it is the file access method of the datonixOne. It contains the QueryObject data sets and
mounts to data sources.
DFS Manager: The DFS Web GUI, optimized for Mozilla, runs on every browser. In one click any
operation (copy, archive, download, upload, etc.) on DFS object can be executed.
Space: datonix ensures a x14 compression factor on complex data structures archive space, so
for example the m100 with 1 TB of physical space ensures 14 TB of space for data in
QueryObject format.
HDFS: DFS can be linked in Write to HDFS ensuring this way unlimited space
Cloudy: DFS can be synchronized with Gdrive and DropBox
Summary of Features
33. datonix™ and QueryObject™ are registered trademarksQOhp Interface Option Summary of features
Rest Support of both NoSql and Sql query language. Produces csv, json, Excel, xml, html grid
output
Chart Professional data visualization language. It support both swf and html5 output. More
then 300 base charts available. World map available as well
ODBC Full Support of ODBC standard clients. Preinstalled Hive and MySql. Extension for
Distinct Count and advanced II order statistics
OCI Full support of Oracle OCI driver
Download Center Support of QO downloads, link to Rest Interfaces, custom web-app external link
Blocks Full Wordpress & Jumla Support
Summary of Features
35. datonix™ and QueryObject™ are registered trademarksAgile Methodology
SPUR
START
• Goals definition
• Data Collection
• Selection of the
Data Science
• Preparation
Estimo
PREPARE
• QueryObject
Set up
• Teaser Set up
USE
• Implement the
Data Science
• Determine the
KPIs
• Implement the
communication
RUN
• Implement Pre
and Post mining
Agents
• Implement the
automation
• Ongoing run
SPUR is the result of more then 6,000 preparation implementations executed directly or through our
certified professionals.
Most of the processes are supported by specific GUIs, called Voyagers, and a wide collection of Microsoft
documents macro.
36. datonix™ and QueryObject™ are registered trademarksData Preparation steps
Using datonix Voyager the following steps are implemented in a recursive way
SPUR
Data cleaning: Fill in missing values, smooth noisy data, identify or
remove outliers, and reconcile inconsistencies
Data transformation: User fields, User Measures, Normalization
and aggregation
Data reduction: Obtains reduced representation in volume but
produces the same or similar analytical results. Discretization of
numeric fields
37. datonix™ and QueryObject™ are registered trademarksHow Voyager handles Missing Data?
Data is not always available (aka many tuples have no recorded value for several attributes, such as
customer income in sales data). It may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
Using datonix tools, data can be reviewed and fixed in respect of versioning, resulting a log of the below
actions:
• Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not
effective when the percentage of missing values per attribute varies considerably)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or
decision tree
Data Cleaning
38. datonix™ and QueryObject™ are registered trademarksNoisy Data
They are random error or variance in a measured variable that results in incorrect attribute values, due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Other data problems which requires data cleaning are duplicate records, incomplete data, inconsistent data
Using datonix tools, Noisy data can be treated with:
• Binning method: first sort data and partition into (equi-depth) bins, then smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
• Clustering: detect and remove outliers
• Combined computer and human inspection: detect suspicious values and check by human
• Regression: smooth by fitting the data into regression functions
Data Cleaning
39. datonix™ and QueryObject™ are registered trademarksBinning & Smoothing
Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-
A)/N but outliers may dominate presentation & skewed data is not handled well.
Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of samples. Good
data scaling
• Managing categorical attributes is not anymore tricky.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
Data Cleaning
40. datonix™ and QueryObject™ are registered trademarksData Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
Data Transformation
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__('
A
A
devstand
meanv
v
_
'
j
v
v
10
' Where j is the smallest integer such that Max(| |)<1'v
41. datonix™ and QueryObject™ are registered trademarksData Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
• Obtains a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
Data reduction strategies
• Data cube aggregation
• Dimensionality reduction
• Numerosity reduction
• Discretization and concept hierarchy generation
Data Reduction
42. datonix™ and QueryObject™ are registered trademarksSteps that follows preparation
Once the QueryObject data set have been prepared the following steps are performed:
- Data Exploration or Data Modeling
- Results interpretations and KPIs definition
- Eventually ingest new data or implement simulations
- Implement the communication of results and implication to decision makers
Then the ongoing run will include
- The Update of the QueryObjects and of the communication
- The knowledge base implementation
- Implementation and monitoring effectivesess
SPUR
43. datonix™ and QueryObject™ are registered trademarksSummary
Data preparation is a the big issue
Data preparation includes
• Data cleaning
• Data reduction and feature selection
• Discretization
A lot a methods (190) have been developed in datonix but still an active area of research
SPUR