SlideShare ist ein Scribd-Unternehmen logo
1 von 44
datonix™ and QueryObject™ are registered trademarks
April 2016
datonix
pragmatic self service data preparation
Product outlook
datonix™ and QueryObject™ are registered trademarks
Datonix introduces an interesting technology
into the BigData space
It is a proprietary fractal like algorithm that converts an input raw data
into a smart, lean and fast file
the QueryObject data set
In addition, technology materializes in the QueryObject data set the
results of any custom defined data processing
The datonix fractal technology is the core of the datonixOne, a
product designed to offer disruptive data preparation features.
Executive summary
datonix™ and QueryObject™ are registered trademarkswhere life follows data
Why we made datonix
What datonix does
Where datonix is used
Annex
Summary of Features
Details of Data Preparation Steps
Index
Why we made datonix
datonix™ and QueryObject™ are registered trademarksTechnology Scenario: the information firewall
Why datonix
The information firewall will fall where
IT division will evolve into separate
“I” and “T” orgs
Data
Ware
house
Legacy
DMP
SAP
Local
DB
Hadoop
BigData
CRM
OSS
Web-app
Data
Science
Business
Intelligence
Data
Journalism
Data
Unification
Dark
Data
Data
Integration
The data management (Tech) side The data exploitation (Info) side
datonix™ and QueryObject™ are registered trademarksData Preparation?
It is
the next big thing in the data
management space
It is the answer
for Data Scientists that need to self
prepare “Ready Data” for their
actvities
It is the answer for end-users
asking for performance and agility
Why datonix
high
Oriented to
Perform
Oriented to
Agility
Oriented to
Comfort
Utility
technology
low
lowhigh
Competence
datonix™ and QueryObject™ are registered trademarksFrom Big to Complex
Assumption:
N algorithmic linear logic isn’t
adequate for a complete data
analysis
Perspective:
Nm real time logic is required to
understand more and to act faster
Data Complex Issue:
How to blend many & large data
component
Why datonix
Complex
Diversified
Big
Simple
Size
few
smalllarge
manyData Components
datonix™ and QueryObject™ are registered trademarksBottom line
To accelerate the “Information Firewall Fall”, most companies are investing in
MDM or Enterprise data hub or data Lake implementations
In parallel, Data Preparation Market is going to be hot simply because
- there is a shortage of Data Scientists
- they are required unique Data Preparation methodologies and tools for the
Business Analysts, the Data Scientists and the Data Journalist
datonix has been designed to support from the “in pectore”, aka citizen, to the
most expert data scientists”
Thanks to its fractal engine datonix can offer real self service data preparation
capabilities for the end-user,
Why datonix
What datonix does
datonix™ and QueryObject™ are registered trademarksMission
Datonix primary product is datonixOne a
Self Service Data Preparation solution
Using datonixOne it is possible to prepare
SMART Ready Data, the QueryObject
data set
Once prepared, with a few of clicks, the
QueryObject can be used as
· a Web 2.0 data services;
· an analytic web-app;
· a first class narrative analytics & report;
· an ODBC set of tables.
What datonix does
datonix™ and QueryObject™ are registered trademarksDatonixOne is a Data Scanner
the Data Scanner Engine quickly convert raw
data in a new file type named QueryObject
QueryObject contains scanned raw data, end-
user defined data processing outputs, and raw
data projections
it is read only, binary portable, compressed,
secure, and fast
in the QueryObject, raw data set are linked
with their projections in real time
the fractal nature of the QueryObject internals
ensure scalable data blending and data
unification
What datonix does
datonix™ and QueryObject™ are registered trademarksQueryObject data set build example
rowId Nname Ncity
1 1 1
2 2 2
3 3 3
4 2 2
What datonix does
Key Value NValue
Name Aldo 1
Name Sara 2
Name Anna 3
City Miami 1
… … …name city DateBirth
Aldo Miami 11/1/90
Sara NYC 12/2/89
Anna Rome 1.1.68
Sara NYC 31-1-61
DateBirth UDateB Age
11/1/90 1/11/90 26
12/2/89 2/12/89 26
1.1.68 1/1/68 48
31-1-61 1/31/61 56
Ncity city state
1 Miami Fl
2 NYC NY
3 Rome Italy
Map Dictionary
Luggage hierarchy
Data complex Storage group
Data source
Fractal conversion
Transform
DateBirth
Add Geo
classification
datonix™ and QueryObject™ are registered trademarksProduct architecture
What datonix does
datonix™ and QueryObject™ are registered trademarksConnect, Build & Combine QueryObject data set
Datonix is an efficient Data Lake Satellite, it can connect to any source for any
load at the maximum speed, and executes custom data processing
What datonix does
QCL Server
QueryObject
Communication
Layer
Scanner
Engine
DFS
Datonix File
System
SRM
Stream
Resource
Manager
Fractal
Engine
QO Joiner
ENGINE
lookup
ENGINE
Metadata Repository
Source Data
Descriptions
Transformation, Filter
Cleansing snippets
External
hierarchies
datonix™ and QueryObject™ are registered trademarksHow to use QueryObject data set
Once QueryObject has been registered for networking, in a few of clicks it can be
transformed in online web objects or in a materialized data set ready to be moved
and used elsewhere
What datonix does
DFS
Datonix File
System
Chart
ENGINE
Metadata Repository
External
Hierarchies
Query
Commands
ODBC
Connector
Download
Center
QOhpIO
Rest
Provider
Oracle
Transparent
Gateway
Services ini
datonix™ and QueryObject™ are registered trademarksThe packaging and the business model
What datonix does
Today, datonixOne is delivered on premise in
four editions:
- Personal
- Server
- Enterprise
- OEM
The Server edition is also commercialized
using our SaaS (Software as a Service) Plan.
Using the SaaS Plan customers can have
datonixOne with a small deposit and an
attractive monthly rate.
Soon it will be announced the general
availability of the Cloud edition.
Standard product maintenance is included in
the price, a three years datonixCare
extension plan for certain services can be
subscribed.
datonix™ and QueryObject™ are registered trademarksBottom Line
datonixOne performs a sophisticated processing of the given data to create an Object based
dataset named QueryObject data set. It contains cold detailed data and hot analytic
information.
In general, the main strengths of datonixOne are as follows:
- It is well scalable in the number of input data rows.
- It supports all regular aggregate and Distinct Count Measure (DCM)
- It handles dimensional hierarchies externally to the QueryObject
- It supports a SMP-based parallel data processing
- It supports high performances incremental or cdc update
- It uses several compression techniques to reduce data footprint.
- It allows to create partial and or optimized view of data
- It is a disconnected active data store
- Its ideal usage is to provide keyback from hot data selections to the related cold data
- Based on Grid architecture, which supports federated queries, supports querying in multi-user environment.
- It is a cost-efficient solution.
In many cases it will work efficiently because many potential applications requirements find
answers in datonix possibilities
Why to use datonix
datonix™ and QueryObject™ are registered trademarks
Where is user’s pain
If analytic requisite is not clear upfront or not
fully structured, preparation could be long and
costs high
Raw Data are not connected with Analytics, and in
order to review/adjust User’s data it is necessary
to cycle back to Data Integration
Dark Data don’t go back to DMP level
Narrative reporting is not handled
Typical data interface to R, Python, Advanced
Analytics and BI tools are slow and tricky
Why to use
Datonix remedy
datonixOne connects and scans any kind of data
sources.
Datonix automatically maintains raw data connected to
analytics. Using datonix, end-user can self develop data
processing resulting “Schema on Read Structures” that
can be easily combined with dark data, external data
and Cloud data.
The datonixOne Publisher is a cloud component designed
to make data available to distribution over a network.
Following the registration of data in Publisher the
following services are automatically made available:
• An Excel worksheet
• A high performance data or info graphic service
• A dynamic report in ppt, doc, xlsx or pdf format
• An active Big Data dashboard
• A DBMS like data table
datonix™ and QueryObject™ are registered trademarksApplication of datonix
Thanks to its unique features and or
performance capabilities, datonix has been
used in a variety of applications over
several industry.
Cloud migrations, Spending Review,
Revenue Assurance and OSS data
collection, processing and movement are
the most common.
Product has been commercialized directly
or through partners to Telcos, Media,
Public Administration, Manufacturing, R&D
Institutes.
Where used
datonix™ and QueryObject™ are registered trademarksAgile Methodology
SPUR
START
• Goals definition
• Data Collection
• Selection of the
Data Science
• Preparation
Estimo
PREPARE
• QueryObject
Set up
• Teaser Set up
USE
• Implement the
Data Science
• Determine the
KPIs
• Implement the
communication
RUN
• Implement Pre
and Post mining
Agents
• Implement the
automation
• Ongoing run
SPUR is the result of more then 6,000 preparation implementations executed directly by datonix or through
our certified professionals.
Most of the processes are supported by specific GUIs, called Voyagers, and a wide collection of Microsoft
office Addin.
datonix™ and QueryObject™ are registered trademarksMeasurable Estimo
Datonix Estimo database is continuously
reviewed and available to datonix certified
professional network in order to ensure:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
SPUR
datonix™ and QueryObject™ are registered trademarksCustomer case: CPM
Type of Business Very large multinational enterprise with annual Supply Chain contracts in excess of 9B.
Customer Pain /Issue: The making of the monthly Corporate Supply Chain Tableau de Bord [TdB] and Spending Review was lengthy [more than
9 man/days of processing], prone to errors and produced a huge report more of 180 pages long in order to provide for required drill down views.
Data inconsistencies arising from different data download procedures from main data repositories. Spreadsheets and final reports sent to
stakeholders were cumbersome.
Initial Conditions The supply chain auditing team monthly fetched data from multiple SAP that were then processed locally in an Access DB,
manually rectified, sent to each category manager for further control and then consolidated into Excel spreadsheets used for the compilation of the
TdB and Spending Review Report. The data cleaning instructions were stored each month into a lengthy Word file together with original and adjusted
data in Excel spreadsheets
Solution Implemented One datonix m200 for data collection and preparation and one datonix m200 platform for publishing.
Obtained Results One single data collection and consolidation platform for all corporate divisions. Effort now required to assemble the TdB is less
than 4 man/days. Lean reports in PDF format with richer data content and drill through capabilities up to detail data. The obtained TdB has enough
synthesis to be delivered to the CEO, but at the same time plentiful in-depth capabilities to serve departmental analysis. Reports provided to users
as a web service and on multiple devices. Completely reshaped content and homogeneous reports format. Automatic generation and archiving of
adjustments data report for reference.
Competitive Advantages datonix external hierarchies hugely simplified data classification and easily accommodated for a consistent data
comparison between different time frames [month over previous month or previous year]. Richer data content due to the overcoming of the Excel
row limits. The possibility to quickly reflect organizational changes into the report data, still maintaining meaningful comparison with previous data.
Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. Displaced 50% of human resources assigned to the
reporting to other duties. One data collection system no longer necessary. Indirect savings by better spending control.
Additional Benefits Narrative reports are accessible also from mobile devices through a web app. Automatic creation of reports with different detail
depths for diverse needs. Efficient and lean data distribution to internal users who can use Excel to do local analysis still maintaining overall data
consistency. Phase out a system used by Finance to consolidate data coming from subsidiaries abroad. Same system used for forecast and what-if
analysis.
Where used
datonix™ and QueryObject™ are registered trademarksCustomer case: Data Warehouse Optimization
Type of Business Top Tier Telco
Customer Pain /Issue: Excessive storage needs and bad performances in an important Trend Analysis OSS system [network alarm data, network
performance data, network inventory data, etc.] but not such a mission critical project to have access to vast human and financial resources.
Initial Conditions The system collects data from several heterogeneous sources and systems to build data models that feed reports. The
datawarehouse required two daily restarts due to the critical volumes of data and this impacted effort required and system performances while not
satisfying business needs.
Solution Implemented Two datonix scanner Enterprise edition, one for the collection of the alarm data coming from the different technologies .
Obtained Results The original database has been freed from the historical data storage thus greatly improving its performance and removing the
need for daily restarts. Data acquisition and data presentation have been uncoupled thus avoiding the momentarily unavailability of the reporting
services if the data acquisition system was down. Very fast and efficient reports. Faster root cause analysis. Closer control over network trunks with
beneficial impacts on HR as well.
Competitive Advantages More effective data availability that converts into better predictive analysis. Improve performances and add new
functionalities without replacing existing infrastructure thus protecting CAPEX. Seamless integration between aggregate information and detail data
so that costly and time consuming data “reverse engineering” is no longer necessary. The same system is used for high-level top management
reporting and for operative analysis.
Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. No longer down time. No need to replace data acquisition
infrastructure. Automatic generation of reports against manual creation. Indirect savings by better control of the network.
Additional Benefits Dynamic reports for the management focusing on KPIs not previously attainable. Weekly Smart Network analysis report
generated automatically [with the previous system it was not economically feasible to collect and process those data]. Efficient data distribution to
internal users who can use Excel to do local analysis still maintaining overall data consistency. The automatic creation of a Downtime Report that
spans from highly aggregated information to finely granular data that previously required dozens of man/days effort to prepare.
Where used
datonix™ and QueryObject™ are registered trademarksCustomer case: SMART METERING
Type of Business Large Utility
Customer Pain /Issue: Data coming from smart meters were only collected manually once a week making it impossible to efficiently monitor the
devices and actively correlate malfunctioning with root causes. Fast and effective troubleshooting was a key to positively affect the bottom line of the
company in a heavily regulated market. KPIs were needed to monitor the quality of the metering process and to gauge the implementation of
improvement actions in conjunction with a finer granularity of the analysed data.
Initial Conditions Meters data were collected in an Oracle database where once a week were extracted and manually aggregated to produce simple
and hasty reports with a 1,5 man/day effort. The data volume and weekly cadence of the extraction made worthless a deep analysis on the meter
network performance.
Solution Implemented One datonix m100 virtual machine for the acquisition of the readings from their business customers (50k meters).
Obtained Results Completely automated daily data collection with an historical depth of 13 months of data.
Competitive Advantages The same system is used for high-level top management reporting and for operative analysis. The possibility to
continue to use Excel for operational reports with increased data availability due to the removed row limit provided by datonix. Easy access to an
holistic view of the whole network with the added possibility to analyse locally one year of full network data at granular level. datonix keyback feature
natively connects aggregate data to the actual records for easier analysis of equipment malfunctioning
Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. Proactive maintenance sprang from better and timely data
yielded to less outage and reduced ticket costs
Additional Benefits Highly scalable solution. During the project lifecycle the customer increased the frequency of the readings and doubled the
number of meters with no impact on performance. Customer is widening the scope of the project to include mass market as well. The capability of
adding further data models without impact on performance allowed the introduction of more KPIs than originally planned, thus improving the overall
quality of the monitoring process
Where used
datonix™ and QueryObject™ are registered trademarksSolution case: the Hadoop Data Scanner
Hadoop is perfect to support simple queries on very large data set.
When it is needed more then a simple full table scans, it is recommended to complement Hadoop with an additional technology.
Since the usage of a data base technology is expensive and shows inherent limitations, and that Hadoop software stack is conceived to leave data
where they are, ie in the HDFS, today, they are emerging modern solutions to run Hadoop data processes (ie Spark, Flink, etc.).
But what does it happen when it is required to move data outside the HDFS to run data processes remotely or, vice versa, remote users need to
move large data set into the HDFS?
Requirement
Issues and high costs of data delivery make untenable the traditional solution of transferring and integrating remote data resources by hauling the
data, in bulk, back to an IT Enterprise Data Hub and processing it there. Worse still, the more distributed the organization, the more serious the
impact of large scale data movement will be. Transferring Hadoop's raw data to remote destinations, blending those data with Remote Dark Data,
extracting there business intelligence, and sending back huge volume of adjusted data should solve the problem. This strategy would require to
maintain expensive data movements, and an IT architecture that matches the distributed nature of the Remote Data.
Solution
With DumbOne, datonix™ offers new perspectives for Hadoop data movement and remote data processing.
DumbOne plugs into HDFS, reads data and natively shares memory segments with the datonix™ data-scan ultrafast processes. Once converted in
the QueryObect™ format, dataset are ultra-compressed and ready to be exchanged with remote datonixOne servers. This way existing IT Centric
architecture can be complemented with an ideal “pure grid” satellite, the datonixOne, so that it’s as non-disruptive as possible, saves operation’s
costs, and increases performance. Using datonix™ Voyager it’s easy and fast to set up connections to data stored in HDFS, so that developers don’t
loose their focus facing against java memory errors, garbage collection, file fragmentation, cluster contention and annoying cluster’s performance
tuning.
Where used
Summary of features
datonix™ and QueryObject™ are registered trademarksProduct architecture
Summary of Features
datonix™ and QueryObject™ are registered trademarksQOhpDG Summary of features
To access Data Sources, several components are available in 5.10.10, in red are listed 6.0.1 components:
File System:Driver to access local csv, Flat, XML, Json, RDF, binary
http/ sftp: Remote File Systems can be accessed using http/sftp protocol
ODBC: Local ODBC sources are accessed using UnixOdbc, remote ODBC sources requires Source’s
ODBC client installation
Oracle: Oracle OCI driver is pre-installed
DumbOne: NFS mount of a Cloudera HDFS, and Hive driver are pre-installed, python Pandas is pre-
installed as well
Cloudy: DropBox and Gdrive drives can be mounted into the datonix File System
SRM: Using Stream Resource Manager http stream source can be accessed
All QOhpDG drivers can benefit of the automatic or static source partitioning for parallel load
Summary of Features
datonix™ and QueryObject™ are registered trademarksFractal Engine Summary of features
Row-id: A primary key is automatically added to any rows of the scan
User field: processing on original data. It is available any operation on Date, String, and Numeric fields.
Special processing logic can be user defined using C language. Wide library of C transformations
is available.
Filters: Include / exclude filters during any build process step.
Csv lookup: Left, Right, Inner and Outer Join on dataset load time
Hierarchies: Fields can be organized in hierarchies, External files containing hierarchies can be dinamically
added to QueryObject fields
MultiThread: Load, Normalization and Data Complex build processes can be parallelized
Load strategy: Loads can be processed in append mode, and change data capture mode
Compression: QueryObject Numbers and String can be compressed
Count Measures: Regular, distinct, exclusive distinct and intersection counts can be preprocessed
ECC: Command line Engine Control Command
WebECC: Rest based ECC wrapper
Security: QueryObject data set can be password protected, content encript can be user defined
Summary of Features
datonix™ and QueryObject™ are registered trademarksBlend Engine Summary of features
Union: two or more QueryObject data set can be virtually unified in union. Union can be inner,
only the common columns will be in union, or outer All the columns will partecipate to
the Union, resulting Null in the columns not belonging to the single QueryObject in
Union. Order of the columns doesn’t matter to the Union. Hierarchies can be applied or
inherited.
Grid Union: Same of the above but in the Network. Each QueryObject remain where it is but the
Union will be network registered.
Merge: two or more QueryObject data set can be consolidated, aka merged, in batch.
Join: two or more QueryObject data set can be joined at high speed.
Csv Join: One QueryObject can be Joined with a csv file
Oracle TG: One or more QueryObject can be linked to an Oracle. Oracle will consider the
QueryObject as an external ODBC connection.
Summary of Features
datonix™ and QueryObject™ are registered trademarksDatonix File System (DFS) Summary of features
DFS it is the file access method of the datonixOne. It contains the QueryObject data sets and
mounts to data sources.
DFS Manager: The DFS Web GUI, optimized for Mozilla, runs on every browser. In one click any
operation (copy, archive, download, upload, etc.) on DFS object can be executed.
Space: datonix ensures a x14 compression factor on complex data structures archive space, so
for example the m100 with 1 TB of physical space ensures 14 TB of space for data in
QueryObject format.
HDFS: DFS can be linked in Write to HDFS ensuring this way unlimited space
Cloudy: DFS can be synchronized with Gdrive and DropBox
Summary of Features
datonix™ and QueryObject™ are registered trademarksQOhp Interface Option Summary of features
Rest Support of both NoSql and Sql query language. Produces csv, json, Excel, xml, html grid
output
Chart Professional data visualization language. It support both swf and html5 output. More
then 300 base charts available. World map available as well
ODBC Full Support of ODBC standard clients. Preinstalled Hive and MySql. Extension for
Distinct Count and advanced II order statistics
OCI Full support of Oracle OCI driver
Download Center Support of QO downloads, link to Rest Interfaces, custom web-app external link
Blocks Full Wordpress & Jumla Support
Summary of Features
Data Preparation Steps
datonix™ and QueryObject™ are registered trademarksAgile Methodology
SPUR
START
• Goals definition
• Data Collection
• Selection of the
Data Science
• Preparation
Estimo
PREPARE
• QueryObject
Set up
• Teaser Set up
USE
• Implement the
Data Science
• Determine the
KPIs
• Implement the
communication
RUN
• Implement Pre
and Post mining
Agents
• Implement the
automation
• Ongoing run
SPUR is the result of more then 6,000 preparation implementations executed directly or through our
certified professionals.
Most of the processes are supported by specific GUIs, called Voyagers, and a wide collection of Microsoft
documents macro.
datonix™ and QueryObject™ are registered trademarksData Preparation steps
Using datonix Voyager the following steps are implemented in a recursive way
SPUR
Data cleaning: Fill in missing values, smooth noisy data, identify or
remove outliers, and reconcile inconsistencies
Data transformation: User fields, User Measures, Normalization
and aggregation
Data reduction: Obtains reduced representation in volume but
produces the same or similar analytical results. Discretization of
numeric fields
datonix™ and QueryObject™ are registered trademarksHow Voyager handles Missing Data?
Data is not always available (aka many tuples have no recorded value for several attributes, such as
customer income in sales data). It may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
Using datonix tools, data can be reviewed and fixed in respect of versioning, resulting a log of the below
actions:
• Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not
effective when the percentage of missing values per attribute varies considerably)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or
decision tree
Data Cleaning
datonix™ and QueryObject™ are registered trademarksNoisy Data
They are random error or variance in a measured variable that results in incorrect attribute values, due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Other data problems which requires data cleaning are duplicate records, incomplete data, inconsistent data
Using datonix tools, Noisy data can be treated with:
• Binning method: first sort data and partition into (equi-depth) bins, then smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
• Clustering: detect and remove outliers
• Combined computer and human inspection: detect suspicious values and check by human
• Regression: smooth by fitting the data into regression functions
Data Cleaning
datonix™ and QueryObject™ are registered trademarksBinning & Smoothing
Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-
A)/N but outliers may dominate presentation & skewed data is not handled well.
Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of samples. Good
data scaling
• Managing categorical attributes is not anymore tricky.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
Data Cleaning
datonix™ and QueryObject™ are registered trademarksData Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
Data Transformation
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



A
A
devstand
meanv
v
_
'


j
v
v
10
' Where j is the smallest integer such that Max(| |)<1'v
datonix™ and QueryObject™ are registered trademarksData Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
• Obtains a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
Data reduction strategies
• Data cube aggregation
• Dimensionality reduction
• Numerosity reduction
• Discretization and concept hierarchy generation
Data Reduction
datonix™ and QueryObject™ are registered trademarksSteps that follows preparation
Once the QueryObject data set have been prepared the following steps are performed:
- Data Exploration or Data Modeling
- Results interpretations and KPIs definition
- Eventually ingest new data or implement simulations
- Implement the communication of results and implication to decision makers
Then the ongoing run will include
- The Update of the QueryObjects and of the communication
- The knowledge base implementation
- Implementation and monitoring effectivesess
SPUR
datonix™ and QueryObject™ are registered trademarksSummary
Data preparation is a the big issue
Data preparation includes
• Data cleaning
• Data reduction and feature selection
• Discretization
A lot a methods (190) have been developed in datonix but still an active area of research
SPUR
©2016 datonix Spa

Weitere ähnliche Inhalte

Was ist angesagt?

The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern AnalyticsThe Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
Denodo
 

Was ist angesagt? (20)

Future of Data Strategy
Future of Data StrategyFuture of Data Strategy
Future of Data Strategy
 
Designing For Occasionally Connected Apps Slideshare
Designing For Occasionally Connected Apps SlideshareDesigning For Occasionally Connected Apps Slideshare
Designing For Occasionally Connected Apps Slideshare
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Enabling Cloud Data Integration (EMEA)
Enabling Cloud Data Integration (EMEA)Enabling Cloud Data Integration (EMEA)
Enabling Cloud Data Integration (EMEA)
 
datonix product overview
datonix product overviewdatonix product overview
datonix product overview
 
Gartner Cool Vendor Report 2014
Gartner Cool Vendor Report 2014Gartner Cool Vendor Report 2014
Gartner Cool Vendor Report 2014
 
Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)
 
2009.10.22 S308460 Cloud Data Services
2009.10.22 S308460  Cloud Data Services2009.10.22 S308460  Cloud Data Services
2009.10.22 S308460 Cloud Data Services
 
Data virtualization an introduction
Data virtualization an introductionData virtualization an introduction
Data virtualization an introduction
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthy
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Data Federation
Data FederationData Federation
Data Federation
 
Consumption based analytics enabled by Data Virtualization
Consumption based analytics enabled by Data VirtualizationConsumption based analytics enabled by Data Virtualization
Consumption based analytics enabled by Data Virtualization
 
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern AnalyticsThe Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
The Role of the Logical Data Fabric in a Unified Platform for Modern Analytics
 
Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Accelerate Return on Data
Accelerate Return on DataAccelerate Return on Data
Accelerate Return on Data
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 

Ähnlich wie Product overview 6.0 v.1.0

Ähnlich wie Product overview 6.0 v.1.0 (20)

A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)
 
Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Bridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need ItBridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need It
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
 
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Accelerate Migration to the Cloud using Data Virtualization (APAC)
Accelerate Migration to the Cloud using Data Virtualization (APAC)Accelerate Migration to the Cloud using Data Virtualization (APAC)
Accelerate Migration to the Cloud using Data Virtualization (APAC)
 
Introduction to Modern Data Virtualization 2021 (APAC)
Introduction to Modern Data Virtualization 2021 (APAC)Introduction to Modern Data Virtualization 2021 (APAC)
Introduction to Modern Data Virtualization 2021 (APAC)
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
 
Datumize Deck 2019
Datumize Deck 2019 Datumize Deck 2019
Datumize Deck 2019
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
 
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
 

Kürzlich hochgeladen

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Kürzlich hochgeladen (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

Product overview 6.0 v.1.0

  • 1. datonix™ and QueryObject™ are registered trademarks April 2016 datonix pragmatic self service data preparation Product outlook
  • 2. datonix™ and QueryObject™ are registered trademarks Datonix introduces an interesting technology into the BigData space It is a proprietary fractal like algorithm that converts an input raw data into a smart, lean and fast file the QueryObject data set In addition, technology materializes in the QueryObject data set the results of any custom defined data processing The datonix fractal technology is the core of the datonixOne, a product designed to offer disruptive data preparation features. Executive summary
  • 3. datonix™ and QueryObject™ are registered trademarkswhere life follows data Why we made datonix What datonix does Where datonix is used Annex Summary of Features Details of Data Preparation Steps Index
  • 4. Why we made datonix
  • 5. datonix™ and QueryObject™ are registered trademarksTechnology Scenario: the information firewall Why datonix The information firewall will fall where IT division will evolve into separate “I” and “T” orgs Data Ware house Legacy DMP SAP Local DB Hadoop BigData CRM OSS Web-app Data Science Business Intelligence Data Journalism Data Unification Dark Data Data Integration The data management (Tech) side The data exploitation (Info) side
  • 6. datonix™ and QueryObject™ are registered trademarksData Preparation? It is the next big thing in the data management space It is the answer for Data Scientists that need to self prepare “Ready Data” for their actvities It is the answer for end-users asking for performance and agility Why datonix high Oriented to Perform Oriented to Agility Oriented to Comfort Utility technology low lowhigh Competence
  • 7. datonix™ and QueryObject™ are registered trademarksFrom Big to Complex Assumption: N algorithmic linear logic isn’t adequate for a complete data analysis Perspective: Nm real time logic is required to understand more and to act faster Data Complex Issue: How to blend many & large data component Why datonix Complex Diversified Big Simple Size few smalllarge manyData Components
  • 8. datonix™ and QueryObject™ are registered trademarksBottom line To accelerate the “Information Firewall Fall”, most companies are investing in MDM or Enterprise data hub or data Lake implementations In parallel, Data Preparation Market is going to be hot simply because - there is a shortage of Data Scientists - they are required unique Data Preparation methodologies and tools for the Business Analysts, the Data Scientists and the Data Journalist datonix has been designed to support from the “in pectore”, aka citizen, to the most expert data scientists” Thanks to its fractal engine datonix can offer real self service data preparation capabilities for the end-user, Why datonix
  • 10. datonix™ and QueryObject™ are registered trademarksMission Datonix primary product is datonixOne a Self Service Data Preparation solution Using datonixOne it is possible to prepare SMART Ready Data, the QueryObject data set Once prepared, with a few of clicks, the QueryObject can be used as · a Web 2.0 data services; · an analytic web-app; · a first class narrative analytics & report; · an ODBC set of tables. What datonix does
  • 11. datonix™ and QueryObject™ are registered trademarksDatonixOne is a Data Scanner the Data Scanner Engine quickly convert raw data in a new file type named QueryObject QueryObject contains scanned raw data, end- user defined data processing outputs, and raw data projections it is read only, binary portable, compressed, secure, and fast in the QueryObject, raw data set are linked with their projections in real time the fractal nature of the QueryObject internals ensure scalable data blending and data unification What datonix does
  • 12. datonix™ and QueryObject™ are registered trademarksQueryObject data set build example rowId Nname Ncity 1 1 1 2 2 2 3 3 3 4 2 2 What datonix does Key Value NValue Name Aldo 1 Name Sara 2 Name Anna 3 City Miami 1 … … …name city DateBirth Aldo Miami 11/1/90 Sara NYC 12/2/89 Anna Rome 1.1.68 Sara NYC 31-1-61 DateBirth UDateB Age 11/1/90 1/11/90 26 12/2/89 2/12/89 26 1.1.68 1/1/68 48 31-1-61 1/31/61 56 Ncity city state 1 Miami Fl 2 NYC NY 3 Rome Italy Map Dictionary Luggage hierarchy Data complex Storage group Data source Fractal conversion Transform DateBirth Add Geo classification
  • 13. datonix™ and QueryObject™ are registered trademarksProduct architecture What datonix does
  • 14. datonix™ and QueryObject™ are registered trademarksConnect, Build & Combine QueryObject data set Datonix is an efficient Data Lake Satellite, it can connect to any source for any load at the maximum speed, and executes custom data processing What datonix does QCL Server QueryObject Communication Layer Scanner Engine DFS Datonix File System SRM Stream Resource Manager Fractal Engine QO Joiner ENGINE lookup ENGINE Metadata Repository Source Data Descriptions Transformation, Filter Cleansing snippets External hierarchies
  • 15. datonix™ and QueryObject™ are registered trademarksHow to use QueryObject data set Once QueryObject has been registered for networking, in a few of clicks it can be transformed in online web objects or in a materialized data set ready to be moved and used elsewhere What datonix does DFS Datonix File System Chart ENGINE Metadata Repository External Hierarchies Query Commands ODBC Connector Download Center QOhpIO Rest Provider Oracle Transparent Gateway Services ini
  • 16. datonix™ and QueryObject™ are registered trademarksThe packaging and the business model What datonix does Today, datonixOne is delivered on premise in four editions: - Personal - Server - Enterprise - OEM The Server edition is also commercialized using our SaaS (Software as a Service) Plan. Using the SaaS Plan customers can have datonixOne with a small deposit and an attractive monthly rate. Soon it will be announced the general availability of the Cloud edition. Standard product maintenance is included in the price, a three years datonixCare extension plan for certain services can be subscribed.
  • 17. datonix™ and QueryObject™ are registered trademarksBottom Line datonixOne performs a sophisticated processing of the given data to create an Object based dataset named QueryObject data set. It contains cold detailed data and hot analytic information. In general, the main strengths of datonixOne are as follows: - It is well scalable in the number of input data rows. - It supports all regular aggregate and Distinct Count Measure (DCM) - It handles dimensional hierarchies externally to the QueryObject - It supports a SMP-based parallel data processing - It supports high performances incremental or cdc update - It uses several compression techniques to reduce data footprint. - It allows to create partial and or optimized view of data - It is a disconnected active data store - Its ideal usage is to provide keyback from hot data selections to the related cold data - Based on Grid architecture, which supports federated queries, supports querying in multi-user environment. - It is a cost-efficient solution. In many cases it will work efficiently because many potential applications requirements find answers in datonix possibilities
  • 18. Why to use datonix
  • 19. datonix™ and QueryObject™ are registered trademarks Where is user’s pain If analytic requisite is not clear upfront or not fully structured, preparation could be long and costs high Raw Data are not connected with Analytics, and in order to review/adjust User’s data it is necessary to cycle back to Data Integration Dark Data don’t go back to DMP level Narrative reporting is not handled Typical data interface to R, Python, Advanced Analytics and BI tools are slow and tricky Why to use Datonix remedy datonixOne connects and scans any kind of data sources. Datonix automatically maintains raw data connected to analytics. Using datonix, end-user can self develop data processing resulting “Schema on Read Structures” that can be easily combined with dark data, external data and Cloud data. The datonixOne Publisher is a cloud component designed to make data available to distribution over a network. Following the registration of data in Publisher the following services are automatically made available: • An Excel worksheet • A high performance data or info graphic service • A dynamic report in ppt, doc, xlsx or pdf format • An active Big Data dashboard • A DBMS like data table
  • 20. datonix™ and QueryObject™ are registered trademarksApplication of datonix Thanks to its unique features and or performance capabilities, datonix has been used in a variety of applications over several industry. Cloud migrations, Spending Review, Revenue Assurance and OSS data collection, processing and movement are the most common. Product has been commercialized directly or through partners to Telcos, Media, Public Administration, Manufacturing, R&D Institutes. Where used
  • 21. datonix™ and QueryObject™ are registered trademarksAgile Methodology SPUR START • Goals definition • Data Collection • Selection of the Data Science • Preparation Estimo PREPARE • QueryObject Set up • Teaser Set up USE • Implement the Data Science • Determine the KPIs • Implement the communication RUN • Implement Pre and Post mining Agents • Implement the automation • Ongoing run SPUR is the result of more then 6,000 preparation implementations executed directly by datonix or through our certified professionals. Most of the processes are supported by specific GUIs, called Voyagers, and a wide collection of Microsoft office Addin.
  • 22. datonix™ and QueryObject™ are registered trademarksMeasurable Estimo Datonix Estimo database is continuously reviewed and available to datonix certified professional network in order to ensure: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility SPUR
  • 23. datonix™ and QueryObject™ are registered trademarksCustomer case: CPM Type of Business Very large multinational enterprise with annual Supply Chain contracts in excess of 9B. Customer Pain /Issue: The making of the monthly Corporate Supply Chain Tableau de Bord [TdB] and Spending Review was lengthy [more than 9 man/days of processing], prone to errors and produced a huge report more of 180 pages long in order to provide for required drill down views. Data inconsistencies arising from different data download procedures from main data repositories. Spreadsheets and final reports sent to stakeholders were cumbersome. Initial Conditions The supply chain auditing team monthly fetched data from multiple SAP that were then processed locally in an Access DB, manually rectified, sent to each category manager for further control and then consolidated into Excel spreadsheets used for the compilation of the TdB and Spending Review Report. The data cleaning instructions were stored each month into a lengthy Word file together with original and adjusted data in Excel spreadsheets Solution Implemented One datonix m200 for data collection and preparation and one datonix m200 platform for publishing. Obtained Results One single data collection and consolidation platform for all corporate divisions. Effort now required to assemble the TdB is less than 4 man/days. Lean reports in PDF format with richer data content and drill through capabilities up to detail data. The obtained TdB has enough synthesis to be delivered to the CEO, but at the same time plentiful in-depth capabilities to serve departmental analysis. Reports provided to users as a web service and on multiple devices. Completely reshaped content and homogeneous reports format. Automatic generation and archiving of adjustments data report for reference. Competitive Advantages datonix external hierarchies hugely simplified data classification and easily accommodated for a consistent data comparison between different time frames [month over previous month or previous year]. Richer data content due to the overcoming of the Excel row limits. The possibility to quickly reflect organizational changes into the report data, still maintaining meaningful comparison with previous data. Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. Displaced 50% of human resources assigned to the reporting to other duties. One data collection system no longer necessary. Indirect savings by better spending control. Additional Benefits Narrative reports are accessible also from mobile devices through a web app. Automatic creation of reports with different detail depths for diverse needs. Efficient and lean data distribution to internal users who can use Excel to do local analysis still maintaining overall data consistency. Phase out a system used by Finance to consolidate data coming from subsidiaries abroad. Same system used for forecast and what-if analysis. Where used
  • 24. datonix™ and QueryObject™ are registered trademarksCustomer case: Data Warehouse Optimization Type of Business Top Tier Telco Customer Pain /Issue: Excessive storage needs and bad performances in an important Trend Analysis OSS system [network alarm data, network performance data, network inventory data, etc.] but not such a mission critical project to have access to vast human and financial resources. Initial Conditions The system collects data from several heterogeneous sources and systems to build data models that feed reports. The datawarehouse required two daily restarts due to the critical volumes of data and this impacted effort required and system performances while not satisfying business needs. Solution Implemented Two datonix scanner Enterprise edition, one for the collection of the alarm data coming from the different technologies . Obtained Results The original database has been freed from the historical data storage thus greatly improving its performance and removing the need for daily restarts. Data acquisition and data presentation have been uncoupled thus avoiding the momentarily unavailability of the reporting services if the data acquisition system was down. Very fast and efficient reports. Faster root cause analysis. Closer control over network trunks with beneficial impacts on HR as well. Competitive Advantages More effective data availability that converts into better predictive analysis. Improve performances and add new functionalities without replacing existing infrastructure thus protecting CAPEX. Seamless integration between aggregate information and detail data so that costly and time consuming data “reverse engineering” is no longer necessary. The same system is used for high-level top management reporting and for operative analysis. Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. No longer down time. No need to replace data acquisition infrastructure. Automatic generation of reports against manual creation. Indirect savings by better control of the network. Additional Benefits Dynamic reports for the management focusing on KPIs not previously attainable. Weekly Smart Network analysis report generated automatically [with the previous system it was not economically feasible to collect and process those data]. Efficient data distribution to internal users who can use Excel to do local analysis still maintaining overall data consistency. The automatic creation of a Downtime Report that spans from highly aggregated information to finely granular data that previously required dozens of man/days effort to prepare. Where used
  • 25. datonix™ and QueryObject™ are registered trademarksCustomer case: SMART METERING Type of Business Large Utility Customer Pain /Issue: Data coming from smart meters were only collected manually once a week making it impossible to efficiently monitor the devices and actively correlate malfunctioning with root causes. Fast and effective troubleshooting was a key to positively affect the bottom line of the company in a heavily regulated market. KPIs were needed to monitor the quality of the metering process and to gauge the implementation of improvement actions in conjunction with a finer granularity of the analysed data. Initial Conditions Meters data were collected in an Oracle database where once a week were extracted and manually aggregated to produce simple and hasty reports with a 1,5 man/day effort. The data volume and weekly cadence of the extraction made worthless a deep analysis on the meter network performance. Solution Implemented One datonix m100 virtual machine for the acquisition of the readings from their business customers (50k meters). Obtained Results Completely automated daily data collection with an historical depth of 13 months of data. Competitive Advantages The same system is used for high-level top management reporting and for operative analysis. The possibility to continue to use Excel for operational reports with increased data availability due to the removed row limit provided by datonix. Easy access to an holistic view of the whole network with the added possibility to analyse locally one year of full network data at granular level. datonix keyback feature natively connects aggregate data to the actual records for easier analysis of equipment malfunctioning Savings Greatly reduced man/day effort required to collect, rectify and aggregate data. Proactive maintenance sprang from better and timely data yielded to less outage and reduced ticket costs Additional Benefits Highly scalable solution. During the project lifecycle the customer increased the frequency of the readings and doubled the number of meters with no impact on performance. Customer is widening the scope of the project to include mass market as well. The capability of adding further data models without impact on performance allowed the introduction of more KPIs than originally planned, thus improving the overall quality of the monitoring process Where used
  • 26. datonix™ and QueryObject™ are registered trademarksSolution case: the Hadoop Data Scanner Hadoop is perfect to support simple queries on very large data set. When it is needed more then a simple full table scans, it is recommended to complement Hadoop with an additional technology. Since the usage of a data base technology is expensive and shows inherent limitations, and that Hadoop software stack is conceived to leave data where they are, ie in the HDFS, today, they are emerging modern solutions to run Hadoop data processes (ie Spark, Flink, etc.). But what does it happen when it is required to move data outside the HDFS to run data processes remotely or, vice versa, remote users need to move large data set into the HDFS? Requirement Issues and high costs of data delivery make untenable the traditional solution of transferring and integrating remote data resources by hauling the data, in bulk, back to an IT Enterprise Data Hub and processing it there. Worse still, the more distributed the organization, the more serious the impact of large scale data movement will be. Transferring Hadoop's raw data to remote destinations, blending those data with Remote Dark Data, extracting there business intelligence, and sending back huge volume of adjusted data should solve the problem. This strategy would require to maintain expensive data movements, and an IT architecture that matches the distributed nature of the Remote Data. Solution With DumbOne, datonix™ offers new perspectives for Hadoop data movement and remote data processing. DumbOne plugs into HDFS, reads data and natively shares memory segments with the datonix™ data-scan ultrafast processes. Once converted in the QueryObect™ format, dataset are ultra-compressed and ready to be exchanged with remote datonixOne servers. This way existing IT Centric architecture can be complemented with an ideal “pure grid” satellite, the datonixOne, so that it’s as non-disruptive as possible, saves operation’s costs, and increases performance. Using datonix™ Voyager it’s easy and fast to set up connections to data stored in HDFS, so that developers don’t loose their focus facing against java memory errors, garbage collection, file fragmentation, cluster contention and annoying cluster’s performance tuning. Where used
  • 28. datonix™ and QueryObject™ are registered trademarksProduct architecture Summary of Features
  • 29. datonix™ and QueryObject™ are registered trademarksQOhpDG Summary of features To access Data Sources, several components are available in 5.10.10, in red are listed 6.0.1 components: File System:Driver to access local csv, Flat, XML, Json, RDF, binary http/ sftp: Remote File Systems can be accessed using http/sftp protocol ODBC: Local ODBC sources are accessed using UnixOdbc, remote ODBC sources requires Source’s ODBC client installation Oracle: Oracle OCI driver is pre-installed DumbOne: NFS mount of a Cloudera HDFS, and Hive driver are pre-installed, python Pandas is pre- installed as well Cloudy: DropBox and Gdrive drives can be mounted into the datonix File System SRM: Using Stream Resource Manager http stream source can be accessed All QOhpDG drivers can benefit of the automatic or static source partitioning for parallel load Summary of Features
  • 30. datonix™ and QueryObject™ are registered trademarksFractal Engine Summary of features Row-id: A primary key is automatically added to any rows of the scan User field: processing on original data. It is available any operation on Date, String, and Numeric fields. Special processing logic can be user defined using C language. Wide library of C transformations is available. Filters: Include / exclude filters during any build process step. Csv lookup: Left, Right, Inner and Outer Join on dataset load time Hierarchies: Fields can be organized in hierarchies, External files containing hierarchies can be dinamically added to QueryObject fields MultiThread: Load, Normalization and Data Complex build processes can be parallelized Load strategy: Loads can be processed in append mode, and change data capture mode Compression: QueryObject Numbers and String can be compressed Count Measures: Regular, distinct, exclusive distinct and intersection counts can be preprocessed ECC: Command line Engine Control Command WebECC: Rest based ECC wrapper Security: QueryObject data set can be password protected, content encript can be user defined Summary of Features
  • 31. datonix™ and QueryObject™ are registered trademarksBlend Engine Summary of features Union: two or more QueryObject data set can be virtually unified in union. Union can be inner, only the common columns will be in union, or outer All the columns will partecipate to the Union, resulting Null in the columns not belonging to the single QueryObject in Union. Order of the columns doesn’t matter to the Union. Hierarchies can be applied or inherited. Grid Union: Same of the above but in the Network. Each QueryObject remain where it is but the Union will be network registered. Merge: two or more QueryObject data set can be consolidated, aka merged, in batch. Join: two or more QueryObject data set can be joined at high speed. Csv Join: One QueryObject can be Joined with a csv file Oracle TG: One or more QueryObject can be linked to an Oracle. Oracle will consider the QueryObject as an external ODBC connection. Summary of Features
  • 32. datonix™ and QueryObject™ are registered trademarksDatonix File System (DFS) Summary of features DFS it is the file access method of the datonixOne. It contains the QueryObject data sets and mounts to data sources. DFS Manager: The DFS Web GUI, optimized for Mozilla, runs on every browser. In one click any operation (copy, archive, download, upload, etc.) on DFS object can be executed. Space: datonix ensures a x14 compression factor on complex data structures archive space, so for example the m100 with 1 TB of physical space ensures 14 TB of space for data in QueryObject format. HDFS: DFS can be linked in Write to HDFS ensuring this way unlimited space Cloudy: DFS can be synchronized with Gdrive and DropBox Summary of Features
  • 33. datonix™ and QueryObject™ are registered trademarksQOhp Interface Option Summary of features Rest Support of both NoSql and Sql query language. Produces csv, json, Excel, xml, html grid output Chart Professional data visualization language. It support both swf and html5 output. More then 300 base charts available. World map available as well ODBC Full Support of ODBC standard clients. Preinstalled Hive and MySql. Extension for Distinct Count and advanced II order statistics OCI Full support of Oracle OCI driver Download Center Support of QO downloads, link to Rest Interfaces, custom web-app external link Blocks Full Wordpress & Jumla Support Summary of Features
  • 35. datonix™ and QueryObject™ are registered trademarksAgile Methodology SPUR START • Goals definition • Data Collection • Selection of the Data Science • Preparation Estimo PREPARE • QueryObject Set up • Teaser Set up USE • Implement the Data Science • Determine the KPIs • Implement the communication RUN • Implement Pre and Post mining Agents • Implement the automation • Ongoing run SPUR is the result of more then 6,000 preparation implementations executed directly or through our certified professionals. Most of the processes are supported by specific GUIs, called Voyagers, and a wide collection of Microsoft documents macro.
  • 36. datonix™ and QueryObject™ are registered trademarksData Preparation steps Using datonix Voyager the following steps are implemented in a recursive way SPUR Data cleaning: Fill in missing values, smooth noisy data, identify or remove outliers, and reconcile inconsistencies Data transformation: User fields, User Measures, Normalization and aggregation Data reduction: Obtains reduced representation in volume but produces the same or similar analytical results. Discretization of numeric fields
  • 37. datonix™ and QueryObject™ are registered trademarksHow Voyager handles Missing Data? Data is not always available (aka many tuples have no recorded value for several attributes, such as customer income in sales data). It may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data Using datonix tools, data can be reviewed and fixed in respect of versioning, resulting a log of the below actions: • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably) • Fill in the missing value manually: tedious + infeasible? • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! • Use the attribute mean to fill in the missing value • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree Data Cleaning
  • 38. datonix™ and QueryObject™ are registered trademarksNoisy Data They are random error or variance in a measured variable that results in incorrect attribute values, due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention Other data problems which requires data cleaning are duplicate records, incomplete data, inconsistent data Using datonix tools, Noisy data can be treated with: • Binning method: first sort data and partition into (equi-depth) bins, then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering: detect and remove outliers • Combined computer and human inspection: detect suspicious values and check by human • Regression: smooth by fitting the data into regression functions Data Cleaning
  • 39. datonix™ and QueryObject™ are registered trademarksBinning & Smoothing Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B- A)/N but outliers may dominate presentation & skewed data is not handled well. Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples. Good data scaling • Managing categorical attributes is not anymore tricky. Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Data Cleaning
  • 40. datonix™ and QueryObject™ are registered trademarksData Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range – min-max normalization – z-score normalization – normalization by decimal scaling Data Transformation AAA AA A minnewminnewmaxnew minmax minv v _)__('     A A devstand meanv v _ '   j v v 10 ' Where j is the smallest integer such that Max(| |)<1'v
  • 41. datonix™ and QueryObject™ are registered trademarksData Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction • Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies • Data cube aggregation • Dimensionality reduction • Numerosity reduction • Discretization and concept hierarchy generation Data Reduction
  • 42. datonix™ and QueryObject™ are registered trademarksSteps that follows preparation Once the QueryObject data set have been prepared the following steps are performed: - Data Exploration or Data Modeling - Results interpretations and KPIs definition - Eventually ingest new data or implement simulations - Implement the communication of results and implication to decision makers Then the ongoing run will include - The Update of the QueryObjects and of the communication - The knowledge base implementation - Implementation and monitoring effectivesess SPUR
  • 43. datonix™ and QueryObject™ are registered trademarksSummary Data preparation is a the big issue Data preparation includes • Data cleaning • Data reduction and feature selection • Discretization A lot a methods (190) have been developed in datonix but still an active area of research SPUR