Incorporating the Real Time Component into Analytics and Machine Learning: Many industries and organizations today want to harness the power of big data analytics and machine learning for its potential to improve margins, enhance discoveries, give insight into the business, and enable fast data driven decisions. The challenges include inability and/or difficulties in using available systems, not knowing where to start or which tools make sense for a particular problem, and dealing with data sets that are too big, too fast, or too complicated to handle with traditional systems.
RTDS Inc. has developed SymetryMLTM which are technologies for zero latency machine learning and analytics/exploration of very large datasets in real time, with a focus on speed, accuracy and simplicity. Our goal has been to cut the memory footprint required to learn large data sets, “reducer” functionality to automatically select the best attributes for model creation and build models on the fly. SymetryMLTM is also designed for easy integration into existing business processes via either an easy to use Web-UI or RESTful APIs.
This talk will explore some of the functionality of these systems including real time exploration of data, fast multi-variate model prototyping, and our use of GPUs and parallelization. An example of brain related data and the complexities of analytics will be discussed as well as a brief overview of other verticals we are exploring. Our work is geared towards making big data make sense in real time and enable users to gain insights faster than traditional methods.
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Real Time Analytics and Machine Learning for Bioinformatics and Financial Data
1. Shiva Amiri, PhD
Chief Product Officer
MLConf Seattle - May 1st 2015
Incorporating the Real Time Component into
Analytics and Machine Learning
2. The Challenge
One or more structural limitations have significantly constrained
successful data mining applications and initiatives
Frequently, these problems are associated with the amount of data,
the rate of data generation and the number of attributes (variables)
to be processed –
1000’s of data variables form which to model from (dimensionality)
100’s of billions of records to model data
Continuously evolving data elements and changing sets of data
The need to execute and adapt in Real Time
Increasingly, this “big data” environment expands beyond the
capabilities of conventional data mining methods and technology
2
4. 4
The Market Opportunity
IDC Reports Big Data Analytics market at $125 billion in 2015
Gartner reports the Internet of Things (IoT) will have 25 billion devices with
sensors connected by 2020 producing exabytes of data
IoT/E Market size by 2020 will exceed $14 trillion
Bioinformatics market is $7.5 billion according to Gartner
Streaming data, Real Time analytics and machine learning remain a
significant challenge for multiple sectors
5. Which verticals are we looking at?
Bioinformatics, Computational Biology – genetics, proteomics, EEG data,
fMRI, Molecular Dynamics data, etc.
Financials – behaviour, signals, patterns
Internet of Everything
Other fast and massive data is what we are interested in
5
7. 7
What kinds of questions do we want to ask?
How do the genes and proteins in disorders relate
to each other – clustering, regression,
classification, etc.
What are the other factors involved in disease
onset and progression?
What about environment data? Quality of Life?
Education? Socioeconomic status? - natural
language processing (NLP), classification,
predictive modeling, etc.
How can we handle massive amounts of brain
sensing and imaging data (EEG, fMRI) and link
them to other data (genes and proteins)?
Integrative analytics
And questions we don’t know we have
9. RTDS’ SymetryMLTM : What have we built?
SymetryML™ is a distributed GPU-
implemented predictive analysis and modeling
technology for our Massive Data universe…
V3.5 released – real time analytics of large-scale
data
Exploration(statistics) and model building,
assessment and prediction in real time
Robust security and privacy features
V4.0 being developed – distributed computing
capability
9
10. How is SymetryML™ addressing these
challenges?
The V’s of Big Data
SymetryMLTM can handle heavy volumes of data (Volume)
SymetryMLTM can handle streaming data (Velocity)
Accelerated hardware with GPUs and distributed computing
REST API – flexibility and modular design, seamless integration into
existing systems or development of custom systems
Simplicity of the design
Real Time analytics – exploration and model generation/prediction,
handling massive data with unprecedented speed in real time
Privacy and security
Service Oriented Architecture – XaaS
11. 11
Faster: In minutes SymetryMLTM can utilize 10,000’s+ variables by constructing 1000’s of model
combinations and ultimately reduce variables to a single model - builds models in real time as
it learns
Smarter with Scale: Linearly scalable with zero limitation in length of data sets and depth of
categorical data allows for unlimited learning from data
More Agile on-the-fly: Continuous learning, both distributed and parallel
Simply Deployed: SymetryMLTM models can be deployed in real time or in the form of scripts
(SQL, Java, etc.)
Proprietary Statistical
Representation
Data
Learner Modeler
Predictor
Explorer
19. RTDS Inc. – Headlines
Team of 6 engineers and Data Scientists in Toronto, Board in NY
Focus on Technology Differentiation
Technology timeline
March ’13 – Launched .NET Based Desktop Version
July ’13 – Launched SymetryMLTM Server with REST API.
December ’13 – Successfully deployed first GPU-based system
June ‘14 – Algorithmic Support Expanded
’15 Roadmap: Aggressive, Attainable and Defensible
Proven technology with successful deployment in advertising
Current Financing
Mogility Capital
19
20. Next steps
We’ve been successful with this technology in the mobile advertising
space…now we want to use the power of this technology in other strategic
sectors
We are looking for partners as beta users - with unique datasets and use cases
- what kinds of questions can we help answer with your data?
We are looking for integration partners where we can both enhance our
offering
Develop the next version (v4.0) of SymetryMLTM – fully parallel with
Apache Spark
20
23. SymetryMLTM and
GPUs
• Native library that uses NVIDIA GPUs are available for:
• Linux 64 bit (CentOS 5.x and Amazon Linux)
• Use of GPUs for core operations:
• Learning / Forgetting data
• Model Building
• Model Selection
24. • Interactive HTML 5 application
• Direct connection to SYM-REST
• It is de-facto a light weight front-end to SYM-REST
• Based on Sencha Ext-JS 4.x
SymetryMLTM-WEB
25. • Provides a Restful API to sym-core.
• Supported Data Sources:
• Amazon S3
• SFTP
• HTTP/HTTPS
• Redshift
• Upcoming Data Sources:
• HDFS
• ODBC/JDBC
SYM-REST
26. • User of the rest-API needs an access key
• We generate these keys
• Key is AES 128 bits.
• Every REST request is authenticated with a HMAC
(SHA1) code based on part of the request
• If data encryption is needed, then usage of HTTPS
is possible
SYM-REST Security
27. Finance data example
• NASDAQ TotalView-ITCH Intraday Data Modeling
175Gb - one month of raw data
55Gb of transactions for NASDAQ100 constituents
12M rows/400 attributes
Univariate analysis across securities
Covariance and Hypothesis Testing
Model Building: Classification/Regression
Prediction of Price Movement
Full Order Book Analysis
27
Editor's Notes
What is the problem in data mining? How does this solve the problem. The ability to model 2000 variable combinations faster than anyone. The ability to update models in real time. The ability to introduce new variables into models without barriers.
What is the problem in data mining? How does this solve the problem. The ability to model 2000 variable combinations faster than anyone. The ability to update models in real time. The ability to introduce new variables into models without barriers.