Raptor combines Hadoop & HBase with machine learning models for adaptive data segmentation, partitioning, bucketing, and filtering to enable ad-hoc queries and real-time analytics. Raptor has intelligent optimization algorithms that switch query execution between HBase and MapReduce. Raptor can create per-block dynamic bloom filters for adaptive filtering. A policy manager allows optimized indexing and autosharding. This session will address how Raptor has been used in prototype systems in predictive trading, times-series analytics, smart customer care solutions, and a generalized analytics solution that can be hosted on the cloud.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Hadoop World 2011: Raptor: Real-time Analytics on Hadoop - Soundararajan Velu - Sungard
1. Raptor: Real-time Analytics on Hadoop
Soundar Velu
Product Architect, Advanced Technology, SunGard
Anil Kumar Batchu
Research Engineer, Applied Research, SunGard
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
2. Who We Are, Who Am I, What We Do?
Who We Are & What We Do?
Fortune 500 Company
Financial Services Firm
Provide software & consulting services across the industry
Exploring impacts of Big Data approach for last 2+ years
Who Am I?
Part of Applied Research & Consulting group based out of
Bangalore, India
We focus on latest technology trends
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 1
3. Agenda
Financial Services and Data Problems
Raptor Architecture Overview
Raptor Components
Benchmarks
Future Enhancements
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 2
4. Financial Services & Typical “data intensive” Problems
Wholesale Markets
Portfolio Construction Price Forecasting
Risk Calculation
Compliance Surveillance Batch Processing
Retail Markets
Fraud Detection
Targeted Marketing Demand Forecasting
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 3
5. Implementation Constraints in Financial Services
RDBMS
Legacy Burden
Centricity
Architectures, languages, Constrained semantics,
tools, systems, skills rigidity and specificity
Data Silos Cost of Change
Governance “You don’t own Evolution is the only answer
that data, I do”
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 4
6. What Did We Need?
A generic, reliable, and cost effective analytics solution
with a wide range of application areas
Query execution and analytics at soft real-time
windows (acceptable and consistent latencies)
Minimum customization, seamless integration and
ease of use
Policy around data storage and processing
Adaptive segmentation algorithms for optimized data
search (Indexing, partitioning and filtering)
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 5
7. Some Limitations with Traditional Hadoop
Jobs are executed in a brute force fashion, causing
complete scans of files for every single query
Long warm up time for jobs, performs poorly for
relatively smaller data sets.
Scheduling imbalance in HDFS operations and job
execution
Limitations around the kind of jobs that can be executed
with MapReduce, (non equality joins not supported)
Open bugs and lacking features, memory management
bottlenecks
Only for batch mode based applications, does not fit
real-time analytics scenarios
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 6
8. Raptor Application Stack
Z Flume Scribe Sqoop
o
o Hue Raptor Oozie
K
e Hive/Pig
e
p HBase Map Reduce Quantum Processor
e
r HDFS
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 7
10. Raptor Digester - Level 1 Segmentation
Database Data Stream Logs/Files
Data load adaptors
Data Cleanser
Call back Digester Raptor Client
Table specific data
transformer and
partitioner, partition
based on
Data Transformer
Call back
demographics,
Storage buckets customers, IP address
or composite columns.
Store partitioned
HDFS data into
respective file
buckets in HDFS.
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 9
11. Raptor Block Analyzer – Level 2 Segmentation
Write Block
Raptor client
DataXceiver
Offer BlockID for
async-Indexing
IndexFactory BlockReceiver
Get Indexer for
Asynchronously
Block
analyze/Index Block
DataBlock
Index
Queue
Block Analyzer DataBlock
Indexer BloomFilter
Store column meta information, Every block is asynchronously analyzed
Block index map and column
level bloom filters in ThorCache.
and the resulting column level blooms,
index tree and column level meta
information is stored in ThorCache. All this
Thor Block happens within the DataNode Context.
Index Cache
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 10
12. Raptor Client Framework
Raptor Client is similar to dfsclient or jobclient, it stands as a
primary interface into Raptor.
Block Analyzer
• Get Indexer for block
• Mirror Indexer to replica nodes
Code Generator NameNode
Deploy Generated • Get block info for file
Raptor Executor Code • Get file stats
• Execute internal task
• Clean up task
• Get result segments
Digester Raptor Client DataNode
• Get Digester for table
• Invoke Quantum Processor Services
• Get Analyzer for table
• Manage Indexes
• Execute Jobs
• Get Block analyzers
Raptor Optimizer • Deploy generated infrastructure patch
Re-Index Block
• Get analyzer for table
Re-Balance blocks
• Get Indexer for table
• Get adaptive metrics
Adaptive Engine Raptor-Hive
• Get indexer for block/table
Policy Manager • Get Hive Meta-data for table
• Get adaptive metrics for table
• Auto shard buckets
• Get Raptor Meta Data for table
• Drop indexes for buckets
• Get digester for Table
• Archive Buckets
• Deploy generated infrastructure patch
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 11
13. Getting Results Out – End to End Flow
Client Execute HQL Thrift
Application Statement Client
Hive JDBC
Connection Thrift
ResultSet Client
Client applications create a hive JDBC
Thrift Hive Server
connection to Raptor-HiveServer. Client fires HQL
Queries over the connection. The query is
executed by Raptor and a Hive ResultSet is
returned as Result. This is the interface used by Raptor-Hive
Clients to access data in HDFS.
e.g. “Select Name, CNO, TxID, floor(TxAmt) Fetch Task Execution
from Credit_Tx order by CNO”
Fetch Results Write Results
Execute Non-HQL queries
directly with raptor Client,
HDFS
Raptor Client Quantum Processor
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 12
14. Raptor-Hive Processing Engine
HiveServer
Get Table Block
JDBC Client Analyzers
Datanodes
Compile
Optimize Raptor Optimizer
Hive
MetaStore Execution Plan Metrics published
to Adaptive engine
Raptor
MetaStore
Raptor Raptor Adaptive
Executor Engine
Raptor Client
MapReduce Raptor HBase
Quantum Processor
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 13
15. Quantum Processor Framework
Hive Executor
QuantumNode is a service similar to
DataXceiver on DataNode. All
RaptorClient RaptorProtocolHelper Raptor operations are executed by
the Quantum Processor framework.
Quantum Node
QuantumProcessor
GetNodeStats SearchBlock
ReadResultSegment Manage Indexes
ReadBoundaryRecord ExecuteInternalTask
Get File analyzers ExecuteExternalTask
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 14
16. Raptor Block Searcher/Processor
Block Searcher analyzes the
block indexes and bloom filters
Hive Executor Process/Search
RaptorClient QuantumProcessor and extracts the resultant
Block indexed records via index
based offset reader or if no
indexes are available then it
does a complete block scan.
Thor Block Index BlockSearcher
Cache
Block Index Block Index No
/Bloom Seq Block Reader
Available
Process All
Column Level Records
Bloom
Yes Quantum Operator
Column Meta Search Index Tree
Data
Result Writer
Index Block Reader
HDFS
Process Result
Records
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 15
17. Quantum Operator
Quantum Operator takes block records as input and performs the required field inspections, and applies
UDFs and aggregation. The resultant records are either written to local segment files or to HDFS directly
based on the operator type.
Evaluate rows
Block Searcher Quantum Operator Read Remote Seg
Offer result rows Read k-remote segment
files, if reduce operator.
Write Result
Inspect Rows Apply Join/Merge Apply UDF
Has Yes
reduce
child?
Write results to k-segmented
local files, if child is a reduce
No operator.
Local Quantum Operators: SelectOperator, JoinOperator,
HDFS Segment File SortByOperator, GropuByOperator, GPUOperator, etc…
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 16
18. Adaptive Indexing
Execute Hive Queries Collect Execution Metrics
Raptor Adaptive
Engine
Asynchronous
Adaptive Raptor Metadata
Raptor adaptive engine collects metrics Indexer
from the executed queries. Metrics
include the columns and tables involved Adaptive
in the query, the types of aggregation Index
executed on columns etc… these metrics Scheduler
are used by the adaptive index scheduler
to schedule re-indexing of the blocks
based on these usage metrics. Re Analyze/Index Blocks
based on usage/query Metrics
HDFS
DataNode DataNode DataNode DataNode
ThorIndex
Cache
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 17
19. Code Generation and Data Definition Framework
User supplies table schema in an XML format.
Based on the Xml table schema and hints,
Raptor code generator generates metadata in
Raptor Code Generator hive metadata store, HBase and raptor metadata
store. It also generates table specific indexer
column analyzers and bloom filter classes.
Create Hive Table Definitions
HiveServer
Hive Metadata
HBase
Create Raptor Table Metadata
Raptor Adaptive
Engine
Raptor Metadata
Raptor Client
Generate Table Specific
Indexers, Bloom Filter , Column Deploy Generated code to
Analyzers and digester Raptor- HiveServer and
DataNodes
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 18
20. Raptor Policy Manager – Time Based Shard
Time based auto sharding, as per the Raptor Policy
Manager
policy defined by the user. Table buckets
are moved from highly segmented time
shards to sparsely segmented archive Raptor Metadata
store over time.
HDFS
Completely Bucket-1 Bucket-2 Bucket-3 Primary
Segmented Time-Shard
Partially
Bucket-1 Bucket-2 Bucket-3 Secondary
Segmented
Time-Shard-1
…
All Segments Merged
and Indexes dropped Bucket-1 Bucket-2 Bucket-3 Archive Store
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 19
21. Infrastructure Enhancements
Adaptive compression based on network congestion statistics
Scheduling of jobs via raptor client (via Hive) in conjunction
with enhanced NameNode block policy
Computation intensive jobs scheduled on GPU enabled nodes
Object Pools across the Hadoop-Raptor ecosystem
Hand-shake mechanism between clients and DataNodes to
avoid imminent operation failures
Interactive user console for managing the cluster, tasks, data
policy, filters etc…
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 20
22. Benchmark Raptor
Benchmark carried out on a 5 node cluster with commodity hardware { Core2-Duo, 4GB
RAM, 1TB storage, 100Mbps NIC, 64MB block size, Replication factor 3, Network
Compression enabled}
Table with 13 columns and mixed types. Sample data generated with node.js with moderate
entropy. (table with 1 group index{3 columns} and 0 column bloom filters, 1 bucket}
5 Node Cluster Raptor Map/Reduce
OperationTable Size 256MB 1GB 10GB 256MB 1GB 10GB
Load Time 4.8s 22s 4.2m 54.6s 3.5m 36.0m
Simple select wihout predicate 6.9s 22s 3.2m 21.6s 50.5s 9.5m
Select with complex Predicate 0.8s 1.7s 0.9m 9.2s 32.2s 4.1m
Select with Order By 1.6s 6.5s 1.1m 41.0s 3.2m 5.6m
Select with Group By 0.6s 1.2s 0.8m 61.3s 1.5m 4.0m
Simple Join 2.9s 5.7s 1.7m 1.2m 3.4m 9.2m
User Defined row level Function 0.6s 1.1s 0.9m 13.2s 49s 7.3m
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 21
23. Case Study: Credit Card Fraud Detection
Scribe Scribe Scribe
Clients Clients Clients
Scribe Server Raptor Digester
Cache Segment Data into
Subsets/Buckets/
Partitions
Raptor-Hadoop Raptor Policy Manager
Pruning - Removes
redundant Classifiers from
Mapreduce Quantum ensemble
Processor
Apply Data Mining
Techniques to generate Periodic Fraud Report
Classifiers in parallel Generation
Raptor Classifier Raptor-Hive
Combine resultant base
models to generate a
uses Meta-Classifiers Meta-Classifier
to Flag Transactions
Analytics Results & Reports
Flagging Cluster
Sends out Notifications
Notifications Cluster PG SQL
Credit Card Fraud Application
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 22
24. Other Raptor Use Cases
Smart customer care solution
Financial fraud analytics
Media usage log analytics
Computation intensive jobs using GPUs
Predictive trading
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 23
25. Roadmap
Merge Raptor into Next Generation MapReduce
Dimensions and Cubes (MRCubes)
Cloud ready Raptor
Roles and security
Job failover management
Rules and triggers
Planning to open source
Starter Kits/Examples of various Use Cases
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 24
26. Benefits
Query responses at soft real-time windows.
Code generation framework, for table specific raptor
infrastructure code.
Zero down time, with hot deployment of generated
infrastructure code.
Seamless integration into existing infrastructure with
multiple ingress options.
Automatic time based sharding and data archival.
Distribute & execute non-MR jobs on the Hadoop Cluster
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices 25
27. Contacts
Soundar Velu Anil Batchu
Product Architect Research Engineer
Soundararajan.velu@sungard.com AnilKumar.B@sungard.com
Twitter: @greyquest
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
28. References
Optimizing Distributed Joins with Bloom Filters
www.l3s.de/web/upload/documents/1/analysis.pdf
Apache Hadoop Goes Realtime at Facebook
borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf
Data Mining with MAPREDUCE: Graph and Tensor Algorithms with MR
www.ml.cmu.edu/research/dap-papers/tsourakakisdap.pdf
Distributed Cube Materialization on Holistic Measures∗
www.eecs.umich.edu/~congy/work/icde11a.pdf
HadoopDB in Action: Building Real World Applications
www.cs.yale.edu/homes/dna/papers/hadoopdb-demo.pdf
TeraByte Sort on Apache Hadoop
sortbenchmark.org/YahooHadoop.pdf
Mahouth in Action
www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684
Web-Scale K-Means Clustering
www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Hive – A Petabyte Scale Data Warehouse Using Hadoop
infolab.stanford.edu/~ragho/hive-icde2010.pdf 27
Proprietary and Confidential. Not to be distributed or reproduced without permission www.sungard.com/globalservices
Cloud Computing, Distributed Computing, Big Data, Low Latency, Mobile Computing, GPU/FPGA, NG UI Frameworks, Functional Languages, DSLs, etc…
Financial Services firms are investigating next generation technologies for data analytics. Data growth – particularly of unstructured data – poses a special challenge as the volume and diversity of data types outstrip the capabilities of older technologies such as relational databases.Predictive analysis of both internal and external data result in better, proactive management of a wide range of issues from: credit and operational risk e.g. fraud and reputational risk; customer loyalty and profitability; etcBanks, mutual funds, and insurance companies need to transform their data into actionable information to comply with government and industry regulations, manage risk, increase revenue in a competitive economy, identify fraud and improve efficiency.
Financial services firms worldwide are struggling with the challenges of Big Data analytics management - analyzing massive volumes of complex data from many sources in order to better serve and retain customers, find and effectively recruit new prospects, detect and prevent fraud, manage assets and streamline operations.Financial services companies are taking advantage of Big Data analytics to increase profit margins, reduce risk and solidify relationships with profitable customers. Now more than ever, a powerful, scalable, affordable and simple Business Intelligence (BI) solution for business users is key to helping make the right decisions that drive overall success. Data in Financial Services - Use It or Lose It! Financial Services is a domain where what you do with data and what you don't do with data can have a dramatic impact on your success. The BigData movement is attracting both investment and press - horizontally as an analytics infrastructure and vertically with domain specific tools. In financial services, distributed computing and parallel processing has been employed for many years, so what is the utility of BigData for these well-known problems? Is the size of data being analyzed the real challenge or is it the ability to evolve insight and analytic capabilities on what, in relative terms, may really be SmallData? This presentation will discuss how time, cost and information asymmetry are critical competitive advantages in financial services and how the current state of the art is in need of some help from the BigData community. We will also discuss the hardware and software used for typical computation and data processing workloads in financial services and why distributed computing is still reserved for only relatively few niche applications. We will conclude with a discussion on the barriers to adoption of new technology in financial services, why they exist, and how they might be overcome through greater academic and industry collaboration.BigDataWhy the allure?Why the marketing?How is innovation marketed?What is reality, what is hypeThesis – can BigData solve our challenges in FS?Financial Services, as a sectorWholesaleProblemsRiskForecastingProcessingcomplianceRetailProblemsForecastingProcessingcomplianceFS Data intensive computing for….ProcessingAlphaFS Data intensive computing in two flavorsTimeSizeFS Data IntensiveInter-machineGrids, BigData?Intra-machineGPUs, FPGAsFS Data Intensive AlgosGraphsLinear Time Series & eventsCalculusA 2nd look at data intensive computingCloud economicsConsumer successesSocial mediaDigital exhaustcompetitionBarriers to be overcome, barriers created by the burden of data constrained computingLegacy burden (see till slide)Data Silos – governanceJailed semantics with relational DBs
One of the most promising technologies is the Apache Hadoop and MapReduce framework for dealing with this “big data” problem. However current implementations of Hadoop lacks the necessary enterprise robustness demanded by Financial Services firms for wide-scale deployment of MapReduce applications.
Hadoop is a perfect fit for processing petabytes of data distributed over a large cluster of nodes. Developed with a fault tolerant design ground up, which makes it very resilient to hardware and software failures on nodes. The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes.
Raptor within the Hadoop ecosystem, Raptor is a specialized analytics platform, as hive is a generalized analytics platform, it is a transparent addition to the hive system.Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.Scribe is a server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine.Oozie is an open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce). Oozie is a lot of things, but being:A workflow solution for off Hadoop processingAnother query processing API, a la Cascading is not one of them.Sqoop(“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:Imports individual tables or entire databases to files in HDFSGenerates Java classes to allow you to interact with your imported dataProvides the ability to import from SQL databases straight into your Hive data warehouseHue is a browser-based desktop interface for interacting with Hadoop. It supports a file browser, job tracker interface, cluster health monitor, and more.
Raptor is an ensemble of various frameworks and instruments that help segment data to a fine level allowing for spontaneous search on massive data sets. This is achieved with segmentation at two levels and a intelligent data policy manager.With adaptive data segmentation, partitioning ,column level hierarchical indexes and bloom filters, Allows ad hoc spontaneous querying and searching with real-time and near-real-time analytics capabilities.Intelligent optimization algorithms for switching execution of queries & search within Raptor, HBase or MapReducePolicy manager allows optimized indexing and automatic sharding, based on time and bucket sizes.Level 1 segmentationRaptor digester framework, segmentation based on high level parametersLevel 2 segmentationRaptor block analyzer framework, fine level segmentation ThorCacheAn LRU/MRU based spill over cache system, facilitates the segmentation frameworks
Level 1 segmentation:Based on high-level parametersPartition columns, row hashesBusiness specific data segmentation, by enhancing the generated digester code to meet business needs.Dimensioning/cubesAdaptive data transformation.Data is loaded into raptor using the following ways,Direct load from hive, using implicit load function in HQL.Via the Digester, This is an auto-generated code framework specific to the table.Data load Adaptors, these are custom data plug-in that facilitate in capturing data from various sources like databases, internet data streams, log files etc...Table Partitions – This approach helps store data based on columns, segmenting data based on demographics, groups, domains, industries etc… Digester – This component helps crunching down incoming data/streams, feeds the processed records into consistently hashed storage buckets.Storage Buckets – These are append 'able storage files in HDFS, each of the bucket corresponds to a specific Hash Key.Adaptive aggregation & transformation - Calculate data aggregates and vital stats per column as data is loaded into raptor and transform data to a more easy to use format. This is used for query optimization during query execution.Data Cubes & Dimensioning – Based on user hints, specific dimensions and cubes are generated for every table.
Level 2 segmentation:Applied fine tuned segmentation techniquesSingle/hierarchical indexingColumn bloom filtersAdaptive column indexingMeta Column InformationIndexing Engine– This component indexes the records of every data block into a B+ Tree/Lucene index map which is in turn cached into LRU based ViteCache (spill over cache). Does adaptive indexing where unused column indexes are evicted and most frequently accessed columns get indexed.Bloom Filter - a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. In our case we use to test if the block/index map consists of the partial search keys.ThorCache– A LRU/MRU based low latency index store to hold the per-block b+ tree/Lucene indices. (allows for to-disk spill over for effective memory management)
Raptor Meta Store – This is a RDBMS based database that store all meta data required by the various components in the system.
Quantum Processor - is a very light weight processing framework written within the Hadoop layer, provides results in fraction of times when compared with traditional mapreduce. Used for real-time queries, used when level 1 segmentation search returns positive/negative and level 2 segmentation search returns Positive.
Business specific Code Generation Framework – This framework generates all the infrastructure components like indexer, digester, table schema, dimensions etc.. Based on user defined xml schema.
Policy Manager – This component helps organize data in a time sorted way, allowing for moving older bucketized data to sparsely indexed/managed buckets while retaining newer data in highly organized/indexed buckets automatically based on defined policy. Allowing for efficient resource utilization.
Intelligent packet compression, based on the congestion on network, raptor switches to a resilient compression mode where packets are compressed using cpu efficient algorithm.Scheduling of jobs is now done using namenode operation maps, where namenode maintains information on all the operations {reads, writes, jobs} currently being performed on the cluster. This ensures a even distribution of all operations on the cluster.All computation intensive operations are scheduled on nodes that have capabilities to execute such jobs on GPU’s.Adaptive rebalancing, this function increases the replication factor of frequently used data and ensures a even availability of data, moves data closer to frequently accessed client sub-cluster nodes. Interactive User Console, this provides an interface to the user to manage the cluster, mange tasks, manage indexes, shards, filters, etc..for each table, set up policies around data, archive obsolete data.
65969
Architecture: Raptor over Hadoop/HBase Raptor partitions incoming transactions Raptor executes Classifiers over GPU for each partition Flagging Cluster uses MetaClassifiers to Flag Transactions Notification Cluster sends out Notifications Raptor also implements Phasing out data and Classifier Pruning. Hive Interface not used except for Periodic Fraud Report GenerationHow does it work?Divide Data into Subsets/Buckets/Partitions (Where Raptor Excels) Apply Data Mining Techniques to generate Classifiers in parallel Combine resultant base models to generate a MetaClassifier The base classifiers execute in parallel while the MetaClassifier combines their results Pruning - Removes redundant Classifiers from ensemble (Very Old Data)Large Scale Data Mining TechniqueUsing a Scalable BlckBox ApproachCost Reduction Based Model usedLearning Algorithms used C4.5, CART, Ripper, Bayes, ID3
Smart customer care solution - allows CSR representatives to query terabytes of real-time business data and handle customer queries on complaints and events. Financial fraud analytics - Retrospective Analysis of incoming financial logs, Ranks entities based on suspicious behavior, generate events on suspected entry triggers.Media usage Log Analytics – “Targeted Marketing”, The analytics system, helps add short term and long term score to marketing content based on media content usage statistics this information helps target end users with the right marketing content.Predictive Trading – Distributed GPU-based event processing platform to significantly speed up algorithm trading computations, namely, market event parsing and market event matching against strategies.Strategies are distributed in disjoint clusters to enables highly parallelizable event matching, In each cluster, strategies are stored as contiguous blocks of memory to enable fast sequential access to improve memory localityComputation intensive jobs using GPUs - Just as in traditional parallel computing, distributed GPU processing scales better than processing on shared-systems because it will not overwhelm shared system resources.