This is the deck I presented in the Big Data eXposed event, September 30, David Intercontinental, Israel.
In this session I’ll take the audience to a short trip in the eXelate’s cloud and present three big data related challenges and how we faced them.
In this session I’ll take the audience to a short trip in the eXelate’s cloud in which I’ll present 3 big data related challenges and how we faced them.
The digital marketing industry includes a huge eco-system: publishers, data providers, data management platforms, ad-networks, marketing agencies and marketers.Data is the fuel driving this industry and eXelate get the raw data coming online from publisher’s sites, enhance it with data coming from various offline providers, run set of deterministic rules and analytical models to mark the users browsing the internet with specific attributes and sell it online to the marketers (via ardencies and DMPs = Demand side platforms).
The basic business entity we are selling is a segment.Segment is group of individuals (part of a larger population) which have similar attributes.Different segments have different attributes that may be defined as a target audience which might be interesting for the marketers.Marketers are always seeking for the most relevant target audience in order to uplift their sells.We can analyze three major categories of segments:The first is demographic segments which includes demographic characteristics such as age, gender, income, education, employment level etc`. These are quite static characteristics barely changing.The second category includes behavioral aspects like domain of interests (like sport, wines, travel, shopping etc`).The third category is intent. As the name suggests, it implies that the person browsing the internet has an immediate intent (like purchasing specific item). These segments are the most relevant for marketers since a specific advertisement related to the user’s intent will most likely be the most efficient one from the marketer’s point of view and the most relevant for the user.eXelate “marks” the user browsing the publishers’ sites with segments and sell it online to the marketers / DMPs so they can direct their ads to the targeted audience.
Our journey starts in the browser.A user browses to one of our partners – publisher’s site – like homeaway, kayak and pronto. The site includes a tag (pixel) with a reference to eXelate’s URL. In most cases, the publisher adds details to the URL (e.g. publisher id, some details he knows about the user, tags representing user activities in the site) and the data is processed by eXelate.This is the place to tell that eXelate is very sensitive and aware to privacy issue. eXelate does not keep any attribute identifying the user. The user is represented as an anonymous entity which belongs to some groups – segments. We are not interested in the specific use’s details but only in the fact that he is part of a large group of individuals with similar attributes.
What is now happening inside the cloud?The request is processed within 200ms to generate response – a redirect to up to 5 buyer’s URLs. How do we identify those buyers?First we process the individual event information by extracting parameters from the URL and user browser agent.By the cookie we can tell if we have previous information on the user or should we generate a new entity representing him.We are adding historical data and apply set of rules and analytic models on the data to generate the segments.After the user is “marked” with segments, we do a buyers matching process to select 5 most relevant buyers to include their URL’s in the redirect response.We have 5 billions events generation 27 GB of data every day. We have a total of 850 million unique users every given moment and their data spans over 14 TB in the storage.We have more than 500,000 rules and 20,000 segments we are generating and selling to more than 100 media platforms.
So, we do all the above in a short time of 200ms. We have 3 challenges we would like to share:Relevancy – how to identify the most relevant data to the buyer?Access time – how to access to a single user info in a shortest time?On demand analytics – how to process huge data sets on demand to provide meaningful insights to the marketer?
We process 5 billions events per day that produce a lot of data, but a lot of data = a lot of noise.Smart data is the signal – actionable data - we are looking for.In the previous slide, intent combined with user characteristics (segments) is a good example of smart data. A well targeted ad will be the most relevant for this user.More than that, it will be most relevant now, not tomorrow and not even in the next hour.
The problem is that our data set is not static, millions of users browsing the web executing a lot of actions every hour and we can learn new things about them.The goal is to mark in segments as many users as we can to cover most of the target audience.In the classical approach, we would just take a snapshot of the daily data once a day or every few hours, run the rules and analytical models to score the users and generate the groups of segments.The problem is that we will not meet most of the users again, a user can be active for the next minutes of hours, maybe after a week. After a month it is most likely that we will not see him again (actually we will see him identified as another user). The relevancy of the data is dropping rapidly.We need to perform scoring and segmentation since the actionable data should be available for the advertiser in real time.
The first step is to generate the analytical models. Our data scientists extract the data in the database, we are using Netezza as our data warehouse. This data is event centric and represents all the events generated by the users up to the last 90 days.The data scientists building the models using R and running them on amazon cloud. After validating the models, they are implementing them as java packages which we are deploying to our eXtream cluster to perform scoring and generate models in real time.
Inside the cluster, we are running the following sequence on every event (URL call):We are running basic rules (defined in XML document) to generate demographic segments.After having the event info and user history, we are running association rules which are deterministic rules that can be implemented by pattern matching. We are using Jboss Drools that implements RETE algorithm which is most efficient for this purpose.Finally, we are applying the analytical models that do the scoring and segmentation based on some advanced algorithms like Linear Regression and Collaborative Filtering.In the near future, we will add real-time learning to generate the analytical models on the fly.We have over 500,000 rules and quite complex algorithms and still growing, can we do all that within 200 ms?
Our solution is Continuous Incremental Segmentation.We separated the serving layer from the rules computation (segmentation) layer. Each layer contains dedicated HW most suitable for its role. Communication between layers is implemented using 0MQ – a blazing fast messaging infrastructure.A single message – request for segmentation – is sent from the serving cluster to the segmentation cluster. Models and rules run in parallel to generate segment and results sent asynchronously to the calling process. Each rule and model send its results independently from the other models.The serving cluster collects all the result within a specific time frame and build the response. Results which will not arrive within this time frame will not be included in the response.The segments and intermediate results are stored in the user storage.Why it is continuous? Because the same process will be performed again for the same user in his next action on the page (or even in another page) which will result a call to eXelate serving.Why it is incremental? Since the next process will run only those models and rules that were not included in the previous response.Eventually, the process spans over several iterations (calls) to generate as many segments as possible.
W saw earlier that we are processing over 5 billions events per day and we need fast access (both read and write) to a storage hosting the user’s information.Not only that, this data must be available for any machine in the cluster or even machines in another data center.
What are the requirements from the user storage?For each user we save the segments, delivery information (to whom and when we deliver this user info as part of segment), and intermediate results.A user object size may vary between a few dozens KB to few hundreds KB.We are holding 850 millions unique users in every given moment.We need fast access time (read and write) within a few milliseconds.We need the user object to be available for every machine in the cluster and across data centers.At a time, we examined a few NoSQL data bases: CouchBase and MongoDB. Eventually we selected Aerospike.
Aerospike is a:Key-Value NoSQL DB support billions of objects, work well in a clusterAbove 500K TPS, we gain 1 -2 ms access timeSmart data eviction policyOptimized for SSD, indexed in RAMObjects partitioning by namespacesCross data center replication
We have a 9 nodes Aerospike cluster on each of our 4 serving data centers.The Aerospike replicates data across data center within minutes.Some of the Aerospike challenges we faced:It was (and still is) a cutting edge technology. We encountered some instabilities and bugs but we ware actually their beta site and they provided very good support. Soon version 3 will be released and it looks like they are becoming more stable and mature product.Very basic management and monitoring capabilities comparing the other products.Small install base and eco-system resulting small knowledge base.Requires specific HW (SSD) which is not yet commodity HW.
The background processes generating a lot of valuable data that can provide meaningful insights to another client.This would be a marketing manager, short in time, working on several campaigns and need to take marketing decisions quickly.For these customers we are providing optiX.
OptiX is an interactive application for the marketers to help him understand the market and get some insights regarding the audience his is targeting.OptiX provides this information based on the Hugh data cloud in our data warehouse.In this sample we can see a screen providing on demand calculation of the size of the segments selected by the user.These are not simple counting, they includes aggregation and de-duplication on the fly.
Here is more complicated screen with even more numbers to calculate on the fly.The numbers are calculated on the fly, we can’t do it a head since the calculation is based on the user’s parameters selection.The challenge is to calculate it fast on the big data sets.This could be a problem easily solved by a well indexed relational data base, problem is that RDBMS are not capable of processing this amount of data quickly and efficiently.
We selected a solution based on a search engine.A search engine is optimized to count words instances on a large set of documents.This is very similar to our use case, in the on demand queries, we are not interested in the data itself but on the number of instances (users) who share the same segments.We selected Elasticsearch for this task.Elasticsearch is an open source, fast search engine based on Apache Lucene.We have built 30 nodes Elasticsearch cluster on amazon cloud.The aggregator collects the data from the Netezza data warehouse and re-organize it in an optimized structure for our search (the data in the Netezza is event-centric while we need it to be user-centric in the Elasticsearch).The aggregator generate data files and load them to amazon S3.Another process – the loader – load it to the Elastic search where the data is indexed.When a query issued from the OptiX application server, it is processed by the reporter machine to a set of queries running in the ES cluster.The ES runs the queries in parallel and aggregates the result (a la Hadoop style) and the result is returned to the application server in 1 second.
Data relevancyReal-time scoringParallel processingSplit processing over timeBig data access timeFront end, Replicated, Aerospike clusterOn-demand analyticsChange your schema to optimize query timeMove processing from querying to loading phaseTrade off: Space + Processing -> Performance