47. Acquire Data (or
so you think)
WUT!? Invalid
UTF8?
Fix the encoding
issue…
Yell at the
engineers
Some columns
are missing!?
Run the
script…DIVISION
BY ZERO!!!
64. <source>
type tail
path /var/log/apache/access.log
tag web.access
format apache2
</source>
<match web.access>
type mongo
user kiyoto
password heartbleed
database web
collection access
… # host, port, etc.
</match>
Apache log
Fluentd
MongoDB
65. <match web.access>
type copy
<store>
type mongo
user kiyoto
password heartbleed
database web
collection access
… # host, port, etc.
</store>
<store>
type s3
… # aws secret, bucket, etc.
</store>
</match>
Apache log
Fluentd
MongoDB S3
Hey everyone!
Today I’m going to talk about we use fluentd and MongoDB to power analytics @ Wish.
After I’m done, Kiyoto, developer evangelist at Treasure Data and a maintainer of fluentd, is going to talk about how easy it is to fit into your architectures.
So, I’m Adam. I run infrastructure & operations at Wish and I’ve been responsible for our MongoDB deployment since day 1.
Like most of you, my background is in development. Back in our cool, pre-launch startup days, I was a backend developer. Then we launched. Suddenly someone had to run production. I volunteered. Now we’re 3 years & 30M users later and I know way too much about MongoDB. So that’s my story.
Let’s talk about Wish.
MongoDB has been our primary database since day 1 back in 2011. And I’m here, 30 million users later. So, apparently it’s going pretty well so far.
A little bit about our infrastructure - We run 67 mongods and recently moved the DB from AWS to an AWS/bare metal hybrid mostly cause SSDs rock.
Wish is a mobile eCommerce platform. We use personalization technology to give users a feed of relevant products and help them discover cool products at great prices.
We have a top 10 app on both iOS and Android and do around in revenue with over 2 million products for sale. [ITERATE, mention 100s of M GMV]
So, how did we get there?
Just about every product change at Wish starts life as an experiment. This does 2 things:
It lets us understand the impact of our decisions more rigorously
It helps us build better intuition about what to do next.
Let’s take an example…
Here’s our billing info page in Android. Pretty standard. Give us your CC# and billing zip.
Well, we looked at the data and international users were dropping off here at a strangely-higher rate than American ones. How can we improve that?
We had a hypothesis: maybe “billing zip” is confusing to our non-American users.
So, here’s that page again.
We made this change as an A/B test. 50% of Android users saw this version with 1 sentence explaining billing zip. Trivial change.
Now – how do we know if our clever hypothesis was actually true?
Well, we need data. Specifically, checkout conversions for international, Android users in that experiment. And we need a way to get it easily.
Let’s see what our systems come up with.
So, this is a report from one of the tools I’ll talk about later. It shows the impact of this A/B test over core actions. Most changes are tradeoffs, so it’s important to understand those dynamics.
Mostly green, a little red. Number of users buying went up. Profile views went down a bit. I guess users were so busy buying things they forgot to look at profiles?
Thanks to this data, we know we got a 7% boost in sales. For 15 minutes of work.
That’s pretty awesome. And doing this hundreds of times is how we grew.
Wish is a data-driven company; virtually all decisions come from data.
A lot of companies say they’re data-driven, but to do this rigorously is hard. Often, analytics are clunky, one-off, inflexible, annoying. Show of hands: how many of you have seen this?
Analytics that are clunky won’t get used. It has to be frictionless.
So: now we know what we want. Let’s shift gears to how.
At a high level, it’s a pretty standard setup.
Take application logs.
Aggregate them and send to Hadoop
Use MapReduce to analyze
And store the result in MongoDB to serve user-facing apps
So, first: logging.
A key idea of our analytics system is that our request logs are the main source of truth. If you log all the details of every HTTP request, you can pretty much reconstruct everything that’s happened in your app. We think of them sorta like the oplog.
As proof of this: back in early 2012, a bad migration destroyed a bunch of data. We had to restore from backup, but those logs & some cobbled together tools let us replay basically everything we lost. Really powerful.
That power lets us answer any question over any time range, even without knowing the question in advance. So, as our business and product evolve, we’re confident that we have all the data we need.
Let’s take a look at these logs…
This is the log of a request to get a product feed
Let’s zoom in…
Here we have properties specific to that page. What products were shown? What category, what sort, what filters? What was count/offset in the feed?
Now we have general app-level properties. User ID, web or mobile, country code. ID of the request, ID of their previous request. What experiment buckets they’re in.
These properties let us really drill down into our analytics and get different views of data.
Last, we have the HTTP things you’d expect in a request log. URI, referrer, arguments, method, locale, and response code.
Request logs are great. They’re an easy guarantee that you’re not missing anything. But, one problem…
We have around 500 billion entries in our request log. Compressed, it’s almost 20 TB. With Hadoop, we can get through it, but it’s needlessly expensive for common queries.
Here, an important principle of schema design in MongoDB applies. Denormalize for performance.
If we have something we know we’ll have to search a lot, we log it separately. Let’s take an example…
Since we’re in eCommerce, transactions are pretty important and we do a lot of analytics on them. So, every transaction gets logged separately.
This is the abridged version of a transaction log. Transaction ID, user ID. Total price, shipping cost. And a list of items the user bought.
With this information, we can write cleaner, faster queries instead of trying to parse the information out of the HTTP request that actually completed the purchase.
Now we have thorough logs all over the place, let’s talk about how we get them to Hadoop for analysis
What are our options here?
We could send them synchronously over the network to Hadoop. But, even if everything is fine, that costs RTT. And, what if the destination is down or slow or erroring out? Then we’d start impacting users and possibly dropping logs. Logging shouldn’t cause user impact.
Next option: We could fire & forget with UDP. It’s fast, but unreliable and you lose logs if Hadoop goes down. Can’t drive business decisions on fundamentally unreliable data.
Thankfully: fluentd solves both of these problems. It’s a fast, reliable buffer with flexible inputs & outputs. It scales linearly and is dead-simple to run. We’ve been using it since day 1 and have had no serious issues.
What does this look like?
We start in the app which generates the logs.
The app synchronously logs to a fluentd running on the same host. There’s no network latency and the load on each local fluentd is trivial, so we’ve never had problems with these getting slow or crashing.
The local fluentd buffers logs on disk for reliability.
Periodically, it flushes those buffers to a host in our fluentd aggregation tier.
These run active/active so scaling them is a breeze. It gives us an easy to monitor & manage conduit for our logs to flow through without imposing costs on the app.
The aggregation servers periodically flush into Hadoop. As an added bonus, they also flush into S3 for backup.
At every step we have a reliable buffer, so temporary problems at one stage won’t ripple up through the system.
So now that the logs are in Hadoop, let’s talk about what we do with them.
But first - quick show of hands – how many people here have heard of Hadoop?
Ok, and how many people have heard of Hive?
In a nutshell, Hive is a layer on top of Hadoop that lets you write SQL queries instead of MapReduce jobs to analyze data. Makes writing & maintaining complex jobs much easier.
Why do we use Hive for analysis? Hive means not needing to know the questions in advance. We don’t need to worry about optimizing schemas for specific queries, so we can just log everything and figure out the questions later.
This is really important, especially for startups. Our product and business have changed a lot, meaning we need to ask different questions. 18 months ago, we didn’t even do eCommerce. Shifting from product discovery to commerce didn’t take a rewrite of our analytics platform… just incremental changes to add new metrics.
But, one big downside of this setup is that Hadoop is notoriously-hard to manage. It needs constant attention to keep things scaling nicely. We really didn’t want to deal with that.
Our friends at TreasureData offer a scalable Hive as a service that we’ve been using since launch and it’s been great. Let’s me focus on being a MongoDB expert without worrying about Hadoop.
With all the logging and analysis done, now we need to make those results easily-available to everyone in the company. This is where MongoDB comes in…
Since we want to serve in real-time, we store the results of our analysis in MongoDB. I’ll show you the schemas in a minute, but, at a high level, we store 1 document for every segmentation we care about.
For example, we store a count of daily impressions for each page and every combination of gender, country, iOS vs Android vs web, and experiment buckets. We have similar collections for clicks, sessions, transactions, and normalized metrics like clicks per session.
This lets us really quickly read the results we need at a cost of lots of writes & storage. It’s a bit unsexy, but it works really well. In total, it uses about 2 TB of storage and takes a few hours to import overnight, which is easily manageable.
So this is the first half of the schema for a document that describes number of clicks on something from a certain page.
In this case, click type is ID 2 and the page ID is 1000. Apparently that happened about 20,000 times on whatever timestamp that is.
And in the other half of the schema, we have the segmentations.
So, those 20,000 clicks were from male users on an Android in Canada that saw the billing zip help text experiment we talked about.
There’d be another document for Female, Canadian iPhone users that saw the help text. And Female, Canadian Android users, etc, etc.
With all that work out of the way, we have all the data we need available in real-time for our users.
On top of the collections we just talked about, there’s an API in Python that returns time series data for whatever metric & segmentation you need. With that, developers can run wild building whatever tools they need.
Let’s take a look at 2 of the most powerful ones.
The first big tool we built is called Dashy. It’s a graph dashboard that lets you drill down into metrics over time.
The graph you can see there shows number of a certain type of click per logged in user, broken down by what device the user was on.
The other tool I want to share today is called Perimeter.
It shows us the impact of A/B tests across many metrics. Most experiments are trade-offs: they move some metrics up & some down. Making these trade-offs clear, helps us make better product & business decisions.
To cap it all off, analytics empowers faster iteration.
Iteration drives growth, engagement, and revenue.
Without these tools, supported by all the infrastructure we just talked about, we wouldn’t be where we are today.
Thanks for listening. If you have any questions, we’ll do Q&A at the end or feel free to shoot me an e-mail.
Now I’m gonna hand the mic off to Kiyoto from Treasure Data to talk about how fluentd works and how easy it is to use.