SlideShare a Scribd company logo
1 of 106
Download to read offline
+

Sunday, July 24, 2011
ajackson
                              @
                   skylineinnovations.com


Sunday, July 24, 2011
a tale of rapid
                          prototyping, data
                          warehousing, solar
                        power, an architecture
                          designed for data
                          analysis at “scale”
                           ...and arduinos!
Sunday, July 24, 2011

So here’s what i’d like to talk about: Who we are, how we got started, and most importantly,
how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and while
i know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’s
flexible nature really helped us as a business, and how Mongo specifically has been a good
choice for us as we build some of our tools. Here are some themes:
Scaling



Sunday, July 24, 2011

Mongo has come to have a pretty strong association with the word “scaling.”

Scaling is a word we throw around a lot, and it almost always means “software performance,
as inputs grow by orders of magnitude.”

But scaling also means performance as the variety of inputs increases. I’d argue that it’s
scaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input to
a hundred.

There’s another word for this.
Scaling
                                Flexibility


Sunday, July 24, 2011

Particularly when you scale in the real world, you start to find that it’s complicated and messy
and entropic in ways that software isn’t always equipped to handle. So for us, when we say
“mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll come
back to them as well.
Business-first
                        development


Sunday, July 24, 2011

This generally means flexibile, lightweight processes. Things that become fixed &
unchangable quickly become obsolete and sad :’(
When Does
                “Context”
               become “Yak
                 Shaving”?


Sunday, July 24, 2011

When i read new things or hear about new stuff, I’m always trying to put it in context. So,
sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast
over the context that *is* important. So please stop me to ask questions! Also, the problem
domain here is a little different than what we might be used to, so bear with me as we go into
plumbing & construction.
Preliminaries



Sunday, July 24, 2011
Est. 8/2009
Sunday, July 24, 2011
Project Development
                                 +
                             Technology


Sunday, July 24, 2011
“Project Development”
Sunday, July 24, 2011
finance, develop, and operate
                 renewable energy and efficiency
                   installations, for measurable,
                        guaranteed savings.



Sunday, July 24, 2011
finance, develop, and
                    operate renewable energy
                   and efficiency installations, for
                   measurable, guaranteed savings.



Sunday, July 24, 2011

We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.
finance, develop, and operate
                    renewable energy and
                  efficiency installations, for
                  measurable, guaranteed savings.



Sunday, July 24, 2011

Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.
finance, develop, and operate
                  renewable energy and efficiency
                  installations, for measurable,
                      guaranteed savings.



Sunday, July 24, 2011

So, here’s the interesting part. Since we put stuff on your roof for free, we need to get that
money back. What we do is, we’ll charge you for the energy that it saved you, but, here’s the
twist. Other companies have done similar things, where they say “we’ll pay for a system/
retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll get
savings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So,
we actually measure the performance of this stuff, collect the data, and guarantee that you
save money.
(not webapps)



Sunday, July 24, 2011
Topics not covered:



Sunday, July 24, 2011
• Why solar thermal?
                        • Why hasn’t anyone else done this before?
                        • Pivots? Iterations?
                        • What’s the market size?
                        • Funding? Capital structures?
                        • Wait, how do you guys make money?

Sunday, July 24, 2011

Oh, right, this isn’t a startup talk. But feel free to ask me these later!
Solar Thermal in Five
                               Minutes
                            ( mongo next, i promise! )




Sunday, July 24, 2011
Municipal
                           =>
                          Roof
                           =>
                          Tank
                           =>
                        Customer
Sunday, July 24, 2011
Relevant Data to Track



Sunday, July 24, 2011
Temperatures
                        (about a dozen)


Sunday, July 24, 2011
Flow Rates
                        (at least two)


Sunday, July 24, 2011
Parallel data streams
                          (hopefully many)


Sunday, July 24, 2011

e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.
how much data?
                        20 data points @ 4 bytes
                        1 minute intervals
                        at 1000 projects (I wish!)
                        for 10 years
                        80 * 60 * 24 * 365 * 10 * 1000 = 400 GB?
                        ...not much, really, “in the raw”


Sunday, July 24, 2011

unfortunately, we can’t really store it with maximal efficiency, because of things like
timestamps, metadata, etc., but still.
Sunday, July 24, 2011

I hope this provides enough context on the business problems we’re trying to solve. It looks
like we’ll need a data pipeline, and we’ll need one fast.

We’ve got data that we’ll need to use to build, monitor, and monetize these energy
technologies. Having worked at other smart grid companies before, I’ve seen some good
data pipelines and some bad data pipelines. I’d like to build a good one. The less stuff i
have to build, the better.
Sunday, July 24, 2011

As i do some research, i find that a lot of these data pipelines have a few well-defined areas
of responsibility.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                         Analytics.



Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                                                }       Designed for these



                         Analytics.            <=           Users are here




Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                         Analytics.



Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                                                }       Designed for these



                         Analytics.             <=     Users are here
                                                Business value is here!




Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
Sunday, July 24, 2011

so, here’s how i started thinking about things. This is a design diagram from the early days
of the company.
Sunday, July 24, 2011

easy, python, no problem. There are some interesting topics here, but they’re not mongoDB
related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the
data would look like.
Sunday, July 24, 2011

This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some
light webapps for internal use. These would be dictated by business goals first, but the
technological questions were straightforward.
Sunday, July 24, 2011

Here was the real question.

What would be some use cases of an analyst having a good experience look like? What would
they expect the tools to do?
Now we can think
                        about what the data
                             looks like


Sunday, July 24, 2011

So, let’s think about what this data looks like, how it’s structured and what it is. Then, after
that, we can look at what the best ways to organize it for future usefulness.
Time series?
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614




Sunday, July 24, 2011
TIME SERIES
                           DATA


Sunday, July 24, 2011

So what is time series data?
Features, Over Time




Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Features, Over Time

               Thing
       (Feature vector, v)




                                              Time
                                                 (t)


Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Features, Over Time

               Thing
       (Feature vector, v)




                                              Time
                                                 (t)


Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Sunday, July 24, 2011

A couple of ideas:
sampling rates. “regularity”. “completeness”
analog vs. digital
instantaneous vs. cumulative (tradeoffs)
tn              tn+1


Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)
tn              tn+1


Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)
t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y
                                                                 Thresholds
       y’




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y
                                                                 Thresholds
       y’




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
(more complicated stuff
                   can be thought of as
                    transformations...)


Sunday, July 24, 2011

e.g., frequency analysis, wavelets, whatever.
Sunday, July 24, 2011

At this point, I go off and do a bunch of research on existing technologies. I really hate
reinventing the wheel, and we really don’t have the manpower.
Time series specific tools



                        Scientific tools & libraries



                        Traditional data-warehousing approaches



Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated
the first two classes of tools.
Time series specific tools

                           RRDtool -- Round Robin Database




Sunday, July 24, 2011

There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, and
i highly recommend it. Unfortunately, it’s really designed for applications that are highly
regular, and that are already pretty digital, for instance, sampling latencies, or temperatures
in a datacenter. It’s not really good for unreliable sensors, nor is it really designed for long
term persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t get
me wrong, it’s totally rad, but i didn’t think it was for us.
Scientific tools & libraries

                           e.g., PyTables




Sunday, July 24, 2011

Pretty cool, but not many of these were mature & ready for primetime. Some that were, like
PyTables, didn’t really match our business use-case.
Traditional data-warehousing approaches



Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated
the first two classes of tools. [...]. That leaves us with the traditional approaches. This
represents a pretty well established field, but very few of the tools are free, lightweight, and
mature.
Enterprise buzzwords
                           (Just google for OLAP)




Sunday, July 24, 2011



But the biggest idea i learned is that most data warehousing revolves around the idea of a
“fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totally
denormalized SQL table.
“Measures”
                          and their
                        “Dimensions”


Sunday, July 24, 2011

(or facts)
pretty neat!
Sunday, July 24, 2011
“how elegant!”

Sunday, July 24, 2011
in practice...



Sunday, July 24, 2011
Sunday, July 24, 2011
(from “How to Build OLAP Application Using Mondrian
                                + XMLA + SpagoBI”)
Sunday, July 24, 2011

to which the only acceptable response is:
Sunday, July 24, 2011

ha! Yeah right.
Time series are not relational!
Sunday, July 24, 2011

even extracted features are not inherently relational!

Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’t
know when you’ll have to start looking for something different.
Why would you lock yourself into a schema?
We don’t know what
                        we’ll want to know.


Sunday, July 24, 2011

We won’t know what we want to know. Not only are we warehousing time-series of
multidimensional feature vectors, we don’t even know the dimensions we’ll be interested in
yet!
natural fit for
                          documents


Sunday, July 24, 2011

This makes a schema-less database a natural fit for these sorts of things. Think about all the
alter-table calls i’ve avoided...
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

isn’t this better?
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,      “measures”
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                                                                         “dimensions”
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                                                                                         ...right?
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequently
we’ll look for measures by other measures -- i.e., each measure serves as a dimension.
...actually, not a good
                                model.


Sunday, July 24, 2011

The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure
provides another dimension.
Anyway!
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

How do we build these quickly & efficiently?
the goal: good numbers!



Sunday, July 24, 2011

Remember, the goal here is to make it easy for analysts to get comparable numbers, so when
i ask for the delivered energy for one system, compared to the delivered energy from
another, i can just get the time-series data, without having to worry about if sensors
changed, when the network was out, when a logger was replaced with another one, etc.
Sunday, July 24, 2011

So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV
series. It doesn’t really provide a lot of intelligence, and is basically the raw numbers
from rows
                             to columns


Sunday, July 24, 2011

So, most of what our pipeline does is turn things from rows to columns, in a flexible, useful
way. I’m gonna walk through that process, quickly.
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {


                                                                       Let’s just look at one
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011
row-major data
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614




Sunday, July 24, 2011
“Functional”
                        class Mass(BasicMeasure):
                            def __init__(self, density, volume):
                                ...

                                self._result_func = functools.partial(
                                     lambda data, density, volume: density * volume(data)
                                     density=density, volume=volume)

                            def __call__(self, data):
                               return self._result_func(data)




Sunday, July 24, 2011

quasi-functional classes that describe how to calculate a value from data.
"_id" : {
                                        "install.name" : "agni-3501",
                                        "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                        "frequency" : "daily" },
                                "measures" : {
                                        "total-delta" : -85.78773442284201,
                                        "Energy Sold" : 450087.1186574721,
                                        "Generation" : 57273.159890170136,
                                        "consumed-delta" : 12.569841951556597,




                                                        A formula:

                                                      E = ∆t × F
                        #pseudocode
                        class LoopEnergy(BasicMeasure):
                            def __init__(self, heat_cap, delta, mass):
                                ...
                                def result_func(data):
                                    return self.delta(data) * self.mass(data) * self.heat_cap
                                self._result_func = result_func

                            def __call__(self, data):
                                return self._result_func(data)




Sunday, July 24, 2011
Creating a Cube
                        For each install, for each chunk of data:

                            apply all known formulas to get values

                            make some convenience keys (e.g., day_of_year)

                            stuff it in mongo

                         Then, map/reduce to whatever dimensionalities you’re
                         interested in: e.g., downsampling.




Sunday, July 24, 2011

Here’s some pseudocode for how to make a cube of multidimensional data.
So, what’s the payoff?
How much water did
                         [x] use, monthly?
                > db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold":
                1}).sort({“_id”: 1})




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
What were our highest
                    production days?
                > db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy
                Sold”: -1})




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
How does the distribution of [x]
                 on the weekend compare to its
                  distribution on the weekdays?
                > weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}})
                > weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}})
                > do stuff




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
What’s the production of installs north of a certain
                        latitude, with a certain class of panel, on Tuesdays?

                        For hours where the average delivered temperature
                        delta was above [x], what was our generation
                        efficiency?

                        Normalize by number of panels? (map/reduce)

                        Normalize by distance from equinox? (map/reduce)

                        ...etc.



Sunday, July 24, 2011
• Building a cube can be done in parallel
                        • Map/reduce is an easy way to think about
                          transforms.

                        • Not maximally efficient, but parallelizes on
                          commodity hardware.




Sunday, July 24, 2011

Some advantages.
re #3 -- so what? It’s not a webapp.
mongoDB:
                        The future of enterprise
                         business intelligence.
                           (they just don’t know it yet)




Sunday, July 24, 2011

So, here’s my thesis:
document-databases are far superior to relational databases for business intelligence cases.
Not only that, but mongoDB and some common sense lets you replace multimillion dollar
IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
Lastly...



Sunday, July 24, 2011
Mongo expands in an
                           organization.


Sunday, July 24, 2011

it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lot
of other schema-loose data that we could use it for -- like the definitions of the measures
themselves, or the details about an install, etc., etc.
Final Thoughts



Sunday, July 24, 2011

Ok, i want to close up with a few jumping-off points.
“Business Intelligence”
                          no longer requires
                              megabucks


Sunday, July 24, 2011
Flexible tools means
                 business responsiveness
                      should be easy


Sunday, July 24, 2011
“Scaling” doesn’t just
                          mean depth-first.


Sunday, July 24, 2011

businesses grow deep, in the sense of adding more users, but they also grow broad.
Questions?



Sunday, July 24, 2011
Epilogue
                        Quest for Logging Hardware




Sunday, July 24, 2011
This’ll be easy!
        This is such an obvious and well
          explored problem space, i’m
           sure we’ll be able to find a
        solution that matches our needs
           without breaking the bank!




Sunday, July 24, 2011
Shopping List!
           16 temperature sensors
                4 flow sensors
        maybe some miscellaneous ones
              internet backhaul
           no software/data lock in.




Sunday, July 24, 2011
Conventions
                  FTW!
        And since we’ve walked a couple
         convention floors and product
         catalogs from major industrial
         supply vendors, i’m sure it’s in
               here somewhere!




Sunday, July 24, 2011
derp derp
                    “internet”?
        I’m sure there’s a reason why all
        of these loggers have to connect
                    via USB...
                         Pace Scientific XR5:
                              8 analog
                               3 pulse
                              ONE MB
                            no internet?
                               $500?!?



Sunday, July 24, 2011
yay windows?
            ...and require proprietary
              (windows!) software or
         subscription plans that route my
            data through their servers

                        (basically all of them!)



Sunday, July 24, 2011
Maybe the gov’t
          can help!
           Perhaps there’s some kind of
          standard that the governments
              require for solar thermal
             monitoring systems to be
            eligible for incentives or tax
                        credits.



Sunday, July 24, 2011
Vive la France!
              An obscure standard by the
                   Organisation
                Internationale de
                Métrologie Légale
                   appears! Neat!




Sunday, July 24, 2011
A “Certified”
                  Logger
                 two temperature sensors
                         one pulse
                  no increase in accuracy
                  no data backhaul -- at all
                             ...
                     what’s the price?



Sunday, July 24, 2011
$1,000




Sunday, July 24, 2011
$1,000




Sunday, July 24, 2011
Hmm...
            I can solder, and arduinos are
                     pretty cheap




Sunday, July 24, 2011
It’s on!




Sunday, July 24, 2011
arduino + netbook!
Sunday, July 24, 2011
TL; DR:
                        Existing loggers
                          are terrible.


Sunday, July 24, 2011

Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.
•   http://www.flickr.com/photos/rknight/4358119571/

                        •   http://4.bp.blogspot.com/_8vNzwxlohg0/
                            TJoUWqsF4LI/AAAAAAAABMg/QaUiKwCEZn8/
                            s320/turtles-all-the-way-down.jpg

                        •   http://www.flickr.com/photos/rhk313/3801302914/

                        •   http://www.flickr.com/photos/benny_lin/481411728/

                        •   http://spagobi.blogspot.com/
                            2010_08_01_archive.html

                        •   http://community.qlikview.com/forums/t/37106.aspx


Sunday, July 24, 2011

More Related Content

What's hot

Data Governance by stealth v0.0.2
Data Governance by stealth v0.0.2Data Governance by stealth v0.0.2
Data Governance by stealth v0.0.2
Christopher Bradley
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
 

What's hot (20)

Why ODS? The Role Of The ODS In Today’s BI World And How Oracle Technology H...
Why ODS?  The Role Of The ODS In Today’s BI World And How Oracle Technology H...Why ODS?  The Role Of The ODS In Today’s BI World And How Oracle Technology H...
Why ODS? The Role Of The ODS In Today’s BI World And How Oracle Technology H...
 
Data-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing StrategiesData-Ed: Data Warehousing Strategies
Data-Ed: Data Warehousing Strategies
 
NoSql
NoSqlNoSql
NoSql
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Session découverte de la Logical Data Fabric soutenue par la Data Virtualization
Session découverte de la Logical Data Fabric soutenue par la Data VirtualizationSession découverte de la Logical Data Fabric soutenue par la Data Virtualization
Session découverte de la Logical Data Fabric soutenue par la Data Virtualization
 
Big table
Big tableBig table
Big table
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Migrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMigrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDB
 
Data Governance by stealth v0.0.2
Data Governance by stealth v0.0.2Data Governance by stealth v0.0.2
Data Governance by stealth v0.0.2
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
 
利用Denodo平台安全地进行数据共享
利用Denodo平台安全地进行数据共享利用Denodo平台安全地进行数据共享
利用Denodo平台安全地进行数据共享
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Persistência Poliglota, Big Data e NoSQL FISL 15
Persistência Poliglota, Big Data e NoSQL FISL 15Persistência Poliglota, Big Data e NoSQL FISL 15
Persistência Poliglota, Big Data e NoSQL FISL 15
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 

Viewers also liked

MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB
 
MongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema Design
MongoDB
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
MongoDB
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
MongoDB
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series data
Anuj Sahni
 

Viewers also liked (20)

MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor ManagementMongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
 
MongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema DesignMongoDB for Time Series Data: Schema Design
MongoDB for Time Series Data: Schema Design
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor Management
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Using MongoDB As a Tick Database
Using MongoDB As a Tick DatabaseUsing MongoDB As a Tick Database
Using MongoDB As a Tick Database
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and Cassasdra
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Resilience an engineering construction perspective
Resilience an engineering construction perspectiveResilience an engineering construction perspective
Resilience an engineering construction perspective
 
Riak TS
Riak TSRiak TS
Riak TS
 
International Journal of Industrial Engineering and Design vol 2 issue 1
International Journal of Industrial Engineering and Design vol 2 issue 1International Journal of Industrial Engineering and Design vol 2 issue 1
International Journal of Industrial Engineering and Design vol 2 issue 1
 
Con8862 no sql, json and time series data
Con8862   no sql, json and time series dataCon8862   no sql, json and time series data
Con8862 no sql, json and time series data
 

Similar to Time Series Data Storage in MongoDB

MLUC 2011 XQuery Enigma
MLUC 2011 XQuery EnigmaMLUC 2011 XQuery Enigma
MLUC 2011 XQuery Enigma
Peter O'Kelly
 
Wibiya founders at The Junction
Wibiya founders at The JunctionWibiya founders at The Junction
Wibiya founders at The Junction
Daniel Tal
 
CMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social DrupalCMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social Drupal
Blake Hall
 
LISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps TransformationLISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps Transformation
benrockwood
 
SecurityBSides las vegas - Agnitio
SecurityBSides las vegas - AgnitioSecurityBSides las vegas - Agnitio
SecurityBSides las vegas - Agnitio
Security Ninja
 
2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore
ikailan
 

Similar to Time Series Data Storage in MongoDB (20)

Operations as a Strategic Weapon
Operations as a Strategic WeaponOperations as a Strategic Weapon
Operations as a Strategic Weapon
 
IT-enabled Business Innovation Workshop 8 July 2011
IT-enabled Business Innovation Workshop 8 July 2011IT-enabled Business Innovation Workshop 8 July 2011
IT-enabled Business Innovation Workshop 8 July 2011
 
Drupal as a winning Web Platform
Drupal as a winning Web PlatformDrupal as a winning Web Platform
Drupal as a winning Web Platform
 
Building an experimentation framework
Building an experimentation frameworkBuilding an experimentation framework
Building an experimentation framework
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrack
 
How to Recruit and Retain Top Talent - Insight into Building a Stellar Team
How to Recruit and Retain Top Talent - Insight into Building a Stellar TeamHow to Recruit and Retain Top Talent - Insight into Building a Stellar Team
How to Recruit and Retain Top Talent - Insight into Building a Stellar Team
 
How to Recruit and Retain Top Talent in the Drupal Community
How to Recruit and Retain Top Talent in the Drupal CommunityHow to Recruit and Retain Top Talent in the Drupal Community
How to Recruit and Retain Top Talent in the Drupal Community
 
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
Lean UX Principles in Practice (Zach Larson on SideReel's iOS App)
 
Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...
Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...
Varieties of Self-Awareness and Their Uses in Natural and Artificial Systems ...
 
Alternative Software Development Methodology
Alternative Software Development MethodologyAlternative Software Development Methodology
Alternative Software Development Methodology
 
Agile xptdd@gosoft
Agile xptdd@gosoftAgile xptdd@gosoft
Agile xptdd@gosoft
 
Agile xp tdd@gosoft
Agile xp tdd@gosoftAgile xp tdd@gosoft
Agile xp tdd@gosoft
 
MLUC 2011 XQuery Enigma
MLUC 2011 XQuery EnigmaMLUC 2011 XQuery Enigma
MLUC 2011 XQuery Enigma
 
Wibiya founders at The Junction
Wibiya founders at The JunctionWibiya founders at The Junction
Wibiya founders at The Junction
 
CMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social DrupalCMS Expo 2011 - Social Drupal
CMS Expo 2011 - Social Drupal
 
Promise notes
Promise notesPromise notes
Promise notes
 
Web heresies
Web heresiesWeb heresies
Web heresies
 
LISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps TransformationLISA 2011 Keynote: The DevOps Transformation
LISA 2011 Keynote: The DevOps Transformation
 
SecurityBSides las vegas - Agnitio
SecurityBSides las vegas - AgnitioSecurityBSides las vegas - Agnitio
SecurityBSides las vegas - Agnitio
 
2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore2011 july-gtug-high-replication-datastore
2011 july-gtug-high-replication-datastore
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Time Series Data Storage in MongoDB

  • 2. ajackson @ skylineinnovations.com Sunday, July 24, 2011
  • 3. a tale of rapid prototyping, data warehousing, solar power, an architecture designed for data analysis at “scale” ...and arduinos! Sunday, July 24, 2011 So here’s what i’d like to talk about: Who we are, how we got started, and most importantly, how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and while i know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’s flexible nature really helped us as a business, and how Mongo specifically has been a good choice for us as we build some of our tools. Here are some themes:
  • 4. Scaling Sunday, July 24, 2011 Mongo has come to have a pretty strong association with the word “scaling.” Scaling is a word we throw around a lot, and it almost always means “software performance, as inputs grow by orders of magnitude.” But scaling also means performance as the variety of inputs increases. I’d argue that it’s scaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input to a hundred. There’s another word for this.
  • 5. Scaling Flexibility Sunday, July 24, 2011 Particularly when you scale in the real world, you start to find that it’s complicated and messy and entropic in ways that software isn’t always equipped to handle. So for us, when we say “mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll come back to them as well.
  • 6. Business-first development Sunday, July 24, 2011 This generally means flexibile, lightweight processes. Things that become fixed & unchangable quickly become obsolete and sad :’(
  • 7. When Does “Context” become “Yak Shaving”? Sunday, July 24, 2011 When i read new things or hear about new stuff, I’m always trying to put it in context. So, sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast over the context that *is* important. So please stop me to ask questions! Also, the problem domain here is a little different than what we might be used to, so bear with me as we go into plumbing & construction.
  • 10. Project Development + Technology Sunday, July 24, 2011
  • 12. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011
  • 13. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.
  • 14. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.
  • 15. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 So, here’s the interesting part. Since we put stuff on your roof for free, we need to get that money back. What we do is, we’ll charge you for the energy that it saved you, but, here’s the twist. Other companies have done similar things, where they say “we’ll pay for a system/ retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll get savings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So, we actually measure the performance of this stuff, collect the data, and guarantee that you save money.
  • 18. • Why solar thermal? • Why hasn’t anyone else done this before? • Pivots? Iterations? • What’s the market size? • Funding? Capital structures? • Wait, how do you guys make money? Sunday, July 24, 2011 Oh, right, this isn’t a startup talk. But feel free to ask me these later!
  • 19. Solar Thermal in Five Minutes ( mongo next, i promise! ) Sunday, July 24, 2011
  • 20. Municipal => Roof => Tank => Customer Sunday, July 24, 2011
  • 21. Relevant Data to Track Sunday, July 24, 2011
  • 22. Temperatures (about a dozen) Sunday, July 24, 2011
  • 23. Flow Rates (at least two) Sunday, July 24, 2011
  • 24. Parallel data streams (hopefully many) Sunday, July 24, 2011 e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.
  • 25. how much data? 20 data points @ 4 bytes 1 minute intervals at 1000 projects (I wish!) for 10 years 80 * 60 * 24 * 365 * 10 * 1000 = 400 GB? ...not much, really, “in the raw” Sunday, July 24, 2011 unfortunately, we can’t really store it with maximal efficiency, because of things like timestamps, metadata, etc., but still.
  • 26. Sunday, July 24, 2011 I hope this provides enough context on the business problems we’re trying to solve. It looks like we’ll need a data pipeline, and we’ll need one fast. We’ve got data that we’ll need to use to build, monitor, and monetize these energy technologies. Having worked at other smart grid companies before, I’ve seen some good data pipelines and some bad data pipelines. I’d like to build a good one. The less stuff i have to build, the better.
  • 27. Sunday, July 24, 2011 As i do some research, i find that a lot of these data pipelines have a few well-defined areas of responsibility.
  • 28. Acquisition, Storage, Search, Retrieval, Analytics. Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.
  • 29. Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are here Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.
  • 30. Acquisition, Storage, Search, Retrieval, Analytics. Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before. It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
  • 31. Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are here Business value is here! Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before. It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
  • 32. Sunday, July 24, 2011 so, here’s how i started thinking about things. This is a design diagram from the early days of the company.
  • 33. Sunday, July 24, 2011 easy, python, no problem. There are some interesting topics here, but they’re not mongoDB related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the data would look like.
  • 34. Sunday, July 24, 2011 This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some light webapps for internal use. These would be dictated by business goals first, but the technological questions were straightforward.
  • 35. Sunday, July 24, 2011 Here was the real question. What would be some use cases of an analyst having a good experience look like? What would they expect the tools to do?
  • 36. Now we can think about what the data looks like Sunday, July 24, 2011 So, let’s think about what this data looks like, how it’s structured and what it is. Then, after that, we can look at what the best ways to organize it for future usefulness.
  • 37. Time series? Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458 Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468 Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471 Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477 Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581 Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614 Sunday, July 24, 2011
  • 38. TIME SERIES DATA Sunday, July 24, 2011 So what is time series data?
  • 39. Features, Over Time Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 40. Features, Over Time Thing (Feature vector, v) Time (t) Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 41. Features, Over Time Thing (Feature vector, v) Time (t) Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 42. Sunday, July 24, 2011 A couple of ideas: sampling rates. “regularity”. “completeness” analog vs. digital instantaneous vs. cumulative (tradeoffs)
  • 43. tn tn+1 Sunday, July 24, 2011 Finding known interesting ranges (definitely the most common)
  • 44. tn tn+1 Sunday, July 24, 2011 Finding known interesting ranges (definitely the most common)
  • 45. t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 46. y t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 47. y Thresholds y’ t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 48. y Thresholds y’ t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 49. (more complicated stuff can be thought of as transformations...) Sunday, July 24, 2011 e.g., frequency analysis, wavelets, whatever.
  • 50. Sunday, July 24, 2011 At this point, I go off and do a bunch of research on existing technologies. I really hate reinventing the wheel, and we really don’t have the manpower.
  • 51. Time series specific tools Scientific tools & libraries Traditional data-warehousing approaches Sunday, July 24, 2011 So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools.
  • 52. Time series specific tools RRDtool -- Round Robin Database Sunday, July 24, 2011 There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, and i highly recommend it. Unfortunately, it’s really designed for applications that are highly regular, and that are already pretty digital, for instance, sampling latencies, or temperatures in a datacenter. It’s not really good for unreliable sensors, nor is it really designed for long term persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t get me wrong, it’s totally rad, but i didn’t think it was for us.
  • 53. Scientific tools & libraries e.g., PyTables Sunday, July 24, 2011 Pretty cool, but not many of these were mature & ready for primetime. Some that were, like PyTables, didn’t really match our business use-case.
  • 54. Traditional data-warehousing approaches Sunday, July 24, 2011 So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools. [...]. That leaves us with the traditional approaches. This represents a pretty well established field, but very few of the tools are free, lightweight, and mature.
  • 55. Enterprise buzzwords (Just google for OLAP) Sunday, July 24, 2011 But the biggest idea i learned is that most data warehousing revolves around the idea of a “fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totally denormalized SQL table.
  • 56. “Measures” and their “Dimensions” Sunday, July 24, 2011 (or facts)
  • 61. (from “How to Build OLAP Application Using Mondrian + XMLA + SpagoBI”) Sunday, July 24, 2011 to which the only acceptable response is:
  • 62. Sunday, July 24, 2011 ha! Yeah right.
  • 63. Time series are not relational! Sunday, July 24, 2011 even extracted features are not inherently relational! Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’t know when you’ll have to start looking for something different. Why would you lock yourself into a schema?
  • 64. We don’t know what we’ll want to know. Sunday, July 24, 2011 We won’t know what we want to know. Not only are we warehousing time-series of multidimensional feature vectors, we don’t even know the dimensions we’ll be interested in yet!
  • 65. natural fit for documents Sunday, July 24, 2011 This makes a schema-less database a natural fit for these sorts of things. Think about all the alter-table calls i’ve avoided...
  • 66. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011 isn’t this better?
  • 67. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, “measures” "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, “dimensions” "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, ...right? "year" : 2010, "day" : 6 Sunday, July 24, 2011 measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequently we’ll look for measures by other measures -- i.e., each measure serves as a dimension.
  • 68. ...actually, not a good model. Sunday, July 24, 2011 The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure provides another dimension. Anyway!
  • 69. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011 How do we build these quickly & efficiently?
  • 70. the goal: good numbers! Sunday, July 24, 2011 Remember, the goal here is to make it easy for analysts to get comparable numbers, so when i ask for the delivered energy for one system, compared to the delivered energy from another, i can just get the time-series data, without having to worry about if sensors changed, when the network was out, when a logger was replaced with another one, etc.
  • 71. Sunday, July 24, 2011 So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV series. It doesn’t really provide a lot of intelligence, and is basically the raw numbers
  • 72. from rows to columns Sunday, July 24, 2011 So, most of what our pipeline does is turn things from rows to columns, in a flexible, useful way. I’m gonna walk through that process, quickly.
  • 73. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { Let’s just look at one "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011
  • 74. row-major data Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458 Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468 Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471 Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477 Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581 Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614 Sunday, July 24, 2011
  • 75. “Functional” class Mass(BasicMeasure): def __init__(self, density, volume): ... self._result_func = functools.partial( lambda data, density, volume: density * volume(data) density=density, volume=volume) def __call__(self, data): return self._result_func(data) Sunday, July 24, 2011 quasi-functional classes that describe how to calculate a value from data.
  • 76. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, A formula: E = ∆t × F #pseudocode class LoopEnergy(BasicMeasure): def __init__(self, heat_cap, delta, mass): ... def result_func(data): return self.delta(data) * self.mass(data) * self.heat_cap self._result_func = result_func def __call__(self, data): return self._result_func(data) Sunday, July 24, 2011
  • 77. Creating a Cube For each install, for each chunk of data: apply all known formulas to get values make some convenience keys (e.g., day_of_year) stuff it in mongo Then, map/reduce to whatever dimensionalities you’re interested in: e.g., downsampling. Sunday, July 24, 2011 Here’s some pseudocode for how to make a cube of multidimensional data. So, what’s the payoff?
  • 78. How much water did [x] use, monthly? > db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold": 1}).sort({“_id”: 1}) Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 79. What were our highest production days? > db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy Sold”: -1}) Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 80. How does the distribution of [x] on the weekend compare to its distribution on the weekdays? > weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}}) > weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}}) > do stuff Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 81. What’s the production of installs north of a certain latitude, with a certain class of panel, on Tuesdays? For hours where the average delivered temperature delta was above [x], what was our generation efficiency? Normalize by number of panels? (map/reduce) Normalize by distance from equinox? (map/reduce) ...etc. Sunday, July 24, 2011
  • 82. • Building a cube can be done in parallel • Map/reduce is an easy way to think about transforms. • Not maximally efficient, but parallelizes on commodity hardware. Sunday, July 24, 2011 Some advantages. re #3 -- so what? It’s not a webapp.
  • 83. mongoDB: The future of enterprise business intelligence. (they just don’t know it yet) Sunday, July 24, 2011 So, here’s my thesis: document-databases are far superior to relational databases for business intelligence cases. Not only that, but mongoDB and some common sense lets you replace multimillion dollar IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
  • 85. Mongo expands in an organization. Sunday, July 24, 2011 it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lot of other schema-loose data that we could use it for -- like the definitions of the measures themselves, or the details about an install, etc., etc.
  • 86. Final Thoughts Sunday, July 24, 2011 Ok, i want to close up with a few jumping-off points.
  • 87. “Business Intelligence” no longer requires megabucks Sunday, July 24, 2011
  • 88. Flexible tools means business responsiveness should be easy Sunday, July 24, 2011
  • 89. “Scaling” doesn’t just mean depth-first. Sunday, July 24, 2011 businesses grow deep, in the sense of adding more users, but they also grow broad.
  • 91. Epilogue Quest for Logging Hardware Sunday, July 24, 2011
  • 92. This’ll be easy! This is such an obvious and well explored problem space, i’m sure we’ll be able to find a solution that matches our needs without breaking the bank! Sunday, July 24, 2011
  • 93. Shopping List! 16 temperature sensors 4 flow sensors maybe some miscellaneous ones internet backhaul no software/data lock in. Sunday, July 24, 2011
  • 94. Conventions FTW! And since we’ve walked a couple convention floors and product catalogs from major industrial supply vendors, i’m sure it’s in here somewhere! Sunday, July 24, 2011
  • 95. derp derp “internet”? I’m sure there’s a reason why all of these loggers have to connect via USB... Pace Scientific XR5: 8 analog 3 pulse ONE MB no internet? $500?!? Sunday, July 24, 2011
  • 96. yay windows? ...and require proprietary (windows!) software or subscription plans that route my data through their servers (basically all of them!) Sunday, July 24, 2011
  • 97. Maybe the gov’t can help! Perhaps there’s some kind of standard that the governments require for solar thermal monitoring systems to be eligible for incentives or tax credits. Sunday, July 24, 2011
  • 98. Vive la France! An obscure standard by the Organisation Internationale de Métrologie Légale appears! Neat! Sunday, July 24, 2011
  • 99. A “Certified” Logger two temperature sensors one pulse no increase in accuracy no data backhaul -- at all ... what’s the price? Sunday, July 24, 2011
  • 102. Hmm... I can solder, and arduinos are pretty cheap Sunday, July 24, 2011
  • 104. arduino + netbook! Sunday, July 24, 2011
  • 105. TL; DR: Existing loggers are terrible. Sunday, July 24, 2011 Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.
  • 106. http://www.flickr.com/photos/rknight/4358119571/ • http://4.bp.blogspot.com/_8vNzwxlohg0/ TJoUWqsF4LI/AAAAAAAABMg/QaUiKwCEZn8/ s320/turtles-all-the-way-down.jpg • http://www.flickr.com/photos/rhk313/3801302914/ • http://www.flickr.com/photos/benny_lin/481411728/ • http://spagobi.blogspot.com/ 2010_08_01_archive.html • http://community.qlikview.com/forums/t/37106.aspx Sunday, July 24, 2011