HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
Â
Data Tactics Analytics Brown Bag (Aug 22, 2013)
1. DT Brown Bag: A Primer in Analytics
WELCOME!
R2 = 500; p<martyâs 1mile time
asymptotically approaching perfect
Thursday, August 22, 13
2. Outline
â˘EAT, Guten Appetit, Bon appetit, Buen apetito, Buon appetito!
â˘Words from the VP
â˘Why this brown-bag?
â˘Analytics Services:
â˘Team Introduction; About YOU!
â˘Why Analytics!?
â˘Philosophy...
â˘Case Studies:
â˘Case Study (Nathan D.)
â˘Localview (Marty A.)
â˘Case Study (me)
â˘Core Values: Analytical Insights
â˘On the horizon...
Thursday, August 22, 13
3. Why this brown bag??
Learning [close] at a pace similar to the pace at which we learn.
Learning and Educating from/to PMs, SWE, and OPs.
PM: Provide insights from FRIs/RFPs.
PM: Atmospherics from our costumers.
SWE: Accessing data spaces.
SWE: Integrating algorithms.
OP: How do you best consume the outputs of models?
OP: What models are best to present to OPs?
PM: Program Managers, SWE: Software Engineers, OP: Operators
Thursday, August 22, 13
5. Data Tactics Analytics Practice
The Team:
(Nathan D., Shrayes R., David P., Adam VE., Andrew T., Geoffrey B., Rich H.)
Graduates from top universities...
Degrees include:
mathematics, computer science, aeronautical engineering,
astrophysics, electrical engineering, mechanical engineering, statistics,
social science(s).
Base competencies (horizontals): Clustering, Association Rules,
Regression, Naive Bayesian ClassiďŹer, Decision Trees, Time-Series,
Text Analysis.
Going beyond the base (verticals)...
Thursday, August 22, 13
6. Data Tactics Analytics Practice
ABOUT YOU:
28 conďŹrmed, 18 webex, 14 tentative (n:60 represent > 25% of the company)
21 conďŹrmed within the ďŹrst 60 minutes....
Monsee Wood & Steve Moccio 1st
Charles Fuller & Lenesto Page Last
Chris Zilligen: 3,120 (Longest resume)
Catherine Schymanski: 284 (shortest resume)
Linguistic Standard:
Jack Gustafson (FK: -126)
Shrayes Ramesh (FK: -38)
...analytics team below the company average!! :)
Thursday, August 22, 13
7. Horizontals & Verticals
Clustering || Regression || Decision Trees || Text Analysis
Association Rules || Naive Bayesian ClassiďŹer || Time Series Analysis
econom
etricsspatialeconom
etrics
graph
theory
algorithm
s
astrophysicaltim
e-series
analysis
path
planning
algorithm
s
bayesian
statistics
constrained
optim
izations
num
ericalintegration
techniques
PCA
G
LM
hierarchicalm
odels
IRT
DLISA
latentclass
analysis
structuralequation
m
odeling
m
ixture
m
odels
SVM
m
axent
CART
naive
bayes
classiďŹer
ICA
Thursday, August 22, 13
8. Data Tactics Analytics Practice
Program
m
ing
&
Scripting
Skills
M
athem
atics
&
Statistics
Domain Expertise
DT
Analytics
Traditional
Research
DangerZone!
~statisticulation
ML
[2] http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
[1] Statisticulation âHow to Lie with Statisticsâ Darrell Huff
[3] https://portal.data-tactics-corp.com/sites/analytics/Wiki/AnalyticsFAQ.aspx
Thursday, August 22, 13
9. Why Analytics [Business]???
Why are analytics important?
(Business, Analytics, Practical)
"We need to stop reinventing the cloud
and start using it!"
(Dave Boyd)
Thursday, August 22, 13
10. Why are analytics important?
(Business, Analytics, Practical)
Analytics:
No Free Lunch (NFL) theorems: no algorithm performs better
than any other when their performance is averaged uniformly
over all possible problems of a particular type. Algorithms must
be designed for a particular domain or style of problem, and that
there is no such thing as a general purpose algorithm.
Why Analytics [Analytics]???
Thursday, August 22, 13
11. Marty doesnât scale - none of us do.
Data Scales
Web Scales
Academic Publications Scale
IC Scales
N
t
t
Why Analytics [Practical]???
Thursday, August 22, 13
12. Why Analytics [Practical]???
Why are analytics important?
(Business, Analytics, Practical)
ââŚthe alternative to good statistics is not âno
statistics,â itâs bad statistics. People who argue
against statistical reasoning often end up backing up
their arguments with whatever numbers they have at
their command, over- or under-adjusting in their
eagerness to avoid anything systematicâ Bill James
Thursday, August 22, 13
13. "companies that have massive amounts of data
without massive amounts of clue are going to be
displaced by startups that have less data but more
clue" (Tim OâReilly)
Philosophy:
Thursday, August 22, 13
14. Philosophy:
We are NOT âData Agnosticâ
...this should represent an early warning
system about our culture. The IT notion
of data is dead.
Thursday, August 22, 13
16. âAnalytics in Perspectiveâ reďŹects how people arrive at
decisions.
GOOD: Induction, Abduction, Circumscription, Counterfactuals.
BAD: Deduction, Speculation, JustiďŹcation, Groupthink
Analytics in Perspective...
Thursday, August 22, 13
18. Background: The Strait of Hormuz
Importance:
⢠Oil
⢠Embargo
⢠Smuggling
Thursday, August 22, 13
19. How to Catch Smugglers
In order to stop smugglers, we must identify:
1. Which boats are undertaking illicit activities
2. Where illicit activities are taking place
3. Points of departure/arrival of suspicious ships
Thursday, August 22, 13
20. A DiďŹcult Task: Too Much Data
AIS (transponder) provides ship-level data:
⢠Ship location (lat-long)
⢠Ship speed
⢠Ship bearing
⢠Ship âpurposeâ
⢠Time stamp
About 0.5M pings from 1,300 boats between
March 2012 and January 2013.
Thursday, August 22, 13
22. A DiďŹcult Task: Too Little Data
Individual pings or tracks not useful: no point of
comparison
Similarly, small duration plots are too thin to provide
analytic leverage.
Thursday, August 22, 13
23. A DiďŹcult Task: Too Little Data
.
A single boat:
Thursday, August 22, 13
24. A DiďŹcult Task: Too Little Data
.
A single day:
Thursday, August 22, 13
26. Solution: Analytics
Use a statistical model to discover patterns in
the dataâŚ
âŚthen identify observations (boat-times) that do
not ďŹt those patterns.
Goal: Identify boats, place, and times that exhibit
or house discrepant behavior.
Thursday, August 22, 13
27. Characteristics of a Good Model
A good model for this data should:
⢠Leverage all of the available data
⢠Take advantage of local information (not global patterns)
⢠Be able to accommodate a variety of patterns (shipping,
ďŹshing, etc)
⢠Be able to identify ships that are only occasionally deviant
⢠Identify place-times where deviant activity occurs
⢠Be estimable with reasonable computational resources
Thursday, August 22, 13
28. The Model
A local, unsupervised-as-supervised learning,
bagged, probability model.
A LUBaP model?
Thursday, August 22, 13
29. The Model
A local, unsupervised-as-supervised learning,
bagged, probability model.
We want to compare apples-to-apples; that is,
treat nearby (spatio-temporally) boats the same,
don't compare them to far-ďŹung ones.
Assign each observation to a geographically
constrained grid square.
Thursday, August 22, 13
30. The Model
A local, unsupervised-as-supervised learning,
bagged, probability model.
Thursday, August 22, 13
31. The Model
A local, unsupervised-as-supervised learning,
bagged, probability model.
Let m denote the number of observations in a particular grid
square. Then, in each square, add m additional observations
with the following characteristics:
â˘position, drawn from bivariate uniform distribution
â˘speed, drawn with replacement from empirical distribution
â˘time of observation, drawn from a uniform distribution
Now, the task is no longer unsupervised, but supervised.
->Model the probability of a boat being a ``real'' boat.
Thursday, August 22, 13
34. The Model
A local, unsupervised-as-supervised learning,
bagged, probability model.
â˘Turned outlier detection, a poorly structured problem, into
modeling a binary target, a very well-understood problem
â˘Now, simply model the probability that each boat is ârealâ
â˘Apply logistic regression to each grid square
â˘Allow the ďŹexibility (order) of the model ďŹt (splines,
interactions) to depend on the data density in each square
(more data, richer model).
â˘logit(ârealâ) = f(speed, location, time)
Thursday, August 22, 13
35. The Model
A local, unsupervised-as-supervised learning,
bagged, probability model.
Problem: Predictions may be arbitrary due to
random assignment and grid coarseness.
Thursday, August 22, 13
36. The Model
A local, unsupervised-as-supervised learning,
bagged, probability model.
Problem: Predictions may be arbitrary due to
random assignment and grid coarseness.
Solution:
1. Create multiple grids with diďŹerent positions.
2. Re-run the local model in each square, for
each diďŹerent grid.
3. Aggregate the predicted probabilities for each
observation, in each grid, by averaging.
Thursday, August 22, 13
37. Computational EďŹciency
Estimating a ďŹexible model in each of ~300 grid squares, for
each of 6 grids, means estimating ~1,800 logistic models!
Not a problem, because:
⢠each one has limited amounts of data (most algorithms take
exponentially longer as a function of data size)
⢠each local model is separate, allowing for parallel
processing
Computation on my laptop takes ~4 minutes after simple
parallelization across cores.
Thursday, August 22, 13
38. What is the Output from this Model?
â˘Predicted probability of each boat-time (i.e. observation)
being a real boat.
â˘High probabilities indicate observations doing something
ânormalâ or âpredictable.â
â˘Low probabilities indicate observations doing something
âdiscrepant.â
Ship ID Lat Long Speed Timestamp Pr
623432 24.546 55.005 9.8 1203221230 0.78
874627 24.716 55.108 12.4 1209242230 0.08
523881 25.128 54.807 4.2 1206120947 0.64
Thursday, August 22, 13
41. Value III: Prioritized List of Suspect Boats
â˘Model generates probabilities on an interval scale
â˘Facilitates eďŹcient use of scarce enforcement resources
Thursday, August 22, 13
42. Lessons Learned
Analytics is a powerful tool for identifying patterns in big data.
Identifying outliers is predicated on identifying patterns.
LUBaP models are a powerful tool for outlier detection.
This model utilizes no subject matter expertise and a simple
probability model (implications: portable across domains; fast)
Thursday, August 22, 13
43. Whatâs the Next Hot Thing?
Unsupervised Scaling of Text Data
Thursday, August 22, 13
44. Analyzing Text is Important
The preponderance of data created today is free text, not
structured numerical data.
One thing people want to do with text is âscaleâ it; that is, rank
order it according to an underlying continuum.
Examples:
-put a numerical value on what each product reviewer thinks of
a particular product
-generate a measure of the extremism of Iranian clerics based
on their writings
Thursday, August 22, 13
45. Analyzing Text is DiďŹcult
Text data is unstructured, and messy.
âI thought I would love the iPhone, but itâs actually not that
great.â
Standard approaches:
1. Dictionary: Create a numeric value for many content-laden
words; compare texts to the dictionary.
2. Estimation: Hand-score many texts; use the scores as a
basis for training a statistical model for other texts.
Thursday, August 22, 13
46. A New Approach
Each authorâs use of a word implies they âsupportâ that
word, as opposed to words they donât use. The
model, developed for scaling ideological positions of
legislators from votes, can be applied to word use.
BeneďŹts:
1: No dictionary!
2: Language invariant!
https://github.com/DataTacticsCorp/text-analysis
Thursday, August 22, 13
47. Preliminary Example
Pulled down 2000 tweets, 1000 each with the hashtags #prolife
and #prochoice.
Drop the hashtags (no cheating!), pre-process the text data, and
run the model.
Thursday, August 22, 13
52. Localview
Localview also known as âLvâ, is a Cloud/Web
based proprietary Dashboard with an
advanced analytics framework â the desired
end state is an integrated data mining,
knowledge discovery and pattern recognition
of social and spatial pattering. Lv will provide
end-users with globally and locally available
historical information as well as globally and
locally available real-time social media data
feed. This service includes; news, on the spot
statistics using a proprietary Data Tactics Tool
called
Š
âZoomStatâ, historical facts, social media, economics, security, military,
infrastructure, health, aid, natural disasters, war, entertainment, weather,
transportation, and travel. All results will be analyzed, ingested,
normalized, and then plotted on a dynamic and interactive global map.
Thursday, August 22, 13
53. ...by the numbers
ďś 7 volunteered & part time team members (NO OVERHEAD)
ďś ďŹrst DEMO delivered in 86 days
ďś 832 hours of research & development time
Thursday, August 22, 13
54. The Team:
The Team
backend development frontend development data analysis development
Marty A
Joe A
Joon K
Annie W Dave P
Rich H
Shenoa H
Thursday, August 22, 13
60. Directional Space Time Analytics
Data Tactics has been working on a set of problems that
require considered solutions. The following method
compares distributions at two points in time, with a
particular focus on changes in the overall morphology of the
distribution as well as mobility of individual observations
within the distribution over that same period of time and
contextually accounting for neighborhood eďŹects. These
dynamics are illuminating and communicate time and
explicitly account for underlying spatial dimension (Wy).
Based on the integration of a dynamic local space-time
together with direction statistics these methods provide
insights on the role of spatial dependence and uncontrolled
variance over time and space.
Thursday, August 22, 13
61. Directional Space Time Analytics
This analysis demonstrates the utility of directional space time analytics
on regional stability distribution dynamics. Drawing on recent advances
in geovisualization [1], we suggest a spatially explicit view of mobility.
Based on the integration of a dynamic local indicator of spatial
association together with directional statistics and mapped data points
to each observation, this framework provides new insights on the role of
spatial dependence in regional stability and change.
These approaches have been illustrated with state level incomes in the
U.S. (1969-2008), Gross Domestic Product (1960 - 2011) Failed State
Index (2010 - 2012), and GMTI data (t0, t1).
[1] Murray, A. T., Liu, Y., Rey, S. J., and Anselin, L. (2010). Exploring movement object patterns.
Thursday, August 22, 13
62. Per Capita Gross Domestic Product
A measure of the total output of a country that takes the gross domestic product (GDP)
and divides it by the number of people in the country. The per capita GDP is especially
useful when comparing one country to another because it shows the relative
performance of the countries. A rise in per capita GDP signals growth in the economy
and tends to translate as an increase in productivity.
GDP is widely used by economists to gauge economic recession and recovery and an
economy's general monetary ability to address externalities. It is not meant to measure
externalities. It serves as a general metric for a nominal monetary standard of living and
is not adjusted for costs of living within a region.
Gross Domestic Product
GDP = private consumption + gross investment + government spending + (exports â imports), or
Thursday, August 22, 13
63. GDP per. Capita
Time Span: 1960 to 2011 (51 temporal bin(s), 1 year intervals): 2000 to 2011 (12 temporal
bin(s), 1 year intervals);
Spatial Area: Global;
Original Sample: 202 obs;
Data processing: imputation;
Pruned Sample: 145 observations;
Method: Directional Local Indicator of Spatial Autocorrelation (Moranâs I) with space-time
classiďŹcations of High-high (Hh), high-High, Low-Low (LL), High Low (HL), Low-High (LH);
Spatial Weights: knn4;
Thursday, August 22, 13
64. > describe(dlisa$yr2000)
> describe(dlisa$yr2011)
V. Name n mean sd median mad min max range skew kurtosis
yr2000 145 5759 9534 1491 1831 87 46453 46366 2.12 3.72
yr2011 145 13292 20621 4666 5841 231 114232 114001 2.46 6.54
Directional Space Time Analytics
Thursday, August 22, 13
66. Directional Space Time Analytics
2000:2011 (12 temporal bin(s), 1 year intervals);
Thursday, August 22, 13
67. Directional Space Time Analytics
What is wrong with Vermont[1]?
- Seemingly nothing!
- Lies within head of approximately normal distribution
- Not an outlier in a classical statistical sense
- Vermont remains below the US average but is
closing the gap.
[1] State Median Income
Thursday, August 22, 13
68. State Median Income
Time Span: 1969 to 2008 (40 temporal bin(s), 1 year intervals)
Spatial Area: Contiguous United States;
Original Sample: 48 obs;
Method: Directional Local Indicator of Spatial Autocorrelation (Moranâs I) with space-time
classiďŹcations of High-high (Hh), high-High, Low-Low (LL), High Low (HL), Low-High (LH);
Spatial Weights: Rook Contiguity;
Thursday, August 22, 13
69. Directional Space Time Analytics
1969:2008 (40 temporal bin(s), 1 year intervals)
Thursday, August 22, 13
70. Directional Space Time Analytics
1969:2008 (40 temporal bin(s), 1 year intervals)
Thursday, August 22, 13
71. Directional Space Time Analytics
1969:2008 (40 temporal bin(s), 1 year intervals)
Thursday, August 22, 13
73. Core Values:
Localview as an ecosystem:
Most existing big data analyses of social media are conďŹned to a
single platform. However, most of the topics of interest to such
studies, such as inďŹuence or information ďŹow can rarely be conďŹned
to the Internet, let alone to a single platform. Understandable
difďŹculty in obtaining high-quality multi-platform data does not mean
that we can treat a single platform as a closed and insular system,
as if human information ďŹows were all gases in a chamber.
âShapes of stories into computers...â Kurt Vonnegut
Nate Silver - Cognition2
; Small Multiples; Tukey vs. Tufte
http://kottke.org/11/09/kurt-vonnegut-explains-the-shapes-of-stories
Thursday, August 22, 13
74. Core Values:
Open-source software where possible.Â
-Bigger data means bigger cost.
-ScientiďŹc Python and R Computing Language reached maturity years ago.
Data = Rough + Smooth Qualities
Rough = impulsive, spiky signal: outliers; Smooth = pervasive
Leverage analytics to help understand patterns in data as well as outliers - so called rough
and smooth elements of data. The âsmoothâ and the âroughâ patterns in data are
informative, depending on the speciďŹc questions customers have.
Local, as opposed to global or whole-map statistics:
We believe that micro-level, local patterns are often of key interest, and can be
obscured or distorted by attempts to ďŹt global models to local data.Â
Analytical Pluralism:
Mutli-method approaches dominate single-method approaches. Rather than craft a single
statistical model to answer a customer question, we attack problems from several angles
simultaneously, deriving insights from areas of overlap and divergence in the pattern of ďŹndings.
Methodological pathways:
Blend nomothetic and idiographic approaches.
Thursday, August 22, 13
79. ...on the horizon.
...On the Horizon:
DT & USMA Department of Systems Engineering partner together and leverage
the Advanced Individual Academic Development Program.
Rstudio: analytics.data-tactics-corp.com; PostgreSQL: analytics.data-tactics-corp.com Port: 5432
https://github.com/rheimann/kiva-master
Thursday, August 22, 13
80. Data Tactics & US Military Academy:
A Prime in MicroďŹnance using KIVA
Rstudio: analytics.data-tactics-corp.com; PostgreSQL: analytics.data-tactics-corp.com Port: 5432
Understanding the complex nature of microďŹnance more completely:
The US military is directly involved in microďŹnance (Iraq & Afghanistan), working primarily
through Provincial Reconstruction Teams (PRTs). Funded by the DoD and DoS; the
operational requirements of these agencies create a need to demonstrate quick impact on
economic recovery and therefore the goal is to report high numbers of loans.Â
Technical complexities separate this data from other datasets:
Heterogeneous forms: structured/unstructured/nominal,ordinal, quantitative/temporal/
geographic/multi-lingual/multiple relationships(lenders to recipients) - multiple sectors/
missing data. Data cleansing is hard!
Big Data(ish): $420M (USD), 1.1 million lenders, 580,000 loans, 250 partners, 4.1M
transactions, 3 WHOLE GBs. (https://vimeo.com/28413747)
Broad appeal:
...government to defense to ďŹnance to banking to non-proďŹt organizations to THE POOR.
https://github.com/rheimann/kiva-master
Thursday, August 22, 13
81. ...on the horizon.
...On the Horizon:
DT & The Institute for the Study of War will collaborate in a balanced but largely
quantitative approach to analyzing revolutions and the role social media plays with
particular focus on the Iraq Spring.
Thursday, August 22, 13
82. ...on the horizon.
...on the Horizon:
Data Science for Program Managers (late September / early October)
Analytics Brown Bag Volume II (October / Early November)
Thursday, August 22, 13