SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
NASA	
  and	
  Apache	
  Spark	
  
	
  
Chris	
  Ma2mann	
  
Chief	
  Architect,	
  Instrument	
  and	
  Science	
  Data	
  Systems	
  Sec6on,	
  NASA	
  JPL	
  
Adjunct	
  Associate	
  Professor,	
  USC	
  
Director,	
  Apache	
  So?ware	
  Founda6on	
  
	
  
	
  
Spark	
  Summit	
  2015	
  
And	
  you	
  are?	
  
•  Apache	
  Board	
  of	
  Directors	
  involved	
  in	
  
–  OODT	
  (VP,	
  PMC),	
  Tika	
  (PMC),	
  Nutch	
  (PMC),	
  Incubator	
  
(PMC),	
  SIS	
  (PMC),	
  Gora	
  (PMC),	
  Airavata	
  (PMC)	
  
•  Chief	
  Architect,	
  
Instrument	
  and	
  Science	
  
Data	
  Systems	
  SecLon	
  at	
  
NASA	
  JPL	
  in	
  Pasadena,	
  CA	
  
USA	
  
•  SoPware	
  Architecture/
Engineering	
  Prof	
  at	
  Univ.	
  
of	
  Southern	
  California	
  	
  
15-­‐Jun-­‐15	
   2	
  SparkSummit	
  
15-­‐Jun-­‐15	
   SparkSummit	
   3	
  
I	
  work	
  here	
  
Instrument	
  &	
  	
  
Ground	
  Data	
  Systems	
  
(Sec6on	
  398)	
  
•  Largest	
  SecLon	
  on	
  Lab	
  
•  250+	
  people	
  
•  Data	
  Science,	
  Machine	
  Learning,	
  	
  
VisualizaLon,	
  OperaLons	
  groups	
  
•  OCO-­‐2,	
  NPP	
  Sounder	
  PEATE,	
  SMAP,	
  
MER,	
  MSL,	
  Mars	
  2020,	
  Image	
  	
  
Processing	
  
15-­‐Jun-­‐15	
   SparkSummit	
   4	
  
Instrument	
  and	
  Ground	
  Systems:	
  
Earth	
  Monitoring	
  
15-­‐Jun-­‐15	
   SparkSummit	
   5	
  
Instrument Ground Systems
Algorithms
(e.g., triage)
scalable
archives
Scalable Data Analytics
Graph-based
Array-based
Text-based
Space-based
Satellite
Airborne Instrument
and Platform
In-Situ Sensor
Deep Space
Network (DSN)
Science Community
Researchers
Decision Makers
data (GBs - Earth;
MBs - Planetary)
data (10s of TB)
data (10s-100s of
GB)
data
data
(reduced)
File & Metadata
Management; Workflow
Management; Resource
Management
processed data
raw &
processed data processed
& reduced data
Some	
  “Big	
  Data”	
  Grand	
  Challenges	
  I’m	
  
interested	
  in	
  
•  How	
  do	
  we	
  handle	
  700	
  TB/sec	
  of	
  data	
  coming	
  off	
  the	
  wire	
  when	
  we	
  
actually	
  have	
  to	
  keep	
  it	
  around?	
  
–  Required	
  by	
  the	
  Square	
  Kilometre	
  Array	
  
•  Joe	
  scien6st	
  says	
  I’ve	
  got	
  an	
  IDL	
  or	
  Matlab	
  algorithm	
  that	
  I	
  will	
  not	
  
change	
  and	
  I	
  need	
  to	
  run	
  it	
  on	
  10	
  years	
  of	
  data	
  from	
  the	
  Colorado	
  
River	
  Basin	
  and	
  store	
  and	
  disseminate	
  the	
  output	
  products	
  
–  Required	
  by	
  the	
  Western	
  Snow	
  Hydrology	
  project	
  
•  How	
  do	
  we	
  compare	
  petabytes	
  of	
  climate	
  model	
  output	
  data	
  in	
  a	
  
variety	
  of	
  formats	
  (HDF,	
  NetCDF,	
  Grib,	
  etc.)	
  with	
  petabytes	
  of	
  remote	
  
sensing	
  data	
  to	
  improve	
  climate	
  models	
  for	
  the	
  next	
  IPCC	
  assessment?	
  
–  Required	
  by	
  the	
  5th	
  IPCC	
  assessment	
  and	
  the	
  Earth	
  System	
  Grid	
  and	
  NASA	
  
•  How	
  do	
  we	
  catalog	
  all	
  of	
  NASA’s	
  current	
  planetary	
  science	
  data?	
  
–  Required	
  by	
  the	
  NASA	
  Planetary	
  Data	
  System	
  
Image	
  Credit:	
  h2p://www.jpl.nasa.gov/news/news.cfm?release=2011-­‐295	
  
Copyright	
  2012.	
  Jet	
  Propulsion	
  Laboratory,	
  California	
  InsLtute	
  of	
  Technology.	
  US	
  
Government	
  Sponsorship	
  Acknowledged.	
  
15-­‐Jun-­‐15	
   SparkSummit	
   6	
  
Task Title PI Section
1 Power Minimization in Signal
Processing for Data-Intensive Science
Larry
D’Addario
335
2 Machine Learning for Smart Triage of
Big Data
Kiri
Wagstaff
388
3 Archiving, Processing and
Dissemination for the Big Date Era
Chris
Mattmann
388
4 Knowledge driven Automated Movie
Production Environment distribution and
Display (AMPED) Pipeline
Eric
De Jong
3223
Big Data Strategic Initiative
•  Apply lower-efficient digital architectures to future
JPL flight instrument developments and proposals.
•  Expand and promote JPL expertise with machine
learning algorithm development for real-time triage.
•  Utilize intelligent anomaly classification algorithms
in other fields, including data-intensive industry.
•  Build on JPL investments in large data archive
systems to capture role in future science facilities.
•  Enhance the efficiency and impact of JPL’s data
visualization and knowledge extraction programs.
Initiative Leader: Chris Mattmann
Steering Committee Leader: Joseph Lazio
Initial Major Milestones for FY13 Date
Report on end-to-end power optimization of instruments Jun 2013
Hierarchical classification method for VAST and ChemCam Jan 2013
Demonstrate smart compression for Hyperion and CRISM Mar 2013
Cloud computing research and scalability experiments Feb 2013
Data formats and text, metadata extraction in big data sys. Aug 2013
Develop AMPED pipeline and install in VIP Center Dec 2012
Future Opportunities: Mission and instrument competitions, data-
intensive industries, LSST, future radio observatories.
JPL Concept: Big data technology for data triage, archiving, etc.
Key Challenges this work enables: Broaden JPL business base
(relevant to 1X, 3X, 4X, 7X, 8X, 9X Directorates)
Initiative Long Term Objectives
15-­‐Jun-­‐15	
   SparkSummit	
   7	
  
Credit:	
  Dayton	
  Jones	
  
The	
  Square	
  Kilometre	
  Array	
  
15-­‐Jun-­‐15	
   SparkSummit	
   8	
  
Credit:	
  Andrew	
  Hart	
  
How	
  did	
  I	
  get	
  involved	
  with	
  
Spark?	
  
•  Grab the principals behind the leading
infrastructure/viz technologies
-  Shove them in a tight space
-  Provide beer coffee and snacks
-  Provide awesome data
and challenges
-  Provide infrastructure and
connectivity
•  Check in every day and 1x a week
•  Wall of Shame/Fame
•  New Challenges Each Week
•  Midterm Presentations
•  Peanut Gallery
•  Make people talk/socialize
•  Put that all together
DARPA	
  XDATA	
  
15-­‐Jun-­‐15	
   SparkSummit	
   10	
  
DARPA	
  XDATA:	
  AnalyLcs	
  +	
  Viz	
  
15-­‐Jun-­‐15	
   SparkSummit	
   11	
  
0
4000
8000
12000
2006 2008 2010 2012
Date
Volume
|2013-07-10GoFFish: XDATA Summer Workshop Mid-Term
Akamai IP Connectivity using AR
Met	
  these	
  fine	
  people	
  
•  Ion	
  Stoica	
  
CEO,	
  DataBricks	
  
Co-­‐Director,	
  
AMP	
  Lab	
  
•  Ma2	
  Massie	
  
Dev	
  Manager,	
  
AMP	
  Lab	
  
•  Dr.	
  Chris	
  White,	
  DARPA	
  
XDATA	
  PM	
  
	
  
	
  
15-­‐Jun-­‐15	
   SparkSummit	
   12	
  
The Apache Software
Foundation
•  Largest open source
software development
entity in the world
–  Over 2600+ committers
–  Over 4200+ contributors
–  Over 400+ members
•  100+ Top Level Projects
–  57 Incubating
–  32 Lab Projects
•  12 retired projects in the “Attic”
•  Over 1.2 million revisions
•  501(c)3 non-profit organization incorporated in Delaware
- Over 10M successful requests
served a day across the world
- HTTPD web server used on
100+ million web sites (52+%
of the market)
15-­‐Jun-­‐15	
   SparkSummit	
   13	
  
Apache	
  Maturity	
  Model	
  
•  Start	
  out	
  
with	
  	
  
IncubaLon	
  
•  Grow	
  	
  
community	
  
•  Make	
  	
  
releases	
  
•  Gain	
  interest	
  
•  Diversify	
  
•  When	
  the	
  project	
  is	
  ready,	
  graduate	
  into	
  
–  Top-­‐Level	
  Project	
  (TLP)	
  
–  Sub-­‐project	
  of	
  TLP	
  
•  Increasingly,	
  Sub-­‐projects	
  are	
  discouraged	
  compared	
  to	
  TLPs	
  
14	
  SparkSummit	
  15-­‐Jun-­‐15	
  
Apache	
  is	
  a	
  well	
  recognized	
  brand	
  
15-­‐Jun-­‐15	
   SparkSummit	
   15	
  
Why	
  Spark	
  and	
  NASA?	
  
Where	
  does	
  Spark	
  fit	
  into	
  science?	
  
U.S.	
  NaLonal	
  Climate	
  Assessment	
  
(pic	
  credit:	
  Dr.	
  Tom	
  Painter)	
  
SKA	
  South	
  Africa:	
  Square	
  Kilometre	
  Array	
  
(pic	
  credit:	
  Dr.	
  Jasper	
  Horrell,	
  Simon	
  Ratcliffe	
  
15-­‐Jun-­‐15	
   SparkSummit	
   17	
  
NASA	
  Science	
  &	
  Architecture	
  
15-­‐Jun-­‐15	
   SparkSummit	
   19	
  
Science Data File Formats
•  Hierarchical Data Format (HDF)
–  http://www.hdfgroup.org
–  Versions 4 and 5
–  Lots of NASA data is in 4, newer NASA data in 5
–  Encapsulates
•  Observation (Scalars, Vectors, Matrices, NxMxZ…)
•  Metadata (Summary info, date/time ranges, spatial
ranges)
–  Custom readers/writers/APIs in many languages
•  C/C++, Python, Java
15-­‐Jun-­‐15	
   SparkSummit	
   20	
  
Science Data File Formats
•  network Common Data Form (netCDF)
–  www.unidata.ucar.edu/software/netcdf/
–  Versions 3 and 4
–  Heavily used in DOE, NOAA, etc.
–  Encapsulates
•  Observation (Scalars, Vectors, Matrices, NxMxZ…)
•  Metadata (Summary info, date/time ranges, spatial
ranges)
–  Custom readers/writers/APIs in many languages
•  C/C++, Python, Java
–  Not Hierarchical representation: all flat
15-­‐Jun-­‐15	
   SparkSummit	
   21	
  
OCO-­‐1	
  Workflow	
  	
  
15-­‐Jun-­‐15	
   SparkSummit	
   22	
  
Credit:	
  B.	
  Chafin	
  
NPP	
  Sounder	
  PEATE	
  
15-­‐Jun-­‐15	
   SparkSummit	
   23	
  
Orbit
IASI
GPolygon
AMSU-A
GPolygon
MHS
GPolygon
IASI
Map
AMSU-A
Map
MHS
Map
MetOpA IASI
L1C
MetOpA
Orbit
Boundary
File
MetOpA
AMSU-A L1B
MetOpA MHS
L1B
MetOpA
AMSU-A L1B
GPolygon File
MetOpA MHS
L1B
GPolygon File
MetOpA IASI
L1C
GPolygon File
MetOpA IASI
L1C Granule
Map File
MetOpA
AMSU-A L1B
Granule Map
File
MetOpA MHS
L1B Granule
Map File
Credit:	
  B.	
  Foster	
  
Data-­‐Reuse	
  Between	
  Stages	
  
•  All	
  of	
  these	
  science	
  data	
  pipelines	
  
– Read/Write	
  NetCDF,	
  HDF	
  files	
  
– Write	
  to	
  distributed	
  file	
  systems	
  (only	
  recently	
  
HDFS,	
  GlusterFS,	
  etc.)	
  
•  Have	
  Lming	
  constraints	
  
•  Include	
  jobs	
  with	
  varying	
  Lming	
  	
  
– Some	
  early	
  compleLng	
  jobs	
  (<1ms)	
  
– Some	
  long	
  running	
  jobs	
  
•  What	
  does	
  this	
  sound	
  like?	
  SPARK	
  
15-­‐Jun-­‐15	
   SparkSummit	
   24	
  
Apache OODT
•  Entered “incubation” at the Apache
Software Foundation in 2010
•  Selected as a top level Apache Software
Foundation project in January 2011
•  Developed by a community of participants
from many companies, universities, and
organizations
•  Used for a diverse set of science data
system activities in planetary science,
earth science, radio astronomy,
biomedicine, astrophysics, and more
OODT Development & user community includes:
http://oodt.apache.org
15-­‐Jun-­‐15	
   SparkSummit	
   25	
  
SciSpark	
  
15-­‐Jun-­‐15	
   SparkSummit	
   27	
  
MoLvaLon	
  for	
  SciSpark	
  
•  Experiment	
  with	
  in	
  memory	
  and	
  frequent	
  data	
  
reuse	
  operaLons	
  
– Regridding,	
  InteracLve	
  AnalyLcs	
  such	
  as	
  MCC	
  
search,	
  and	
  variable	
  clustering	
  (min/max)	
  over	
  
decadal	
  datasets	
  could	
  benefit	
  from	
  in-­‐memory	
  
tesLng	
  (rather	
  than	
  frequent	
  disk	
  I/O)	
  
– Data	
  IngesLon	
  (preparaLon,	
  formaong)	
  
15-­‐Jun-­‐15	
   SparkSummit	
   28	
  
Architecture	
  of	
  SciSpark	
  
15-­‐Jun-­‐15	
   SparkSummit	
   29	
  
Scientific RDD Creation
SciSpark
HDFS /
Shark
Scala
(split)
Map
Save /
Cache
Split by
Time
Split by
Region
Load
HDF
Load
NetCDF
Regrid Metrics
Data Scientists /
Expert Usrs
Scientists / Decision Makers /
Educators / Students
Data Centers/
SystemsRCMES obs4MIPs ESGF ExArch DAACs
D3.js
model skill varies strongly for metrics. UC and UCT that
exhibit the largest bias (Fig. 2k) also yield large RMSE
(Fig. 2l); however, DMI, which shows relatively small
bias, is one of the two that show the largest RMSE. All
RCMs also consistently yield higher skill in simulating
precipitation distribution over land surfaces for winter than
for summer (Fig. 3a).
The ENS bias (Fig. 4a) is characterized by wet (dry)
biases in the climatologically dry (wet) regions. This indi-
cates a weakness in representing the precipitation contrast
across the African landscape that would have deleterious
effects in representing everything from regional atmospheric
circulation to the coupling with land and vegetation pro-
cesses. To consider model biases in relation to typical
anomalous conditions, we normalize the ENS bias by the
interannual variability of the CRU precipitation over the
18-year period (rt). The ± 1rt range approximately coin-
cides with the 68 % CI. In this case, the values remain
within the ±1 range over most of Africa (Fig. 4b), i.e., the
magnitude of the ENS bias is less than the local interannual
variability. To consider the systematic bias relative to the
expected precipitation values, the ENS bias is normalized by
the CRU annual-mean value (Fig. 4c). In this case, the
normalized ENS bias is 20 % of the CRU in the region
between 20°S and 10°N. Both normalized ENS biases are
large in the dry/marginally-dry regions including northern
Sahara, eastern Horn of Africa, and Arabia Peninsula.
Figure 5 presents the precipitation annual cycle (Lieb-
mann et al. 2012) in 10 out of the 21 sub-regions; two in
the northern Africa coast (Fig. 5a, f), four in the west
Africa (Fig. 5b–e) and four in the east Africa (Fig. 5g–j)
regions. An annual cycle plot for the entire 21 sub-regions
is presented in Supplemental Figure 2a, b (http://rcmes.jpl.
nasa.gov/publications/figures/Kim-Climate_Dynamics-2012
). Two green lines in Fig. 5a–j represent the ±1rt range
about the observation. All RCMs well simulate the sea-
sonality of precipitation, at least in its phase. Despite large
inter-RCM variations, ENS agrees reasonably with CRU in
most sub-regions. For the Mediterranean regions (Figs. 5a,
f), ENS is within the ±1rt range for most of the year. ENS
also closely agrees with CRU, both in seasonality and
magnitude, in most of the western Africa regions. Fidelity
of ENS in these east Africa regions is generally lower than
in the west coast region. In the Ethiopian Highlands and
Eastern Horn of Africa, all RCMs overestimate CRU and
ENS is outside the ±1rt range throughout a year. The
RCM skill in simulating the annual cycle is summarized for
all sub-regions using portrait diagrams. The normalized
RMSE (Fig. 5k) reveals that model skill varies according
to regions. RMSE remains70 % of CRU for most RCMs
in most sub-regions except the northeastern Africa (eastern
Horn of Africa) and eastern Arabia Peninsula (R10, R20,
R21), coastal Western Sahara (R05), and eastern inland
Sahara (R06) regions. Most RCMs also simulate the phase
of the annual cycle measured by the correlation coeffi-
cients, reasonably well except for R10, R20, and R21
where RMSE is also large (Fig. 5l). Results in Fig. 5 show
that RCM skill varies according to regional climate as these
regions of poor performance are characterized by arid cli-
mate. Among these, the regions in northeastern Africa and
eastern Arabia Peninsula (regions 10, 20, and 21) are
affected by the Arabian-Sea monsoon (e.g., Segele et al.
2009). This may imply that in addition to shortcomings in
model physics for simulating precipitation in these dry
regions, the seasonal moisture flux from the Indian Ocean
associated with the movement of the Indian Ocean ITCZ
(Liebmann et al. 2012) may not be well represented via the
lateral boundary forcing. Evaluation of the large-scale
forcing will be subjects for future studies. Figure 5k, l also
show that ENS is consistently among the best performers.
(a) Precipitation (b) Temperature
Fig. 3 The standardized deviations and spatial pattern correlations between the CRU data and the individual model results for the boreal summer
(June–July–August; blue) and winter (December–January–February; red) over the land surface: a precipitation and b temperature
Systematic model errors
123
Author's personal copy
eastern regions. This west-to-east gradient is reversed in
the Southern Hemisphere (SH) subtropical region. All
RCMs simulate these observed features, but with varying
fidelity (Supplemental Figure 1; http://rcmes.jpl.nasa.gov/
publications/figures/Kim-Climate_Dynamics-2012). The
model bias (Fig. 2a–j) varies strongly among these RCMs.
It also shows systematic regional variations across all or a
majority of these RCMs. All or most RCMs generate wet
biases in South Africa and sub-Sahara (Sahel) region and
dry biases in the northwestern Sahara, northern Madagas-
car Island, southeastern Africa coast, and interior Arabia
Peninsula regions. Precipitation biases in the tropics vary
among RCMs. The spatial variation of the annual-mean
precipitation is evaluated for the mean (Fig. 2k), pattern
correlation, and standardized deviation (Fig. 2l) over the
land area. The distance between REF and individual points
in the Taylor diagram corresponds to RMSE (Taylor 2001).
All RCMs well simulate the overland-mean precipitation
amount (Fig. 2k) with typical biases 10 % of CRU,
except UC and UCT. The spatial pattern agrees closely
with CRU with correlation coefficients 0.8–0.95 (Fig. 2l).
Most RCMs overestimate the magnitude of spatial vari-
ability (standardized deviations). ENS yields smaller
RMSE than all RCMs within ENS (Fig. 2l). The measured
mm/day
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j)
(k) (l)
Fig. 2 The biases in the simulated annual-mean precipitation (mm/
day) against the CRU data for the individual models (a–j). The
overland-mean precipitation (k) and the spatial pattern correlations
and standardized deviations (l) with respect to the CRU data over the
land surface. The red square in (l) indicates the multi-RCM ensemble
J. Kim et al.
123
Author's personal copy
Extract, Transform and Load (ETL)
Sci Spark User Interface
Sci	
  Spark	
  –	
  VisualizaLon	
  (D3	
  and	
  friends)	
  
15-­‐Jun-­‐15	
   SparkSummit	
   30	
  
Query Storage
Raster
Vector
Text
Filtering
Gora
(storage framework)
Solr
(indexing framework)
web
pages
index
HBase/Hadoop
Tika
(parsing framework)
ProtocolLayer
•  Tika,	
  Nutch,	
  Blaze,	
  Bokeh,	
  Solr,	
  Tangelo	
  
Use	
  Cases	
  
•  (A)	
  MulL-­‐stage	
  generaLon	
  to	
  generate	
  Lme-­‐
split	
  data	
  
•  (B)	
  MulL-­‐stage	
  operaLon	
  to	
  select	
  data	
  from	
  
Shark,	
  and	
  cluster	
  by	
  deviaLon	
  from	
  mean	
  
15-­‐Jun-­‐15	
   SparkSummit	
   31	
  
Climate	
  Metrics	
  on	
  SciSpark	
  
15-­‐Jun-­‐15	
   SparkSummit	
   32	
  
sRDD
OODT-
based ETL
into HDFS
sRDD
Regrid
(bilinear and sRDD.cache)
Split by Time
(Jan 1998 - 2012)
in-memory
(node local or replicated)
HDFS/replicated spinning disk or SSD
sRDD
sRDD
Select lat,
lng,value from
Shark over NA
sRDD
Cluster values on
mean, output
clusters
Top 10
sorted
clusters by
mean
in-memory
(node local or replicated)
HDFS/replicated spinning disk or SSD
sRDD
in-memory
(node local or replicated)
A)
B)
SciSpark	
  –	
  just	
  geong	
  started	
  
•  Funded	
  NASA	
  AIST14	
  award	
  to	
  construct	
  
•  SciSpark	
  	
  climate	
  scenarios	
  
– Climate	
  extremes	
  /	
  impact	
  analysis	
  and	
  clustering	
  
– Mesocale	
  ConvecLve	
  Complex	
  Search	
  
15-­‐Jun-­‐15	
   SparkSummit	
   33	
  
SKA/Astronomy	
  
15-­‐Jun-­‐15	
   SparkSummit	
   34	
  
Go Where the Science is Best!
Deploy Where there is No Infrastructure
Reconfigure as Needed to Optimize Performance
Simplify by Using Raw Voltage Capture
DARPA	
  Memex	
  
•  Domain	
  Specific	
  search	
  of	
  audio/video/media	
  
•  Focused	
  crawling;	
  interacLve	
  crawling	
  
15-­‐Jun-­‐15	
   SparkSummit	
   35	
  
Conclusions	
  
•  Lots	
  of	
  places	
  in	
  science	
  and	
  NASA	
  for	
  Spark	
  
•  Great	
  connecLons	
  already	
  
•  ExisLng	
  Apache	
  projects	
  to	
  integrate	
  
upstream	
  
•  Downstream	
  use	
  cases	
  
•  Come	
  chat	
  with	
  me	
  today!	
  
15-­‐Jun-­‐15	
   SparkSummit	
   36	
  
Thank	
  you!	
  
chris.a.ma2mann@nasa.gov	
  
@chrisma2mann/Twi2er	
  
h2p://sunset.usc.edu/
~ma2mann/	
  	
  
Credit:	
  Vala	
  Afshar,	
  Extreme	
  Networks	
  

Weitere ähnliche Inhalte

Was ist angesagt?

TechEvent Cloud Governance
TechEvent Cloud GovernanceTechEvent Cloud Governance
TechEvent Cloud GovernanceTrivadis
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Spark Summit
 
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysHow Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysSpark Summit
 
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastConThe Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastConJason Barnard
 
GDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsGDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsNeo4j
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1AjayRawat971036
 
Stream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMRStream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMRCicero Joasyo Mateus de Moura
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For WorkflowTimothy Spann
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukErwin de Kreuk
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
 
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA Luc Vanrobays
 

Was ist angesagt? (20)

TechEvent Cloud Governance
TechEvent Cloud GovernanceTechEvent Cloud Governance
TechEvent Cloud Governance
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
 
Screw DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOpsScrew DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOps
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysHow Spark is Making an Impact at Goldman Sachs by Vincent Saulys
How Spark is Making an Impact at Goldman Sachs by Vincent Saulys
 
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastConThe Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
 
GDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of GraphsGDPR: Leverage the Power of Graphs
GDPR: Leverage the Power of Graphs
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Stream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMRStream de dados e Data Lake com Debezium, Delta Lake e EMR
Stream de dados e Data Lake com Debezium, Delta Lake e EMR
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
 

Andere mochten auch

Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark Summit
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Spark Summit
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Secure Communication with Privacy Preservation in VANET
Secure Communication with Privacy Preservation in VANETSecure Communication with Privacy Preservation in VANET
Secure Communication with Privacy Preservation in VANETAnkit Gupta
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreMark Wong
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanGalder Zamarreño
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc
 
VANET Simulation - Jamal Toutouh
VANET Simulation - Jamal  ToutouhVANET Simulation - Jamal  Toutouh
VANET Simulation - Jamal ToutouhJamal Toutouh, PhD
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in ActionDan Crankshaw
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. ExperienceMike Fogus
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
 

Andere mochten auch (20)

Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Secure Communication with Privacy Preservation in VANET
Secure Communication with Privacy Preservation in VANETSecure Communication with Privacy Preservation in VANET
Secure Communication with Privacy Preservation in VANET
 
Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
VANET Simulation - Jamal Toutouh
VANET Simulation - Jamal  ToutouhVANET Simulation - Jamal  Toutouh
VANET Simulation - Jamal Toutouh
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. Experience
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
VANET overview & technical review
VANET overview &  technical reviewVANET overview &  technical review
VANET overview & technical review
 
OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2
 

Ähnlich wie NASA's Use of Apache Spark for Big Data Analytics

Creating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data SuperhighwayCreating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data SuperhighwayLarry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayChris Mattmann
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Toward A National Big Data Superhighway
Toward A National Big Data SuperhighwayToward A National Big Data Superhighway
Toward A National Big Data SuperhighwayLarry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataLarry Smarr
 
The Pacific Research Platform 18 Months In
The Pacific Research Platform 18 Months In The Pacific Research Platform 18 Months In
The Pacific Research Platform 18 Months In Larry Smarr
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudAccelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
 
Scaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesScaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesMatthew Vaughn
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...Larry Smarr
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...Larry Smarr
 

Ähnlich wie NASA's Use of Apache Spark for Big Data Analytics (20)

Creating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data SuperhighwayCreating a Science-Driven Big Data Superhighway
Creating a Science-Driven Big Data Superhighway
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Toward A National Big Data Superhighway
Toward A National Big Data SuperhighwayToward A National Big Data Superhighway
Toward A National Big Data Superhighway
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big Data
 
The Pacific Research Platform 18 Months In
The Pacific Research Platform 18 Months In The Pacific Research Platform 18 Months In
The Pacific Research Platform 18 Months In
 
afternoon3.pdf
afternoon3.pdfafternoon3.pdf
afternoon3.pdf
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudAccelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
 
Scaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data ChallengesScaling People, Not Just Systems, to Take On Big Data Challenges
Scaling People, Not Just Systems, to Take On Big Data Challenges
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Kürzlich hochgeladen (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

NASA's Use of Apache Spark for Big Data Analytics

  • 1. NASA  and  Apache  Spark     Chris  Ma2mann   Chief  Architect,  Instrument  and  Science  Data  Systems  Sec6on,  NASA  JPL   Adjunct  Associate  Professor,  USC   Director,  Apache  So?ware  Founda6on       Spark  Summit  2015  
  • 2. And  you  are?   •  Apache  Board  of  Directors  involved  in   –  OODT  (VP,  PMC),  Tika  (PMC),  Nutch  (PMC),  Incubator   (PMC),  SIS  (PMC),  Gora  (PMC),  Airavata  (PMC)   •  Chief  Architect,   Instrument  and  Science   Data  Systems  SecLon  at   NASA  JPL  in  Pasadena,  CA   USA   •  SoPware  Architecture/ Engineering  Prof  at  Univ.   of  Southern  California     15-­‐Jun-­‐15   2  SparkSummit  
  • 3. 15-­‐Jun-­‐15   SparkSummit   3   I  work  here  
  • 4. Instrument  &     Ground  Data  Systems   (Sec6on  398)   •  Largest  SecLon  on  Lab   •  250+  people   •  Data  Science,  Machine  Learning,     VisualizaLon,  OperaLons  groups   •  OCO-­‐2,  NPP  Sounder  PEATE,  SMAP,   MER,  MSL,  Mars  2020,  Image     Processing   15-­‐Jun-­‐15   SparkSummit   4  
  • 5. Instrument  and  Ground  Systems:   Earth  Monitoring   15-­‐Jun-­‐15   SparkSummit   5   Instrument Ground Systems Algorithms (e.g., triage) scalable archives Scalable Data Analytics Graph-based Array-based Text-based Space-based Satellite Airborne Instrument and Platform In-Situ Sensor Deep Space Network (DSN) Science Community Researchers Decision Makers data (GBs - Earth; MBs - Planetary) data (10s of TB) data (10s-100s of GB) data data (reduced) File & Metadata Management; Workflow Management; Resource Management processed data raw & processed data processed & reduced data
  • 6. Some  “Big  Data”  Grand  Challenges  I’m   interested  in   •  How  do  we  handle  700  TB/sec  of  data  coming  off  the  wire  when  we   actually  have  to  keep  it  around?   –  Required  by  the  Square  Kilometre  Array   •  Joe  scien6st  says  I’ve  got  an  IDL  or  Matlab  algorithm  that  I  will  not   change  and  I  need  to  run  it  on  10  years  of  data  from  the  Colorado   River  Basin  and  store  and  disseminate  the  output  products   –  Required  by  the  Western  Snow  Hydrology  project   •  How  do  we  compare  petabytes  of  climate  model  output  data  in  a   variety  of  formats  (HDF,  NetCDF,  Grib,  etc.)  with  petabytes  of  remote   sensing  data  to  improve  climate  models  for  the  next  IPCC  assessment?   –  Required  by  the  5th  IPCC  assessment  and  the  Earth  System  Grid  and  NASA   •  How  do  we  catalog  all  of  NASA’s  current  planetary  science  data?   –  Required  by  the  NASA  Planetary  Data  System   Image  Credit:  h2p://www.jpl.nasa.gov/news/news.cfm?release=2011-­‐295   Copyright  2012.  Jet  Propulsion  Laboratory,  California  InsLtute  of  Technology.  US   Government  Sponsorship  Acknowledged.   15-­‐Jun-­‐15   SparkSummit   6  
  • 7. Task Title PI Section 1 Power Minimization in Signal Processing for Data-Intensive Science Larry D’Addario 335 2 Machine Learning for Smart Triage of Big Data Kiri Wagstaff 388 3 Archiving, Processing and Dissemination for the Big Date Era Chris Mattmann 388 4 Knowledge driven Automated Movie Production Environment distribution and Display (AMPED) Pipeline Eric De Jong 3223 Big Data Strategic Initiative •  Apply lower-efficient digital architectures to future JPL flight instrument developments and proposals. •  Expand and promote JPL expertise with machine learning algorithm development for real-time triage. •  Utilize intelligent anomaly classification algorithms in other fields, including data-intensive industry. •  Build on JPL investments in large data archive systems to capture role in future science facilities. •  Enhance the efficiency and impact of JPL’s data visualization and knowledge extraction programs. Initiative Leader: Chris Mattmann Steering Committee Leader: Joseph Lazio Initial Major Milestones for FY13 Date Report on end-to-end power optimization of instruments Jun 2013 Hierarchical classification method for VAST and ChemCam Jan 2013 Demonstrate smart compression for Hyperion and CRISM Mar 2013 Cloud computing research and scalability experiments Feb 2013 Data formats and text, metadata extraction in big data sys. Aug 2013 Develop AMPED pipeline and install in VIP Center Dec 2012 Future Opportunities: Mission and instrument competitions, data- intensive industries, LSST, future radio observatories. JPL Concept: Big data technology for data triage, archiving, etc. Key Challenges this work enables: Broaden JPL business base (relevant to 1X, 3X, 4X, 7X, 8X, 9X Directorates) Initiative Long Term Objectives 15-­‐Jun-­‐15   SparkSummit   7   Credit:  Dayton  Jones  
  • 8. The  Square  Kilometre  Array   15-­‐Jun-­‐15   SparkSummit   8   Credit:  Andrew  Hart  
  • 9. How  did  I  get  involved  with   Spark?  
  • 10. •  Grab the principals behind the leading infrastructure/viz technologies -  Shove them in a tight space -  Provide beer coffee and snacks -  Provide awesome data and challenges -  Provide infrastructure and connectivity •  Check in every day and 1x a week •  Wall of Shame/Fame •  New Challenges Each Week •  Midterm Presentations •  Peanut Gallery •  Make people talk/socialize •  Put that all together DARPA  XDATA   15-­‐Jun-­‐15   SparkSummit   10  
  • 11. DARPA  XDATA:  AnalyLcs  +  Viz   15-­‐Jun-­‐15   SparkSummit   11   0 4000 8000 12000 2006 2008 2010 2012 Date Volume |2013-07-10GoFFish: XDATA Summer Workshop Mid-Term Akamai IP Connectivity using AR
  • 12. Met  these  fine  people   •  Ion  Stoica   CEO,  DataBricks   Co-­‐Director,   AMP  Lab   •  Ma2  Massie   Dev  Manager,   AMP  Lab   •  Dr.  Chris  White,  DARPA   XDATA  PM       15-­‐Jun-­‐15   SparkSummit   12  
  • 13. The Apache Software Foundation •  Largest open source software development entity in the world –  Over 2600+ committers –  Over 4200+ contributors –  Over 400+ members •  100+ Top Level Projects –  57 Incubating –  32 Lab Projects •  12 retired projects in the “Attic” •  Over 1.2 million revisions •  501(c)3 non-profit organization incorporated in Delaware - Over 10M successful requests served a day across the world - HTTPD web server used on 100+ million web sites (52+% of the market) 15-­‐Jun-­‐15   SparkSummit   13  
  • 14. Apache  Maturity  Model   •  Start  out   with     IncubaLon   •  Grow     community   •  Make     releases   •  Gain  interest   •  Diversify   •  When  the  project  is  ready,  graduate  into   –  Top-­‐Level  Project  (TLP)   –  Sub-­‐project  of  TLP   •  Increasingly,  Sub-­‐projects  are  discouraged  compared  to  TLPs   14  SparkSummit  15-­‐Jun-­‐15  
  • 15. Apache  is  a  well  recognized  brand   15-­‐Jun-­‐15   SparkSummit   15  
  • 16. Why  Spark  and  NASA?  
  • 17. Where  does  Spark  fit  into  science?   U.S.  NaLonal  Climate  Assessment   (pic  credit:  Dr.  Tom  Painter)   SKA  South  Africa:  Square  Kilometre  Array   (pic  credit:  Dr.  Jasper  Horrell,  Simon  Ratcliffe   15-­‐Jun-­‐15   SparkSummit   17  
  • 18. NASA  Science  &  Architecture  
  • 20. Science Data File Formats •  Hierarchical Data Format (HDF) –  http://www.hdfgroup.org –  Versions 4 and 5 –  Lots of NASA data is in 4, newer NASA data in 5 –  Encapsulates •  Observation (Scalars, Vectors, Matrices, NxMxZ…) •  Metadata (Summary info, date/time ranges, spatial ranges) –  Custom readers/writers/APIs in many languages •  C/C++, Python, Java 15-­‐Jun-­‐15   SparkSummit   20  
  • 21. Science Data File Formats •  network Common Data Form (netCDF) –  www.unidata.ucar.edu/software/netcdf/ –  Versions 3 and 4 –  Heavily used in DOE, NOAA, etc. –  Encapsulates •  Observation (Scalars, Vectors, Matrices, NxMxZ…) •  Metadata (Summary info, date/time ranges, spatial ranges) –  Custom readers/writers/APIs in many languages •  C/C++, Python, Java –  Not Hierarchical representation: all flat 15-­‐Jun-­‐15   SparkSummit   21  
  • 22. OCO-­‐1  Workflow     15-­‐Jun-­‐15   SparkSummit   22   Credit:  B.  Chafin  
  • 23. NPP  Sounder  PEATE   15-­‐Jun-­‐15   SparkSummit   23   Orbit IASI GPolygon AMSU-A GPolygon MHS GPolygon IASI Map AMSU-A Map MHS Map MetOpA IASI L1C MetOpA Orbit Boundary File MetOpA AMSU-A L1B MetOpA MHS L1B MetOpA AMSU-A L1B GPolygon File MetOpA MHS L1B GPolygon File MetOpA IASI L1C GPolygon File MetOpA IASI L1C Granule Map File MetOpA AMSU-A L1B Granule Map File MetOpA MHS L1B Granule Map File Credit:  B.  Foster  
  • 24. Data-­‐Reuse  Between  Stages   •  All  of  these  science  data  pipelines   – Read/Write  NetCDF,  HDF  files   – Write  to  distributed  file  systems  (only  recently   HDFS,  GlusterFS,  etc.)   •  Have  Lming  constraints   •  Include  jobs  with  varying  Lming     – Some  early  compleLng  jobs  (<1ms)   – Some  long  running  jobs   •  What  does  this  sound  like?  SPARK   15-­‐Jun-­‐15   SparkSummit   24  
  • 25. Apache OODT •  Entered “incubation” at the Apache Software Foundation in 2010 •  Selected as a top level Apache Software Foundation project in January 2011 •  Developed by a community of participants from many companies, universities, and organizations •  Used for a diverse set of science data system activities in planetary science, earth science, radio astronomy, biomedicine, astrophysics, and more OODT Development & user community includes: http://oodt.apache.org 15-­‐Jun-­‐15   SparkSummit   25  
  • 28. MoLvaLon  for  SciSpark   •  Experiment  with  in  memory  and  frequent  data   reuse  operaLons   – Regridding,  InteracLve  AnalyLcs  such  as  MCC   search,  and  variable  clustering  (min/max)  over   decadal  datasets  could  benefit  from  in-­‐memory   tesLng  (rather  than  frequent  disk  I/O)   – Data  IngesLon  (preparaLon,  formaong)   15-­‐Jun-­‐15   SparkSummit   28  
  • 29. Architecture  of  SciSpark   15-­‐Jun-­‐15   SparkSummit   29   Scientific RDD Creation SciSpark HDFS / Shark Scala (split) Map Save / Cache Split by Time Split by Region Load HDF Load NetCDF Regrid Metrics Data Scientists / Expert Usrs Scientists / Decision Makers / Educators / Students Data Centers/ SystemsRCMES obs4MIPs ESGF ExArch DAACs D3.js model skill varies strongly for metrics. UC and UCT that exhibit the largest bias (Fig. 2k) also yield large RMSE (Fig. 2l); however, DMI, which shows relatively small bias, is one of the two that show the largest RMSE. All RCMs also consistently yield higher skill in simulating precipitation distribution over land surfaces for winter than for summer (Fig. 3a). The ENS bias (Fig. 4a) is characterized by wet (dry) biases in the climatologically dry (wet) regions. This indi- cates a weakness in representing the precipitation contrast across the African landscape that would have deleterious effects in representing everything from regional atmospheric circulation to the coupling with land and vegetation pro- cesses. To consider model biases in relation to typical anomalous conditions, we normalize the ENS bias by the interannual variability of the CRU precipitation over the 18-year period (rt). The ± 1rt range approximately coin- cides with the 68 % CI. In this case, the values remain within the ±1 range over most of Africa (Fig. 4b), i.e., the magnitude of the ENS bias is less than the local interannual variability. To consider the systematic bias relative to the expected precipitation values, the ENS bias is normalized by the CRU annual-mean value (Fig. 4c). In this case, the normalized ENS bias is 20 % of the CRU in the region between 20°S and 10°N. Both normalized ENS biases are large in the dry/marginally-dry regions including northern Sahara, eastern Horn of Africa, and Arabia Peninsula. Figure 5 presents the precipitation annual cycle (Lieb- mann et al. 2012) in 10 out of the 21 sub-regions; two in the northern Africa coast (Fig. 5a, f), four in the west Africa (Fig. 5b–e) and four in the east Africa (Fig. 5g–j) regions. An annual cycle plot for the entire 21 sub-regions is presented in Supplemental Figure 2a, b (http://rcmes.jpl. nasa.gov/publications/figures/Kim-Climate_Dynamics-2012 ). Two green lines in Fig. 5a–j represent the ±1rt range about the observation. All RCMs well simulate the sea- sonality of precipitation, at least in its phase. Despite large inter-RCM variations, ENS agrees reasonably with CRU in most sub-regions. For the Mediterranean regions (Figs. 5a, f), ENS is within the ±1rt range for most of the year. ENS also closely agrees with CRU, both in seasonality and magnitude, in most of the western Africa regions. Fidelity of ENS in these east Africa regions is generally lower than in the west coast region. In the Ethiopian Highlands and Eastern Horn of Africa, all RCMs overestimate CRU and ENS is outside the ±1rt range throughout a year. The RCM skill in simulating the annual cycle is summarized for all sub-regions using portrait diagrams. The normalized RMSE (Fig. 5k) reveals that model skill varies according to regions. RMSE remains70 % of CRU for most RCMs in most sub-regions except the northeastern Africa (eastern Horn of Africa) and eastern Arabia Peninsula (R10, R20, R21), coastal Western Sahara (R05), and eastern inland Sahara (R06) regions. Most RCMs also simulate the phase of the annual cycle measured by the correlation coeffi- cients, reasonably well except for R10, R20, and R21 where RMSE is also large (Fig. 5l). Results in Fig. 5 show that RCM skill varies according to regional climate as these regions of poor performance are characterized by arid cli- mate. Among these, the regions in northeastern Africa and eastern Arabia Peninsula (regions 10, 20, and 21) are affected by the Arabian-Sea monsoon (e.g., Segele et al. 2009). This may imply that in addition to shortcomings in model physics for simulating precipitation in these dry regions, the seasonal moisture flux from the Indian Ocean associated with the movement of the Indian Ocean ITCZ (Liebmann et al. 2012) may not be well represented via the lateral boundary forcing. Evaluation of the large-scale forcing will be subjects for future studies. Figure 5k, l also show that ENS is consistently among the best performers. (a) Precipitation (b) Temperature Fig. 3 The standardized deviations and spatial pattern correlations between the CRU data and the individual model results for the boreal summer (June–July–August; blue) and winter (December–January–February; red) over the land surface: a precipitation and b temperature Systematic model errors 123 Author's personal copy eastern regions. This west-to-east gradient is reversed in the Southern Hemisphere (SH) subtropical region. All RCMs simulate these observed features, but with varying fidelity (Supplemental Figure 1; http://rcmes.jpl.nasa.gov/ publications/figures/Kim-Climate_Dynamics-2012). The model bias (Fig. 2a–j) varies strongly among these RCMs. It also shows systematic regional variations across all or a majority of these RCMs. All or most RCMs generate wet biases in South Africa and sub-Sahara (Sahel) region and dry biases in the northwestern Sahara, northern Madagas- car Island, southeastern Africa coast, and interior Arabia Peninsula regions. Precipitation biases in the tropics vary among RCMs. The spatial variation of the annual-mean precipitation is evaluated for the mean (Fig. 2k), pattern correlation, and standardized deviation (Fig. 2l) over the land area. The distance between REF and individual points in the Taylor diagram corresponds to RMSE (Taylor 2001). All RCMs well simulate the overland-mean precipitation amount (Fig. 2k) with typical biases 10 % of CRU, except UC and UCT. The spatial pattern agrees closely with CRU with correlation coefficients 0.8–0.95 (Fig. 2l). Most RCMs overestimate the magnitude of spatial vari- ability (standardized deviations). ENS yields smaller RMSE than all RCMs within ENS (Fig. 2l). The measured mm/day (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Fig. 2 The biases in the simulated annual-mean precipitation (mm/ day) against the CRU data for the individual models (a–j). The overland-mean precipitation (k) and the spatial pattern correlations and standardized deviations (l) with respect to the CRU data over the land surface. The red square in (l) indicates the multi-RCM ensemble J. Kim et al. 123 Author's personal copy Extract, Transform and Load (ETL) Sci Spark User Interface
  • 30. Sci  Spark  –  VisualizaLon  (D3  and  friends)   15-­‐Jun-­‐15   SparkSummit   30   Query Storage Raster Vector Text Filtering Gora (storage framework) Solr (indexing framework) web pages index HBase/Hadoop Tika (parsing framework) ProtocolLayer •  Tika,  Nutch,  Blaze,  Bokeh,  Solr,  Tangelo  
  • 31. Use  Cases   •  (A)  MulL-­‐stage  generaLon  to  generate  Lme-­‐ split  data   •  (B)  MulL-­‐stage  operaLon  to  select  data  from   Shark,  and  cluster  by  deviaLon  from  mean   15-­‐Jun-­‐15   SparkSummit   31  
  • 32. Climate  Metrics  on  SciSpark   15-­‐Jun-­‐15   SparkSummit   32   sRDD OODT- based ETL into HDFS sRDD Regrid (bilinear and sRDD.cache) Split by Time (Jan 1998 - 2012) in-memory (node local or replicated) HDFS/replicated spinning disk or SSD sRDD sRDD Select lat, lng,value from Shark over NA sRDD Cluster values on mean, output clusters Top 10 sorted clusters by mean in-memory (node local or replicated) HDFS/replicated spinning disk or SSD sRDD in-memory (node local or replicated) A) B)
  • 33. SciSpark  –  just  geong  started   •  Funded  NASA  AIST14  award  to  construct   •  SciSpark    climate  scenarios   – Climate  extremes  /  impact  analysis  and  clustering   – Mesocale  ConvecLve  Complex  Search   15-­‐Jun-­‐15   SparkSummit   33  
  • 34. SKA/Astronomy   15-­‐Jun-­‐15   SparkSummit   34   Go Where the Science is Best! Deploy Where there is No Infrastructure Reconfigure as Needed to Optimize Performance Simplify by Using Raw Voltage Capture
  • 35. DARPA  Memex   •  Domain  Specific  search  of  audio/video/media   •  Focused  crawling;  interacLve  crawling   15-­‐Jun-­‐15   SparkSummit   35  
  • 36. Conclusions   •  Lots  of  places  in  science  and  NASA  for  Spark   •  Great  connecLons  already   •  ExisLng  Apache  projects  to  integrate   upstream   •  Downstream  use  cases   •  Come  chat  with  me  today!   15-­‐Jun-­‐15   SparkSummit   36  
  • 37. Thank  you!   chris.a.ma2mann@nasa.gov   @chrisma2mann/Twi2er   h2p://sunset.usc.edu/ ~ma2mann/     Credit:  Vala  Afshar,  Extreme  Networks