This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
How Concur uses Big Data to get you to Tableau Conference On Time
1.
2. How Concur Uses Big Data to
Get You to TC On Time
Denny Lee
Senior Director, Data Sciences Engineering
3. About Concur
What do we do?
• Leading provider of spend management solutions and
(Travel, Invoice, TripIt, etc.) services in the world
• Global customer base of 20,000 clients and 25 million
users
• Processing more than $50 Billion in Travel & Expense
(T&E) spend each year
4. About the Speaker
Who Am I?
• Long time SQL Server BI
guy (24TB Yahoo! Cube)
• Project Isotope (Hadoop
on Windows and Azure)
• At Concur, helping with
Big Data and Data
Sciences
5. Is Big Data ….
The most overused buzzword today?
An actual useful framework?
Yes!
9. A long time ago…
• We started using Hadoop because
• It was free
• i.e. Didn’t want to pay for a big data warehouse
• Could slowly extract from hundreds of relational data
sources, consolidate it, and query it
• We were not thinking about advanced analytics
• We were thinking …. “cheaper reporting”
• We have some hardware lying around … let’s cobble it
together and now we have reports
10. But why Hadoop?
• Even with primarily relational systems, it involved
hundreds of sources
• Getting Tableau or any BI tool to connect to so many
sources is … not fun
• More times than not, we needed to understand a subset
or aggregate of this data - not all of the data!
• Can use Pig to process, extract, filter the data
• Can use Hive - a SQL like query language - to query my
data
20. Instead, choose Extract which will bring the data across from Hive and you
run live queries within Tableau. Note, the extraction will take a long time too!
21. Now that the data is in Tableau, I can pivot, slice, and filter at the speed of thought!
22. Can quickly switch to map mode and determine where most itineraries are from in 2013
24. Evolution of Hive
• Hive built originally by Facebook placed
a SQL-like query language in front of
Hadoop Map-Reduce.
• Has its flexibility but also its overhead
and complexity
• Apache community working on Hive
Stinger project to advance Hive
including DAG scheduler, optimized
columnar format, and improved engine
semantics
28. But notice the query running in Impala significantly faster!
29. Not just limit 10 types of queries but ones that involve more complicated
where clauses
30. And quickly chart out the results - e.g. highest airport in Taiwan is
Sun Moon Lake
31. Or even quickly map out the airport locations on a map to see that Sun Moon
Lake Airport is in the center of Taiwan
32. And using Impala is not just for Hue
- its even better on Tableau
33. Now I can connect to my data live and have fast queries returned to Tableau
34. After quickly modifying the data within Tableau, can discover the amount of flight
delays to Seattle, and denote that San Jose has the least # of delays
35. Why Impala?
• Focus is to speed up BI queries
• Analogous to relational BI tools except
now I can do this against a distributed
cluster
• Similar to relational BI tools that as its
special purpose, can do a lot of
optimizations to improve speed
• But note this demo was against the
same Hive table against data stored in
Hadoop
37. Using AtScale to build up a dimensional model based on the data that is
stored within Impala / Hive
38. Slice and filter the Impala model using Tableau
For more info, check out: http://atscale.com/
39. Data Extraction
How to query multiple endpoints or multiple data sources?
Setup a whole bunch of VMs and have someone connecting to
each one and executing get commands?
40. Optimizing Data Extraction
Use Hadoop streaming to execute python script to perform get
Hadoop will generate tasks for each API get call and then execute
it across all the clusters in the node in parallel
43. What is Apache Spark?
Fast and general cluster computing system
interoperable with Hadoop
Improves efficiency through:
»In-memory computing primitives
»General computation graphs
Improves usability through:
»Rich APIs in Scala, Java, Python
»Interactive shell
Up to 100× faster
(2-10× on disk)
2-5× less code
44. Project History
Started in 2009, open sourced 2010
30+ companies now contributing code
»Databricks, Yahoo!, Intel, Adobe, Cloudera, Bizo,
…
One of the largest communities in big data
45. A General Stack
Spark
Spark
Streaming
real-time
Shark
SQL
GraphX
graph
MLlib
machine
learning
…
47. Starbucks Store #3313
601 108th Ave NE
Bellevue, WA (425) 646-9602
-------------------------------
Chk 713452
05/14/2014 11:04 AM
1961558 Drawer: 1 Reg: 1
-------------------------------
Bacon Art Brkfst 3.45
Warmed
T1 Latte 2.70
Triple 1.50
Soy 0.60
Gr Vanilla Mac 4.15
Reload Card 50.00
AMEX $50.00
XXXXXXXXXXXXXXXXXX1004
SBUX Card $13.56
SUBTOTAL $62.40
New Caffe Espresso
Frappuccino(R) Blended beverage
Our Signature
Frappuccino(R) roast coffee and
fresh milk, blended with ice.
Topped with our new espresso
whipped cream and new
Italian roast drizzle
Expense Categorization
One of my receipts that I had OCRed
One of the issues we’re trying to solve
is to auto-categorize this, so how
can we do this?
Below is a simplistic solution using
WordCount
Note, a real solution should involve
machine learning algorithms
48. Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from
SCDynamicStore
14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
Spark context available as sc.
scala> val receipt =
sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt")
receipt: org.apache.spark.rdd.RDD[String] =
/usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile
at <console>:12
scala> receipt.count
res0: Long = 30
49. scala> val words = receipt.flatMap(_.split(" "))
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at
<console>:14
scala> words.count
res1: Long = 161
scala> words.distinct.count
res2: Long = 72
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ +
_).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)}
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at
<console>:16
scala> wordCounts.take(12)
res5: Array[(String, Int)] = Array(("",82), (with,2),
(Card,2), (new,2), (-------------------------------
,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2),
(New,1), (Topped,1), (Starbucks,1))
50. Still beta, but can connect from Tableau to SparkSQL using Shark driver
51. Can / will be able to connect to this SparkSQL live
53. SparkSQL - What’s Next?
• Currently makes use of Hive code-base
• Major focus for 1.2
• Pluggable external datasources
• Easier access through pure SQL
interface
• Access things like JSON tables
though SQL?
55. Invite
• Pacific Northwest Cloudera User Group
• http://bit.ly/1uFD6vJ
• Doug Cutting, Hadoop Co-Creator, will be speaking at
Disney on 9/24
• Seattle Spark Meetup
• http://bit.ly/1q4Z0Ke
• Next sessions:
• Deep Dive into Spark and Mesos Internals
• Unlocking your Hadoop data with Apache Spark
and CDH5