08448380779 Call Girls In Friends Colony Women Seeking Men
Big Data Analytics Projects - Real World with Pentaho
1. Big Data Analytics Projects
in the Real World
Mark Kromer
Pentaho Big Data Analytics Product Manager
@mssqldude
@kromerbigdata
http://www.kromerbigdata.com
2. 1. The Big Data Technology Landscape
2. Big Data Analytics
3. Big Data Analytics Scenarios:
❯ Digital Marketing Analytics
• Hadoop, Aster Data, SQL Server
❯ Sentiment Analysis
• MongoDB, SQL Server
❯ Data Refinery
• Hadoop, MPP, SQL Server, Pentaho
4. SQL Server in the Big Data world (Quasi-Real World)
What we’ll (try) to cover today
3. Big Data 101
3 V’s
❯ Volume – Terabyte records, transactions, tables, files
❯ Velocity – Batch, near-time, real-time (analytics), streams.
❯ Variety – Structures, unstructured, semi-structured, and all the above in a mix
Text Processing
❯ Techniques for processing and analyzing unstructured (and structured) LARGE files
Analytics & Insights
Distributed File System & Programming
4. Big Data ≠ NoSQL
❯ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not
the same thing
❯ Facebook, for example, uses Hbase from the Hadoop stack
❯ NoSQL does not have to be Big Data
Big Data ≠ Real Time
❯ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was
otherwise too complex to provide value
❯ Use in-memory analytics for real time insights
Big Data ≠ Data Warehouse
❯ I still refer to large multi-TB DWs as “VLDB”
❯ Big Data is about crunching stats in text files for discovery of new patterns and insights
❯ Use the DW to aggregate and store the summaries of those calculations for reporting
Mark’s Big Data Myths
5. • Batch Processing
• Commodity Hardware
• Data Locality, no shared
storage
• Scales linearly
• Great for large text file
processing, not so great on
small files
• Distributed programming
paradigm
Hadoop 1.x
11. Apache Spark
High-Speed In-Memory Analytics over Hadoop
● Open Source
● Alternative to Map Reduce for certain applications
● A low latency cluster computing system
● For very large data sets
● May be 100 times faster than Map Reduce for
– Iterative algorithms
– Interactive data mining
● Used with Hadoop / HDFS
● Released under BSD License
16. Sentiment Analysis
Reference Architecture 2
Big Data
Platforms
Hadoop
PDW
MongoDB
Social Media
Sources
Data
Orchestration
Data Models
Analytical
Models
OLAP Cubes
Data Mining
OLAP
Analytics
Tools,
Reporting
Tools,
Dashboards
18. • Distributed Data (Data Locality)
❯ HDFS / MapReduce
❯ YARN / TEZ
❯ Replicated / Sharded Data
• MPP Databases
❯ Vertica, Aster, PDW, Greenplum … In-database analytics that can scale-out with
distributed processing across nodes
• Distributed Analytics
❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a
distributed, in-memory and parallel environment.”
http://www.sas.com/resources/whitepaper/wp_46345.pdf
• In-memory Analytics
❯ Microsoft PowerPivot (Tabular models)
❯ SAP HANA
❯ Tableau
Big Data Analytics
Core Tenets
19. using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Map)
20. public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
MapReduce Framework (Reduce & Job)
21. Linux shell commands to access data in HDFS
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
List files in HDFS:
c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv
View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
Now, we can work on the data with MapReduce, Hive, Pig, etc.
Get Data into Hadoop
22. create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as textfile location
'/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Use Hive for Data Schema and Analysis
23. sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1
> hadoop fs -cat /user/mark/customers/part-m-00000
> 5,Bob Smith
sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir
/user/mark/data/employees3
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec)
12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Sqoop
Data transfer to & from Hadoop & SQL Server
24. Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later
‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface
‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)
25. MongoDB and Enterprise IT Stack
EDWHadoop
Management&Monitoring
Security&Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Online Data Offline Data
27. Text Search Example
(e.g. address typo so do fuzzy match)
// Text search for address filtered by first name and NY
> db.ticks.runCommand(
“text”,
{ search: “vanderbilt ave. vander bilt”,
filter: {name: “Smith”,
city: “New York”} })
28. //Find total value of each customer’s accounts for a given RM (or Agent) sorted by value
db.accts.aggregate(
{ $match: {relationshipManager: “Smith”}},
{ $group :
{ _id : “$ssn”,
totalValue: {$sum: ”$value”} }},
{ $sort: { totalValue: -1}} )
Aggregate: Total Value of Accounts
29. SQL Server Big Data – Data Loading
Amazon HDFS & EMR Data Loading
Amazon S3 Bucket
30. SQL Server Database
❯ SQL 2012 Enterprise Edition
❯ Page Compression
❯ 2012 Columnar Compression on Fact Tables
❯ Clustered Index on all tables
❯ Auto-update Stats Asynch
❯ Partition Fact Tables by month and archive data with sliding window technique
❯ Drop all indexes before nightly ETL load jobs
❯ Rebuild all indexes when ETL completes
SQL Server Analysis Services
❯ SSAS 2012 Enterprise Edition
❯ 2008 R2 OLAP cubes partition-aligned with DW
❯ 2012 cubes in-memory tabular cubes
❯ All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
32. DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists
OPERATIONAL DATA BIG DATA DATA STREAMPUBLIC/PRIVATE CLOUDS
Enterprise &
Interactive
Reporting
Interactive
Analysis
Dashboards Predictive
Analytics
Pentaho Business Analytics
Data Integration
Instaview | Visual Map Reduce
DIRECT ACCESS
Pentaho Big Data Analytics
33. Pentaho Big Data Analytics
Accelerate the time to big data value
• Full continuity from data
access to decisions –
complete data integration &
analytics for any big data
store
• Faster development,
faster runtime – visual
development, distributed
execution
• Instant and interactive
analysis – no coding and
no ETL required
34. Product Components
Pentaho Data Integration
• Visual development for big data
• Broad connectivity
• Data quality & enrichment
• Integrated scheduling
• Security integration
• Visual data exploration
• Ad hoc analysis
• Interactive charts & visualizations
Pentaho Dashboards
• Self-service dashboard builder
• Content linking & drill through
• Highly customized mash-ups
Pentaho Data Mining &
Predictive Analytics
• Model construction & evaluation
• Learning schemes
• Integration with 3rd part models
using PMML
Pentaho Enterprise &
Interactive Reports
• Both ad hoc & distributed reporting
• Drag & drop interactive reporting
• Pixel-perfect enterprise reports
Pentaho for Big Data
MapReduce & Instaview
• Visual Interface for Developing
MR
• Self-service big data discovery
• Big data access to Data Analysts
Pentaho Analyzer
35. ❯ Simple, easy-to-use visual data exploration
❯ Web-based thin client; in-memory caching
❯ Rich library of interactive visualizations
• Geo-mapping, heat grids, scatter plots, bubble
charts, line over bar and more
• Pluggable visualizations
❯ Java ROLAP engine to analyze structured and
unstructured data, with SQL dialects for querying
data from RDBMs
❯ Pluggable cache integrating with leading caching
architectures: Infinispan (JBoss Data Grid) &
Memcached
Pentaho Interactive Analysis & Data Discovery
Highly Flexible Advanced Visualizations