As customer data grows massively, you need the tools to process data with the goal to answer important questions related to the success of your business. Traditional data processing tools have been effective in the past, but don't scale to grapple with the massive volume, velocity, and variety of data that's available to drive these decisions today. In addition, these tools required Salesforce customers to move data off-platform for processing. Salesforce provides a new tool - Data Pipeline - to help you process trillions of customer interactions on our trusted platform. Join us as we deep-dive and demo the Data Pipeline solution and cover interesting customer use-cases around Big Data Processing.
What's New in Teams Calling, Meetings and Devices March 2024
Processing Big Data At-Scale in the App Cloud
1. Processing Big Data At Scale
Naren Chawla
Senior Director, Product Management (nchawla@salesforce.com)
Prashant Kommireddi @prashant1784
Leverage platform-native Data Pipelines for ETL
2. Safe Harbor
Safe harbor statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if
any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-
looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of
product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of
management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and
customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our
service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth,
interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any
possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and
motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-
salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial
results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for
the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor
Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not
be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently
available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
3. Topics
Big Data Processing Problem and Proposed Solution
Data Pipeline Deep-dive
Demo
Key Use-cases
Customer Stories
Summary
Q&A
4. Problem
ERP
HCM SCM Logs
1. Acquire & Store Data
2. Prepare Data
(Cleanse, Augment, Transform, Join)
Data Lake /
EDW
4. Take Action
Customer Success Platform
3. Analyze
Wave
Firewall
• Cost and complexity of
managing external data
platforms
• Slow time-to-value, poor
support for ad-hoc
analysis
• Inability to deliver high-
value packaged analytic
solutions
5. Solution
ERP
HCM SCM Logs/Machine Data
4. Take Action
Salesforce Apps
3. Analyze
Wave
Firewall
• Greater ease-of-use,
consistent end-to-end
experience
• Greater flexibility and
faster time-to-value
• Packaged Analytic
Solutions
2. Prepare Data
Data Pipelines / Async Query
1. Acquire & Store Data
BigObjects
6. Data Pipelines Overview
Currently in Pilot
Data Pipelines
Programmatic language based on
Apache Pig plus whitelisted UDF
libraries (Piggybank, DataFu)
Multi-tenancy resource management,
scheduling, job monitoring and
management
Data Sources Data Targets
SObjects
BigObjects
Wave Data Sets
External Objects
Files
Archive Objects
SObjects
BigObjects
Wave Data Sets
External Objects
Files
Archive Objects
Generate mapReduce Jobs
Hadoop
Big Data Processing
9. BigObjects vs. SObjects
SObjects BigObjects
Use cases CRM transactional data Read-only immutable data
Data volumes <50m Rows Billions of Rows
Field types All Types Strings, numbers, dates, json
Query Real Time Query Response Blend of real time and asynchronous query response
determine by size of result set
Transactions ACID transactions Record Level Consistency
Access Management Full Sharing User Permissions and Field-level Security
APIs Full Support SOQL, Async Query, Data Pipelines
Triggers Full Support None
Reports Full Support Limited CRTs
Search Full Support None
12. Key Use Cases
Big Object
Ext Object
Files
sObject
Wave
sObject
sObject
Native Big Data Processing
Data Prep for Descriptive
Analytics
Data Enrichment to turn “Insight into
Actions”
Big Object
Ext Object
Files
sObject
Wave
sObject
Handling Semi-structured Data
JSON, HTML, XML and
other complex semi-
structured data...
13. Customer Stories
Gamification - based on
experience points update user
levels
Computing Partner
Scorecards
Asset Management Analytics Analytics
Large volume data processing
(250M + records). Trawl the
rewards and update user-
objects. Later, will like to use
analytics.
Scorecard determines
status which in turn
determines pricing,
resources that partners
have access to assist in
sales.
Calculated multiple times
every week for Partner
Accounts (70h+).
Account assignment at account/
office/contact levels.
Will like to run daily
Correlate game-play data
with customer interaction
to improve customer
retention, loyalty, etc.
Multi-org consolidation;
White-space analysis.
18. Summary & Next-Steps
Why Data Pipeline?
● Massive Parallelism (10-40X performance improvement)
● Overcome governor limits
● Work towards Data Lake Architecture
● Reduce complexity/cost - 100% Platform-Native
Resources
● Implementation Guide - http://docs.releasenotes.salesforce.com/en-us/summer15/release-notes/
rn_forcecom_data_pipelines.htm
Join the Pilot Program
Any questions: nchawla@salesforce.com
20. And make any adjustments needed before loading.
FUTURE
21. BigObjects
External
SObjects
• New object type optimized for extremely large row-count
• Use cases: read-only data from external systems, point-of-sale
data, connected product event data, clickstream data, etc.
• Backed by HBase as a System of Record
• Integrated into platform via External sObject framework,
Phoenix, Pliny
HBase
Phoenix
SQL
Pliny
SOQL
Platform
22. Data Pipelines Overview
Data Pipelines
Programmatic language based on
Apache Pig plus whitelisted UDF
libraries (Piggybank, DataFu)
Declarative tooling
for admins and
analysts
WaveDevConsoleSetup
Multi-tenancy Hadoop, resource
management, scheduling, job monitoring
and management
Data Sources
Data
Targets
SObjects
BigObjects
Wave Data Sets
External Objects
Files
Archive Objects
SObjects
BigObjects
Wave Data Sets
External Objects
Files
Archive Objects
Data Set
Objects
Snapshot for
provenance
tracking
Generate Data Pipelines
Generate mapReduce Jobs
Data Processing
Data Set
Objects
Snapshot for
provenance
tracking
Remove Data Sets Object
Declarative Tooling - bring it
later
23. Customer Name Brief Description Use-cases
Cloud App CloudApps increases
organisational performance by
enabling, encouraging,
enhancing and measuring
behavioural change using
gamification
Large volume data processing (250M + records). Trawl the rewards and update user-objects. Later,
will like to use analytics.
EMC Computing Partner Scorecards Business Partner scorecards help partners track whether they qualify for a particular Partner Tier
status (Gold, Silver, Platinum). Tier status determines pricing, resources that partners have access
to assist in sales. Scorecards are calculated multiple times every week for Partner Accounts. This
takes 70h+ to calculate. When being processed Scorecards are zero'ed out and a Partner cannot
not see the details of why they are in a certain status. In order to process them in a shorter window
(~10h), they've reduced the total number of Partner Accounts that qualify for the Business Partner
program from 22K to 780.
Legg Mason Asset Management Legg Mason has built an internal process to updates account assignment at account/office/contact
levels. They will like to do this more frequently, but async batch apex process is causing them to hit
several limits and preventing them to run this process daily.
Activision Video Game Developer Activision want’s to correlate game-play data with customer interaction to improve customer
retention, loyalty, etc. Currently, they load game-play data every 2 weeks, they will like to do that
daily. Plus, use Pipeline to join game play data with Case records and use Analytics to drive insight
(for example, impact of service issue on gaming behaviour)
Financial Force ERP on Platform FF gets files in emails and they have to do manual downstream processing to generate invoices, etc
based on this incoming files. They want to leverage Pipelines to scale and automate some steps
USPS Business Transformation USPS wants to combine CRM data with external data (from Equifax) to marry physical address with
digital identity for a user. They expect 500 million external records. And they will build
transformational applications based on this data (For ex, twitter handle on envelopes, Uber for
Customer Stories
24. Data Pipelines Roadmap (WORK ON THIS SLIDE)
- Spark for internal customers
- Wave connectors
- Better error handling
- Monitoring improvements
- Basic limits
198
Winter ’16 / DF15
- Resource management
- Scheduler
- Performance / optimization
- Hardening
200
Spring 16
- Metadata API
- Simple Monitoring
- Dev Console integration
- Logging improvements
- Deployment to HBase servers
196
Summer ‘15
Pilot II Pilot III
GA
(stretch goal)
25. Salesforce.com
Confidential
External
SObjects
BigObjects
• New object type optimized for extremely large row-count
• Targeted functionality
• Use cases: read-only data from external systems, point-of-sale
data, connected product event data, clickstream data, etc.
• Backed by HBase as a System of Record
• Integrated into platform via External sObject framework,
Phoenix, Pliny
HBase
Phoenix
SQL
Pliny
SOQL
Platform
18
6
2.4
26. Salesforce.com
Confidential
BigObjects vs. SObjects
SObjects BigObjects
Use cases CRM transactional data Write-once / Read-only data from external systems, point-of-
sale data, connected product event data, clickstream data,
etc
Data volumes <50m Rows Billions of Rows
Filed types All Types Strings, numbers, dates
Query Realtime query response Blend of real time and asynchronous query response
determine by size of result set
Transactions ACID transactions Eventually consistent
Access Management Full Sharing Object Perm Based, Sharing Descriptors in future
APIs Full Support REST, SOQL, Bulk
Triggers Full Support None
Reports Full Support Limited CRTs
Search Full Support None