Have you ever been involved in developing a strategy for loading, extracting, and managing large amounts of data in salesforce.com? Join us to learn multiple solutions you can put in place to help alleviate large data volume concerns. Our architects will walk you through scenarios, solutions, and patterns you can implement to address large data volume issues.
What's New in Teams Calling, Meetings and Devices March 2024
Large Data Management Strategies
1. Managing Large Data Volumes
Suchin Rengan, Director, Salesforce Services
@Sacrengan
Mahanthi Gangadhar, Senior Solutions Technical Architect, Salesforce Services
2. Safe harbor
Safe harbor statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties
materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results
expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be
deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other
financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any
statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new
functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our
operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any
litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our
relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our
service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to
larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is
included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent
fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor
Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently
available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions
based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these
forward-looking statements.
3. We do a lot of things with data…
Data Searches/
Reporting/ List
Views
Data Creation
(Loads/ Manual)
Data Archival
Am I using the platform’s features optimally?
How do we ensure we keep up with performance?
What factors do we need to consider across each topic?
How can I ensure I have a scalable process?
Data Integration
(Out and In)
Data Extracts
4. SFDC Cloud Computing for the Enterprise
Infrastructure
Services
Network Storage
Operation System
Database
App Server
Web Server
Data Center
Application
Services
Security Sharing
Integration
Customization
Web Services
API
Multi-Language
Operations
Services
Authentication
Availability
Monitoring
Patch Mgmt
Upgrades
Backup / NOC
Customer Avoids:
Tuning OS, Capacity Management
Tuning Web Servers, Certificates, Log file Mgmt, etc
Tuning App Servers, threads, Java stack, memory/log mgmt
Tuning DB Servers, memory mgmt, disk distribution
Network management, bandwidth
Innovation
Development
Data Model
Business Logic
User Interface
5. Lets understand the underlying Platform
Your System
Apex
API
User
Custom Object
Custom Object
Account
…
7. Considerations at every layer and level
Apex
Triggers
File Storage
VF
Workflow
Rules
Sandbox
Apex
Data Storage
Data Objects
Sharing Tables
Indexes
Skinny Tables
Validation Rules
API
Logic / Application
Layer
Storage Layer
8. SOAP/REST API
SOAP API
Real Time
Relatively slow for large volumes
Loads for up to 250K records
Batch side
Time out / failure
API
Bulk API
Batches per day
Parallel mode
Larger than 250K records
Rolling 24 clock for available batches
Time to download results
9. Bulk API – Asynchronous Process
Data streamed to
temporary storage
Job updated in
Admin Setup
Client
Send all
data
Processing
Processing
Processing
Servers
Thread
Thread
Data
Batches
Check
Status
Retrieve
Results
Dataset processed
in parallel
Dequeue
batch
Insert/
update
Results
Save
results
10. Bulk API
The “go-to” option for tens of thousands of records and up
Up to 10,000 records in a batch file
Asynchronous loading, tracked in Setup’s “Monitor Bulk
Data Load Jobs” section
Walkthrough time!
Example: American Insurance Co. ?
230 million records processed in 33 hours,
14 hours ahead of schedule
11. Some tips for loading
•Difficult to extrapolate performance in a Sandbox
•
At Par or Better in Production
•Sharing Calculations
•Indexing
•File Storage
•Triggers – Act Judiciously
•Upserts – Avoid Them!
•Parent References
12. Sequence of Events/Logic
Parent Rollup
Summary
System
Validation
Custom
Validation
Assignment
Rules
Auto-response
Rules
Workflow
Rules
Escalation
Rules
APEX Layer
Before
Trigger
After
Trigger
Database Layer
Begin
Transaction
End
Transaction
13. Initial Load – Incremental or Big Bang
Obj 1
Obj2
Obj3
Option 1
All Objects
Option 2
Legend
Pre-Implementation
Activities
Initial Load
Validation
Catch up and
Ongoing Sync
User Activation
16. Data Extraction – Bulk Query Current Limitations
Bulk query works exactly like data loads
Create a job (Job Id)
Each query is a batch (Batch Id)
Close the Job and fetch results when job is complete
Limitations
Query optimizer has 100mins processing time (timeout issue)
Informatica currently does not support Bulk Query
Other tools like data loader can submit only one query
Currently requires a custom client for submitting multiple queries for a given
job
17. Data Extraction - Chunking
Auto Number Chunking
Query smaller chunks of the entire data
Use Auto number and formula field for internal indexing
Find chunk boundaries (25K) and issue queries for each chunk
PK Chunking
Use primary key to chunk (ID)
Usually better performance when entire object is extracted
Find chunk boundaries (250K) and issue queries for each chunk
Auto Chunking (“safe harbor”)
18. Data Extraction -
Chunking
1
Q1
Job Id (1234)
Q2
Q1 -> Batch Id (123)
Q2 -> Batch Id (234)
Q3
…..
Q4
Qn -> Batch Id (789)
Close Job Id (1234)
Q5
Find Chunk Boundaries -> Create Job ->
Submit each batch/query -> Get Results -> Close Job
Qn
150 M
19. Data Cleanup -
General Considerations
Deletion is one of the most resource intensive data operations in Salesforce and
can perform even slower than data load in some cases (objects with complex
sharing, with Master/Detail relationships, with Rollup Summary fields etc.) even
in bulk hard delete mode.
Custom objects can be cleaned up by truncation. Note that truncation cannot be
performed on OOB standard objects.
Data in standard objects can be deleted only using Delete API, faster
performing Bulk Hard Delete option is available
Records from User object cannot be deleted, only deactivated.
20. Data Cleanup -
Truncation and Hard Delete
Where Truncation is not Possible?
Are referenced by another non empty object through a lookup field
Are referenced in an analytic snapshot
Contains a geo location custom field
Have a custom index or an external ID field
Hard Delete
Hard Delete option is disabled by default and must be enabled by an
administrator
Observed about 4.5M/hr on Task object delete, versus 18M/hr for load time,
indicative of how slow of a process it is…
21. Data Cleanup -
General Recommendations
General Recommendations
When removing large amounts of data from a sandbox org consider Refresh option first. This is
the fastest way to clean up an org. Additional benefit of Refresh option is that data in User
object is removed also.
To remove data from a custom object use truncate function. Truncation deletes an object
entirely and places it into the recycle bin. Use Erase function to physically remove data from the
system it frees up storage space. Note it might take several hours to erase data from a large
custom object.
To remove data from standard object use bulk hard delete option. Note that performance of
bulk hard delete is quite low so plan sufficient time to remove large amounts of data using this
option. Recent test on Task object in Shadow production org demonstrated performance of
~2M records deleted per hour. On Account object the rate might be even lower.
When planning large scale tests in production environment consider possibility of getting 2 orgs
that can be used in “round robin” mode: when a test is performed in one org, cleanup is being
performed in another.
22. General Guidelines – Tips / Partner Tools
Tips
Test early and often and as big data sets as possible
Split initial load set to smaller subsets (we used 10MM records). This allows for greater flexibility in load
monitoring and control
Queries: for aggregate data validation queries (since bulk query option is not available for them) consider use of
Work Bench with special asynchronous background processing configuration that prevents early timeout on client
side (more info: http://wiki.developerforce.com/page/Workbench). Publicly available WB app
https://workbench.developerforce.com/login.php can be utilize
ETL Tool
Timeout settings
Bulk query support
Handling of Success and Error files
Monitoring of Job Status
25. General Guidelines for other areas
Searching and Reporting and List Views
•
•
Filtered Queries
•
Skinny Tables
•
Data Scope
•
Roles and Visibility
•
Indexes
Report Folders
Data Governance
•
•
Data Management and Administration
•
Data Model
Security and Access Controls
Data Archival and Retention
•
Storage
26. Speaker Name
Speaker Name
Speaker Name
Speaker Name
Speaker Title,
@twittername
Speaker Title,
@twittername
Speaker Title,
@twittername
Speaker Title,
@twittername
27.
28. Initial Load - Parent References (* goes into notes)
General LDV recommendations
Avoid reference to parent record via External Id if possible,
use SFDC Id instead.
XLDV considerations
Consider preparing source data sets with native SFDC id for
parent reference. This might be achieved by querying key
mapping data from parent objects and performing joins on
client side to retrieve SFDC Id for parent reference vs. using
External Id reference. Additional benefit – this is a good data
validation step.
Particularly important on large objects with multiple parent
references
Note that querying data from extra large objects might be a
challenge by itself and often takes hours. Consider alternative
approaches, for example collecting mapping keys from initial
upload log files.
Referencing parent via native SFDC GUID will also allow the
use of FAST LOAD option when available (next release, “safe
harbor”)
29. Initial Load - Data Validation
General LDV recommendations
The following post load data validation checks are generally
can be considered:
Aggregate values/subtotals comparison (numeric and
currency values)
Data validation on XLDV is a challenging task. Queries on
tables with hundreds of millions of records are slow and often
time out (even bulk queries). Consider options for splitting
(chunking ) your data validation queries by filtering on
indexed attributes.
Special chunking techniques: auto number chunking, PK
chunking
Do not underestimate and plan for enough time for
performing post load data validation on XLDV.
Total counts comparison
Attribute to attribute comparison
XLDV considerations
Spot checking
Negative validations
30. Incremental synch’s
General LDV recommendations
For SOAP API’s use the largest batch size (200).
During incremental syncs on objects with large amount of
processing the biggest batches can fail due to time out (10
min per batch). In this case batch size might need to be
reduced.
It is possible to “batch” standard SOAP API calls by
submitting several jobs in parallel. When loading data into the
same object using several SOAP API batches in parallel
consider to group child records by the parent in the same
batch to minimize DB locking conflicts.
XLDV considerations
For incremental synchs of larger data sets (over 50K-100K)
use bulk API’s.
Consider configuring client application to programmatically
set load mode (SOAP vs. Bulk) based on the size of
incremental data set in each load session.
31. Data Loads -
Current Volumes - REMOVE
Object
User Role
User
Account (+Contact)
Agent Role
Relationship
Household
Household Member
Opportunity
Task
Remark
Note
Attachment
Life Event
External Agreement
Agent Marketing Inf.
Policy
Policy Role
Agent Agreement Role
Billing Account
Billing Account Role
Total
• Initial volume close to 3.5 B
• Worked with customer to reduce volumes
• Split the implementation into two phases
approx. 1A # of rows
approx. 1B # of rows
60,000
200,000
186,000,000
155,000,000
11,000,000
96,000,000
155,000,000
100,000,000
104,500,000
16,500,000
15,000,000
6,000,000
845,413,000
120,000,000
18,000,000
900,000
33,540,000
26,000,000
149,000,000
165,000,000
172,000,000
23,220,000
23,220,000
730,880,000
32. Data Loads - Org/Environment preparation and tuning
For XLDV it is recommended to perform test loads in Production like environments
To get a true representation of the performance in Production
To test loads in real production environment for environment specific settings
Allows for more accurate planning of data load in terms of timing and dependencies
Concerns:
Large scale deletion / environment cleanup after tests. Deletion is an resource intensive data
operation and the slowest (even using bulk hard delete)
Cannot predict in advance all possible issues and consequences related to multiple mass
data deletions in actual production environment
Users cannot be deleted from Production environment, only deactivated. Multiple User load
tests can produce a “garbage pile” of inactive User records that would clutter Production
environment.
33. Data Loads - Org/Environment preparation and tuning (Contd..)
General best practice recommendation:
Request a dedicated production test environment to avoid testing in actual Production org.
Work with SFDC Operation on plans for re-provisioning test production environment because
erasing XLDV data from an org for subsequent test might not be feasible
Coordinate with SFDC Operations on any large scale test activities in Prod environment
(systems can be shut down by SFDC Operations as a suspected DDOS attack)
Plan accordingly and factor in the time required for setting up and configuring test production
environments when multiple production tests rounds are required (production org cannot be
refreshed as sandbox org so it needs to be re-provisioned and configured from scratch)
34. Data Loads – Org/Environment preparation and tuning
General LDV recommendations
For large volume data loads it is possible to request additional
temporary special changes on SFDC side to expedite load
performance
XLDV considerations
Request increase of batch API limit
Request Increase of file storage
Notify SFDC Operations about upcoming loads with
approximate data volumes
Request turning off Oracle Flashback
Use FASTLOAD option (future option, “safe harbor”)
Note that some settings/perms can technically be tweaked only in
Production environments, not in Sandbox environments
35. Initial Load - Use Bulk APIs for massive Data Loads
General LDV recommendations
SFDC Bulk API are specifically designed to load large volumes
of data. It is recommended to use bulk API for LDV initial data
load.
Use the Salesforce Bulk API in parallel mode when loading
more that few hundred thousand records. Main performance gain
of bulk API’s is executing batches in parallel.
Note that generally bulk API (w/o batching, in sequential mode)
might perform slower than SOAP standard API so if number of
records is less than certain threshold (that depends on number of
factors: object being loaded, processing on SFDC side, number
of attributes loaded etc…) using bulk API’s might become
counterproductive.
XLDV considerations
Use bulk API’s in parallel mode with maximum allowed batch
size whenever possible to maximize number of records that
can be loaded within 24 hour period (based on bulk API batch
limit – standard limit 2K batches per 24 hours)
Use standard SOAP APIs for incremental ongoing data
synchs (data sets less than 50-100K) and bulk APIs for larger
incremental data sets (over 50K-100K rows)
Using standard SOAP API’s for incremental synchs has a
benefit of reducing the risk of database contention during
uploads of child objects
Allow enough time for collecting Request and Result log files
(takes ~15 min for INFA to extract load logs for 10MM rows
loaded)
36. Initial Load - Planning Bulk Jobs (Parallel vs Single Object)
General LDV recommendations
When planning bulk jobs consider the following:
When loading large volume of data into a single object it is
recommended to submit all the batches in one job. Splitting
(partitioning) data across several parallel jobs does not
improve performance when each individual job is big enough.
Note that batches of all jobs are placed in the same batch
pool and queued for processing independently of jobs they
belong to.
When loading data into several objects running several jobs
for different objects in parallel is recommended when jobs are
small enough to not consume all the available system
resources.
XLDV considerations
For loading extra large volumes of data into single object in
Salesforce it is recommended to run one object load at a
time. The rationale behind this recommendation is that bulk
jobs with hundreds/thousands of batches consume all
available system resources and running other jobs in parallel
just causes jobs to compete for the limited resources. The
other consideration is DB cache. When several objects are
loaded in parallel sharing DB cache is causing more frequent
swapping that slows down overall performance on the DB
layer.
37. Initial Load - Defer Sharing Calculations
General LDV recommendations
Reduce sharing recalculations processing during initial
data upload.
Disable sharing rules and/or use the defer sharing
calculation perm for custom objects while loading data.
Note that sharing calculations will still need to be
performed after data load is complete and it can take
significant amount of time but it can reduce loading
time by allowing sharing to be calculated in “off” hours.
XLDV considerations
Based on the XLDV performance tests general recommendation for
deferring sharing calculations might not be applicable for extra large
volumes. Post loading sharing calculations might take very long time
to execute on bigger volumes (it is not parallel currently, on roadmap
for the nearest release)
Load data in the following sequence to enforce main sharing
calculations to be performed during data upload:
Create main user groups and group members
Create sharing rules
Create User Role Hierarchy and upload users with correct roles assigned
Sharing tables can alternatively be uploaded directly
that allows to avoid sharing recalculation on initial load
altogether
Set OWD on objects Private where applicable
Upload data with correct owner specified.
Consider uploading sharing tables on some objects directly to avoid
sharing calculations during upload and post load processing.
38. Initial Load - Triggers, WF Rules, Data Validation Rules
General LDV recommendations
Disable when possible triggers, WF rules, data validations on
objects being uploaded.
To avoid lengthy post load catch up processing consider
performing data transformations/ validations on source data
set prior to data upload or
consider batch APEX post-load processing for data
transformations that can’t be performed prior to upload.
For each trigger, WF rule and validation rude individual
analysis should be performed to define the best strategy.
XLDV considerations
General best practice rules apply
39. Initial Load - Defer Indexing
General LDV recommendations
Reduce additional processing on SFDC during initial data
upload: turn off search indexes, consider creating custom
indexes after data load
XLDV considerations
Based on initial performance results index recalculations on
extra large volumes of data take long time post load (not
parallelized currently). It might be a better approach to
configure all required indexes prior to XLDV load.
40. Initial Load - Use Fastest Operation Possible
General LDV recommendations
Use the fastest operation possible. insert is fastest then
update, then upset
XLDV considerations
If possible load full set of initial data in one go using insert
with only small incremental upserts when needed (failed
records reprocessing for example).
To avoid loading big volumes data in upsert mode (for
example when main load job fails in the middle and remaining
data should be added on top of existing data in an object)
consider configuring a client load job the way that joins
existing records to remaining records (on external Id) outside
of SFDC and then loading resulting set in insert mode.
41. Initial Load - Use Clean Data Set
General LDV recommendations
Use as clean data set as possible
Note that errors in batches are causing single row
processing on the remainder of the batch that slows down
load performance significantly.
XLDV considerations
Perform rigorous data validation prior to upload. This will allow:
1. Loading most of the records in the fastest insert operation
and avoid slow down due to the preventable errors
2. Avoid slow down when processing goes record by record
within a 200 record transactional batch when number of
failed record reaches a Threshold
46. Data Load Considerations (*- will be replaced)
Topic
Consideration
Data Model
Normalized/ De-normalized
Sharing Model
Sharing calculations
Target Data
Does it really need to exist in SFDC? BP etc.
Test Environments
Test them before you deploy!
Timelines
Adequate timeline for testing
User data
Cannot be deleted
API
Soap versus Bulk
Batches
Limit of 2K batches for Bulk API a day
Deletes
Oh No..
Extrapolation
Not really possible
What do I need to do?
SF Support to increase batches
It is at par or better, if Prod can be used even
better
Hinweis der Redaktion
Traditional DB performance tuning techniques used to improve data load performance are not applicable to Salesforce multitenant cloud architecture. Standard objects and custom objects have different underlying DB tables and storage mechanisms, and can behave differently.Along with general LDV best practices the XLDV specific recommendations are included based on actual data design, volumes and results of performance testing conducted by SFDC Performance Test Lab team did tests on extra large data sets of several hundred million rows per objects, several billion rows per org.
Traditional DB performance tuning techniques used to improve data load performance are not applicable to Salesforce multitenant cloud architecture. Standard objects and custom objects have different underlying DB tables and storage mechanisms, and can behave differently.Along with general LDV best practices the XLDV specific recommendations are included based on actual data design, volumes and results of performance testing conducted by SFDC Performance Test Lab team did tests on extra large data sets of several hundred million rows per objects, several billion rows per org.