This workshop will look into ways to create synthetic data from lending club loan record datasets alongside comparing characteristics and statistical properties of real and synthetic datasets. There will also be discussions into building machine learning models for predicting interest rates using real and synthetic datasets and evaluating the performance and discuss the advantages and disadvantages of using synthetic datasets as a proxy for real datasets
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Synthetic data in finance
1. Synthetic Data Generation for Machine Learning in Finance
2020 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
QuantUniversity
9/25/2020
Powered by:
2. 2
Speaker
• Quant, Data Science & ML practitioner
• Prior Experience at MathWorks, Citigroup
and Endeca and 25+ financial services and
energy customers.
• Columnist for the Wilmott Magazine
• Author of forthcoming book
“Pragmatic Machine Learning in Finance ”
• Teaches Data Science/AI at Northeastern
University, Boston
• Reviewer: Journal of Asset Management
Sri Krishnamurthy
Founder and CEO
QuantUniversity
3. 3
About QuantUniversity
• Boston-based Data Science, Quant
Finance and Machine Learning
training and consulting advisory
• Trained more than 1000 students in
Quantitative methods, Data Science,
ML and Big Data Technologies
• Building a platform for
operationalizing AI and Machine
Learning in the Enterprise
4. 4
1. Challenges with Real Datasets
2. Synthetic Dataset generation tools
▫ Proprietary
▫ Open Source
– Faker
– Data Synthesizer
– SDV
– Synthpop
– GANs
3. Demos
▫ VIX Data Generator
Agenda
7. 7
Not be feasible to get samples for all categories
• Lighting conditions
• Modifications (Glasses/No glasses,
Moustache/ No Moustache etc.)
• Positions
Coverage
Challenges with real datasets
8. 8
All scenarios haven’t
played out
• Stress scenarios
• What-if scenarios
Challenges with real datasets
Figure ref: http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
9. 9
Missing values
• Missing at random
• Missing sequences
• Need data to fill frames
Challenges with real datasets
10. 10
• Access
▫ Hard to find
▫ Rare class problems
▫ Privacy concerns
making it difficult to
share
Challenges with real datasets
11. 11
Imbalanced
• Need more samples of rare
class
• Need proxies for data points
that were not observed or
recorded
Challenges with real datasets
14. 14
Proprietary Tools
Company Core Technology
Tonic.ai
All-in-one platform for data anonymization, subsetting, and synthesis
integrated with databases (hadoop, oracle, mysql, MS sql server, mongo
db, amazon aurora/redshift, and google big query)
- Uses Condenser and Masquerade
Mostly.ai
Tablular data using generative deep neural networks (no image data)
CVEDIA
- Sensor modeling and algorithm training
- Handle image using SynCity as a custom pocket laboratory to generate
highly entropic scenes, conditions, and metadata. Enable real-time
Hardware-In-the-Loop (HWIL), Human-In-the-Loop (HITL) or Software-In-
the-Loop (SIL) simulations even with complex sensor configurations
Deep vision data
Image creation
Synthetic training data
Synthesis.ai The data generation platform for computer vision
15. 15
López de Prado, Marcos, Machine Learning for Asset Managers,
Cambridge University Press 2020
32. 32
If you want to be a part of QuSandbox private Beta
Contact us:
info@qusandbox
33. Sri Krishnamurthy, CFA, CAP
Founder and Chief Data Scientist
sri@quantuniversity.com
srikrishnamurthy
www.qu.academy
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
33