A comprehensive introduction to machine learning and deep learning along with application in finance (provided by an example of predicting bank failure). Then, the difference of ML in tech and ML in finance is outlined. Last section is excluded from the file.
3. 1
2. ML in Tech vs ML in Finance
3. Example: Bank Rating Prediction
4. Deep Learning and Neural Networks
5. Example: Neural Net Copula in Markoviz Problem
1. Fundamentals of Machine Learning
5. 3
1. Fundamentals of Machine Learning
Major AI Approaches
• Logic and Rules-Based Approach
• Hard-code knowledge about the world in formal languages
• Top-down rules are created for computers
• Computers reason about these rules automatically.
Example: Project Cyc (Lenat and Guha, 1989)
6. 4
1. Fundamentals of Machine Learning
Major AI Approaches
Example within law – Expert Systems
• Turbotax
• Personal income tax laws
• Represented as logical computer rules
• Software computers tax liability
• Logic and Rules-Based Approach
7. 5
1. Fundamentals of Machine Learning
Learning: Process of converting experience into expertise or knowledge
We wish to program “agents” that they can “learn” from input data
ML is what computers use to learn about the outside world. Much like humans
use math and physics for the same purpose.
Agent = Architecture + Algorithm
AI systems need the ability to acquire their own knowledge, by extracting
patterns from raw data.
• Machine Learning (Pattern-Based Approach)
Major AI Approaches
12. 10
1. Fundamentals of Machine Learning
Formal Definition
Field of study that gives computers the ability to learn without being
explicitly programmed
Arthur Samuel (1959):
Well posed Learning Problem: A computer program is said to learn
from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P,
improves with experience E.
Tom Mitchell (1998):
Example: Chess
T: playing chess, E: agent playing with itself, P: number of wins / number of games
13. 11
1. Fundamentals of Machine Learning
Studies “intelligent agents” that perceive their environment and perform different
actions to solve tasks that involve mimicking cognitive functions of human brain
(Russell, Norvig)
Artificial Intelligence
Goals of AI
Knowledge
Representation
Taking Actions,
Planning
Perception and
Learning
Natural
Language
Processing
Automated
Reasoning
Ontology: the
set of objects,
relations,
concepts
Acting with
visualizing future
to achieve goals
Perception from
sensors,
learning from
experience
Ability to read
and understand
human language
Mimicking
human
reasoning for
logical
deductions
M
L
14. 12
1. Fundamentals of Machine Learning
Perception
(learning),
actions
Communication
(NLP)
Knowledge/Ont
ologies
Reasoning,
planning
Applied AI
Learns and
acts
autonomously
Use sub-
symbolic
information
Algorithmic
theory of
cognitive acts
Solves any
intellectual
tasks
Artificial General
Intelligence (AGI)
Present Future
15. 15
13
1. Fundamentals of Machine Learning
Agent Environment
Perception
Actions
Perception Tasks: There is a fixed action
(Perception via The physical world (through sensors), or digital
data (read from a disk))
Action Tasks: There are multiple possible actions
involve planning and forecasting the future
involve sub-tasks of learning, for sequential (multi-step) problems
(Actions can be fixed, or can vary. May or may not change the
environment)
16. 14
1. Fundamentals of Machine Learning
When do we need ML (instead of directly program)?
• Complexity
1. Tasks performed by Animals/Humans: Can’t extract a well
defined program. (Driving, speech recognition, image understanding)
2. Tasks beyond Human Capabilities: Analysis of very large and
complex datasets (Astronomical and genomic data, turning medical
archives to medical knowledge, weather prediction)
• Adaptivity
adaptive to changes in the environment they interact with.
(handwritten text, spam detection, speech recognition)
17. 15
1. Fundamentals of Machine Learning
Types of learning
• Supervised: environment (teacher) that “supervises” the learner by
providing the extra information (“labels”). We have train (seen) and test
(unseen) data.
p(y|x)
• Unsupervised: come up with summary or a compressed version of
data, learn probability distribution, clustering (denoise, synthesis)
• Reinforcement: Intermediary. There is teacher but with partial feedback
(reward), sequence of actions. (describe chess’s setting position value,
Self-drive)
19. 17
1. Fundamentals of Machine Learning
Linear Regression Example: satisfaction rate of company employees
Training data: company employees have rated their satisfaction on a scale of 1 to 100
Predictor:
20. 18
1. Fundamentals of Machine Learning
Linear Regression Example: satisfaction rate of company employees
Let’s start with
21. 19
1. Fundamentals of Machine Learning
Cost Function:
As we minimized J (using Gradient Descent, the fitting line gets better and better
Linear Regression Example: satisfaction rate of company employees
22. 22
20
1. Fundamentals of Machine Learning
Best Line:
Linear Regression Example: satisfaction rate of company employees
23. 21
1. Fundamentals of Machine Learning
Minimization Algorithm: Gradient Descent
Linear Regression Example: satisfaction rate of company employees
𝑔 𝜃 =
𝜕
𝜕
𝑔
24. 22
1. Fundamentals of Machine Learning
Plot of J
Linear Regression Example: satisfaction rate of company employees
In this case, J is convex and therefore there is no local minima!
25. 23
1. Fundamentals of Machine Learning
J cantors
Linear Regression Example: satisfaction rate of company employees
26. 24
1. Fundamentals of Machine Learning
Iterations
Linear Regression Example: satisfaction rate of company employees
Fore more visualization:
https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-
41a5d11f5220
31. 29
1. Fundamentals of Machine Learning
Machine Learning Landscape
Supervised Learning Unsupervised Learning
Learn regression
Function
Given: input/output
pairs
Regression Classification
Representation
Learning
Clustering
Learn regression
Function
Given: input/output
pairs
Learn class
Function
k: the number of
clusters
Given: inputs only
Learn representer
function
Given: input/output
pairs
Perception Tasks
32. 30
1. Fundamentals of Machine Learning
Machine Learning Landscape
Reinforcement Learning
Learn regression
Function
Given: input/output
pairs
Optimization of
strategy for a task
IRL: Learn
objectives from
behavior
Learn regression
Function
Given: input/output
pairs
Action Tasks
38. 36
1. Fundamentals of Machine Learning
Reinforcement Learning
Trading strategies
Asset management
Optimization of
strategy for a task
IRL: Learn
objectives from
behavior
Reverse engineering
of consumer
behavior, trading
strategies, …
Action Tasks
Machine Learning in Finance
39. 37
1. Fundamentals of Machine Learning
ML by Financial Application Areas
Banking Asset Management
Customer
segmentation
Loan defaults
Credit card defaults
Fraud detection
Anti-money laundry
Retail P2P
Lending
Commercial and
Investment
Portfolio
optimization
Representation
Learning
Rating prediction
Default modeling
Client data mining
Recommender
systems
Factor modeling
De-noising
Regime change
Detection
Stock segmentation
Multi-period
portfolio
optimization
Derivatives trading
Perception Tasks
40. 38
1. Fundamentals of Machine Learning
Quantitative Trading
Profit-maximizing
trading execution
Optimal trade
execution
Quantitative trading
strategies
Earning prediction
Algorithmic trading
Optimal market
making
Action Tasks
ML by Financial Application Areas
41. ML in Tech
• Perception (image recognition, NLP tasks, etc.)
Methods: SL/UL
• Action (computational advertising, robotics, self-driving cars, etc.). Methods:
SL/UL/RL
39
2. ML in Tech vs ML in Finance
ML in Tech ML in Finance
Image
recognition
NLP Tasks
Forecasting
Tasks
Valuation
Tasks
Computational
advertising
Robotics
42. ML in Finance
Perception: Forecasting tasks
• Security price predictions (stocks, bonds, commodities, etc.).
Methods: SL/UL
• Corporate actors action prediction (dividends, mergers, defaults, etc.).
Methods: SL/UL/RL
• Individual actors action prediction (loan defaults, fraud, AML, etc.).
Methods: SL/UL/RL 40
2. ML in Tech vs ML in Finance
ML in Tech ML in Finance
Image
recognition
NLP Tasks
Forecasting
Tasks
Valuation
Tasks
Computational
advertising
Robotics
43. ML in Finance
Perception: Valuation tasks
• Asset valuation (stocks, futures, commodities, bonds, etc.). Related to forecasting.
Methods: SL/UL
• Derivatives valuation.
Methods: SL/UL/RL
41
2. ML in Tech vs ML in Finance
ML in Tech ML in Finance
Image
recognition
NLP Tasks
Forecasting
Tasks
Valuation
Tasks
Computational
advertising
Robotics
44. 42
2. ML in Tech vs ML in Finance
Tasks ML in Tech ML for Finance
Big Data? typically yes typically no
Data for ML in Tech are of huge size.
Most of data for ML in Finance are medium-size, except HFT.
45. 43
2. ML in Tech vs ML in Finance
Tasks ML in Tech ML for Finance
Stationary Data? typically yes typically no
As most of financial data
are non-stationary,
collecting more data, even
when possible is not
always helpful
46. 44
2. ML in Tech vs ML in Finance
Tasks ML in Tech ML for Finance
Noise-to-signal ratio typically low typically high
Financial data are typically quite noisy,
“true” signals are unobservable!
47. 45
2. ML in Tech vs ML in Finance
Tasks ML in Tech ML for Finance
Interpretability of results Typically, not important, or
not the main focus
Typically, either desired or
required
Interpretability of results is:
• Desired for trading
• Required for regulation (General Data
Protection Regulation, 2018)
48. 46
2. ML in Tech vs ML in Finance
Tasks ML in Tech ML for Finance
Action (RL) tasks Low dimensional state-action
space, low uncertainty
High-dimensional state-
action space, high
uncertainty
• ML in Tech: Dimensionality of the state-action space is usually in
hundreds.
The action space is often more discrete (except in robotics)
Uncertainty is low to moderate (think self-driving cars!)
• ML in Finance: Dimensionality of the state-action space is often
in thousands.
The action space is usually continuous.
Uncertainty is low to high (think Brexit!)
49. 47
1. Fundamentals of Machine Learning
A Gentle Model (Statistical Learning Framework)
Domain set: features
Label set
(discrete or continuous)
Training data: also called training set (seen)
The learner’s input:
Prediction function (hypothesis)
Data-generation model: probability distribution of
Measure of success: error of predictor, loss function
The learner’s output:
50. 48
1. Fundamentals of Machine Learning
Types of Error
• The ability to perform well on previously unobserved inputs is called generalization
• What separates machine learning from optimization is that we want the generalization
error to be low as well
• Estimate generalization error by a test set of examples that were collected separately
from the training set
Error measure on the training set
Training error
𝐿 𝐷,𝑓 ℎ ≝ 𝑃𝑥 𝐷 ℎ 𝑥 ≠ 𝑦
Generalization error (Test error):
51. 49
1. Fundamentals of Machine Learning
• We sample the training set, then use it to choose the parameters to
reduce training set error. Under this process, the expected test error is
greater than or equal to the expected value of training error
• The factors determining how well a machine learning algorithm will
perform are its ability to
1. Make the training error small (underfitting)
2. Make the gap between training and test error small (overfitting)
Types of Error
52. 50
1. Fundamentals of Machine Learning
Papayas Example
𝐿 𝐷 ℎ 𝑆
= 1 2
𝐿 𝑆 ℎ 𝑆
= 0
• No matter what the sample is ,
• Predicts label 1 only an finite number of instances:
• We have found a predictor whose performance on the training set is excellent, yet its
performance on the true “world” is very poor
54. 52
1. Fundamentals of Machine Learning
• Overfitting occurs when our hypothesis fits the training data “too well” (perhaps
like the everyday experience that a person who provides a perfect detailed
explanation for each of his single actions may raise suspicion).
Altering Capacity
• Model’s capacity is its ability to fit a wide variety of functions.
• Capacity is controlled by Restrict hypothesis class (size or complexity), VC
dimension, techniques, program bits, …
• Restrict to axis aligned rectangles guarantees not to overfit
• If H is a finite class, then ERMH will not overfit
55. 53
1. Fundamentals of Machine Learning
Bias – Complexity Tradeoff
Error Decomposition
Approximation Error
• Due to underfitting
• the minimum risk achievable by a predictor in the hypothesis class.
• how much risk we have because we restrict ourselves to a specific class (bias)
• depends on the chosen hypothesis class
• Reflects the quality of prior knowledge
Estimation Error
• Due to overfitting
• the difference between the approximation error and the predictor error
• It exists because the training error is only an estimate of the generalization error
• depends on the training set size and on the size or complexity of the hypothesis class
57. 55
1. Fundamentals of Machine Learning
Model Capacity
DataComplexity
Bias – Complexity Tradeoff
58. 56
1. Fundamentals of Machine Learning
Generalization
Design Matrix
• A model is trained using only a training set
• A test set is used to estimate algorithm’s ability to generalize, i.e. perform well on
unseen data.
59. 57
1. Fundamentals of Machine Learning
• To generalize well, machine learning algorithms need to be guided by prior beliefs
about what kind of function they should learn.
• the stronger the prior knowledge (or prior assumptions) that one starts the learning
process with, the easier it is to learn from further examples. However, the stronger
these prior assumptions are, the less flexible the learning is (it is bound, a priori, by the
commitment to these assumptions.)
Prior Knowledge
• Restricting our hypothesis class (Finiteness, VC Dimension)
• Assumption on distribution
Examples
60. 58
1. Fundamentals of Machine Learning
Prior Knowledge
Bait
Shyness
The rats seem to have some “built in” prior knowledge telling them that, while temporal
correlation between food and nausea can be causal, it is unlikely that there would be a
causal relationship between food consumption and electrical shocks or between sounds
and nausea.
63. 61
3. Bank Failures Example
FDI
C
• US-based commercial banks are regulated by the FDIC
• FDIC provides insurance for commercial banks, and charges them insurance premium
according to an internal (and non-public) rating based on the CAMELS supervisory
system
65. 63
3. Bank Failures Example
CAMEL
S • Rate 1: Best, Rate 5: Worst
• Rating 4 or 5 is likely to be closed soon
Capital inadequacy is the most common cause of a
bank closure (other reasons: violation of financial
rules, management failures)
If FDIC decides to close the bank, it takes over both
its assets and its liabilities and then tries to sell the
assets at the best price possible to pay up the
liabilities.
• CAMEL ratings are not publicly known; However,
Call Reports are available.
• In addition, FDIC provides historical data for failed
banks:
(https://www.fdic.gov/bank/individual/failed/)
66. 64
3. Bank Failures Example
Call Report
• 28 schedules in total
• Form FFIEC 031: for banks with both domestic (US) and foreign offices
• Form FFIEC 041: for banks with domestic (US) offices only
69. 67
3. Bank Failures Example
Correlation Matrix of features
In this problem we want to predict failed(defaulter) Banks and non-failed Banks
NI: net income
log_TA: logarithm of total assets
TL: total loans
NPL: non-performing loans
Assessment Base: average consolidated assets minus tangle equity
…
78. 75
4. Deep Learning and Neural Networks
The performance of simple machine learning algorithms depends heavily on the
representation of the data they are given.
Goal: separate the factors of variation
Problem: influence every single piece of data we are able to observe. (car
image at night, car )
Most applications require us to disentangle the factors of variation and discard
the ones that we do not care about
Representation Learning: use ML to discover not only the mapping from
representation to output but also the representation itself.
quintessential example: Autoencoder
the combination of an encoder function, which converts the input data into a
different representation, and a decoder function, which converts the new
representation back into the original format.
Representation
80. 77
4. Deep Learning and Neural Networks
Deep learning solves this problem by introducing representations that are
expressed in terms of other, simpler representations.
(build complex concepts out of simpler concepts. )
Example
81. 77
4. Deep Learning and Neural Networks
Depth
Depth enables the computer to learn a multistep computer program
Layer: state of the computer’s memory after executing another set of instructions in
parallel
Networks with greater depth can execute more instructions in sequence. (later
instructions can refer back to the results of earlier instructions.
Measuring Depth
1. Depth of computational graph: number of sequential instructions (length of the
longest path through a flow chart)
2. Depth of the concepts graph: describing how concepts are related to each other.
• Depth of the flowchart of the computations needed to compute the representation of
each concept may be much deeper than the graph of the concepts themselves
85. 81
4. Deep Learning and Neural Networks
History of
DL
• Dates back to 1940s (only appears to be new)
• Different Names:
1. 1940s - 1960: Cybernetics
2. 1980s – 1990s: Connectionism
3. Beginning of 2006: Deep Learning
4. learning algorithms for biological learning (models of how learning happens or
could happen in brain): Artificial Neural Networks
Neural Perspective on DL
1. Brain provides a proof that intelligent behavior is possible
2. Reverse engineer the computational principles behind the brain
• Today, neuroscience is regarded as an important source of inspiration for DL
researchers, but it is no longer the predominant guide for the field because To obtain a
deep understanding of the actual algorithms used by the brain, we would need to be
able to monitor the activity of (at the very least) thousands of interconnected neurons
simultaneously.
• The basic idea of having many computational units that become intelligent only via their
interactions with each other is inspired by the brain
• 1980s algorithms work quite well, but this was not apparent circa 2006 because they
were too computationally costly.
86. 82
4. Deep Learning and Neural Networks
• Increasing Dataset sizes: Some skill is required to get good performance from a DL
algorithm. Fortunately, the amount of skill required reduces as the amount of training
data increases.
The age of “Big Data” has made ML much easier because the key burden of statistical
estimation (generalizing to new data after observing only a small amount) has been
considerably lightened.
• Increasing Model Sizes: animals become intelligent when many of their neurons work
together. Larger networks are able to achieve higher accuracy on more complex tasks.
History of
DL
87. 83
4. Deep Learning and Neural Networks
Challenges motivating DL
• Curse of Dimensionality
Regions Regions Regions
statistical challenge arises because the number of possible configurations of x is much
larger than the number of training examples.
88. 84
4. Deep Learning and Neural Networks
www.playground.tensorflow.org
• Local Constancy and Smoothness
Among the most widely used of these implicit “priors” is the smoothness
prior, or local constancy prior.
It states that the function we learn should not change very much within a small region.
Much of the modern motivation for deep learning is derived from studying the limitations of
local template matching and how deep models are able to succeed in cases where local
template matching fails (Bengio et al., 2006b).
89. 85
4. Deep Learning and Neural Networks
Neural Networks
Feedforward Neural Network (MLP)
Goal: approximate some function with some
Feedforward: information flows through the function with no feedback connections
Neural: loosely inspired by neuroscience
Network: composing together many different functions .
( is the ’th layer and final layer is output layer)
Depth: overall length of the chain
Width: dimensionality of hidden layers
Hidden Layer: Training data does not show the desired output for each of these layers
• During NN training, we drive to match
• Each hidden layer is vector valued
90. 86
4. Deep Learning and Neural Networks
Depth
𝑓 1
𝑓 2
𝑓 3
Feedforward Neural Network (MLP)
Width
91. 87
Feedforward Neural Network (MLP)
MLP as a kernel technique
extend linear models to represent nonlinear functions of by applying the linear model not to
, but to a transformed input
How to choose
1. Generic: infinite-dimensional(based on RBF kernel).
Enough capacity but poor generalization
2. Manually Engineer : Requires decades of human effort for each separate task
3. Learn :
This is an example of a deep feedforward network, with defining a hidden layer
• The advantage of 3’rd approach is that the human designer only needs to find the right
general function family rather than finding precisely the right function.
92. 88
4. Deep Learning and Neural Networks
Feedforward Neural Network (MLP)
Example: Learning XOR
• After solving: and
, where and
• Most neural networks establish a nonlinear function by using a affine transformation
controlled by learned parameters, followed by a fixed nonlinear function called an
activation function.
or , where
93. 89
4. Deep Learning and Neural Networks
When , the model’s output must increase as increases. When
, the model’s output must decrease as increases.
95. 91
4. Deep Learning and Neural Networks
Recurrent Neural Network (RNN)
• For processing a sequence of values . ( can be variable)
• Parameter sharing: using the same parameter for more than one function in a
model (tied weights).
If we had separate parameters for each value of the time index, we could
not generalize to sequence lengths not seen during training, nor share
statistical strength across different sequence lengths and across different
positions in time. Such sharing is particularly important when a specific piece
of information can occur at multiple positions within the sequence. (“I went
to Nepal in 2009” and “In 2009, I went to Nepal)
• Each member of the output is a function of the previous members of the output. Each
member of the output is produced using the same update rule applied to the previous
outputs.
• Include cycles that represent the influence of the present value of a variable on its own
value at a future time step.
• Any function involving recurrence can be considered a recurrent neural network.
96. 92
4. Deep Learning and Neural Networks
Parameter Sharing
Recurrent Neural Network (RNN)
97. 93
4. Deep Learning and Neural Networks
Unfolding Computational Graphs
The unfolding process thus introduces two major advantages:
1. Regardless of the sequence length, the learned model always has the same
input size, because it is specified in terms of transition from one state to
another state, rather than specified in terms of a variable-length history of
states.
2. It is possible to use the same transition function f with the same parameters
at every time step.
Recurrent Neural Network (RNN)
98. 94
4. Deep Learning and Neural Networks
Some types of
RNNs
Recurrent Neural Network (RNN)
I. Produce an output at each time step and have recurrent connections between hidden
units
II. Produce an output at each time step and have recurrent connections only from the
output at one time step to the hidden units at the next time step.
III. With recurrent connections between hidden units, that read an entire sequence and
then produce a single output
• The network with recurrent connections only from the output at one time step to
the hidden units at the next time step is strictly less powerful because it lacks hidden-to-
hidden recurrent connections. For example, it cannot simulate a universal Turing
machine. It requires that the output units capture all the information about the past that
the network will use to predict the future.
103. 99
4. Deep Learning and Neural Networks
Teacher Forcing
Recurrent Neural Network (RNN)
a procedure that emerges from the maximum likelihood criterion, in which during training
the model receives the ground truth output as input at time .
𝑙𝑜𝑔𝑝 𝑦 1
, 𝑦 2
𝑥 1
, 𝑥 2
= 𝑙𝑜𝑔𝑝 𝑦 2
𝑦 1
, 𝑥 1
, 𝑥 2
+ 𝑙𝑜𝑔𝑝 𝑦 1
𝑦 1
, 𝑥 1
, 𝑥 2
• avoid back-propagation through time in models that lack hidden-to-hidden connections.
Teacher forcing may still be applied to models that have hidden-to-hidden connections
as long as they have connections from the output at one time step to values computed
in the next time step.
• As soon as the hidden units become a function of earlier time steps, however, the BPTT
algorithm is necessary.
• Some models may thus be trained with both teacher forcing and BPTT.
105. 101
4. Deep Learning and Neural Networks
Any time we choose a specific machine learning algorithm, we are implicitly stating some
set of prior beliefs we have about what kind of function the algorithm should learn.
Choosing a deep model encodes a very general belief that the function we want to learn
should involve composition of several simpler functions. This can be interpreted from a
representation learning point of view as saying that we believe the learning problem
consists of discovering a set of underlying factors of variation that can in turn be described
in terms of other, simpler underlying factors of variation. Alternately, we can interpret the
use of a deep architecture as expressing a belief that the function we want to learn is a
computer program consisting of multiple steps, where each step makes use of the previous
step’s output. These intermediate outputs are not necessarily factors of variation but can
instead be analogous to counters or pointers that the network uses to organize its internal
processing. Empirically, greater depth does seem to result in better generalization
Last Note
106. 102
References
1. Understanding Machine Learning: From Theory to
Algorithms (Shai Ben-David and Shai Shalev-
Shwartz)
2. Deep Learning (Aaron C. Courville, Ian Goodfellow,
and Yoshua Bengio)
3. “Machine Learning in Finance” course
(www.coursera.org)
4. Advances in Financial Machine Learning (marcos
lopez de prado)
Another examples: anomaly detection (fraud), any suggestion on social media, google news, learning someone’s taste
Another fancy example: speech synch
Knowledge Representation: representing information about the world in a form that a computer system can utilize to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. This field incorporates findings from psychology[1] about how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. Also incorporates findings from logic to automate various kinds of reasoning, such as the application of rules or the relations of sets and subsets.
(Knowledge-Based approach)
Automated Reasoning: he study of automated reasoning helps produce computer programs that allow computers to reason completely, or nearly completely, automatically. Although automated reasoning is considered a sub-field of artificial intelligence, it also has connections with theoretical computer science, and even philosophy.
NLP: Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
You can't say to an "Applied AI" agent: go out and find out what you do on your own.
Example of sub-symbolic information:
Yann LeCun: The phrase "He took his bag and left the room", implies in particular that the person walked out of the room rather than, for instance jumped out of the window or teleported to another planet
Other even more remote tasks include algorithmic theories of creativity, curiosity and surprise as pursued by Juergen Schmidhuber. one expects that AI intelligence will be able to solve arbitrary intellectual tasks, which is expected to around 2,045 according to Ray Kurzweil, a famous entrepreneur and futurologist.
Types are based on agent’s interaction with environment
program synthesis is the task to automatically construct a program that satisfies a given high-level specification[1]. In contrast to other automatic programmingtechniques, the specifications are usually non-algorithmic statements of an appropriate logical calculus.[2] Often, program synthesis employs techniques from formal verification.For reinforcement learning example, one may try to learn a value function that describes for each setting of a chess board the degree by which White’s position is better than the Black’s. Yet, the only information available to the learner at training time is positions that occurred throughout actual chess games, labeled by who eventually won that game
In IRL setting, everything is the same as the direct reinforcement learning, but there is no information on rewards received by the agent upon taking actions. Instead we are simply given a sequence of states of the environment and actions by the agent. And given that we are asked what objective the agent pursued when performing these actions
Demand Forecast: understand and predict customer demand to optimize supply decisions by corporate supply chain and business management.
Machine Translation example: Google Translate
As you can see, NN is present in all types. Because of the Universal Approximation Theorem, every function is representable via a NN.
Regression is the most commonly used algorithm in Finance
Asset management refers to systematic approach to the governance and realization of value from the things that a group or entity is responsible for, over their whole life cycles. It may apply both to tangible assets (physical objects such as buildings or equipment) and to intangible assets (such as human capital, intellectual property, goodwill and/or financial assets).
Quantitative Trading (Algorithmic Trading): Algorithmic trading is a method of executing a large order (too large to fill all at once) using automated pre-programmed trading instructions accounting for variables such as time, price, and volume
Most common uses
The reason that reinforcement learning has application in perceptions tasks of finance:
In finance, expectations regarding the future are sometimes embedded in perception of today’s environment. If this future is influenced by actions of rational agents, RL might be an appropriate framework
(تصور الان روی قیمت آینده اثر میگذارد)
Rational financial AI agents:
These agents learn to perceive the environment; that is to digest financial and sometimes non-financial data and perform certain actions to maximize some measures of performance
Interoperability is also important in sensitive (life-depending) or moral problems. For more information, see this:
each pair in the training data S is generated by first sampling a point xi accordingto D and then labeling it by f.
Domain set is the set of objects that we may wish to label. For example, the set of all papayas.
It is important to note that we do not assume that the learner knowsanything about distribution Dwe assume that there is some “correct” labeling function,f : X ->Y, and that yi = f(xi) for all i. This assumption can be relaxed
= {1,…,m}
The area of the gray square in the picture is 2 and the area of the blue square is 1. Assume that the probability distribution D is such that instances are distributeduniformly within the gray square and the labeling function, f, determines thelabel to be 1 if the instance is within the inner blue square, and 0 otherwise.
The first component reflects the quality of our prior knowledge
choosing H to be a very rich class decreases the approximation error but at the same time might increase the estimation error, as a rich H might lead to overfitting. On the other hand, choosing H to be a very small set reduces the estimation error but might increase the approximation error or, in other words, might lead to underfitting.
Bayesian probability is a special kind of prior knowledge. (prior knowledge about distribution)
once we make no prior assumptionsabout the data-generating distribution, no algorithm can be guaranteed to finda predictor that is as good as the Bayes optimal one
Advantages of Representation Learning: better performance, adapt to new tasks with minimal human interventionFactors: sources of influence… for example:1)unobserved objects or unobserved forces in the physical world that affect observable quantities )2) constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data
speech recording (speaker’s age, their sex, their accent and speaking words)car image analyze (position of the car, its color, and the angle and brightness of the sun.)
The individual pixels in an image of a red car might be very close to black at night. The shape of the car’s silhouette depends on the viewing angle
Suppose we have a vision system that can recognize cars, trucks, and birds, and these objects can each be red, green, or blue. One way of representing these inputs would be to have a separate neuron or hidden unit that activates for each of the nine possible combinations: red truck, red car, red bird, green truck, and so on. This requires nine different neurons, and each neuron must independently learn the concept of color and object identity. One way to improve on this situation is to use a distributed representation, with three neurons describing the color and three neurons describing the object identity. This requires only six neurons total instead of nine, and the neuron describing redness is able to learn about redness from images of cars, trucks and birds, not just from images of one specific category of objects.
Just as two equivalent computer programs will have different lengths depending on which language the program is written in, the same function may be drawn as a flowchart with different depths depending on which functions we allow to be used as individual steps in the flowchart.
For example, an AI systemobserving an image of a face with one eye in shadow may initially see only oneeye. After detecting that a face is present, the system can then infer that a secondeye is probably present as well. In this case, the graph of concepts includes onlytwo layers—a layer for eyes and a layer for faces—but the graph of computationsincludes 2n layers if we refine our estimate of each concept given the other n times. there is no single correct value for the depth of anarchitecture, just as there is no single correct value for the length of a computerprogram. Nor is there a consensus about how much depth a model requires toqualify as “deep.”
While the kinds of neural networks used for machine learning have sometimesbeen used to understand brain function (Hinton and Shallice, 1991), they aregenerally not designed to be realistic models of biological function
The earliest predecessors of modern deep learning were simple linear models
One should not view deep learning as an attempt to simulate the brain. Modern deep learning draws inspiration from many fields, especially applied math fundamentals like linear algebra, probability, information heory, and numerical optimization
Larger networks are able to achieve higher accuracy on more complex tasks.
the number of possible distinctconfigurations of a set of variables increases exponentially as the number of variables increases .
we may also discuss prior beliefs as directly influencingthe function itself and influencing the parameters only indirectly, as a result of therelationship between the parameters and the function. Additionally, we informallydiscuss prior beliefs as being expressed implicitly by choosing algorithms thatare biased toward choosing some class of functions over another, even thoughthese biases may not be expressed (or even be possible to express) in terms of aprobability distribution representing our degree of belief in various functions.
In other words, if we know a goodanswer for an input x (for example, if x is a labeled training example), then thatanswer is probably good in the neighborhood of x.
Rather than thinking of the layer as representing a single vector-to-vector function, we can also think of the layer as consisting of many units that act in parallel, each representing a vector-to-scalar function. Each unit resembles a neuron inthe sense that it receives input from many other units and computes its ownactivation value. The idea of using many layers of vector-valued representations is drawn from neuroscience
Depth: deep and shallow networks
Linear models, such as logisticregression and linear regression, are appealing because they can be fit efficientlyand reliably, either in closed form or with convex optimization. Linear models alsohave the obvious defect that the model capacity is limited to linear functions, sothe model cannot understand the interaction between any two input variables.
We can think of φ as providing a set of features describing x, or as providing a new representation for x.
3’rd approach can capture the benefit of the firstapproach by being highly generic—we do so by using a very broad familyφ(x; θ). Deep learning can also capture the benefit of the second approach.Human practitioners can encode their knowledge to help generalization bydesigning families φ(x; θ) that they expect will perform well
The only challenge is to fit the training set. By Occam’s Razor, we start with linear models.
it may be tempting to make f(1) linear as well. Unfortunately, if f(1) were linear, then the feedforward network as a whole would remain a linear function of its input.
Most neural networks establish nonlinear function using an affine transformation controlled by learned parameters,followed by a fixed nonlinear function called an activation function
The bold numbersprinted on the plot indicate the value that the learned function must output at eachpoint.
If we use a sufficiently powerful neural network, we can think of the neural network as being able to represent any function f from a wide class of functions,with this class being limited only by features such as continuity and boundedness rather than by having a specific parametric form. (Universal Approximation Theorem)
If we ask a machine learning model to read each sentence and extract the year in which the narrator went to Nepal, we would like it to recognize the year 2009 as the relevant piece of information, whether it appears in the sixth word or in the second word of the sentence. Suppose that we trained a feedforward network that processes sentences of fixed length. A traditional fully connected feedforward network would have separate parameters for each input feature, so it would need to learn all the rules of the language separately at each position in the sentence. By comparison, a recurrent neural network shares the same weights across several time steps.
The convolution operation allows a network to share parameters across time but is shallow. The output of convolution is a sequence where each member of the output is a function of a small number of neighboring members of the input.Recurrent networks share parameters in a different way (second dot)
(Top)The black arrows indicate uses of the centralelement of a 3-element kernel in a convolutional model. Because of parameter sharing, thissingle parameter is used at all input locations. (Bottom)The single black arrow indicatesthe use of the central element of the weight matrix in a fully connected model. This modelhas no parameter sharing, so the parameter is used only once
Parameter sharing is a kind of prior knowledge.
the time step index need not literally refer to the passage of time in the real world. Sometimes it refers only to the position in the sequence.
S(t): state of the system (dynamical system)
Each node represents the state at some time t, and the function f maps the state at t to the state at t + 1. The same parameters (the same value of θ used to parametrize f) are used for all time steps.
By unfolding, we avoid cycles in graph
RNN has input to hidden connections parametrized by a weight matrix U, hidden-to-hidden recurrent connections parametrized by a weight matrix W , and hidden-to-output connections parametrized by a weight matrix V
any function computable by a Turing machine can be computed by such a recurrent network of a finite size
The output can be read from the RNN after a number of time steps that is asymptotically linear in the number of time stepsused by the Turing machine and asymptotically linear in the length of the input(Siegelmann and Sontag, 1991; Siegelmann, 1995; Siegelmann and Sontag, 1995;Hyotyniemi, 1996). The functions computable by a Turing machine are discrete,so these results regard exact implementation of the function, not approximations.
A loss L measures how far each o is from the corresponding training target y. When usingsoftmax outputs, we assume o is the unnormalized log probabilities.
Unless o isvery high-dimensional and rich, it will usually lack important information from the past.This makes the RNN in this figure less powerful, but it may be easier to train becauseeach time step can be trained in isolation from the others, allowing greater parallelizationduring training
Maximum likelihood thus specifies that during training, rather than feeding the model’s own output back into itself, these connections should be fed with the target values specifying what the correct output should be
Much as almostany function can be considered a feedforward neural network, essentially anyfunction involving recurrence can be considered a recurrent neural network.