Más contenido relacionado

Machine Learning in Finance

  1. MACHINE LEARNING IN FINANCE CREATED BY: HAMED VAHEB FALL 2018
  2. 1 2. ML in Tech vs ML in Finance 3. Example: Bank Rating Prediction 4. Deep Learning and Neural Networks 5. Example: Neural Net Copula in Markoviz Problem 1. Fundamentals of Machine Learning
  3. What is ML? 2
  4. 3 1. Fundamentals of Machine Learning Major AI Approaches • Logic and Rules-Based Approach • Hard-code knowledge about the world in formal languages • Top-down rules are created for computers • Computers reason about these rules automatically. Example: Project Cyc (Lenat and Guha, 1989)
  5. 4 1. Fundamentals of Machine Learning Major AI Approaches Example within law – Expert Systems • Turbotax • Personal income tax laws • Represented as logical computer rules • Software computers tax liability • Logic and Rules-Based Approach
  6. 5 1. Fundamentals of Machine Learning Learning: Process of converting experience into expertise or knowledge We wish to program “agents” that they can “learn” from input data ML is what computers use to learn about the outside world. Much like humans use math and physics for the same purpose. Agent = Architecture + Algorithm AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. • Machine Learning (Pattern-Based Approach) Major AI Approaches
  7. 6 1. Fundamentals of Machine Learning Machine Learning in our daily life
  8. 7 1. Fundamentals of Machine Learning Example: Email Spam Filter
  9. 8 1. Fundamentals of Machine Learning Example: Email Spam Filter
  10. 9 1. Fundamentals of Machine Learning Example: AARON
  11. 10 1. Fundamentals of Machine Learning Formal Definition Field of study that gives computers the ability to learn without being explicitly programmed Arthur Samuel (1959): Well posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Tom Mitchell (1998): Example: Chess T: playing chess, E: agent playing with itself, P: number of wins / number of games
  12. 11 1. Fundamentals of Machine Learning Studies “intelligent agents” that perceive their environment and perform different actions to solve tasks that involve mimicking cognitive functions of human brain (Russell, Norvig) Artificial Intelligence Goals of AI Knowledge Representation Taking Actions, Planning Perception and Learning Natural Language Processing Automated Reasoning Ontology: the set of objects, relations, concepts Acting with visualizing future to achieve goals Perception from sensors, learning from experience Ability to read and understand human language Mimicking human reasoning for logical deductions M L
  13. 12 1. Fundamentals of Machine Learning Perception (learning), actions Communication (NLP) Knowledge/Ont ologies Reasoning, planning Applied AI Learns and acts autonomously Use sub- symbolic information Algorithmic theory of cognitive acts Solves any intellectual tasks Artificial General Intelligence (AGI) Present Future
  14. 15 13 1. Fundamentals of Machine Learning Agent Environment Perception Actions Perception Tasks: There is a fixed action (Perception via The physical world (through sensors), or digital data (read from a disk)) Action Tasks: There are multiple possible actions involve planning and forecasting the future involve sub-tasks of learning, for sequential (multi-step) problems (Actions can be fixed, or can vary. May or may not change the environment)
  15. 14 1. Fundamentals of Machine Learning When do we need ML (instead of directly program)? • Complexity 1. Tasks performed by Animals/Humans: Can’t extract a well defined program. (Driving, speech recognition, image understanding) 2. Tasks beyond Human Capabilities: Analysis of very large and complex datasets (Astronomical and genomic data, turning medical archives to medical knowledge, weather prediction) • Adaptivity adaptive to changes in the environment they interact with. (handwritten text, spam detection, speech recognition)
  16. 15 1. Fundamentals of Machine Learning Types of learning • Supervised: environment (teacher) that “supervises” the learner by providing the extra information (“labels”). We have train (seen) and test (unseen) data. p(y|x) • Unsupervised: come up with summary or a compressed version of data, learn probability distribution, clustering (denoise, synthesis) • Reinforcement: Intermediary. There is teacher but with partial feedback (reward), sequence of actions. (describe chess’s setting position value, Self-drive)
  17. 16 1. Fundamentals of Machine Learning Supervised Learning Most common types
  18. 17 1. Fundamentals of Machine Learning Linear Regression Example: satisfaction rate of company employees Training data: company employees have rated their satisfaction on a scale of 1 to 100 Predictor:
  19. 18 1. Fundamentals of Machine Learning Linear Regression Example: satisfaction rate of company employees Let’s start with
  20. 19 1. Fundamentals of Machine Learning Cost Function: As we minimized J (using Gradient Descent, the fitting line gets better and better Linear Regression Example: satisfaction rate of company employees
  21. 22 20 1. Fundamentals of Machine Learning Best Line: Linear Regression Example: satisfaction rate of company employees
  22. 21 1. Fundamentals of Machine Learning Minimization Algorithm: Gradient Descent Linear Regression Example: satisfaction rate of company employees 𝑔 𝜃 = 𝜕 𝜕 𝑔
  23. 22 1. Fundamentals of Machine Learning Plot of J Linear Regression Example: satisfaction rate of company employees In this case, J is convex and therefore there is no local minima!
  24. 23 1. Fundamentals of Machine Learning J cantors Linear Regression Example: satisfaction rate of company employees
  25. 24 1. Fundamentals of Machine Learning Iterations Linear Regression Example: satisfaction rate of company employees Fore more visualization: https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression- 41a5d11f5220
  26. 25 1. Fundamentals of Machine Learning Unsupervised Learning Example: Dimension Reduction
  27. 26 1. Fundamentals of Machine Learning Supervised vs Unsupervised Clusterin g Classificati on
  28. 27 1. Fundamentals of Machine Learning Reinforcement
  29. 28 1. Fundamentals of Machine Learning Linear Regression for Stock Market
  30. 29 1. Fundamentals of Machine Learning Machine Learning Landscape Supervised Learning Unsupervised Learning Learn regression Function Given: input/output pairs Regression Classification Representation Learning Clustering Learn regression Function Given: input/output pairs Learn class Function k: the number of clusters Given: inputs only Learn representer function Given: input/output pairs Perception Tasks
  31. 30 1. Fundamentals of Machine Learning Machine Learning Landscape Reinforcement Learning Learn regression Function Given: input/output pairs Optimization of strategy for a task IRL: Learn objectives from behavior Learn regression Function Given: input/output pairs Action Tasks
  32. 31 1. Fundamentals of Machine Learning Machine Learning Examples Supervised Learning Unsupervised Learning Demand Forecast Regression Classification Representation Learning Clustering Spam detection Image recognition Document classification Customer segmentation Anomaly detection Text recognition Machine Translation Perception Tasks
  33. 32 1. Fundamentals of Machine Learning Machine Learning Examples Reinforcement Learning Robotics Computational advertising Optimization of strategy for a task IRL: Learn objectives from behavior Imitation learning for robotics Action Tasks
  34. 33 1. Fundamentals of Machine Learning Machine Learning Methods Supervised Learning Unsupervised Learning Linear regression Trees: CART SMV/SRV Ensemble methods Neural Networks Regression Classification Representation Learning Clustering Logistic regression Naive Bayes Nearest neighbors SVM Decision trees Ensemble methods Neural Networks K-means Hierarchical clustering Guassian matrix Hidden Markov Models Neural Networks PCA Factor models ICA Dimension reduction Manifold learning Neural Networks Perception Tasks
  35. 34 1. Fundamentals of Machine Learning Machine Learning Methods Reinforcement Learning Model-based RL Model-free RL Batch/online RL RL with linear models Neural Networks Optimization of strategy for a task IRL: Learn objectives from behavior Model-based IRL Model-free IRL Batch/online IRL MaxEnt IRL Neural Networks Action Tasks
  36. 35 1. Fundamentals of Machine Learning Machine Learning in Finance Supervised Learning Unsupervised Learning Earning prediction Credit loss forecast Algorithmic trading Regression Classification Representation Learning Clustering Rating prediction Default modeling Credit card fraud Anti-money laundry Customer segmentation Stock segmentation Factor modeling De-noising Regime change detection Perception Tasks
  37. 36 1. Fundamentals of Machine Learning Reinforcement Learning Trading strategies Asset management Optimization of strategy for a task IRL: Learn objectives from behavior Reverse engineering of consumer behavior, trading strategies, … Action Tasks Machine Learning in Finance
  38. 37 1. Fundamentals of Machine Learning ML by Financial Application Areas Banking Asset Management Customer segmentation Loan defaults Credit card defaults Fraud detection Anti-money laundry Retail P2P Lending Commercial and Investment Portfolio optimization Representation Learning Rating prediction Default modeling Client data mining Recommender systems Factor modeling De-noising Regime change Detection Stock segmentation Multi-period portfolio optimization Derivatives trading Perception Tasks
  39. 38 1. Fundamentals of Machine Learning Quantitative Trading Profit-maximizing trading execution Optimal trade execution Quantitative trading strategies Earning prediction Algorithmic trading Optimal market making Action Tasks ML by Financial Application Areas
  40. ML in Tech • Perception (image recognition, NLP tasks, etc.) Methods: SL/UL • Action (computational advertising, robotics, self-driving cars, etc.). Methods: SL/UL/RL 39 2. ML in Tech vs ML in Finance ML in Tech ML in Finance Image recognition NLP Tasks Forecasting Tasks Valuation Tasks Computational advertising Robotics
  41. ML in Finance Perception: Forecasting tasks • Security price predictions (stocks, bonds, commodities, etc.). Methods: SL/UL • Corporate actors action prediction (dividends, mergers, defaults, etc.). Methods: SL/UL/RL • Individual actors action prediction (loan defaults, fraud, AML, etc.). Methods: SL/UL/RL 40 2. ML in Tech vs ML in Finance ML in Tech ML in Finance Image recognition NLP Tasks Forecasting Tasks Valuation Tasks Computational advertising Robotics
  42. ML in Finance Perception: Valuation tasks • Asset valuation (stocks, futures, commodities, bonds, etc.). Related to forecasting. Methods: SL/UL • Derivatives valuation. Methods: SL/UL/RL 41 2. ML in Tech vs ML in Finance ML in Tech ML in Finance Image recognition NLP Tasks Forecasting Tasks Valuation Tasks Computational advertising Robotics
  43. 42 2. ML in Tech vs ML in Finance Tasks ML in Tech ML for Finance Big Data? typically yes typically no Data for ML in Tech are of huge size. Most of data for ML in Finance are medium-size, except HFT.
  44. 43 2. ML in Tech vs ML in Finance Tasks ML in Tech ML for Finance Stationary Data? typically yes typically no As most of financial data are non-stationary, collecting more data, even when possible is not always helpful
  45. 44 2. ML in Tech vs ML in Finance Tasks ML in Tech ML for Finance Noise-to-signal ratio typically low typically high Financial data are typically quite noisy, “true” signals are unobservable!
  46. 45 2. ML in Tech vs ML in Finance Tasks ML in Tech ML for Finance Interpretability of results Typically, not important, or not the main focus Typically, either desired or required Interpretability of results is: • Desired for trading • Required for regulation (General Data Protection Regulation, 2018)
  47. 46 2. ML in Tech vs ML in Finance Tasks ML in Tech ML for Finance Action (RL) tasks Low dimensional state-action space, low uncertainty High-dimensional state- action space, high uncertainty • ML in Tech: Dimensionality of the state-action space is usually in hundreds. The action space is often more discrete (except in robotics) Uncertainty is low to moderate (think self-driving cars!) • ML in Finance: Dimensionality of the state-action space is often in thousands. The action space is usually continuous. Uncertainty is low to high (think Brexit!)
  48. 47 1. Fundamentals of Machine Learning A Gentle Model (Statistical Learning Framework)  Domain set: features  Label set (discrete or continuous)  Training data: also called training set (seen) The learner’s input:  Prediction function (hypothesis)  Data-generation model: probability distribution of  Measure of success: error of predictor, loss function The learner’s output:
  49. 48 1. Fundamentals of Machine Learning Types of Error • The ability to perform well on previously unobserved inputs is called generalization • What separates machine learning from optimization is that we want the generalization error to be low as well • Estimate generalization error by a test set of examples that were collected separately from the training set Error measure on the training set Training error 𝐿 𝐷,𝑓 ℎ ≝ 𝑃𝑥 𝐷 ℎ 𝑥 ≠ 𝑦 Generalization error (Test error):
  50. 49 1. Fundamentals of Machine Learning • We sample the training set, then use it to choose the parameters to reduce training set error. Under this process, the expected test error is greater than or equal to the expected value of training error • The factors determining how well a machine learning algorithm will perform are its ability to 1. Make the training error small (underfitting) 2. Make the gap between training and test error small (overfitting) Types of Error
  51. 50 1. Fundamentals of Machine Learning Papayas Example 𝐿 𝐷 ℎ 𝑆 = 1 2 𝐿 𝑆 ℎ 𝑆 = 0 • No matter what the sample is , • Predicts label 1 only an finite number of instances: • We have found a predictor whose performance on the training set is excellent, yet its performance on the true “world” is very poor
  52. 51 1. Fundamentals of Machine Learning Example
  53. 52 1. Fundamentals of Machine Learning • Overfitting occurs when our hypothesis fits the training data “too well” (perhaps like the everyday experience that a person who provides a perfect detailed explanation for each of his single actions may raise suspicion). Altering Capacity • Model’s capacity is its ability to fit a wide variety of functions. • Capacity is controlled by Restrict hypothesis class (size or complexity), VC dimension, techniques, program bits, … • Restrict to axis aligned rectangles guarantees not to overfit • If H is a finite class, then ERMH will not overfit
  54. 53 1. Fundamentals of Machine Learning Bias – Complexity Tradeoff Error Decomposition Approximation Error • Due to underfitting • the minimum risk achievable by a predictor in the hypothesis class. • how much risk we have because we restrict ourselves to a specific class (bias) • depends on the chosen hypothesis class • Reflects the quality of prior knowledge Estimation Error • Due to overfitting • the difference between the approximation error and the predictor error • It exists because the training error is only an estimate of the generalization error • depends on the training set size and on the size or complexity of the hypothesis class
  55. 54 1. Fundamentals of Machine Learning Bias – Complexity Tradeoff
  56. 55 1. Fundamentals of Machine Learning Model Capacity DataComplexity Bias – Complexity Tradeoff
  57. 56 1. Fundamentals of Machine Learning Generalization Design Matrix • A model is trained using only a training set • A test set is used to estimate algorithm’s ability to generalize, i.e. perform well on unseen data.
  58. 57 1. Fundamentals of Machine Learning • To generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. • the stronger the prior knowledge (or prior assumptions) that one starts the learning process with, the easier it is to learn from further examples. However, the stronger these prior assumptions are, the less flexible the learning is (it is bound, a priori, by the commitment to these assumptions.) Prior Knowledge • Restricting our hypothesis class (Finiteness, VC Dimension) • Assumption on distribution Examples
  59. 58 1. Fundamentals of Machine Learning Prior Knowledge Bait Shyness The rats seem to have some “built in” prior knowledge telling them that, while temporal correlation between food and nausea can be causal, it is unlikely that there would be a causal relationship between food consumption and electrical shocks or between sounds and nausea.
  60. 59 1. Fundamentals of Machine Learning Pigeon Superstition Prior Knowledge
  61. 60 1. Fundamentals of Machine Learning ML vs Statistical Modeling
  62. 61 3. Bank Failures Example FDI C • US-based commercial banks are regulated by the FDIC • FDIC provides insurance for commercial banks, and charges them insurance premium according to an internal (and non-public) rating based on the CAMELS supervisory system
  63. 62 3. Bank Failures Example Importance
  64. 63 3. Bank Failures Example CAMEL S • Rate 1: Best, Rate 5: Worst • Rating 4 or 5 is likely to be closed soon Capital inadequacy is the most common cause of a bank closure (other reasons: violation of financial rules, management failures) If FDIC decides to close the bank, it takes over both its assets and its liabilities and then tries to sell the assets at the best price possible to pay up the liabilities. • CAMEL ratings are not publicly known; However, Call Reports are available. • In addition, FDIC provides historical data for failed banks: (https://www.fdic.gov/bank/individual/failed/)
  65. 64 3. Bank Failures Example Call Report • 28 schedules in total • Form FFIEC 031: for banks with both domestic (US) and foreign offices • Form FFIEC 041: for banks with domestic (US) offices only
  66. 65 3. Bank Failures Example Call Report Content (Schedules)
  67. 66 3. Bank Failures Example Call Report Content (Schedules)
  68. 67 3. Bank Failures Example Correlation Matrix of features In this problem we want to predict failed(defaulter) Banks and non-failed Banks NI: net income log_TA: logarithm of total assets TL: total loans NPL: non-performing loans Assessment Base: average consolidated assets minus tangle equity …
  69. 68 3. Bank Failures Example Defaulter by log_TA in Training data
  70. 68 3. Bank Failures Example Defaulter by log_TA in Test data
  71. 69 3. Bank Failures Example
  72. 70 3. Bank Failures Example Logistic Regression used for classification
  73. 71 3. Bank Failures Example Training
  74. 72 3. Bank Failures Example Training
  75. 73 3. Bank Failures Example Testing
  76. 3 Deep Learning 74
  77. 75 4. Deep Learning and Neural Networks The performance of simple machine learning algorithms depends heavily on the representation of the data they are given. Goal: separate the factors of variation Problem: influence every single piece of data we are able to observe. (car image at night, car ) Most applications require us to disentangle the factors of variation and discard the ones that we do not care about Representation Learning: use ML to discover not only the mapping from representation to output but also the representation itself. quintessential example: Autoencoder the combination of an encoder function, which converts the input data into a different representation, and a decoder function, which converts the new representation back into the original format.  Representation
  78. 76 4. Deep Learning and Neural Networks Example  Representation
  79. 77 4. Deep Learning and Neural Networks Deep learning solves this problem by introducing representations that are expressed in terms of other, simpler representations. (build complex concepts out of simpler concepts. ) Example
  80. 77 4. Deep Learning and Neural Networks  Depth Depth enables the computer to learn a multistep computer program Layer: state of the computer’s memory after executing another set of instructions in parallel Networks with greater depth can execute more instructions in sequence. (later instructions can refer back to the results of earlier instructions. Measuring Depth 1. Depth of computational graph: number of sequential instructions (length of the longest path through a flow chart) 2. Depth of the concepts graph: describing how concepts are related to each other. • Depth of the flowchart of the computations needed to compute the representation of each concept may be much deeper than the graph of the concepts themselves
  81. 78 4. Deep Learning and Neural Networks Depth = 3 Depth = 2
  82. 79 4. Deep Learning and Neural Networks
  83. 80 4. Deep Learning and Neural Networks
  84. 81 4. Deep Learning and Neural Networks History of DL • Dates back to 1940s (only appears to be new) • Different Names: 1. 1940s - 1960: Cybernetics 2. 1980s – 1990s: Connectionism 3. Beginning of 2006: Deep Learning 4. learning algorithms for biological learning (models of how learning happens or could happen in brain): Artificial Neural Networks Neural Perspective on DL 1. Brain provides a proof that intelligent behavior is possible 2. Reverse engineer the computational principles behind the brain • Today, neuroscience is regarded as an important source of inspiration for DL researchers, but it is no longer the predominant guide for the field because To obtain a deep understanding of the actual algorithms used by the brain, we would need to be able to monitor the activity of (at the very least) thousands of interconnected neurons simultaneously. • The basic idea of having many computational units that become intelligent only via their interactions with each other is inspired by the brain • 1980s algorithms work quite well, but this was not apparent circa 2006 because they were too computationally costly.
  85. 82 4. Deep Learning and Neural Networks • Increasing Dataset sizes: Some skill is required to get good performance from a DL algorithm. Fortunately, the amount of skill required reduces as the amount of training data increases. The age of “Big Data” has made ML much easier because the key burden of statistical estimation (generalizing to new data after observing only a small amount) has been considerably lightened. • Increasing Model Sizes: animals become intelligent when many of their neurons work together. Larger networks are able to achieve higher accuracy on more complex tasks. History of DL
  86. 83 4. Deep Learning and Neural Networks Challenges motivating DL • Curse of Dimensionality Regions Regions Regions statistical challenge arises because the number of possible configurations of x is much larger than the number of training examples.
  87. 84 4. Deep Learning and Neural Networks www.playground.tensorflow.org • Local Constancy and Smoothness Among the most widely used of these implicit “priors” is the smoothness prior, or local constancy prior. It states that the function we learn should not change very much within a small region. Much of the modern motivation for deep learning is derived from studying the limitations of local template matching and how deep models are able to succeed in cases where local template matching fails (Bengio et al., 2006b).
  88. 85 4. Deep Learning and Neural Networks Neural Networks Feedforward Neural Network (MLP) Goal: approximate some function with some Feedforward: information flows through the function with no feedback connections Neural: loosely inspired by neuroscience Network: composing together many different functions . ( is the ’th layer and final layer is output layer) Depth: overall length of the chain Width: dimensionality of hidden layers Hidden Layer: Training data does not show the desired output for each of these layers • During NN training, we drive to match • Each hidden layer is vector valued
  89. 86 4. Deep Learning and Neural Networks Depth 𝑓 1 𝑓 2 𝑓 3 Feedforward Neural Network (MLP) Width
  90. 87 Feedforward Neural Network (MLP) MLP as a kernel technique extend linear models to represent nonlinear functions of by applying the linear model not to , but to a transformed input How to choose 1. Generic: infinite-dimensional(based on RBF kernel). Enough capacity but poor generalization 2. Manually Engineer : Requires decades of human effort for each separate task 3. Learn : This is an example of a deep feedforward network, with defining a hidden layer • The advantage of 3’rd approach is that the human designer only needs to find the right general function family rather than finding precisely the right function.
  91. 88 4. Deep Learning and Neural Networks Feedforward Neural Network (MLP) Example: Learning XOR • After solving: and , where and • Most neural networks establish a nonlinear function by using a affine transformation controlled by learned parameters, followed by a fixed nonlinear function called an activation function. or , where
  92. 89 4. Deep Learning and Neural Networks When , the model’s output must increase as increases. When , the model’s output must decrease as increases.
  93. 90 4. Deep Learning and Neural Networks ,
  94. 91 4. Deep Learning and Neural Networks Recurrent Neural Network (RNN) • For processing a sequence of values . ( can be variable) • Parameter sharing: using the same parameter for more than one function in a model (tied weights). If we had separate parameters for each value of the time index, we could not generalize to sequence lengths not seen during training, nor share statistical strength across different sequence lengths and across different positions in time. Such sharing is particularly important when a specific piece of information can occur at multiple positions within the sequence. (“I went to Nepal in 2009” and “In 2009, I went to Nepal) • Each member of the output is a function of the previous members of the output. Each member of the output is produced using the same update rule applied to the previous outputs. • Include cycles that represent the influence of the present value of a variable on its own value at a future time step. • Any function involving recurrence can be considered a recurrent neural network.
  95. 92 4. Deep Learning and Neural Networks Parameter Sharing Recurrent Neural Network (RNN)
  96. 93 4. Deep Learning and Neural Networks Unfolding Computational Graphs The unfolding process thus introduces two major advantages: 1. Regardless of the sequence length, the learned model always has the same input size, because it is specified in terms of transition from one state to another state, rather than specified in terms of a variable-length history of states. 2. It is possible to use the same transition function f with the same parameters at every time step. Recurrent Neural Network (RNN)
  97. 94 4. Deep Learning and Neural Networks Some types of RNNs Recurrent Neural Network (RNN) I. Produce an output at each time step and have recurrent connections between hidden units II. Produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step. III. With recurrent connections between hidden units, that read an entire sequence and then produce a single output • The network with recurrent connections only from the output at one time step to the hidden units at the next time step is strictly less powerful because it lacks hidden-to- hidden recurrent connections. For example, it cannot simulate a universal Turing machine. It requires that the output units capture all the information about the past that the network will use to predict the future.
  98. 95 4. Deep Learning and Neural Networks I
  99. 96 4. Deep Learning and Neural Networks II
  100. 97 4. Deep Learning and Neural Networks
  101. 98 4. Deep Learning and Neural Networks III
  102. 99 4. Deep Learning and Neural Networks Teacher Forcing Recurrent Neural Network (RNN) a procedure that emerges from the maximum likelihood criterion, in which during training the model receives the ground truth output as input at time . 𝑙𝑜𝑔𝑝 𝑦 1 , 𝑦 2 𝑥 1 , 𝑥 2 = 𝑙𝑜𝑔𝑝 𝑦 2 𝑦 1 , 𝑥 1 , 𝑥 2 + 𝑙𝑜𝑔𝑝 𝑦 1 𝑦 1 , 𝑥 1 , 𝑥 2 • avoid back-propagation through time in models that lack hidden-to-hidden connections. Teacher forcing may still be applied to models that have hidden-to-hidden connections as long as they have connections from the output at one time step to values computed in the next time step. • As soon as the hidden units become a function of earlier time steps, however, the BPTT algorithm is necessary. • Some models may thus be trained with both teacher forcing and BPTT.
  103. 104 100 4. Deep Learning and Neural Networks
  104. 101 4. Deep Learning and Neural Networks Any time we choose a specific machine learning algorithm, we are implicitly stating some set of prior beliefs we have about what kind of function the algorithm should learn. Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation. Alternately, we can interpret the use of a deep architecture as expressing a belief that the function we want to learn is a computer program consisting of multiple steps, where each step makes use of the previous step’s output. These intermediate outputs are not necessarily factors of variation but can instead be analogous to counters or pointers that the network uses to organize its internal processing. Empirically, greater depth does seem to result in better generalization Last Note
  105. 102 References 1. Understanding Machine Learning: From Theory to Algorithms (Shai Ben-David and Shai Shalev- Shwartz) 2. Deep Learning (Aaron C. Courville, Ian Goodfellow, and Yoshua Bengio) 3. “Machine Learning in Finance” course (www.coursera.org) 4. Advances in Financial Machine Learning (marcos lopez de prado)

Hinweis der Redaktion

  1. http://aigamedev.com/open/article/top-down-vs-bottom-up-design/
  2. Another examples: anomaly detection (fraud), any suggestion on social media, google news, learning someone’s taste
  3. Another fancy example: speech synch
  4. Knowledge Representation: representing information about the world in a form that a computer system can utilize to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. This field incorporates findings from psychology[1] about how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. Also incorporates findings from logic to automate various kinds of reasoning, such as the application of rules or the relations of sets and subsets. (Knowledge-Based approach) Automated Reasoning: he study of automated reasoning helps produce computer programs that allow computers to reason completely, or nearly completely, automatically. Although automated reasoning is considered a sub-field of artificial intelligence, it also has connections with theoretical computer science, and even philosophy. NLP: Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
  5. You can't say to an "Applied AI" agent: go out and find out what you do on your own. Example of sub-symbolic information: Yann LeCun: The phrase "He took his bag and left the room", implies in particular that the person walked out of the room rather than, for instance jumped out of the window or teleported to another planet Other even more remote tasks include algorithmic theories of creativity, curiosity and surprise as pursued by Juergen Schmidhuber. one expects that AI intelligence will be able to solve arbitrary intellectual tasks, which is expected to around 2,045 according to Ray Kurzweil, a famous entrepreneur and futurologist.
  6. Types are based on agent’s interaction with environment program synthesis is the task to automatically construct a program that satisfies a given high-level specification[1]. In contrast to other automatic programmingtechniques, the specifications are usually non-algorithmic statements of an appropriate logical calculus.[2] Often, program synthesis employs techniques from formal verification. For reinforcement learning example, one may try to learn a value function that describes for each setting of a chess board the degree by which White’s position is better than the Black’s. Yet, the only information available to the learner at training time is positions that occurred throughout actual chess games, labeled by who eventually won that game
  7. In IRL setting, everything is the same as the direct reinforcement learning, but there is no information on rewards received by the agent upon taking actions. Instead we are simply given a sequence of states of the environment and actions by the agent. And given that we are asked what objective the agent pursued when performing these actions
  8. Demand Forecast: understand and predict customer demand to optimize supply decisions by corporate supply chain and business management. Machine Translation example: Google Translate
  9. As you can see, NN is present in all types. Because of the Universal Approximation Theorem, every function is representable via a NN.
  10. Regression is the most commonly used algorithm in Finance
  11. Asset management refers to systematic approach to the governance and realization of value from the things that a group or entity is responsible for, over their whole life cycles. It may apply both to tangible assets (physical objects such as buildings or equipment) and to intangible assets (such as human capital, intellectual property, goodwill and/or financial assets). 
  12. Quantitative Trading (Algorithmic Trading): Algorithmic trading is a method of executing a large order (too large to fill all at once) using automated pre-programmed trading instructions accounting for variables such as time, price, and volume
  13. Most common uses
  14. The reason that reinforcement learning has application in perceptions tasks of finance: In finance, expectations regarding the future are sometimes embedded in perception of today’s environment. If this future is influenced by actions of rational agents, RL might be an appropriate framework (تصور الان روی قیمت آینده اثر می‌گذارد) Rational financial AI agents: These agents learn to perceive the environment; that is to digest financial and sometimes non-financial data and perform certain actions to maximize some measures of performance
  15. Interoperability is also important in sensitive (life-depending) or moral problems. For more information, see this:
  16. each pair in the training data S is generated by first sampling a point xi according to D and then labeling it by f. Domain set is the set of objects that we may wish to label. For example, the set of all papayas. It is important to note that we do not assume that the learner knows anything about distribution D we assume that there is some “correct” labeling function, f : X ->Y, and that yi = f(xi) for all i. This assumption can be relaxed
  17. = {1,…,m}
  18. The area of the gray square in the picture is 2 and the area of the blue square is 1. Assume that the probability distribution D is such that instances are distributed uniformly within the gray square and the labeling function, f, determines the label to be 1 if the instance is within the inner blue square, and 0 otherwise.
  19. The first component reflects the quality of our prior knowledge choosing H to be a very rich class decreases the approximation error but at the same time might increase the estimation error, as a rich H might lead to overfitting. On the other hand, choosing H to be a very small set reduces the estimation error but might increase the approximation error or, in other words, might lead to underfitting.
  20. Bayesian probability is a special kind of prior knowledge. (prior knowledge about distribution) once we make no prior assumptions about the data-generating distribution, no algorithm can be guaranteed to find a predictor that is as good as the Bayes optimal one
  21. Advantages of Representation Learning: better performance, adapt to new tasks with minimal human intervention Factors: sources of influence… for example: 1)unobserved objects or unobserved forces in the physical world that affect observable quantities ) 2) constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data speech recording (speaker’s age, their sex, their accent and speaking words) car image analyze (position of the car, its color, and the angle and brightness of the sun.) The individual pixels in an image of a red car might be very close to black at night. The shape of the car’s silhouette depends on the viewing angle
  22. Suppose we have a vision system that can recognize cars, trucks, and birds, and these objects can each be red, green, or blue. One way of representing these inputs would be to have a separate neuron or hidden unit that activates for each of the nine possible combinations: red truck, red car, red bird, green truck, and so on. This requires nine different neurons, and each neuron must independently learn the concept of color and object identity. One way to improve on this situation is to use a distributed representation, with three neurons describing the color and three neurons describing the object identity. This requires only six neurons total instead of nine, and the neuron describing redness is able to learn about redness from images of cars, trucks and birds, not just from images of one specific category of objects.
  23. Just as two equivalent computer programs will have different lengths depending on which language the program is written in, the same function may be drawn as a flowchart with different depths depending on which functions we allow to be used as individual steps in the flowchart. For example, an AI system observing an image of a face with one eye in shadow may initially see only one eye. After detecting that a face is present, the system can then infer that a second eye is probably present as well. In this case, the graph of concepts includes only two layers—a layer for eyes and a layer for faces—but the graph of computations includes 2n layers if we refine our estimate of each concept given the other n times. there is no single correct value for the depth of an architecture, just as there is no single correct value for the length of a computer program. Nor is there a consensus about how much depth a model requires to qualify as “deep.”
  24. While the kinds of neural networks used for machine learning have sometimes been used to understand brain function (Hinton and Shallice, 1991), they are generally not designed to be realistic models of biological function The earliest predecessors of modern deep learning were simple linear models One should not view deep learning as an attempt to simulate the brain. Modern deep learning draws inspiration from many fields, especially applied math fundamentals like linear algebra, probability, information heory, and numerical optimization
  25. Larger networks are able to achieve higher accuracy on more complex tasks. the number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases .
  26. we may also discuss prior beliefs as directly influencing the function itself and influencing the parameters only indirectly, as a result of the relationship between the parameters and the function. Additionally, we informally discuss prior beliefs as being expressed implicitly by choosing algorithms that are biased toward choosing some class of functions over another, even though these biases may not be expressed (or even be possible to express) in terms of a probability distribution representing our degree of belief in various functions. In other words, if we know a good answer for an input x (for example, if x is a labeled training example), then that answer is probably good in the neighborhood of x.
  27. Rather than thinking of the layer as representing a single vector-to-vector function, we can also think of the layer as consisting of many units that act in parallel, each representing a vector-to-scalar function. Each unit resembles a neuron in the sense that it receives input from many other units and computes its own activation value. The idea of using many layers of vector-valued representations is drawn from neuroscience
  28. Depth: deep and shallow networks
  29. Linear models, such as logistic regression and linear regression, are appealing because they can be fit efficiently and reliably, either in closed form or with convex optimization. Linear models also have the obvious defect that the model capacity is limited to linear functions, so the model cannot understand the interaction between any two input variables. We can think of φ as providing a set of features describing x, or as providing a new representation for x. 3’rd approach can capture the benefit of the first approach by being highly generic—we do so by using a very broad family φ(x; θ). Deep learning can also capture the benefit of the second approach. Human practitioners can encode their knowledge to help generalization by designing families φ(x; θ) that they expect will perform well
  30. The only challenge is to fit the training set. By Occam’s Razor, we start with linear models. it may be tempting to make f(1) linear as well. Unfortunately, if f(1) were linear, then the feedforward network as a whole would remain a linear function of its input. Most neural networks establish nonlinear function using an affine transformation controlled by learned parameters, followed by a fixed nonlinear function called an activation function
  31. The bold numbers printed on the plot indicate the value that the learned function must output at each point.
  32. If we use a sufficiently powerful neural network, we can think of the neural network as being able to represent any function f from a wide class of functions, with this class being limited only by features such as continuity and boundedness rather than by having a specific parametric form. (Universal Approximation Theorem)
  33. If we ask a machine learning model to read each sentence and extract the year in which the narrator went to Nepal, we would like it to recognize the year 2009 as the relevant piece of information, whether it appears in the sixth word or in the second word of the sentence. Suppose that we trained a feedforward network that processes sentences of fixed length. A traditional fully connected feedforward network would have separate parameters for each input feature, so it would need to learn all the rules of the language separately at each position in the sentence. By comparison, a recurrent neural network shares the same weights across several time steps. The convolution operation allows a network to share parameters across time but is shallow. The output of convolution is a sequence where each member of the output is a function of a small number of neighboring members of the input. Recurrent networks share parameters in a different way (second dot)
  34. (Top)The black arrows indicate uses of the central element of a 3-element kernel in a convolutional model. Because of parameter sharing, this single parameter is used at all input locations. (Bottom)The single black arrow indicates the use of the central element of the weight matrix in a fully connected model. This model has no parameter sharing, so the parameter is used only once Parameter sharing is a kind of prior knowledge.
  35. the time step index need not literally refer to the passage of time in the real world. Sometimes it refers only to the position in the sequence. S(t): state of the system (dynamical system) Each node represents the state at some time t, and the function f maps the state at t to the state at t + 1. The same parameters (the same value of θ used to parametrize f) are used for all time steps. By unfolding, we avoid cycles in graph
  36. RNN has input to hidden connections parametrized by a weight matrix U, hidden-to-hidden recurrent connections parametrized by a weight matrix W , and hidden-to-output connections parametrized by a weight matrix V any function computable by a Turing machine can be computed by such a recurrent network of a finite size The output can be read from the RNN after a number of time steps that is asymptotically linear in the number of time steps used by the Turing machine and asymptotically linear in the length of the input (Siegelmann and Sontag, 1991; Siegelmann, 1995; Siegelmann and Sontag, 1995; Hyotyniemi, 1996). The functions computable by a Turing machine are discrete, so these results regard exact implementation of the function, not approximations.
  37. A loss L measures how far each o is from the corresponding training target y. When using softmax outputs, we assume o is the unnormalized log probabilities. Unless o is very high-dimensional and rich, it will usually lack important information from the past. This makes the RNN in this figure less powerful, but it may be easier to train because each time step can be trained in isolation from the others, allowing greater parallelization during training
  38. Maximum likelihood thus specifies that during training, rather than feeding the model’s own output back into itself, these connections should be fed with the target values specifying what the correct output should be
  39. Much as almost any function can be considered a feedforward neural network, essentially any function involving recurrence can be considered a recurrent neural network.