Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Surveillance Platform for Bank Compliance
Mayur Thakur, Goldman Sachs
2
• “Banks pay out £166bn over six years: a history of banking
misdeeds and fines” – The Guardian
• “Banks 'pay 60%' of pr...
Key Technical Challenges
 Diverse data sets and formats (sql, flatfiles, proprietary, etc)
 Size of data, updated freque...
Surveillance Architechture
4
SQL 1
Surv. 1
SQL n
Flatfile 1
Flatfile m
Prop 1
Prop k
HDFS 1
HDFS q
Flattened
1
Flattened
2...
Spoofing Illustration
5
A Real World Spoofing Case
Navinder Singh Sarao was accused of spoofing
 …and even contributing to the flash crash of 20...
Review of Regulatory Cases
7
– Analyzed six regulatory enforcement cases for related to spoofing
– Identified common facto...
Transactions Data Pipeline
 Spoofing implementation has 2 parts: data preprocessing and surveillance logic
 Data preproc...
Related Transactions Table
9
Related
Transactions
Pivot Exec Orders Execs/Cancels MktData
216.8, 216.9, …
One row of the r...
Search Problem
10
Given a semi-structured corpus of about a 1B documents in a
hadoop cluster, design a search engine over ...
Search Workflow
11
Search
Master
Ranker
Fast Index
Servers
Slow Index
Servers
HBase
Web
Client
Yarn containers
HDFS
• Impl...
Nächste SlideShare
Wird geladen in …5
×

Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

655 Aufrufe

Veröffentlicht am

Mayur is head of the Data Analytics Group in the Global Compliance Division. He joined Goldman Sachs as a managing director in 2014.
Prior to joining the firm, Mayur worked at Google, where he designed search algorithms for more than seven years. Previously, he was an assistant professor of computer science at the University of Missouri.
Mayur earned a PhD in Computer Science from the University of Rochester in 2004 and a BTech in Computer Science and Engineering from the Indian Institute of Technology, Delhi, in 1999.

Abstract Summary:

Surveillance platforms for bank compliance
Bank compliance uses models to look for outlier events such as insider trading, spoofing, front running, etc. With the exponential increase in the size of the data and a growing need to use such models, a key question is: How do we scale these models so they run efficiently and at the same time detect outlier events with good precision and recall?

In this talk, we will describe our experience building, from scratch, a Hadoop-based platform for surveillance.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017

  1. 1. Surveillance Platform for Bank Compliance Mayur Thakur, Goldman Sachs
  2. 2. 2 • “Banks pay out £166bn over six years: a history of banking misdeeds and fines” – The Guardian • “Banks 'pay 60%' of profits in fines and customer payments” – BBC News • “Deutsche Bank to Pay $2.5 Billion to Settle Libor Investigation” – The Wall Street Journal • “$1.2 Billion Fine for Hedge Fund SAC Capital in Insider Case” – NY Times Stakes Are High
  3. 3. Key Technical Challenges  Diverse data sets and formats (sql, flatfiles, proprietary, etc)  Size of data, updated frequently • ~1B* pieces of text per year • ~1B edges in a graph • 100s of millions of trading events in a day  Data from past can change (e.g., manual trade correction) • Causes a cascade of changes  Surveillance decisions need to be debuggable • Why was trade X on Oct 25, 2015 not flagged?  Not real time; often need time guarantees (say, T+1) * All numbers are “orders of magnitudes” 3
  4. 4. Surveillance Architechture 4 SQL 1 Surv. 1 SQL n Flatfile 1 Flatfile m Prop 1 Prop k HDFS 1 HDFS q Flattened 1 Flattened 2 Flattened n Preprocessing pipeline 1 Preprocessing pipeline 2 Preprocessing pipeline n Alerts… Bookkeeping
  5. 5. Spoofing Illustration 5
  6. 6. A Real World Spoofing Case Navinder Singh Sarao was accused of spoofing  …and even contributing to the flash crash of 2010 Sarao pled guilty to spoofing in Nov 2016 He allegedly made $40M in illegal profit over years. 6
  7. 7. Review of Regulatory Cases 7 – Analyzed six regulatory enforcement cases for related to spoofing – Identified common factors indicative of spoofing behavior • Creating false impression of demand by placing spoof orders on opposite side to trigger a price movement (“order imbalance”) • Cancellation of spoof orders within short time after pivot execution (“time to cancel post execution”) Case Factors Order Imbalance Time to Cancel Post Execution ( > 2.5 times ) ( < 1 sec ) Sarao / Flash Crash a a Hold Brothers a a Coscia/ Panther a a Visionary Trading NA a Swift a 5 secs 3 Red a a
  8. 8. Transactions Data Pipeline  Spoofing implementation has 2 parts: data preprocessing and surveillance logic  Data preprocessing pipeline is reused for multiple surveillances  ~100M orders, 1B mkt data points, 100K products, multiple order mgmt. system 8 Order 1 Related Transactions Spoofing Orders n Exec. 1 Exec m Market 1 Market k Account Product Flattened Order Flattened Exec Flattened Market Order Processing Pipeline Exec. Processing Pipeline Mkt. Processing Pipeline Front running Surv. n… Alerts Alerts Alerts…
  9. 9. Related Transactions Table 9 Related Transactions Pivot Exec Orders Execs/Cancels MktData 216.8, 216.9, … One row of the related transactions table contains information about one pivot execution and all the activity around the time of that execution. X X X X
  10. 10. Search Problem 10 Given a semi-structured corpus of about a 1B documents in a hadoop cluster, design a search engine over YARN that is fast and satisfies the investigative needs of a variety of users. Unique Challenges  Cannot move data outside of an already existing hadoop cluster  Support deep scoring algorithms specifically for GS-specific signals (colloquial language, trades, etc)  Unstructured and structured signals
  11. 11. Search Workflow 11 Search Master Ranker Fast Index Servers Slow Index Servers HBase Web Client Yarn containers HDFS • Implemented as YARN apps • Auth enabled • Slow index Servers can scale as much as HBase # indexed documents > 1Billion # indexed tokens > 500 billion Current Index Size Runs in several TBs (Memory and Disk)

×