This document describes a proposed surveillance platform for bank compliance. It discusses the large scale of financial data that must be analyzed, including over 1 billion pieces of text and trading events per year. It outlines the technical challenges of working with diverse data formats and sources that can change over time. The proposed architecture preprocesses and flattens data from various sources into a common format before running multiple surveillance algorithms to detect issues like spoofing and generate alerts. Examples of real-world spoofing cases are provided to illustrate the types of patterns the system aims to identify.
2. 2
• “Banks pay out £166bn over six years: a history of banking
misdeeds and fines” – The Guardian
• “Banks 'pay 60%' of profits in fines and customer payments” – BBC
News
• “Deutsche Bank to Pay $2.5 Billion to Settle Libor Investigation” –
The Wall Street Journal
• “$1.2 Billion Fine for Hedge Fund SAC Capital in Insider Case” – NY
Times
Stakes Are High
3. Key Technical Challenges
Diverse data sets and formats (sql, flatfiles, proprietary, etc)
Size of data, updated frequently
• ~1B* pieces of text per year
• ~1B edges in a graph
• 100s of millions of trading events in a day
Data from past can change (e.g., manual trade correction)
• Causes a cascade of changes
Surveillance decisions need to be debuggable
• Why was trade X on Oct 25, 2015 not flagged?
Not real time; often need time guarantees (say, T+1)
* All numbers are “orders of magnitudes”
3
4. Surveillance Architechture
4
SQL 1
Surv. 1
SQL n
Flatfile 1
Flatfile m
Prop 1
Prop k
HDFS 1
HDFS q
Flattened
1
Flattened
2
Flattened
n
Preprocessing
pipeline 1
Preprocessing
pipeline 2
Preprocessing
pipeline n
Alerts…
Bookkeeping
6. A Real World Spoofing Case
Navinder Singh Sarao was accused of spoofing
…and even contributing to the flash crash of 2010
Sarao pled guilty to spoofing in Nov 2016
He allegedly made $40M in illegal profit over years.
6
7. Review of Regulatory Cases
7
– Analyzed six regulatory enforcement cases for related to spoofing
– Identified common factors indicative of spoofing behavior
• Creating false impression of demand by placing spoof orders on opposite
side to trigger a price movement (“order imbalance”)
• Cancellation of spoof orders within short time after pivot execution (“time
to cancel post execution”)
Case
Factors
Order Imbalance Time to Cancel Post Execution
( > 2.5 times ) ( < 1 sec )
Sarao / Flash Crash a a
Hold Brothers a a
Coscia/ Panther a a
Visionary Trading NA a
Swift a 5 secs
3 Red a a
8. Transactions Data Pipeline
Spoofing implementation has 2 parts: data preprocessing and surveillance logic
Data preprocessing pipeline is reused for multiple surveillances
~100M orders, 1B mkt data points, 100K products, multiple order mgmt. system
8
Order 1
Related
Transactions
Spoofing
Orders n
Exec. 1
Exec m
Market 1
Market k
Account
Product
Flattened
Order
Flattened
Exec
Flattened
Market
Order Processing
Pipeline
Exec. Processing
Pipeline
Mkt. Processing
Pipeline
Front
running
Surv. n…
Alerts Alerts Alerts…
9. Related Transactions Table
9
Related
Transactions
Pivot Exec Orders Execs/Cancels MktData
216.8, 216.9, …
One row of the related transactions table contains information about one pivot
execution and all the activity around the time of that execution.
X X X X
10. Search Problem
10
Given a semi-structured corpus of about a 1B documents in a
hadoop cluster, design a search engine over YARN that is
fast and satisfies the investigative needs of a variety of users.
Unique Challenges
Cannot move data outside of an already existing hadoop cluster
Support deep scoring algorithms specifically for GS-specific signals
(colloquial language, trades, etc)
Unstructured and structured signals
11. Search Workflow
11
Search
Master
Ranker
Fast Index
Servers
Slow Index
Servers
HBase
Web
Client
Yarn containers
HDFS
• Implemented as YARN apps
• Auth enabled
• Slow index Servers can scale as much as HBase
# indexed documents > 1Billion
# indexed tokens > 500 billion
Current Index Size Runs in several TBs (Memory
and Disk)