Marcus  Gründler  |  aixigo  AG
Financial  Portfolio  
Management  on  Steroids
About me
Marcus  Gründler
@marcusgruendler
Head  of Portfolio  Management  Systems
Architect at  aixigo  AG,  Germany
www....
Agenda
• Financial  portfolio management
• Hardware
• Java  memory
• Programming patterns
• Scaling
What exactly is
Financial  Portfolio
Management?
Portfolio  Management
• Transactions,  prices,  securities
• Financial  algorithms
• Historical  analysis
• Time  series
Portfolio  Management
• No extreme  low latency
• High  data throughput (1  mio rec/sec)
• Low  response times (100ms/10,0...
Is
high  performance
computing
possible with
Java?
Yes  ...
...  read the fine print!
Matrix  Sum
23 101 2 34 88 120 4
44 12 234 211 112 189 11
33 1 86 201 3 11 22
65 32 62 22 34 15 67
43 178 105 138 192 38 4...
Matrix  Sum
23 101 2 34 88 120 4
44 12 234 211 112 189 11
33 1 86 201 3 11 22
65 32 62 22 34 15 67
43 178 105 138 192 38 4...
Matrix  Sum
(10,000  x  10,000  elements)
ops/sec
ops/sec
ops/sec
Matrix  Sum
Row major access
Column major access
Matrix  Sum
Row major access
Column major access
Matrix  Sum
Row major access
Column major access
Matrix  Sum
Row major access
Column major access
Matrix  Sum
Row major access
Column major access
Matrix  Sum
Row major access
Column major access
Matrix  Sum
Row major access
Column major access
Matrix  Sum
Row major access
Column major access
Tool  Support  -­ JMH
• OpenJDK JMH  (Java  Microbenchmark Harness)  
• Eliminates measurement (in)accuracy
• Statisticall...
Memory  Access
CPU
RAM
Cached Memory  Access
CPU
RAM
Cache
Multi  Level  Caches
CPU
RAM
L1 L2 L3
Latency Numbers
CPU  cycles Time Size
L1  latency ~  4-­5  cycles 1.5 ns 32  KB
L2  latency ~  12  cycles 3.5  ns 256  KB
...
Latency Numbers
CPU  cycles Time Size
L1  latency ~  4-­5  cycles 1.5 ns 32  KB
L2  latency ~  12  cycles 3.5  ns 256  KB
...
Cache-­oblivious Algorithms
• Optimized for minimal  memory transfer
• All  computation with L1  cache
• Cache-­“oblivious...
Plain Matrix  Transpose
1 2 3
4 5 6
...
1 4
...2 5
637
7
Cache-­oblivious Transpose
Matrix  Transpose
(4096  x  4096  elements)
ops/sec
Matrix  Transpose
(4096  x  4096  elements)
Cores
ops/sec
• Memory  access patterns matter
• Avoid main memory jumps
• Algorithms should support prefetching
How much
memory
do  we
consume?
How large  is an  object?
double  =  8 byte
Double  =  ? byte
BigDecimal =  ? byte
java.lang.Double
double  =  8 byte
Double  =  ? byte
BigDecimal =  ? byte
0 4 8 12 16 20 24 28
Padding
8  byte
alignment
O...
java.lang.Double
double  =  8 byte
Double  =  ? byte
BigDecimal =  ? byte
0 4 8 12 16 20 24 28
Object
header
„Mark  word“
...
double  =      8  byte
Double  =  24  byte
BigDecimal =      ?  byte
double  array
0 4 8 12 16 20 24 28
„Mark  word“
• Has...
double  =      8  byte
Double  =  24  byte
BigDecimal =      ?  byte
java.lang.Double
double  =      8  byte
Double  =  24  byte
BigDecimal =      ?  byte
BigDecimal
0 4 8 12 16 20 24 28 32 36 40
ref to BigIn...
double  =          8  byte
Double  =      24  byte
BigDecimal =  >96  byte
BigDecimal
Tool  Support  -­ JOL
• OpenJDK tool – JOL  (Java  Object Layout)
• Insight  into memory layout
• Heap  dump analysis
• Ex...
Tool  Support  -­ JOL
java -jar jol-cli.jar 
internals java.math.BigDecimal
or
java –cp jol-cli.jar:my-own.jar 
org.openjd...
Tool  Support  -­ JOL
java.math.BigDecimal object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
1...
Data  Locality
56  +  520  =  576  bytes 64  Bytes
Memory  Layout
TransactionPlain[100]
TransactionCompact[100]
276  kB
• Keep  data compact
• Think  about data types
• Keep  memory allocations low
• Keep  garbage collection rate  low
And...
Which patterns
should I
use?
Tree Model
Flattened Data  Model
Streaming  Data  Access
!
𝑧
𝑦
𝑥
4
%
𝑦
𝑥
1
%
𝑥
𝑥
3
2
Decoupled Algorithms
• Prefer primitives  over classes
• Prefer arrays over object trees
• Process one array at  a  time
What about
cluster
replication?
MVCC
(Multi  Version  Concurrency Control)
V1
V2
V3
V4
View1View1+2View1+2+3View1+2+3+4
Compact  Data
V1
V2
V3
V4
View1View1+2View1+2+3View1+2+3+4
Off  heap RAM
On  disk
From Disk  to RAM
V1
V2
V3
V1 V2 V3
MappedByteBuffer
Data  Distribution
Apache  Kafka
Data  Distribution
Data  Distribution
Summary
• Large  speedups through cache optimized
algorithms
• Memory  layout is crucial
• Scale by cache replication
@marcusgruen...
Thank
you!
Nächste SlideShare
Wird geladen in …5
×

Financial Portfolio Management with Java on Steroids - JAX Finance 2016

619 Aufrufe

Veröffentlicht am

Marcus Gründler talks about high performance software at the JAX Finance 2016 conference in London.

Analyzing financial transactions data of a large number of portfolios and hundreds of millions transactions and quotes is a demanding job for any computing environment.

Analysis systems don't necessarily have extremely low latency requirements such as trading systems do, but have to provide low user response times. Applying massive parallelism over CPU cores or even clusters of machines doesn't help much if you want to achieve response times of a few milliseconds.

Unleashing the extreme power of todays CPUs with a stream oriented architecture - which is optimal for utilizing caches and branch prediction of modern CPUs - becomes the corner stone in such systems.

The talk presents insights into a system that is based on the idea of multi version concurrency control (MVCC) using Apache Kafka and binary in-memory data representation. Learn how to achieve speed improvements of a factor of 100-200 times compared to classic object oriented design.

Veröffentlicht in: Technologie
0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
619
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
53
Aktionen
Geteilt
0
Downloads
10
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Financial Portfolio Management with Java on Steroids - JAX Finance 2016

  1. 1. Marcus  Gründler  |  aixigo  AG Financial  Portfolio   Management  on  Steroids
  2. 2. About me Marcus  Gründler @marcusgruendler Head  of Portfolio  Management  Systems Architect at  aixigo  AG,  Germany www.aixigo.de JAX  Finance 2016  -­ London
  3. 3. Agenda • Financial  portfolio management • Hardware • Java  memory • Programming patterns • Scaling
  4. 4. What exactly is Financial  Portfolio Management?
  5. 5. Portfolio  Management • Transactions,  prices,  securities • Financial  algorithms • Historical  analysis • Time  series
  6. 6. Portfolio  Management • No extreme  low latency • High  data throughput (1  mio rec/sec) • Low  response times (100ms/10,000  rec) • Very large  datasets
  7. 7. Is high  performance computing possible with Java?
  8. 8. Yes  ... ...  read the fine print!
  9. 9. Matrix  Sum 23 101 2 34 88 120 4 44 12 234 211 112 189 11 33 1 86 201 3 11 22 65 32 62 22 34 15 67 43 178 105 138 192 38 41 11 58 35 25 27 16 21 Row major access
  10. 10. Matrix  Sum 23 101 2 34 88 120 4 44 12 234 211 112 189 11 33 1 86 201 3 11 22 65 32 62 22 34 15 67 43 178 105 138 192 38 41 11 58 35 25 27 16 21 Column major access
  11. 11. Matrix  Sum (10,000  x  10,000  elements) ops/sec ops/sec ops/sec
  12. 12. Matrix  Sum Row major access Column major access
  13. 13. Matrix  Sum Row major access Column major access
  14. 14. Matrix  Sum Row major access Column major access
  15. 15. Matrix  Sum Row major access Column major access
  16. 16. Matrix  Sum Row major access Column major access
  17. 17. Matrix  Sum Row major access Column major access
  18. 18. Matrix  Sum Row major access Column major access
  19. 19. Matrix  Sum Row major access Column major access
  20. 20. Tool  Support  -­ JMH • OpenJDK JMH  (Java  Microbenchmark Harness)   • Eliminates measurement (in)accuracy • Statistically robust  measurements • Maven  and Jenkins  support http://openjdk.java.net/projects/code-­tools/jmh/
  21. 21. Memory  Access CPU RAM
  22. 22. Cached Memory  Access CPU RAM Cache
  23. 23. Multi  Level  Caches CPU RAM L1 L2 L3
  24. 24. Latency Numbers CPU  cycles Time Size L1  latency ~  4-­5  cycles 1.5 ns 32  KB L2  latency ~  12  cycles 3.5  ns 256  KB L3  latency ~  36  cycles 10.6 ns 8-­40  MB RAM  latency ~  230  cycles 67.6  ns 256  GB 2KB  over 1GBit 20,000 ns 1  MB  from RAM 250,000  ns Disk  seek 10,000,000 ns Roundtrip US-­EU-­US 150,000,000  ns Intel  i7-­4770  (Haswell),  3.4  GHz Sources:http://www.7-­cpu.com/cpu/Haswell.html http://norvig.com/21-­days.html#answers
  25. 25. Latency Numbers CPU  cycles Time Size L1  latency ~  4-­5  cycles 1.5 ns 32  KB L2  latency ~  12  cycles 3.5  ns 256  KB L3  latency ~  36  cycles 10.6 ns 8-­40  MB RAM  latency ~  230  cycles 67.6  ns 256  GB 2KB  over 1GBit 20,000 ns 1  MB  from RAM 250,000  ns Disk  seek 10,000,000 ns Roundtrip US-­EU-­US 150,000,000  ns Intel  i7-­4770  (Haswell),  3.4  GHz Sources:http://www.7-­cpu.com/cpu/Haswell.html http://norvig.com/21-­days.html#answers
  26. 26. Cache-­oblivious Algorithms • Optimized for minimal  memory transfer • All  computation with L1  cache • Cache-­“oblivious“:  no knowledge about cache hierarchy • Keeps CPU  permanently „under pressure“ • Empowers cache prefetching
  27. 27. Plain Matrix  Transpose 1 2 3 4 5 6 ... 1 4 ...2 5 637 7
  28. 28. Cache-­oblivious Transpose
  29. 29. Matrix  Transpose (4096  x  4096  elements) ops/sec
  30. 30. Matrix  Transpose (4096  x  4096  elements) Cores ops/sec
  31. 31. • Memory  access patterns matter • Avoid main memory jumps • Algorithms should support prefetching
  32. 32. How much memory do  we consume?
  33. 33. How large  is an  object? double  =  8 byte Double  =  ? byte BigDecimal =  ? byte
  34. 34. java.lang.Double double  =  8 byte Double  =  ? byte BigDecimal =  ? byte 0 4 8 12 16 20 24 28 Padding 8  byte alignment Object header „Mark  word“ • HashCode • Flags Class   pointer Data
  35. 35. java.lang.Double double  =  8 byte Double  =  ? byte BigDecimal =  ? byte 0 4 8 12 16 20 24 28 Object header „Mark  word“ • HashCode • Flags Class   pointer Data
  36. 36. double  =      8  byte Double  =  24  byte BigDecimal =      ?  byte double  array 0 4 8 12 16 20 24 28 „Mark  word“ • HashCode • Flags Class   pointer Data  0 Data  1Array length ...
  37. 37. double  =      8  byte Double  =  24  byte BigDecimal =      ?  byte java.lang.Double
  38. 38. double  =      8  byte Double  =  24  byte BigDecimal =      ?  byte BigDecimal 0 4 8 12 16 20 24 28 32 36 40 ref to BigInteger ref to int[  ] Σ 40  byte Σ 40  byte Σ >16  byte >96  byte
  39. 39. double  =          8  byte Double  =      24  byte BigDecimal =  >96  byte BigDecimal
  40. 40. Tool  Support  -­ JOL • OpenJDK tool – JOL  (Java  Object Layout) • Insight  into memory layout • Heap  dump analysis • Exact memory usage • Graphical layout visualization • Maven  module http://openjdk.java.net/projects/code-­tools/jol/
  41. 41. Tool  Support  -­ JOL java -jar jol-cli.jar internals java.math.BigDecimal or java –cp jol-cli.jar:my-own.jar org.openjdk.jol.Main internals foo.MyClass
  42. 42. Tool  Support  -­ JOL java.math.BigDecimal object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 12 (object header) N/A 12 4 int BigDecimal.scale N/A 16 8 long BigDecimal.intCompact N/A 24 4 int BigDecimal.precision N/A 28 4 BigInteger BigDecimal.intVal N/A 32 4 String BigDecimal.stringCache N/A 36 4 (loss due to the next object alignment) Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
  43. 43. Data  Locality 56  +  520  =  576  bytes 64  Bytes
  44. 44. Memory  Layout TransactionPlain[100] TransactionCompact[100] 276  kB
  45. 45. • Keep  data compact • Think  about data types • Keep  memory allocations low • Keep  garbage collection rate  low
  46. 46. And... Which patterns should I use?
  47. 47. Tree Model
  48. 48. Flattened Data  Model
  49. 49. Streaming  Data  Access ! 𝑧 𝑦 𝑥 4 % 𝑦 𝑥 1 % 𝑥 𝑥 3 2
  50. 50. Decoupled Algorithms
  51. 51. • Prefer primitives  over classes • Prefer arrays over object trees • Process one array at  a  time
  52. 52. What about cluster replication?
  53. 53. MVCC (Multi  Version  Concurrency Control) V1 V2 V3 V4 View1View1+2View1+2+3View1+2+3+4
  54. 54. Compact  Data V1 V2 V3 V4 View1View1+2View1+2+3View1+2+3+4
  55. 55. Off  heap RAM On  disk From Disk  to RAM V1 V2 V3 V1 V2 V3 MappedByteBuffer
  56. 56. Data  Distribution Apache  Kafka
  57. 57. Data  Distribution
  58. 58. Data  Distribution
  59. 59. Summary
  60. 60. • Large  speedups through cache optimized algorithms • Memory  layout is crucial • Scale by cache replication @marcusgruendler
  61. 61. Thank you!

×