GraphConnect Europe 2016 - Moving Graphs to Production at Scale - Ian Robinson

Moving Graphs to Produc3on
Ian Robinson

Overview
•  Solu%on Architectures
•  Hardware/So5ware Requirements
•  HA Architecture
•  Backups
•  Monitoring
•  Tes%ng

Solu3on Architectures
Server
Server with Procedures
Embedded

Server
•  Server infrastructure wraps embedded Neo4j
•  Binary protocol (Bolt)
•  Uniform drivers (Java, .NET, Python, JavaScript)
Embedded

Cypher/Bolt Cypher/Bolt Cypher/Bolt
Driver
Applica%on
Load balancer

Server
•  Server-side jar, called from Cypher
•  Execute complex logic on server
•  Close to the data
•  Mul%ple opera%ons per request
•  Integrate with backend systems
•  Graph global queries, schema
introspec%on, etc.
Embedded
Cypher/Bolt REST API Cypher/Bolyt
Driver
Applica%on
Load balancer
Cypher/Bolt
Procedures
hUps://github.com/neo4j-contrib/neo4j-apoc-procedures

Server
Embedded
•  Host Neo4j in applica%on’s Java process
•  Access to Neo4j’s Java APIs
Java APIs
Applica%on

Hardware
CPU
•  Intel Core i3 (minimum)
•  Intel Core i7 (recommended)
•  Neo4j scales with the number of cores
•  Requires Enterprise to scale beyond 4 cores
Disk
•  SLC (single-level cell) SSD w/SATA
•  ext4 (recommended), ZFS
•  Increase permiUed number of open ﬁles to 40,000+
Memory
•  Lots of RAM (for heap + page cache)
•  8-12 GB heap (up to 24 GB)
•  Explicitly set page cache to (store size + 10% + headroom)
–  Otherwise defaults to 50% of RAM - heap-size (75% pre 2.3)
dbms.memory.pagecache.size=10g
neo4j.conf

SoEware
Java
•  OpenJDK 8 or Oracle Java 8
•  IBM JDK 8 on POWER8
•  G1 garbage collector
•  Default from 2.3
•  JDK 1.7.0_71 or later
Opera3ng System
•  Linux
•  HP UX
•  Windows 2012
wrapper.java.additional=-XX:+UseG1GC
neo4j-wrapper.conf (pre 2.3)

EC2 Instances
•  HVM (hardware virtual machine) over PV (paravirtual)
•  C3 or C4 (compute-op%mized)
•  E.g c4.2xlarge (15 GiB RAM, 8 vCPU, 1000 Mbps EBS throughput)
•  R3 (memory-op%mized)
•  E.g. r3.xlarge (30.5 GiB RAM, 4 vCPU)
•  Not EBS-op%mized by default
•  Use HA clustering and online backups for increased durability
•  Distribute cluster across Availability Zones in a Region

Local Storage
•  SSD or HDD
•  Highest I/O performance
•  Included in virtual server
•  Up to 8 x 800 GB SSD (i2.8xlarge) or 24 x 2000 GB HDD (d2.8xlarge)
•  Lost when EC2 instance is terminated
Elas3c Block Store (EBS)
•  AUached to EC2 instance via network connec%on
•  Up to 16 TB SSD
•  Persist even if EC2 instance is terminated
•  Use EBS-op%mized EC2 instances for dedicated throughput to EBS
•  Provisioned IOPS (io1) for predictable performance
•  Up to 30 IOPS per GiB
–  E.g. 300 GiB volume, 9000 IOPS

HA Architecture
Database
Transac%on
Propaga%on
Cluster
Management
Neo4j HA
Instance 2
Database
Transac%on
Propaga%on
Cluster
Management
Neo4j HA
Instance 1
Database
Transac%on
Propaga%on
Cluster
Management
Neo4j HA
Instance 3
Master

Cluster Configura3on
Joining Cluster
•  ha.initial_hosts (neo4j.conf)
•  List of servers to contact when joining cluster
•  All hosts must be available when star%ng instance
•  For large clusters, supply only a small number of hosts, e.g. 3
Pull and Push Transac3ons
•  ha.pull_interval=10s (off by default)
•  ha.tx_push_factor=1 (default, but best efforts only)
Tuning
•  ha.heartbeat_timeout=11s (default)
•  Heartbeats sent, by default, every 5s
•  Increase %meouts if pauses cause heartbeats to be delayed
•  Warning: it will take longer to discover an instance has failed
•  ha.role_switch_timeout=120s (default)
•  Increase if new instances %meout while catching up with master on startup

HA Role Endpoints – Useful for Load Balancing
Endpoint State Status Code Body
/db/manage/server/ha/master

Master 200 OK true
Slave 404 Not Found false
Unknown 404 Not Found UNKNOWN
/db/manage/server/ha/slave

Master 404 Not Found false
Slave 200 OK true
/db/manage/server/ha/available

Master 200 OK master
Slave 200 OK slave
From 2.3 onwards
dbms.security.ha_status_auth_enabled=false
neo4j.conf

HA JMX Endpoint
JSON Response
•  Alive?
•  Role
•  Last commiUed transac%on ID
•  Instances in cluster
•  Role
•  Instance ID
•  Available?
•  URI
Iden%fy slaves
falling behind
Does everyone agree
on composi%on of
cluster?
/db/manage/server/jmx/domain/org.neo4j/instance%3Dkernel%230%2Cname%3DHigh%20Availability

Cross DC-Clusters
•  Same subnet (consider using a VPN)
•  Bandwidth between DCs aligned with write throughput
•  Common prac%ce: instances in secondary run as slave-only
•  Restricts master elec%on to the primary
•  When failing over, reconﬁgure instances in secondary
ha.slave_only=true
neo4j.conf
ha.slave_only=false
neo4j.conf

Scale Horizontally For High Read Throughput
Applica%on
Master Slave Slave
Load Balancer
e.g. HAProxy
ELB
NGINX

Scale Horizontally For High Read Throughput
Applica%on
Master Slave Slave
Read Load Balancer Write Load Balancer

HAProxy Conﬁgura3on
hUp://blog.armbruster-it.de/2015/08/neo4j-and-haproxy-some-best-prac%ces-and-tricks/

Conﬁgure HAProxy as Read Load Balancer
global
daemon
maxconn 256
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend http-in
bind *:80
default_backend neo4j-slaves
backend neo4j-slaves
option httpchk GET /db/manage/server/ha/slave
server s1 10.0.1.10:7474 maxconn 32 check
listen admin
bind *:8080
stats enable

Conﬁgure HAProxy as Read Load Balancer
global
daemon
maxconn 256
defaults
mode http
frontend http-in
bind *:80
option httpchk GET /db/manage/server/ha/slave
listen admin
bind *:8080
stats enable
404 Not Found
false
404 Not Found
UNKNOWN
200 OK
true
Master
Slave
Unknown

Improve Read Performance with Cache Sharding
Applica%on
1 2 3
Load Balancer
MATCH (c:Country{name:'Australia'})...MATCH (c:Country{name:'Zambia'})...MATCH (c:Country{name:'Norway'})...

Cache Sharding Using Consistent Rou3ng
Applica%on
1 2 3
Load Balancer
MATCH (c:Country{name:'Australia'})...MATCH (c:Country{name:'Zambia'})...MATCH (c:Country{name:'Norway'})...
A-I 1
J-R 2
S-Z 3
MATCH (c:Country{name:'Zambia'})...MATCH (c:Country{name:'Norway'})...MATCH (c:Country{name:'Australia'})...

Conﬁgure HAProxy for Cache Sharding
global
daemon
maxconn 256
defaults
mode http
frontend http-in
bind *:80
balance url_param country_code
server s1 10.0.1.10:7474 maxconn 32
listen admin
bind *:8080
stats enable

Backups
Modes
•  Full
•  Incremental
•  On top of a previous backup
•  Uses logical logs to apply changes, so logs must be kept at least 2 x backup interval

Consistency Check
•  Part of full backup and standalone tool
•  Evaluate store health
•  -verify false to disable in backup
dbms.tx_log.rotation.retention_policy=7 days (default)
neo4j.conf

Backup Strategies
•  Local or remote backups
•  If backing up to remote machine, consistency check takes place oﬄine with
respect to the database
•  Backup from a dedicated slave or round robin
•  Choose a schedule:
•  Full once per day, incremental every hour
•  To restore from backup:
•  Stop instance
•  Replace graph.db with backup
•  Start instance

Backup Strategies
Backup
Server
A B C
A – full , consistency check
B – full , consistency check
C – full , consistency check
A – incremental
B – incremental
C – incremental
…
A – incremental
B – incremental
C – incremental
A – full , consistency check
B – full , consistency check
C – full , consistency check
bin/neo4j-backup
-from single://neo4j.example.org:20000
-to /backups/201510151318263/graph.db
-verify true|false

Monitoring
Pull
•  Metrics available via JMX and HTTP and in browser
Push
•  Metrics publishing from 2.3 onwards (Enterprise)
•  Node, rela%onship, property counts
•  Network/cluster
•  Transac%ons (ac%ve, started, commiUed, rolled back, etc)
•  Neo4j page cache (page faults, evic%ons, ﬂushes, excep%ons)
•  JVM
•  Published to:
•  Graphite
•  Ganglia
•  CSV
metrics.graphite.enabled=true
metrics.graphite.server=52.29.63.174:2003
metrics.prefix=neo4j-1
neo4j.conﬁg

Collate Internal and External Views of the System
System
•  collectd
Database
•  Metrics
•  Tail neo4j.log
HA Endpoints
•  /db/manage/server/ha/master
•  /db/manage/server/ha/slave
Server Latencies
•  hAp.log
Cypher Queries
•  dbms.logs.query.enabled=true
•  dbms.logs.query.threshold=2s
Applica3on metrics
•  End-to-end latencies

Test at Scale
Soak Tests
•  Representa%ve dataset and queries
•  Peak load and above
Verify
•  Correctness
•  Performance
•  Latency
•  Throughput
•  Stability
Opera3ons
•  Backup
•  Disaster recovery
•  Replace instances

Performance Tip – Use the Cypher Query Planner
8,386,880 hits 59,272 hits
CREATE INDEX
ON :Crime(description)

Performance Tip – Write Requests
•  Align the number of concurrent write requests with the number of
Neo4j server threads on the master
•  By default, number of server threads = number of CPUs reported available
by the JVM
•  Conﬁgure the number of threads in neo4j.conf using
org.neo4j.server.webserver.maxthreads
•  Service requests from a thread pool in your applica%on
•  Use the thread pool queue depth to apply back pressure

Performance Tip – Batch Writes Using a Queue
Write
Write
Write
Queue
Single
Thread Batch
hUp://maxdemarzi.com/2013/09/05/scaling-writes/
hUp://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/

Performance Tip – JVM
•  Look for GC pauses in debug.log
•  grep blocked data/databases/graph.db/debug.log
•  Caused by
•  Heap too small
•  New/survivor space too small
•  Badly wriUen Cypher query or stored procedure

Enable GC Logging
Log will be wriUen to logs/neo4j-gc.log
wrapper.java.additional=-Xloggc:logs/neo4j-gc.log
wrapper.java.additional=-XX:+PrintGCDetails
wrapper.java.additional=-XX:+PrintGCDateStamps
wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
wrapper.java.additional=-XX:+PrintTenuringDistribution
wrapper.java.additional=-XX:+PrintGCCause
neo4j-wrapper.conf

GraphConnect Europe 2016 - Moving Graphs to Production at Scale - Ian Robinson

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie GraphConnect Europe 2016 - Moving Graphs to Production at Scale - Ian Robinson

Ähnlich wie GraphConnect Europe 2016 - Moving Graphs to Production at Scale - Ian Robinson (20)

Mehr von Neo4j

Mehr von Neo4j (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GraphConnect Europe 2016 - Moving Graphs to Production at Scale - Ian Robinson