10. 10
Data Set Overview
• Commercial Real Estate Data
• Both Internal and External Datasets - Quality of the data varies significantly
• Different types of Data: Transactions, Entities, Individuals, Assets
• Relationships are inherently hidden due to the nature of the business
11. 11
Problem Overview
• Find the hidden relationships between different entities in Commercial Real Estate
Data
• Help Business easily analyze the data and gain insights into hidden relationships
• Expose Data through easy to consume APIs
• Both Data Storage and Data Queries need to scale - start small and add more datasets
and users
11
13. 13
Platform Design Goals
• Private Data Center
• All Open Source Tools
• Ability to Iterate faster
• Multi-tenant – one platform for all lines of business and all teams
• Easily scalable
• Keep it Simple, Stupid
14. 14
Data Storage
Graph Database
Search Index
Containerization
API/Application Deployment
Java Based API Framework
Technologies That power the platform
14
15. 15
JanusGraph
• Forked from Titan DB
• Support for multiple persistence engines
• Integration with Geospatial and text search (Elastic Search)
• Implements Apache TinkerPop Gremlin Server
• Support For Apache TinkerPop Gremlin Language
• Open Source Apache 2.0 Licensing
15
16. 16
Cassandra
• Elastic and Linear Scalability with Data Growth
• Resiliency to hardware failures
• Replication Across Data Centers OLTP and OLAP
• Open Source Apache 2.0 Licensing
16
19. 19
Infrastructure Configuration
19
• 3 Node Cluster
• Cassandra and JanusGraph servers are co-hosted
• 1 TB SSDs on each Node for a total of 3TB
• 16 cores
• 128 GB total Memory - Ability to scale vertically as we grow
• RHEL 7.3
• Stand alone Elastic Search Cluster for text indexing
20. 20
Infrastructure Deployment Configuration
• Use Ansbile for Janus and Cassandra deployment
• Configuration Expressed as YML
• Declarative Representation of Deployment
• Agentless, only requires python2.7 on host
• Simple for small teams
• Consistently reproduce the environment
20
22. 22
API Layer
• Create Read Only APIs that serves the data to the applications
• Use Spring Boot for API Development
• Use Docker and Mesos for scalable API layer
• Publishes feedback information to Kafka that goes through the
pipeline again
23. 23
Spring Boot
• Embedded HTTP Server
• Simple Application Configuration
• Easy to design REST endpoints
• Ease of Testing
• Java based (Apache TinkerPop Gremlin Language Variant)
23
24. 24
Docker
• Simplifies Integration Testing
• Application deployed as generic unit (Containers)
• Configuration provided through environment variables
• Easy to setup developer environment
24
25. 25
Mesos/Marathon
• Scalability: Easy to scale as the load increases. Supports both Horizontal and vertical scaling
on a cluster of machines.
• Resiliency: If a container dies, Mesos will act on it as necessary and spawn a new container.
• Multi-Tenancy: Easy to control how resources are used. Prioritize Job’s access to limited
Resources.
• Service Discovery and Load Balancing: Easy to load balance and allow services to be
discovered automatically.
• Health Check: Out of the box health checks.
25
29. 29
Concluding Thoughts..
• Start Simple - Do not over engineer
• Build a solid CI/CD pipeline - makes development faster
• Ensure developers can work in parallel
• Sample Demo code available here:
• https://github.com/experoinc/spring-boot-graph-day
• https://github.com/experoinc/dropwizard-tinkerpop
29