SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Datastax and Cassandra at Nexgate
Rich Sutton, CTO
Harold Nguyen, Sr. Data Scientist
A Little About Us
Company – Security & Compliance for Social
 Launched April 2013 - Series A from Sierra & WindForce Ventures
– 15 employees, 7 in Engineering (2 Data Scientists)
 Security guys from:
 Customers:
Key Enterprise Pain Points
① Brand social account sprawl
• Can‟t inventory, audit, track social media
infrastructure
• Can‟t continuously find fake accounts
② Inbound protection for accounts
• Nothing to detect and remediate account
anomalies / hacks
• No automated coverage for volumes of
inappropriate and malicious content
③ Outbound compliance controls
• Too many admins and apps installed
across multiple accounts
• Little or no automated coverage for
sensitive and regulated data
Novartis Slapped
by the FDA
FINRA begins social
compliance audits
Spam
Where Nexgate Fits
Protecting the social account itself
Nexgate
Protect branded accounts and ensure compliance
 Find, audit, and track the actual social accounts of the brand
 Catch & remediate social account hacks, tampering, and misuse
 Remove bad „inbound‟ content including spam, malware, and acceptable use
 Enforce usage of approved publishing platforms
 Comply with regulations using prebuilt content policies, workflow, and intelligent archiving
Listening Platforms
Mine external social data and conversations
• Find brand „mentions‟ and present them with inferences
• Provide volumes of market data that is analyzed for trends, share of voice, etc.
• Social CRM identification of key conversations and influencers that may need engagement
Publishing Platforms
Engage audiences and track outcomes
• Build communities
• Deliver content, custom apps, ads with workflow
• Promotions, contests, and campaigns
:001> Content classification is what we do. The completeness of any
classification system is predicated on the breadth of the corpus of data upon
which it is built.
:002> We made a lazy storage choice.
:003> Some success forced our hand.
:004> Social data is small and jagged.
• Average 1K all in, content and metadata
• Some common small stuff: time, social IDs, parent, account
• Some common big stuff: content, links
• Lots of disparate stuff, specific to the social platform
:005>
Keep in SQL: Fixed length, non-null, heavily indexed, group
access
Keep in NoSQL: Variable length, commonly null, non
indexed, single access, text search
:006> Requirements
• Simple, proven horizontal scalability
• Integrated tools for research: search, analysis
• Operational simplicity; nodes all the same
• Enterprise support
:007> Deployment
• Multi-region AWS
• M1 Large instances
• Instance attached storage
• About to scale again
• Separate dev, test, prod clusters
Datastax:
• Start-up pricing, per-core pricing
• On site experts, responsive support
 Over 250 million pieces of social
media total content spread across
Facebook, Twitter, YouTube,
Google+, LinkedIn
 Currently about half a million new
content per day
– All classified in real time as it
comes in
 About 50,000 new social media
content authors per day
 Cassandra is a great choice for a
database– allows flexibility for the
ever rapidly-changing landscape of
social media threats
Scale of Data
Data throughput
Average reads = 70 / sec
Average writes = 25 / sec
 Among the many security and compliance
classifications that Nexgate provides, we also
have powerful spam detection
 Spam can be a single link directing to a
fraudulent site (screenshots of a Facebook
comment):
Fighting Spam with
Cassandra
 Or it can be less obvious, and more personal. This is extremely common.
Here, the same user has posted the same message across different social
media accounts (screenshot taken from Nexgate product):
Social media spam grew by
355% in the first half of 2013.
Get the report at http://nx.gt/SocialSpamReport
 Can create Spam signatures to catch this
type of content
 ...but it would be too slow to catch Spam in
real time.
 Cassandra
Cassandra and
Social Media Spam
 Even though Cassandra is a NoSQL schema-
less database, it is worth carefully defining
the data model
 Can‟t just “throw data at it” – can make for
some really inefficient queries
 Define the data model based on how you will
query the data
 For us, we want to determine spam content
that has been posted duplicate times
– Spammers tend to post same-content messages
Define Your Data Model
 Typical table in Cassandra
– Wide “unconstrained” rows is a nice feature w.r.t. SQL
Spam Multiplicity Data Model
 Row key -> hash of content
 Column Key -> Unique ID (strictly increasing with time)
 Column Value -> Item_id and time of post
 Spammers typically post the same content over and over
 Easy to determine how many times a same-content post is made:
check the number of columns
 Will never double count because the column key will simply be
updated instead of added
 Indexed by the content, so quick reads and writes
 By reading the column value, can extract the time series information
of duplicated posts
– Can also map back to the original value – we store actual content
indexed by the item_id in another Cassandra table
 Cassandra not a magic bullet
– still need a relational database to glue all the pieces of data together
– Batch processing may need other tools like Hadoop
Why this Data Model ?
 This has become invaluable to us for catching spam content in real
time – the following “rant” comment was posted 38 times…
– Brand can more easily moderate given automated tools
Real-world spam multiplicity
 In another example, a customer received 25,000 inappropriate
messages, and this tool helped us automate content removal
 Another way to tackle real-time spam is by
identifying spammy users
– Since Cassandra effortlessly keeps all the
content we observed, our algorithm takes into
account all the posts contributed by an author
to determine if they are a spammer
 Additionally, it is important to keep all data
to train our 100+ classifiers
Importance of Keeping All Data
 Cassandra actually has been humming along quite nicely!
– Barely any tweaking needed from default values
– No deletes (just the nature of our dataset) => not a lot of frequent
repairs performed (repair is done to resolve inconsistencies across
all replicas of data due to deletes)
• Fine for us, because repair requires intensive disk I/O
 Only times we observed performance issues:
– When the rates of our reads and writes reached a certain threshold
– When the size of the data being inserted was too large
– Heap memory issue with Cassandra 1.1.x
 In all cases, Datastax provided a quick and simple solution,
mostly just toggling a few parameters in config files and
restarting the nodes
Tuning Cassandra
 Community is wonderful - it's really easy to jump on the
Cassandra IRC channel and talk to fellow users and
developers to get real-time feedback.
– With IRC and mailing list help, implemented composite columns
to detect malware sites on the second day of using Cassandra 3
years ago
 In fact, when we tested a migration to the latest version of
Casandra, and one of our Ruby wrappers didn't play nice with
CQL3, I was able to speak directly with the Ruby wrapper
author on IRC and received a reason on why it didn't work.
– In the same day, I committed and made a pull request for a fix to
the Ruby wrapper on github, and the author looked at it the next
morning
 Datastax support has been invaluable for providing fast
feedback and simple solutions
Cassandra Community
 OpsCenter helpful in debugging
performance issues
 Solr – used to obtain training data for
classifiers by phrase matching
 Looking forward:
– Datastax Hadoop support to look into training
labeled data with MapReduce
Datastax Additional Tools
Thank you Datastax and RelateIQ!
Let us show you: nexgate.com/demo
Follow us:
@NXGate
facebook.com/NXGate

Weitere ähnliche Inhalte

Was ist angesagt?

Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analytics
DataWorks Summit
 
Treat Detection using Hadoop
Treat Detection using HadoopTreat Detection using Hadoop
Treat Detection using Hadoop
DataWorks Summit
 

Was ist angesagt? (14)

Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analytics
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
SANS CTI Summit 2016 Borderless Threat Intelligence
SANS CTI Summit 2016 Borderless Threat IntelligenceSANS CTI Summit 2016 Borderless Threat Intelligence
SANS CTI Summit 2016 Borderless Threat Intelligence
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Elastic Stack Roadmap
Elastic Stack RoadmapElastic Stack Roadmap
Elastic Stack Roadmap
 
DEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And AttributionDEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And Attribution
 
Treat Detection using Hadoop
Treat Detection using HadoopTreat Detection using Hadoop
Treat Detection using Hadoop
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Threat Modeling for Dummies
Threat Modeling for DummiesThreat Modeling for Dummies
Threat Modeling for Dummies
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Building a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache DruidBuilding a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache Druid
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Threat Modeling for Dummies - Cascadia PHP 2018
Threat Modeling for Dummies - Cascadia PHP 2018Threat Modeling for Dummies - Cascadia PHP 2018
Threat Modeling for Dummies - Cascadia PHP 2018
 

Andere mochten auch

Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013
Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013
Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013
NETFOX AG
 
Automation and Security Company Profile
Automation and Security Company ProfileAutomation and Security Company Profile
Automation and Security Company Profile
Benjie Fabro
 
Basic Company Valuation
Basic Company ValuationBasic Company Valuation
Basic Company Valuation
Faizanization
 

Andere mochten auch (10)

Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013
Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013
Vortrag Willi Kaczorowski NETFOX CIO-Treff 16.04.2013
 
NETFOX Admin-Treff: Penetration Testing II
NETFOX Admin-Treff: Penetration Testing IINETFOX Admin-Treff: Penetration Testing II
NETFOX Admin-Treff: Penetration Testing II
 
IT Network Security & Penetration Testing In Houston, Dallas, Austin, San Ant...
IT Network Security & Penetration Testing In Houston, Dallas, Austin, San Ant...IT Network Security & Penetration Testing In Houston, Dallas, Austin, San Ant...
IT Network Security & Penetration Testing In Houston, Dallas, Austin, San Ant...
 
Automation and Security Company Profile
Automation and Security Company ProfileAutomation and Security Company Profile
Automation and Security Company Profile
 
Penetration Testing and Intrusion Detection System
Penetration Testing and Intrusion Detection SystemPenetration Testing and Intrusion Detection System
Penetration Testing and Intrusion Detection System
 
Network ESC - Security & Call Center Staffing
Network ESC - Security & Call Center StaffingNetwork ESC - Security & Call Center Staffing
Network ESC - Security & Call Center Staffing
 
NETWORK PENETRATION TESTING
NETWORK PENETRATION TESTINGNETWORK PENETRATION TESTING
NETWORK PENETRATION TESTING
 
Basic Company Valuation
Basic Company ValuationBasic Company Valuation
Basic Company Valuation
 
Penetration testing reporting and methodology
Penetration testing reporting and methodologyPenetration testing reporting and methodology
Penetration testing reporting and methodology
 
Wsd pentesting workshop
Wsd pentesting workshopWsd pentesting workshop
Wsd pentesting workshop
 

Ähnlich wie Social Security Company Nexgate's Success Relies on Apache Cassandra

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 

Ähnlich wie Social Security Company Nexgate's Success Relies on Apache Cassandra (20)

Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social Media
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
How much money do you lose every time your ecommerce site goes down?
How much money do you lose every time your ecommerce site goes down?How much money do you lose every time your ecommerce site goes down?
How much money do you lose every time your ecommerce site goes down?
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Amazon quicksight
Amazon quicksightAmazon quicksight
Amazon quicksight
 
Big Data
Big DataBig Data
Big Data
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Don't Let Your Shoppers Drop; 5 Rules for Today’s eCommerce
Don't Let Your Shoppers Drop; 5 Rules for Today’s eCommerceDon't Let Your Shoppers Drop; 5 Rules for Today’s eCommerce
Don't Let Your Shoppers Drop; 5 Rules for Today’s eCommerce
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PM
 

Mehr von DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Mehr von DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Social Security Company Nexgate's Success Relies on Apache Cassandra

  • 1. Datastax and Cassandra at Nexgate Rich Sutton, CTO Harold Nguyen, Sr. Data Scientist
  • 2. A Little About Us Company – Security & Compliance for Social  Launched April 2013 - Series A from Sierra & WindForce Ventures – 15 employees, 7 in Engineering (2 Data Scientists)  Security guys from:  Customers:
  • 3. Key Enterprise Pain Points ① Brand social account sprawl • Can‟t inventory, audit, track social media infrastructure • Can‟t continuously find fake accounts ② Inbound protection for accounts • Nothing to detect and remediate account anomalies / hacks • No automated coverage for volumes of inappropriate and malicious content ③ Outbound compliance controls • Too many admins and apps installed across multiple accounts • Little or no automated coverage for sensitive and regulated data Novartis Slapped by the FDA FINRA begins social compliance audits Spam
  • 4. Where Nexgate Fits Protecting the social account itself Nexgate Protect branded accounts and ensure compliance  Find, audit, and track the actual social accounts of the brand  Catch & remediate social account hacks, tampering, and misuse  Remove bad „inbound‟ content including spam, malware, and acceptable use  Enforce usage of approved publishing platforms  Comply with regulations using prebuilt content policies, workflow, and intelligent archiving Listening Platforms Mine external social data and conversations • Find brand „mentions‟ and present them with inferences • Provide volumes of market data that is analyzed for trends, share of voice, etc. • Social CRM identification of key conversations and influencers that may need engagement Publishing Platforms Engage audiences and track outcomes • Build communities • Deliver content, custom apps, ads with workflow • Promotions, contests, and campaigns
  • 5. :001> Content classification is what we do. The completeness of any classification system is predicated on the breadth of the corpus of data upon which it is built.
  • 6. :002> We made a lazy storage choice.
  • 7. :003> Some success forced our hand.
  • 8. :004> Social data is small and jagged. • Average 1K all in, content and metadata • Some common small stuff: time, social IDs, parent, account • Some common big stuff: content, links • Lots of disparate stuff, specific to the social platform
  • 9. :005> Keep in SQL: Fixed length, non-null, heavily indexed, group access Keep in NoSQL: Variable length, commonly null, non indexed, single access, text search
  • 10. :006> Requirements • Simple, proven horizontal scalability • Integrated tools for research: search, analysis • Operational simplicity; nodes all the same • Enterprise support
  • 11. :007> Deployment • Multi-region AWS • M1 Large instances • Instance attached storage • About to scale again • Separate dev, test, prod clusters Datastax: • Start-up pricing, per-core pricing • On site experts, responsive support
  • 12.  Over 250 million pieces of social media total content spread across Facebook, Twitter, YouTube, Google+, LinkedIn  Currently about half a million new content per day – All classified in real time as it comes in  About 50,000 new social media content authors per day  Cassandra is a great choice for a database– allows flexibility for the ever rapidly-changing landscape of social media threats Scale of Data
  • 13. Data throughput Average reads = 70 / sec Average writes = 25 / sec
  • 14.  Among the many security and compliance classifications that Nexgate provides, we also have powerful spam detection  Spam can be a single link directing to a fraudulent site (screenshots of a Facebook comment): Fighting Spam with Cassandra
  • 15.  Or it can be less obvious, and more personal. This is extremely common. Here, the same user has posted the same message across different social media accounts (screenshot taken from Nexgate product):
  • 16. Social media spam grew by 355% in the first half of 2013. Get the report at http://nx.gt/SocialSpamReport
  • 17.  Can create Spam signatures to catch this type of content  ...but it would be too slow to catch Spam in real time.  Cassandra Cassandra and Social Media Spam
  • 18.  Even though Cassandra is a NoSQL schema- less database, it is worth carefully defining the data model  Can‟t just “throw data at it” – can make for some really inefficient queries  Define the data model based on how you will query the data  For us, we want to determine spam content that has been posted duplicate times – Spammers tend to post same-content messages Define Your Data Model
  • 19.  Typical table in Cassandra – Wide “unconstrained” rows is a nice feature w.r.t. SQL Spam Multiplicity Data Model  Row key -> hash of content  Column Key -> Unique ID (strictly increasing with time)  Column Value -> Item_id and time of post
  • 20.  Spammers typically post the same content over and over  Easy to determine how many times a same-content post is made: check the number of columns  Will never double count because the column key will simply be updated instead of added  Indexed by the content, so quick reads and writes  By reading the column value, can extract the time series information of duplicated posts – Can also map back to the original value – we store actual content indexed by the item_id in another Cassandra table  Cassandra not a magic bullet – still need a relational database to glue all the pieces of data together – Batch processing may need other tools like Hadoop Why this Data Model ?
  • 21.
  • 22.  This has become invaluable to us for catching spam content in real time – the following “rant” comment was posted 38 times… – Brand can more easily moderate given automated tools Real-world spam multiplicity  In another example, a customer received 25,000 inappropriate messages, and this tool helped us automate content removal
  • 23.  Another way to tackle real-time spam is by identifying spammy users – Since Cassandra effortlessly keeps all the content we observed, our algorithm takes into account all the posts contributed by an author to determine if they are a spammer  Additionally, it is important to keep all data to train our 100+ classifiers Importance of Keeping All Data
  • 24.  Cassandra actually has been humming along quite nicely! – Barely any tweaking needed from default values – No deletes (just the nature of our dataset) => not a lot of frequent repairs performed (repair is done to resolve inconsistencies across all replicas of data due to deletes) • Fine for us, because repair requires intensive disk I/O  Only times we observed performance issues: – When the rates of our reads and writes reached a certain threshold – When the size of the data being inserted was too large – Heap memory issue with Cassandra 1.1.x  In all cases, Datastax provided a quick and simple solution, mostly just toggling a few parameters in config files and restarting the nodes Tuning Cassandra
  • 25.  Community is wonderful - it's really easy to jump on the Cassandra IRC channel and talk to fellow users and developers to get real-time feedback. – With IRC and mailing list help, implemented composite columns to detect malware sites on the second day of using Cassandra 3 years ago  In fact, when we tested a migration to the latest version of Casandra, and one of our Ruby wrappers didn't play nice with CQL3, I was able to speak directly with the Ruby wrapper author on IRC and received a reason on why it didn't work. – In the same day, I committed and made a pull request for a fix to the Ruby wrapper on github, and the author looked at it the next morning  Datastax support has been invaluable for providing fast feedback and simple solutions Cassandra Community
  • 26.  OpsCenter helpful in debugging performance issues  Solr – used to obtain training data for classifiers by phrase matching  Looking forward: – Datastax Hadoop support to look into training labeled data with MapReduce Datastax Additional Tools
  • 27. Thank you Datastax and RelateIQ! Let us show you: nexgate.com/demo Follow us: @NXGate facebook.com/NXGate

Hinweis der Redaktion

  1. Understanding and managing the touch points and scale of your social presenceLow barrier to adoption => unmanaged account sprawlFocus on sentiment alone => miss activity, risks, and opportunities on what your company is responsible forManual moderation/measurement => Rapidly rising costs, reduced effectiveness, risks of PR crisesEstablishing governance policies and processes to protect your brandSocial accounts & applications live outside the corporate network => corporate governance and security risksSiloed account owners => no auditing of account accessManual moderation for content on accounts => higher probability for errors and crises