Patch management for 3rd-party software can be a significant challenge. The raw data for effective vulnerability management is available in MS’ SCCM (software inventory) and NIST’s NVD (vulnerability database). However extracting the relevant information from complex, sometimes undocumented data structures poses significant challenges.
The stage is set with a brief overview of SCCM / NVD data structures as well as a look at a (non-typical but interesting!) production environment. Then we’ll take a quick dive into data wrangling / Machine Learning fundamentals applied to this problem: feature extraction, choice of approach, algorithm choice and turning.
Once the technical challenges are resolved, the path to “Data Nirvana” can still be strewn with significant non-technical hurdles to overcome as well. We will discuss some practical “been there, done that” examples.
2. Overview
• “Pleased to meet you”
• The Playground
• Challenge #1: Complex Data Structures
• Challenge #2: “Dirty” unstructured data
• Challenge #3: People issues
• Lessons learned + Demo
3. Who I am
• Technical Security Architect at Ubisoft
• Previous: 2 large financial institutions, a major retailer, a
world-class telco, service bureaus
• Generalist with a passion for all things “technical security”
4. Disclaimer
“Opinions expressed as well
as the content of this
presentation are the
responsibility of the author.
They do not represent Ubisoft
company policy or views.”
5. The Playground: “Find the panda”
• 10K+ team members
• 26 studios in 18 countries
• Windows-centric
• Creativity Rules!
Where is the vulnerable
non-Microsoft software
installed?
6.
7. The Great Idea
Microsoft’s SCCM: Reliable production software inventory
NIST’s NVD database: Up-to-date vulnerability data
Effective Patch Management
8. The Great Idea: Why?
• Avoids expen$ive licensing by using free public software
• Vuln data can become a JSON feed into SIEM or DFIR “big
data” mining app
• Do the “impossible” with leading-edge technologies
10. MS’ System Center Configuration Mgr
• “The application people love to hate”
• Indispensable for management of enterprise-scale
Windows-centric environment
• Back-end MS-Sql database: 1600+ tables, 6200+ views
• Distributed component design leveragingWMI
• On-premises deployment: complex architecture
11. SCCM Components
• 50+ components!!!
• DLLs running (mostly) as threads, also
some separate services
• Communication:
• In-core queues
• Flat files stored in inboxes / outboxes
12. SCCM and WMI
SMS was the original WMI client
“Everything” is architected using WMI:
• Client-side
• Internal control of agent operations
• Discovery of hardware inventory
• Server-side
• SMS Provider isWMI provider
• Exposes important database objects asWMI objects
• ConfigMgr Console, SCCM auxiliary applications and tools are
implemented asWMI Mgmt Applications.
13. SCCM Discovery - I
• Populates inventory data in SCCM database
• 6 different methods
• Which are enabled depends on site configuration
• 4 methods target AD
• 1 searches the surrounding network
• 1 interacts with the SCCM client
14. SCCM Discovery - II
• AD Forest Discovery: IP subnets, AD sites
• AD Group discovery: AD groups and memberships
• AD User discovery: User accounts,AD attributes
• AD System discovery: Computer discovery
• Heartbeat discovery:
• Enabled by default + must be enabled Are clients healthy and reachable?
• “creates discovery data records (DDRs) containing information about the client
including network location, NetBIOS name, and operational status.”
• Every 7 days by default.
• Network discovery: Search domains, SNMP services, Dhcp servers.
Disabled by default.
16. SCCM Discovery - IV
“Make friends with your
SCCM administrator”
• Methods enabled?
• Polling interval?
17. SCCM Data – “Getting to know you”
“Hands-on” Exploring
• MS Sql Studio
Use AD to augment host inventory data
• E.g. OU in Distinguished Name
“Google isYour Friend”
• Also SafariTechnical Library
18. SCCM Data - I
UseViews notTables
• More stable interface
• Better documentation
• Permissions already in place
• Performance – avoid locking tables
• MS has done the “heavy lifting” e.g.
joins, stored procedure definitions
• More Community experience
• This is what MS MVPs say to do
Query SQL notWMI
• More direct, simpler, better performance
19. SCCM Data II – WMI Underpinnings
• WMI Class Name: “SMS_xxx” SQLView Name: v_xxx
• WMI Property Names Column names in SQLViews
• View names > 30 chars are truncated
• Column names have “0” appended to avoid conflicts with
SQL reserved words
20. SCCM Data III – View types
• Inventory data:
• Current: v_GS_< group name >
• History: v_HS_< group name >
• Discovery data:
• WMI scalar properties: v_R_< resource type name >
• WMI array properties: v_RA_< architecture name >_< group name >
21. SCCM Data III –
View types
v_SchemaViews lists
and categorizes
ConfigMgr views
22. SCCM Data IV – Inventory groups / views
• v_GroupMap view lists inventory groups and views
• Each one represents a WMI class configured for
inventory collection in client agent settings
DisplayName InvClassName InvHistoryClassName MIFClass
System v_GS_System v_HS_System SYSTEM
Add Remove Pgms
v_GS_ADD_REMOVE_PROG
RAMS
v_HS_ADD_REMOVE_PROGR
AMS
MICROSOFT|ADD_REM
OVE_PROGRAMS|1.0
Memory v_GS_X86_PC_MEMORY v_HS_X86_PC_MEMORY
MICROSOFT|X86_PC_M
EMORY|1.0
23. SCCM Data V - Collections
• A Collection is “a logical
group of resources in
ConfigMgr”
• v_Collection view:
Collection meta-data
• “All…” columns –
system-wide collections
Name Members
All Systems 25106
All Users 22903
All Unknown Computers 8
AllWindows Clients 20630
AllWindows Servers 3610
24. SCCM Data VI – Which view to use?
• v_R_System
• FromAD / Network / Heartbeat Discovery
• Resource_ID
• NetBIOS name, OS, AD domain
• 60+ fields
• v_GS_System
• Updated when Hardware Inventory runs
• Less accurate – host must have active agent and be scheduled for
hdware inventory
• 10 fields
25. SCCM Data : TL;DR
In most production contexts, the relevant views are:
• v_R_System
• Host / user data
• v_GS_ADD_REMOVE_PROGRAMS
• v_GS_ADD_REMOVE_PROGRAMS_64
• Updated when Hardware Inventory runs
• Installed software registry data
26. NIST Data
• Two main NIST data sets:
• CPE:Vendor / product dictionary
• CVE: List of vulnerabilities by year
• Formalized, structured format (== XML)
27. NIST’s CPE
CPE == “Common Platform Enumeration”
“Common Platform Enumeration (CPE) is a standardized method of
describing and identifying classes of applications, operating systems, and
hardware devices present among an enterprise's computing assets.”
A master list of all vendors and all their products.
40. NIST’s NVD
“The NationalVulnerability Database is the U.S. government
repository of standards-based vulnerability management data
…This data enables automation of vulnerability management,
security measurement, and compliance.” (Wikipedia)
41. NIST NVD Components
A typical NIST NVD entry has the following components:
Component Name Description
CVE
CommonVulnerabilities and
Exposures
The basic vulnerability listing includingCPE vendor /
product.
CVSS
CommonVulnerability Scoring
System
Standardized vulnerability impact
CWE Common Weakness Enumeration Augmented, standardized description of vulnerability
51. NIST NVD Feeds
NVD CVE data available as a daily Feed:
• XML or (new) JSON format
• Compressed gzip or zip archive
• Delta file or full download by year
• Meta file with file sizes / SHA256 hash to determine if feed file has
changed
https://nvd.nist.gov/vuln/data-feeds
52. NIST Data : TL;DR
• CPE:Vendor / product dictionary
• CVE: List of vulnerabilities by year
• CVSS:Vuln Impact (contained in CVE)
• XML standardized format
• Daily feeds available
53. Complex Data: The solution
The challenge:
• How to extract the unstructured vendor registry data from SCCM?
• How to match this data with the NIST vulnerability data?
The solution:
• Wise choice ofTools
• “Divide and conquer”
54. Make Good Technology choices
python: Good “data science” language
• fuzzywuzzy: Fuzzy matching
• xmltodict: XML parsing
pandas: Data will fit in computer memory. Great python-
based data analysis tool.
scikit-learn: Reliable Artificial Intelligence / Machine Learning
algorithms
Docker: Move “skunkworks” project around as required
ansible: Automate provisioning
55. Basic Approach
• Keep it native
• UseWindows to talk toWindows (AD, SCCM)
• Use Linux for Docker / python / pandas / scikit-learn
• Keep it simple
• 3rd-party software only, not Microsoft
• “Divide and conquer”
• Match vendors first
• Then match products for a given vendor
56. Basic Approach con’t
Use Machine Learning
• Treat this as two separate classification problems.
• Manually label data (especially vendors) since data sets are
small
• Extract features from data using fuzzy matching
57. Sample Vendor Data – Potential Matches
SCCM CPE
The GnuPG Project gnupg
DigitalVolcano Software Ltd digitalvolcano
NETGEAR Powerline netgear
MIT Media Lab mit
Cisco Systems, Inc. cisco
DameWare Development, LLC. dameware
BumpTechnologies, Inc. bump_project
Open Source open_source_development_team
58. Sample Vendor Data – SCCM Vendor names
Will the real vendor please stand up?
Cisco Oracle
Cisco Consumer Products LLC Oracle
Cisco Systems Oracle and/or its affiliates
Cisco Systems, Inc Oracle Corporation
Cisco Systems, Inc. Oracle Corporation.
CiscoWebEx LLC Oracle USA
Oracle, Inc.
59. ML – Feature Extraction
ML Classification Algorithm needs data “features”
Basic approach:
• Tokenization
• Stop words
• Fuzzy matching statistics
• String length
60. ML – Tokenization
• Convert name string into a set of tokens:
• Shift to lower case
• Split string into tokens using separators: _ . , ( ) + !
• Remove “Stop” words
• Tokens that appear often e.g. “Ltd.” “Inc.” “Project” “Software”
• Add little “value” in determining whether there is a match
61. ML – Fuzzy Matching I
Levenshtein or “edit” distance:
“The Levenshtein distance between two words is the minimum number
of single-character edits (insertions, deletions or substitutions) required
to change one word into the other.” (Wikipedia)
62. ML – Fuzzy Matching II
python FuzzyWuzzy package
https://github.com/seatgeek/fuzzywuzzy
1st string 2cd string Ratio
Simple Ratio "this is a test" "this is a test!" 97
Partial Ratio "this is a test" "this is a test!" 100
Token Sort Ratio
"fuzzy wuzzy was a
bear"
"wuzzy fuzzy was a
bear"
100
Token Set Ratio "fuzzy was a bear"
"fuzzy fuzzy was a
bear"
100
63. ML – Feature Extraction
To extract data “features”:
• Use the fuzzywuzzy pkg to calculate match ratios
• Also use string length
64. ML – Label the input data sets
Observations:
• Accurately matching vendor data is crucial
• Data set size is small: ~10K vendors
Approach:
• Manually label data taking care to target important vendors
• Use the manually labelled data to train the ML algorithm
• Use ML-classified data + labelled data for final match processing!!
66. ML – Algorithm Selection II
Use simple K-Folds cross-validation
• Split labelled data into k consecutive folds
• Each fold is used once for validation while remaining k – 1 folds
form the training set
• Repeat for each algorithm being tested
67. ML – Algorithm Selection III
Random Forest Classifier was the best.
• “Forest” of decision trees
• Diverse set of classifiers built by introducing randomness in
classifier construction
• Prediction of the ensemble is the averaged prediction of
the individual classifiers.
http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
68. ML – Algorithm Tuning
“This algorithm has many parameters. How to tune for
maximum accuracy?”
Use Randomized Grid Search with Cross-Validation
• Define initial parameter bounds / possible values
• Randomized search over the parameter space
• Use cross-validation to evaluate estimator accuracy
69. ML – Software match sample results
Just how good is the matching?
CPE SCCM DisplayName0
cpe:/a:wireshark:wireshark:1.4.3 Wireshark 1.4.3
cpe:/a:videolan:vlc_media_player:1.1.6 VLC media player 1.1.6
cpe:/a:hp:headless_server_registry_update:1.0.0.0 Headless Server Registry Update
cpe:/a:hp:insight_management_agents:8.70.0.0 HP Insight Management Agents
cpe:/a:wireshark:wireshark:1.12.6 Wireshark 1.12.6 (64-bit)
cpe:/a:adobe:indesign_cs4_common_base_files:6.0 Adobe InDesignCS4Application Feature Set Fil..
cpe:/a:hp:smart_web_printing:4.60 HP SmartWeb Printing 4.60
cpe:/a:mozilla:firefox:45.0.1 Mozilla Firefox 45.0.1 (x64 en-US)
cpe:/a:watchguard:watchguard_system_manager:- WatchGuard System Manager 11.5.1
70. Complex Data : TL;DR
• Choose powerful technology: python / pandas / scikit-learn
• Split into 2 separate simple classification problems
• K-Folds Cross-validation picked Random Forest Classifier
• Randomized Grid Search with Cross-validation to tune
72. Then Everything Blew Up!
Discovery: Real-life production data is full of anomalies!
• AD
• 80K extraneous hosts
• SCCM
• Did not manage “everything”
• Some hosts were “missing in action” e.g. laptops
• CPE
• Vendor product naming / versioning varied wildly from vendor to
vendor
• Vendor buyouts / merges impacted product naming e.g. Java
• Foreign language data / Unicode
73. “Dirty” data solutions I
• Spend hands-on time with the data
• Manual labelling several code rewrites
• Use Defensive Coding
• Validate all input
• Use python “try”
• Handle Missing data
• The “bane” of pandas
• Either discard or initialize to a known value
74. “Dirty” Data solutions - II
Discard extraneous data as quickly as possible, e.g.:
• Microsoft software data
• Deprecated NVD data
• Unmanaged SCCM hosts
• CVE listings for hardware / OS vulnerabilities
75. “Dirty” Data Solutions - III
Use heuristics to speed up matching
• Vendor:
• Ignore CPE vendors that are 1-2 characters long
• 1st word of CPEVendor string has to be in the tokenizedWMI SCCM
Publisher0 string somewhere
• The condensed CPE name has to be shorter than the fullWMI
"Publisher0“
• Products:
• Release #’s should at least partially match
• At least one word in the CPE product name should be found in the
SCCM equivalent
76. “Dirty” Data Solutions – IV
When all else fails, develop code for the “problem” data
e.g. Java product versioning
77. “Dirty” Data Solutions: TL;DR
• Get “intimate” with the data
• “Shields up”: validate, “try”
• “Shoot from the hip”: Kill the “missing” data before it gets
you
• “Take out the garbage” (data)
• Cheat if you have to: Heuristics
• “Plan B”: code around obstacles
79. Present the idea to Ops to get support
• Took my “great idea”
to the SCCM
Production Ops team
• They were kind enough
to meet with me.
• On-site meeting with
SCCM architect on
conference call.
80. Production Ops reaction: Oups! Disaster!
• Talked “technology”
instead of presenting
from Ops viewpoint
• SCCM architect
• “The” key player
• 6 time zones away, end of
his day
• Local meeting was not in
his native language
• The “man in the wall”
81. Blessed by the King! (… Sort of)
• VP came to town
• Heard the prez
• Wanted “his” dashboard:
• For “yesterday”
• Budget: $0 / 0 hr
82. Ops reaction: We are Worried!!!!
• Ops people rapidly
became concerned
about visibility ofVP
Dashboard
• Started making noises
about “SCCM DB
Performance”
• Totally understandable
reaction
83. Ops Proposition: “Take our nice siding here”
• Instead of direct
production access, use
a secondary non-prod
DB employed for
reporting / query
• Turned out that this
DB underwent
arbitrary “black box”
ETL of SCCM data
depending on Ops
reporting needs,
visibility req’t!
84. “People” Solutions I
• Operate in “pirate” mode: Budget of 0 hr $0 means:
• Run under the radar
• Be focused and efficient – refactor prototype code into prod-ready batch
classes
• Be flexible, be creative:
• Docker-based project bounced from Ubuntu toWindows to CentOS to
save $
• Run on lab PCs, on scrapped PCs, on laptops, anything that is available
• Make deals
• “Sell your grandmother to the highest bidder” to get that precious direct
production access
85. “People” Solutions II
• Deliver quietly, slowly, and “down-sell” to ease viz concerns
• “Uh MrVP, your dashboard is not quite ready yet …”
• “This is a new app and new technology. Data reliability is still to be
proven …”
• Provide targeted Ops training
• Help the “dump truck” people understand the new-fangled
“airplane” paradigm
• Give Ops control and help them find ways to leverage the new
technology
86. Lessons learned
“What I didn’t do but should have”
• Data wrangling requires time and effort to do well
• Set management and user expectations at the outset
• Think “big”, think “production” to start
• “Take baby steps”: Always runnable continuous
development
• Write test cases before writing code
• Write code in small reusable modules with clean
interfaces
• Document and delegate
87. “People” Solutions: TL;DR
• Operate in “pirate” mode
• Be flexible, be creative
• When necessary:
• “Sell your grandmother to the highest bidder”
• Deliver quietly, slowly, and “down-sell”
• Provide targeted Ops training
• “Lessons learned”