2. One GoalToday
• What are the capabilities of NodeXL, and do I have a
use for it (for research, for exploration, for fun, or some
mix of the prior)?
2
3. Overview
1. Network graphs and related terminology
2. Potential uses in research
3. NodeXL (Network Overview, Discovery, and Exploration for Excel)
4. Social media platforms
5. Data extraction runs
6. Data processing
3
4. Overview(cont.)
7. Network graph data visualizations
8. NodeXL Graph Gallery and the virtual community
9. Beyond NodeXL
10.Presentation review
11.Some general takeaways about network analysis using social media data
12.Questions? Comments?
4
8. Underlying Data in Matrices (cont.)
• May show whether a relationship exists or not (binary)
• May show strength (or intensity) of a relationship
• May show the direction of a relationship (one-way, two-ways/reciprocated)
• …and other information
8
9. Unit of Analysis
Global Network
• Global network measures: Indicators of the types
of interrelated communities being observed
• Inferences about the state of the community
• Inferences about how power moves, how
information moves
• Inferences about who is influential and how
• Predictive analytics about where this community
is going
• Overlapping networks
About Online Global
Networks
• Central masses continue for a time
• Small clusters either meld with larger ones, or
they eventually disappear
• Often held together for a time through
charismatic leaders
• Isolates and pendants usually disappear over time
• Dynamism is a part of all networks
9
10. Unit of Analysis(cont.)
Nodes
• Node-level measures: Indicators of the egos
• Inferences about the ego even if it is “invisible”
based on its effect on the surrounding egos
and entities
About Online Nodes
• Cyber selves somewhat representational of the
real-world selves
• Messaging / location / imagery / profiles may
be analyzed to infer personality and interests
• Popularity falling under a power law (a few
stars garnering most of the attention, the rest
in the long tail of social aspirants and poseurs)
10
11. Statistical Measures
Global Network Measures
Betweenness centrality: Total number of shortest paths or
walks for each pair of dyadic nodes (info moves between the
shortest paths and closest ties), how much of a bridge a
node is for network connectivity
In an undirected graph, distance to all other nodes
In directed graph, distances to a node more meaningful
because node has little control over in-coming nodes <-
Closeness centrality: Geodesic path distance between a
node and every other node (farness as sum of all distances
to all other nodes; closeness as inverse of farness)
Node-level (Local) Measures
Degree centrality: In-degree and out-degree
(relative popularity within the network)
Clustering coefficient: Embeddedness of
single nodes in cliques or ego neighborhoods
with its alters
11
12. Statistical Measures (cont.)
Global Network Measures
Eigenvector centrality (diversity): Relative distances
between a node and every other node and those
connected to higher-value or popular nodes resulting in
a higher value (values between 0 and 1) as a measure of
relative influence in a graph
Clustering coefficient: Aggregation of multiple nodes
based on similarity (like co-occurrence) or connectivity,
and expressed as proximity or closeness visually; may be
a measure of transitivity
Motif Measures
Dyads, triads, and other structured sub-groupings
Local and experiential for the nodes in terms of
structured connections
May (fractals) / may not be reflective of the overall
structure
Global motif censuses (counts of occurrences of various
types of motif structures in a whole network)
Structural holes as indicators of potential openings for
nodes and links (to build resilience)
12
13. Different Network Graphs
Social Network Graphs
• Entities and interrelationships
• Follower – following (formally declared relationships)
• Parasocial relationships online
• Weak ties, fragile linkage
• Reply message / retweet / reply video / likes /
comment-to and others (situationally created
relationships, empirical)
• Entities and contents
• Entities and events
Content Network Graphs
• Based on content similarity
• Based on content proximity to other terms (based on
different-sized “windows” moved across a text)
• Co-occurring terms (content) or tags (metadata)
• Scraped thumbnail images
• May be based on pre-structured content “thesauruses”
or may be extracted and structured in an emergent way
from the text corpuses (or texts)
13
14. Some RelatedTerminology
• Structure-mining: The study of networks and interrelationships in order to make
inferences about systems (structure as a “topology” or a “map”)
• Graph: A data visualization of interrelationships (in either 2D or 3D), including
node-link diagrams (usually without set x- y- axes but spatially relational via
Euclidean distance)
• Undirected graph: A graph in which the relationship between nodes is associational,
without arrows at the ends
• Directed graph (digraph): A graph in which the relationship between nodes is directional,
with the potential for arrows at the ends
14
15. Some RelatedTerminology(cont.)
• Sociogram / sociograph: A network graph showing interrelationships
between social entities
• Degree: The proximity of relationship (such as a 1, 1.5, or 2 degree
relationship), in terms of directness of ties
• In-degree: The numbers of relationships in-coming to a vertex or node
• Out-degree:The numbers of out-relationships from a vertex or node
15
16. Some RelatedTerminology(cont.)
• Motifs: Various types of structures of node-based relationships between
nodes in dyadic, triadic, and other polyatic relationships
• Clusters: Densely connected groups (subgraphs) in a network, including
islands
• Isolates: Nodes in a network that are not directly connected to any other
node
• Pendants or whiskers: A node connected to a network by only one
relationship (link, edge)
16
17. Some RelatedTerminology(cont.)
• Bridging nodes: A node which is on the periphery of multiple social
networks and connects them in a way that would not exist otherwise (and so
is influential even if it is peripheral in the respective networks)
• Core-periphery dynamic: A concept of power and influence with those
closest in to the core considered as most influential and those on the
periphery as less so
• Graph diameter: The distance between the two farthest nodes in a network
(in terms of shortest-distance hops between intermediate nodes)
17
18. Affordances of
Electronic Social Network Analysis (E-SNA)
• Plenty of theory and current research
• Social Networks (Journal, Elsevier)
• Structure-mining (relational topologies) and content-mining (text analysis, cultural
analysis)
• Micro-, meso-, and macro- levels of analysis (zooming in and zooming out (for
different levels of granularity): nodes / entities and links / relationships; motifs,
clusters, and branches; entire networks
• Part of “network science”
18
20. General Research Possibilities
• Social media account profiling (through inferential analysis, data leakage,
de-aliasing to personally identifiable information or “PII”)
• Trending online conversations (by #hashtag, by keyword); human sensor
networks
• Identification of the “mayors of the hashtag” (per Dr. MarcA. Smith of SMRF)
• Public mindset on a topic (by both direct and indirect analysis) (by related
tags networks from “free-form” folk tagging / folksonomies)
20
21. Some General Research Possibilities (cont.)
• Eventgraphing, event detection and monitoring, and event postmortems
• Reverse-engineering a social-mediated (political, marketing, fund-raising,
or other) campaign; semi-live-tracking a social-mediated campaign
• Discovery of artificial accounts (including AI social bots); some application to
potential fraud analysis
• The “company you keep” concept
21
22. Some General Research Possibilities (cont.)
• Geolocational applications: location -> messaging; messaging -> location
• “Oppo” (opposition) research (such as for political campaigns) through
open-source intelligence (OSINT)
• Messaging: broadscale themes and particulars
• Inter-relationships
22
24. A Simplified Research Sequence of Extracting and
Analyzing Social Media Information with NodeXL
1. Research question / open exploration / mixed intent
2. Social media strategy: social media platform(s), seeding term(s), and data
extraction parameters
3. Data extractions using NodeXL
4. Data processing
5. Data visualizations
6. Data analysis (in NodeXL)
7. Data analysis (outside of NodeXL)
24
25. Data Limitations
• Limited data sets (with no knowledge of the “N of all,” at least not without
insider access)
• “Recent” data only (usually reverse listed from present to the past)
• Rate-limited data extractions
• Time-dependent data (with hidden dependencies)
25
26. Data Limitations(cont.)
• Reliance on (often noisy) textual descriptions of multimedia contents
• Inherent “noise” in metadata, content labeling, content descriptions,
tagging, and related online conversations
• Sparse geolocational data in microblogging messages and in uploaded
imagery / videos (in terms of “exchangeable image file format” or “EXIF”
data)
26
27. Local NodeXL Mitigations to Data Limitations
• Re-running the data extraction on different machines but with the same
parameters at the same time
• Running the data extractions at slightly different times
• Running multiple and different data visualizations on the same dataset
• Using multiple seeding terms for a particular issue
27
28. Using an N = All
• Capturing an N= all through Gnip (a company now owned byTwitter) or
a similar company (unless Gnip has an exclusivity contract)
• Working directly with the company or organization behind the social
media platform (particularly their research divisions), but research may
be embargoed (restricted from any release or publication)
28
29. Using Proper Research Practices
• Posing research questions in strategic ways: Ask ambitiously, but do not
over-claim from results
• Respecting the research traditions and methods of the respective
domain or field
• Applying serious efforts at (dis)confirmation of findings
29
30. Using Proper Research Practices(cont.)
• Capturing multiple streams of data (often in a cross-platform way)
• Documenting all data extraction parameters, data processing, and
data provenance issues
• Using multiple analytical tools to analyze the captured data
• Comparing cyber info with real-world info (determining where the
cyber-physical confluence lies)
• Using accurate qualifiers to the presented data
30
32. NodeXL “Template”
Brief History
• Formerly known as .NetMap
• First released in July 2008 as an add-on to
Microsoft Excel
• Available at the Microsoft CodePlex site
• Supported by the Social Media Research
Foundation (SMRF) with the tagline
“OpenTools, Open Data, Open
Scholarship for Social Media”
• Third-party data importer tools to
NodeXL available through integrated
links available through NodeXL
APIs and Add-ons
• Application programming interface (API):
protocols for the building of software
applications to interact with (in this case)
public-facing social media platform databases
• Add-on: An addition to a software program to
add functionality
32
35. Social Media
• Integrated online sites and applications that enable people to …
• Interact
• Inter-communicate
• Share information, digital artifacts and objects, materials, funds, and other elements
• Collaborate (co-create knowledge, fund-raise, support, and others)
• Create continuing and long-term profiles
35
36. Web 2.0 / the SocialWeb
• Microblogging site: Twitter, SinaWeibo
• Social networking sites: Facebook, LinkedIn
• Wikis: Wikipedia (with a MediaWiki understructure)
• Video sharing: YouTube,Vimeo
• Image-sharing: Flickr
• Blogs: WordPress (understructure)
• Email:
• Short message service (SMS):
• and others
36
Note: It helps to immerse
in each platform and
observe how users use the
platform and how the
platform’s community
responds to in-world
events. It helps to
challenge assumptions
about how things actually
work vs. how one assumes
it works.
37. Social Media Accessible via NodeXL
• Facebook Fan Page Network
• Facebook Personal Network
• Flickr RelatedTags Network
• Flickr User’s Network
• MediaWiki Page Network*
• Twitter Search Network (#hashtag, keyword,
other)
• Twitter User’s Network (@account, @group)
• Web 1. / Blog Network (viaVOSON / “Virtual
Observatory for the Study of Online
Networks”)*
• YouTube User’s Network
• YouTubeVideo Network (topic)
• [3rd party graph data importers]*
37
38. Social Media AccountTypes
• Social media accounts
• Public or private accounts
• Individual or group (often topic-focused) accounts
• Human, cyborg, ‘bot (including socialbots)
38
39. Application Programming Interfaces (APIs)
• Application Programming Interfaces (APIs) enabling access to some
limited data from the social media platforms
• Often rate-limited by the social media platform
• Enables downloading of a percentage of the available public data (full amount of
dataset not indicated by the API)
• Data released by content creators through the end user license agreements
(EULAs)
• Data scraping also possible
39
40. Application Programming Interfaces (APIs) (cont.)
• Access requires an email-verified account to “whitelist” to access the data
(to enable the platform’s rate-limiting)
• Some (like Flickr) require a secret and a key
• Terms of access change, and developers may not keep up with changing the
software to ensure some access
40
41. Types of Social Media Data Available
NodeXL
• Topical slice-in-time; dynamic and continuous
(for a certain period of time) (onTwitter)
• Protected user accounts in Facebook (with log-
in authentication into Facebook)
• Public-facing user accounts in Flickr,YouTube,
and fan accounts in Facebook
• Article edits inWikipedia
Others
• Tweetstreams going back in time (up to about
3,000 per account) (NCapture in NVivo, on
Twitter)
• Geomapping ofTweets (NCapture in NVivo, on
Twitter)
• Links between accounts on social media
platforms to the SurfaceWeb (Maltego
Chlorine 3.6.0)
41
43. General Parameters of a Data Extraction
• Seeding term(s)
• Boolean data types (sets) [# and #; # and keyword; tag and tag]
• Type of social or content network (or two-mode / bipartite or multi-mode
networks)
• Degree of network (1, 1.5, or 2)
• Amount of vertices or messages or videos (size of network), and others
43
46. Graph Metrics
• Selection of desired metrics
of the extracted graph
• Processed on the local
machine
• May have to process in parts
and pieces (instead of “select
all”) because of machine
processing limits (saving after
each iteration)
46
47. Graph Metrics (in detail)
• Overall graph metrics
• Vertex degree (undirected graphs only)
• Vertex in-degree (directed graphs only)
• Vertex out-degree (directed graphs only)
• Vertex betweenness and closeness centralities (a
measure of influence in the network based on
“bridging” along shortest paths / transmission /
propagation efficiency)
• Vertex eigenvector centrality (a measure of influence
in the network based on connectivity to influential or
high-scoring nodes)
• Vertex PageRank
• Vertex clustering coefficient
• Vertex reciprocated vertex pair ratio (directed
graphs only)
• Edge reciprocation (directed graphs only)
• Group metrics
• Words and word pairs
• Edge creation by shared content similarity
• Top items
• Twitter search network top items
47
48. Resulting Global-View
Graph MetricsTable
48
• Vertices
• Unique edges
• Edges with duplicates
• Total edges
• Self-loops
• Reciprocated vertex pair ratio
• Reciprocated edge ratio
• Connected components
• Single-vertex connected components
• Maximum vertices in a connected
component
• Maximum edges in a connected
component
• Maximum geodesic distance (diameter)
• Average geodesic distance
• Graph density (or sparseness)
• Modularity
52. Toggling between the
GraphVisualizations and the Underlying Data
Data Cleaning
• Deletion of information from the graph that
may not be directly relevant (from the data
worksheets)
• De-duplication of messaging (if relevant)
Data Filtering
• Using “Dynamic Filters” to select particular types
of data of interest to show in the graph pane:
relationship date,Tweet Date (UTC), x-axis, y-
axis, in-degree, out-degree, betweenness
centrality, closeness centrality, eigenvector
centrality, PageRank, clustering coefficient,
reciprocated vertex pair ratio, followed,
followers,Tweets, favorites, joinedTwitter date
(UTC)
• (and UTC degree time to geo-location)
52
55. NodeXL Graph Gallery
• Set up as a place for shared research about social network graphs
• Includes experimental interactive versions of the graphs (if GraphML version
is enabled in the upload by the creators of the data)
• Includes some downloadable datasets
• Enables email-verified account creation (which allows the revision of related
texts and reversing publication of graphs)
• No commenting on others’ graphs or datasets here
55
56. NodeXLVirtual Community and Resources
• NodeXL on CodePlex
• Source Code (open-source)
• Documentation
• Discussions
• Issues
• License (Ms-PL, Microsoft Public
License)
56
58. Other (Complementary)Tools
Surface Web Data Collection
• Maltego Chlorine 3.6.0 (commercial
“subscription” license but with a limited
community version)
• NCapture of NVivo 10 (commercial license:
perennial or subscription-type site license)
Text Analysis
• Natural LanguageToolkit (NLTK) in Python
(open-source and free)
• AutoMap and NetScenes (CASOS) (open-
source and free)
58
60. Review: NodeXL Capabilities
NodeXL Capabilities with
Social Media Data
• Data extractions from both social media
platforms and the SurfaceWeb (withVOSON
or “Virtual Observatory for the Study of Online
Networks” third-party data importer server)
• Additional social media platforms in the works
NodeXL Capabilities
• Network graph data processing
• Network graph analysis
• Graph visualizations
• Multi-lingual data processing
• … and others
• Addition of rudimentary sentiment analysis in
commercial version (“NodeXL Pro”) released in
2015
60
61. A Short Note about the Sentiment Analysis
Feature
• Based on a positive-negative polarity
• Uses a built-in positive word set and a built-in negative word set
• Customizable
• Enables the addition of a third type of word set (a new construct) based on a
custom-made text set
61
64. Some GeneralTakeaways
• Unique aspects of social media platforms and their particular users. The
social media platforms are constantly changing.Their users and their
metrics are critical to understanding the extracted data. As such, only some
voices are captured via social media platforms.
• In other words, who is online, and how are they actually using the social
media platforms? What geographical regions are covered? (How does this
skew the data?)
64
65. Some GeneralTakeaways (cont.)
• Nature of social media platforms. The nature of the social media
platforms are important—whether they are for content sharing, knowledge
structures, social networking (and for what purpose), and so on.
• Rules of engagement change what is seeable and seen in terms of messaging
• Technically, how “entities” and “relationships” are defined depends on the social media
platform. (Read the fine print. Read the developers’ pages.)
• Continuous (dynamic data) vs. slice-in-time (static data); access to historical data
65
66. Some GeneralTakeaways (cont.)
• A sampling. There are numerous dependencies in terms of data
extractions. The connectivity speed, the busyness of the target servers, the
rate limiting of the application programming interfaces (APIs), the
dynamism of the data, and such, affect what is collected.This sampling is
not a random sample, but it is hard to know how much of a part of a full set
has been captured. In most cases, only a very small sample is acquired.
• Very rarely is a full set possible, and only for particular types of data (such as an article
network fromWikipedia).
66
67. Some GeneralTakeaways (cont.)
• Data visualizations used with underlying data: The data visualizations are
rich and varied; however, they are always in a sense less than the full set of
information. By definitions, data visualizations are data summaries.
• The “graph metrics” table is a critical aspect of the information. Data
visualizations should be used with the underlying data.
67
68. Some GeneralTakeaways (cont.)
• Understandings of how social media platforms are used: The general public
tends to be a lot faster than one would assume in terms of responding to breaking
events with messaging across the various platforms.
• Any eventgraphing has to draw from all public sources (and across social media
platforms) because each contributes different angles and perspectives on the
events; each also attracts different portions of the population.
• (And of course, a lot of information is not publicly shared, so the whole social
media angle is still somewhat limited.)
68
69. Some GeneralTakeaways (cont.)
• Speed: With unfolding events on social media, most have gone to automated
means to surveil and monitor communications. Computational text analytics (and
visual analytics) are applied to the messaging in order to see
• what is trending
• the strength and direction of sentiments (positive or negative)
• the types of emotions expressed and in what textual contexts, and so on.
• There is progress in terms of computational visual analysis for object identification,
facial recognition, and others.
69
71. Questions? Comments?
71
• What research questions are you interested in pursuing? What is the
potential role of social media in augmenting your (main) research?
• How do you think you might go about capturing the required social media
information? How would you confirm or disconfirm any findings?
• What are complementary streams of data you could use to bolster your
work?
72. Questions? Comments? (cont.)
• How would you pursue leads that are surfaced from social media? The
SurfaceWeb?
• How would you represent your work in publication and / or presentation (to
show methods, explain complexity, and delimit your assertions)?
• What skills can you hone in order to better exploit public social media data?
What do you perceive as strengths in this area? Weaknesses?Why?
72
73. Conclusion and Contact
• Dr. Shalin Hai-Jew
• Instructional Designer, iTAC, K-State
• 212 Hale / Farrell Library
• shalin@k-state.edu
• 785-532-5262
• Querying Social Media with NodeXL (an open-source text on the Scalar
platform)
73