A conversation grounded in slides, which was with a group of Swedish CIOs in March 2018. We talked about the implications of data collection for those who weren't even directly involved in data collection; for organisations, talent hunters, and ultimately ecosystems.
3. AGENDA
1. What are âBig Dataâ?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
4. WHAT ARE BIG DATA?
⢠Big Data is abroad term used to describe data sets that are large, complex, and cannot be
addressed by traditional IT methodologies and applications (Davenport, 2013)
⢠New technologiesâboth hardware and softwareâhave had to be designed to manage the
volume
5.
6. DIGITAL TRACE DATA
Taxonomy of Digital Data
Understanding different data types is crucial to correctly address
problematic areas relating to the use and collection of digital trace data
Data that we leave behind
Content Data Metadata
- Userâs name and
address
- Substantial and
personal: can be
identified/linked
to a person
- Explicitly shared
or traced through
content shared on
social media like
Facebook
- Userâs IP address, time of
login (data about data)
- Strength is in scale: as
companies can use it to
recognise user patterns
- Potentially problematic:
if it reveals things that we
donât want to reveal,
example presence of a
mobile device at a protest
in might reveal the
identities of protesters
Entrusted Data: content we post
on medium not controlled by us
(FB). We donât control what
firms do with our traces
Incidental Data: data about us
shared by others (tagged
photos). We neither influence
nor control our data traces
Service Data: Information we
provide to be able to use a
service
Disclosed Data: Content that we
post online, but on a medium
that we control, example blogs,
limiting our data traces
Behavioural Data:
unintentionally shared; captured
by services from our devices.
Example, time spent on a site
Derived Data: data inferred
about us from other data.
Example, our credit profiles built
by firms using personal data
7. DIGITAL TRACES
⢠Make existing services more efficient
⢠Create new services
⢠Access (or create?) new markets
8. âThe loan amounts users are initially presented with currently
tend to be either ÂŁ111 or ÂŁ265, although I have also achieved
figures of ÂŁ350 and ÂŁ361. In my informal survey, those using
Apple products (a Safari browser, or say an iPhone or an
iPad) seemed to be most consistently offered ÂŁ265. Although
tests with some obscure browsers suggest that it is likely that
it is less that you are âupratedâ by using Apple products, than
you are âdown ratedâ by using less niche browsers like Firefox
and Internet explorer.â (Deville 2013)
âThe firm has found that people who
immediately shove the slider up to the
maximum amount on offer, currently ÂŁ400
for 30 days for a first-time applicant for a
personal loan, are more likely than others
to default.â (Pollock 2012)
9.
10. STRUCTURED VS UNSTRUCTURED DATA
⢠Structured: clean, organised, in a database format. Has relational properties and can be
divided into fields (e.g. what you have been working with in SQL)
⢠Thought to be 5-10% of all data
⢠Semi-structured: unstructured data that has some organisational properties that make it
easier to query, but not enough to be considered structured (e.g. your CSV files)
⢠Also around 5-10% of data
⢠Unstructured data: no structure, no clear relational properties (e.g. images, multimedia,
business documents)
⢠Around 80% of all data
11. AGENDA
1. What are âBig Dataâ?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
16. AGENDA
1. What are âBig Dataâ?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
17.
18. âTRAININGâ AN ALGO
âA computer program is said to learn from
experience (E) with some class of tasks (T) and
a performance measure (P) if its performance at
tasks in T as measured by P improves with Eâ
Training
data
Feature
Extraction
Model
ML
Algorithm
Test
data
Model
(learnt during
training phase)
predictions
19. TERMINOLOGY
⢠Features: features or distinct traits that can be used to describe each item in a
quantitative manner.
⢠Sample: item(s) to process (e.g. classify). It can be a document, a picture, a sound, a
video, a row in database or CSV file, or whatever you can describe with a fixed set of
quantitative traits.
⢠Feature extraction: simplifies samples into, e.g. vectors
⢠Training data: data to discover potentially predictive relationships.
⢠Test data: different data used to test the model built
21. SUPERVISED LEARNING
⢠the correct classes of the training data are known
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
22. UNSUPERVISED LEARNING
⢠the correct classes of the training data are not known
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
23. SEMI-SUPERVISED LEARNING
⢠A Mix of Supervised and Unsupervised learning
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
24. REINFORCEMENT LEARNING
⢠allows the machine or software agent to learn its behavior based on feedback from the
environment.
⢠This behavior can be learnt once and for all, or keep on adapting as time goes by.
Credit: http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth
28. MANAGERIAL CHALLENGES
⢠Leadership
Set clear goals, define success, ask the right questions, be creative, create a vision,
deal with stakeholders âŚ
⢠Talent management
Obvious: Data scientists, computer scientists.
Also: Those who can reframe questions so that data can answer them, design
experiments, visualize and interpret data, speak the language of business.
⢠Technology
Commonly used: Hadoop. IT departments will need to adapt.
⢠Decision making
Bring people who understand the problem together with the relevant data.
⢠Company culture
Stop relying on hunches. Ask yourself âWhat do we know?â, not âWhat do we think?â
29. RECOMMENDATIONS
⢠Self-regulate
⢠Be transparent / educate your customers
⢠Need for clear rules around ownership
⢠Public infrastructure?
⢠Is data collection anti-competitive?
⢠Trust?
30. AGENDA
1. What are âBig Dataâ?
2. Data and Data Science
3. Machine learning at scale
4. Still to come
32. ARTIFICIAL INTELLIGENCE
⢠â [The automation of] activities that we associate with human thinking, activities such
as decision-making, problem solving, learning ...â (Bellman, 1978)
⢠"A field of study that seeks to explain and emulate intelligent behavior in terms of
computational processes" (Schalkoff, 1990)
⢠Turing Test: âIs a machine able to exhibit intelligent behavior equivalent to, or
indistinguishable from, that of a human?â