Intervento di Paolo Bajardi al secondo incontro del corso di formazione per dirigenti sindacali "Le parole dell'innovazione e il lavoro", nato da una progettazione congiunta tra ISMEL e le segreterie CGIL, CISL e UIL di Torino e tenutosi tra marzo e maggio 2019.
Memorándum de Entendimiento (MoU) entre Codelco y SQM
Big data. Opportunità e rischi
1. BIG DATA
OPPORTUNITÀ E RISCHI
Le parole dell'innovazione e il lavoro
Paolo Bajardi, PhD
Applied Data Science Manager, ISI Foundation
Torino - Aprile 4, 2019
2. ISI Foundation
www.isi.it
‣ basic and applied research
‣ 35+ years of history
‣ ~50+ reseachers
‣ Turin, Italy & New York, USA
‣ international network
‣ supported by:
• institutional philanthropy
• research grants
• industrial partnerships
‣ focus on
• data science & AI
• complex systems science
• comp. soc. sci, comp. epi.
6. […] Companies are placing big bets on data and analytics. But
adapting to an era of more data-driven decision making has not
always proven to be a simple proposition for people or
organizations. Many are struggling to develop talent, business
processes, and organizational muscle to capture real value
from analytics.
McKinsey Insights (2016)
10. tracce digitali
prospettiva storica
orizzonte temporale limitato
riproducibilità limitata
contesto limitato
privacy e protezione dei dati
disponibili come effetto collaterale di attività ordinarie
alto livello di copertura, accesso alle grandi scale
possibilità di elaborazione automatica
11. 73% della popolazione accede ad Internet
57% della popolazione usa social media
51% accede da smartphone
6+ ore al giorno online
wearesocial.com/blog/2018/01/global-digital-report-2018
Italia:
13. {"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
Tweet's
creation
date.
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
Theauthorofthetweet.This
embeddedobjectcangetoutofsync.
Theauthor's
userID.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.
The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Numberoftweets
thisuserhas.
Number of
users this user
is following.The timezone and offset
(in seconds) for this user.
The user's selected
language.
metadati
14. "profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
em
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Numberoftweets
thisuserhas.
Number of
users this user
is following.The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Whetherthisuserhasgeo
enabled(http://bit.ly/4pFY77).
DEPRECATED
in this context
Whether this user
has a verified badge.
Thegeotagonthistweetin
GeoJSON(http://bit.ly/b8L1Cp).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this placeThe printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweet
Map of a Twitter Status Object
Raffi Krikorian <raffi@twitter.com>
18 April 2010
15. {"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
Tweet's
creation
date.
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
Theauthorofthetweet.This
embeddedobjectcangetoutofsync.
Theauthor's
userID.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.
The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Numberoftweets
thisuserhas.
Number of
users this user
is following.The timezone and offset
(in seconds) for this user.
The user's selected
language.
metadata
16. "profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
em
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Numberoftweets
thisuserhas.
Number of
users this user
is following.The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Whetherthisuserhasgeo
enabled(http://bit.ly/4pFY77).
DEPRECATED
in this context
Whether this user
has a verified badge.
Thegeotagonthistweetin
GeoJSON(http://bit.ly/b8L1Cp).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this placeThe printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweet
Map of a Twitter Status Object
Raffi Krikorian <raffi@twitter.com>
18 April 2010
17.
18. J. Ginsberg et al., Nature 457, 1012 (2009)
google.org/flutrends
Segnali “impliciti”
36. ‣ più dati comportamentali da piattaforme digitali
‣ grandi coorti, visibilità di intere comunità,
risoluzione di comportamenti individuali
su lunghi orizzonti temporali
‣ uso crescente di dati non-strutturati
‣ connessione sempre più stretta fra mondo fisico e
mondo digitale: sensori, ambienti intelligenti,
Internet of Things
‣ uso crescente di dati non tradizionali e/o esterni,
nuove partnership legate allo scambio dei dati
trend
37. ‣ è possibile usare metodi automatici per estrarre
regolarità e generare ipotesi, usando statistica
inferenziale, data mining, machine learning, analisi del
linguaggio naturale, visualizzazione dati, etc.
‣ i modelli matematici sono costruiti su un ricco
substrato di dati (transazioni, social media, mobilità,
preferenze espresse o inferite) e sono informati da grandi
basi di dati e da flussi di dati in tempo reale
‣ è possibile confrontare modello e realtà di un sistema
a velocità e scale che non hanno precedenti
l’immagine digitale del mondo
è sempre più fedele alla realtà
trend
39. “modello” ?
• modello matematico
• modello statistico
• modello generativo
• modello di apprendimento automatico
• modello descrittivo
• modello dinamico
• modello ad agenti
• modello predittivo (di fattori ignoti)
• modello predittivo (del futuro)
• …
40.
41. mobilità umanapopolazione
scala geografica
short range
mobility layerpopulation layer
long range
mobility layer101
105
101
105
Balcan et al. PNAS 2009
pendolarismo viaggio aereo
esempio: predire un'epidemica
56. “The CNN (convolutional neural network) achieves performance
on par with all tested experts across both tasks, demonstrating
an artificial intelligence capable of classifying skin cancer
with a level of competence comparable to dermatologists.”
59. modelli matematici,
sistemi complessi,
comp. soc. sci.,
statistica, …
data mining,
machine learning,
natural language
processing, …
dati da piattaforme digitali
expertise
di dominio
decisioni & politiche
dai dati ai modelli alle decisioni
60. “This is a world where massive amounts of data and applied
mathematics replace every other tool that might be brought to
bear. Out with every theory of human behavior, from linguistics
to sociology. Forget taxonomy, ontology, and psychology. Who
knows why people do what they do? The point is they do it, and
we can track and measure it with unprecedented fidelity. With
enough data, the numbers speak for themselves.”
Chris Anderson (2008)
61. Bias e discriminazione algoritmica:
sfide etiche e regolatorie
‣ accesso ai dati
‣ bias delle sorgenti di dati
‣ leggibilità dei modelli
‣ big (personal) data
‣ dati industriali e nuove partnership
65. “ […] ensure that by using big data algorithms [firms] are not
accidentally classifying people based on categories that
society has decided— by law or ethics— not to use, such
as race, ethnic background, gender, and sexual orientation.”
Edith Ramirez, chair of the Federal Trade Commission
discriminazione algoritmica
+
++
+
+
+
+
+
+
+
+
+
+
+
+++
+
o
o
o
o
oo
o
o
o
oo
oo
o
o
(attributo sensibile)
attributononsensibile
M F
75. la nuova prospettiva sui dati personali
http://www.weforum.org/issues/rethinking-personal-data
76. sfida: nuove partnership istituzionali
competenze accesso ai dati+
non tradizionali non tradizionali
RESEARCH & INSIGHTS
REVENUE
RESPONSIBILITY
RECIPROCITY
REPUTATION
REGULATORY COMPLIANCE
• dati commerciali
• dati sensibili
• big / fast data