SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
JoeOlson
DataArchitect
SmartChicagoCollaborative
27Mar2014
joe.olson@cct.org
(All the cool buzzwords in one place!)
Social Media,
Cloud Computing,
Machine Learning,
Open Source, and
Big Data Analytics
Social Media - Twitter
• What can we learn from Twitter?
• 400 million tweets per day
source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter
• 218 million users
source: http://techcrunch.com/2013/10/03/bweeting/
• Excellent source of sentiment
• Excellent source of big data
• Prototyping
• Modeling natural language
• Resume padding
Social Media - Twitter
• How do we get at the data?
• Twitter provided APIs:
• https://dev.twitter.com/docs
• Streaming
• Set up a real time data stream (json) based on keywords
• REST (v1.1)
• Make REST requests, and get results
• Possible parameters:
• Geospatial bounding box
• By time
• By user, hashtag, retweets etc
• Fire hose
• Big $$$. Big data
Social Media - Twitter
• Information & Obstacles
• Who
• What
• At best: Plain English (!)
• Worse: (Spanish or Arabic or Portuguese...)
• Worst: “Textspeak” symbols :-0, UTF8 chars, etc.
• Absolute Worst: combination of all of them
• Where
• 1-2% with latitude / longitude
• Geocode
• When
Social Media - Twitter
JSON Tweet example:
• "created_at":"Sun Oct 27 13:57:40 +0000 2013",
• "id":394462908261740540,
• "text":"Flu :(",
• "source":"<a href="http://twitmania.com" rel="nofollow">TwitMania™</a>",
• "user":{
• "id":594141140,
• "name":"Yultiana Farida N",
• "screen_name":"yultiana",
• "followers_count":231,
• "friends_count":252,
• "created_at":"Tue May 29 23:58:25 +0000 2012",
• "statuses_count":2397,
• },
• "geo":null,
Cloud Computing
• What does cloud computing bring to the table?
• Amazon’s EC2:
• Commoditized hardware
• Low cost
• Only charged for resources you use
• No long term commitments
• Scalable
• "Throwaway" mentality
**IF** you play by their rules!
Cloud Computing – AWS
• Tools
• Virtual Machines
• # of Processors, RAM, OS, disk capacity and I/O – all configurable
• Price range: $.02/hr - $4.60/hr
• Licensed OSes cost 50% more than Linux OSes
• Archive Storage
• S3 / Glacier
• Work Queues
• SQS
• Data Stores
• Dynamo (key value store), Red Shift (analysis store)
• Virtual Networking
• Routers, VPN gateways, access control lists, etc
• APIs
• Command line
• HTTPS REST
• Native programming languages (Python, bash, PHP, Java etc.)
Ideal for rapid prototyping / proof of concepts
Cloud Computing – AWS
• APIs
• Basic
• Start an instance (and start billing)
• Stop an instance (stop billing)
• Insert item into queue
• Remove item from queue
• Write to backup store
• Ultra advanced
• Reserved vs. on demand vs. spot instances
• Price can drop as much as 80% due to market demand
• Instance can disappear at any time
Big Data Analytics
• Can we skirt the “big data” problem by distilling the tweets
down from millions and millions “noise” tweets into a more
desirable data set?
• Enrich in real time, rather than on archived data, and avoid the
overhead of map/reduce?
• Possible Enrichment of raw data:
• Classification – separate tweets into “relevant” and “irrelevant”
• Geocoding – improve on the 1-2% ?
• Aggregation –> map reduce
• Mapping -> Reduce Function -> Output
• AWS – Elastic Map Reduce
• Clustering
Machine Learning
• Classification: relevant, or irrelevant?
• Human trained model
• Once model is established, bounce new data off it for
classification
• Validation of model
• Accuracy =
(Total # of classifications – Mismatches between machine / human)
Total # of classifications
• Crowdsourcing – AWS Mechanical Turk
• Improve model by feeding disagreements back into the model
• Our best text classification model to date: low 90%
Open Source
• Friendly to the commoditized computing paradigm
• Don’t have to worry about licensing issues
• Contributes to the “throwaway” discipline
• Don’t have to re-invent the wheel (collaboration)
• Solutions applicable to all parts of the architecture
• Acquire data: Node.js – non blocking
• Analyze data: R – statistical engine
• Store and query data: MongoDB (document store) or Riak (key-
value database)
Architecture
• We know Twitter is providing a mountain of data from all parts
of the world
• We know Amazon is providing a framework of low cost, on-
demand, no commitment computing
• Open source is providing a rich tool set
• Goals:
• Architect with cost in mind!
• Enrichment - Real time and after-the-fact enrichment (open data)
• Scalable
• Decoupled
• Service based
• Rapid development
• Prove the concepts
Architecture - Acquire
• Acquire the data from Twitter
• If classifying in real time:
• Store then classify?
• Classify then store?
• Tools
• Twitter streaming API
• Keywords
• Node.js
• Several different packages to interface with Twitter APIs
• Amazon
• EC2
• SQS (?) Extremely useful, but drives the cost up
Architecture - Analyze
• Classification interface
• Service based – HTTP REST
• Push or pull?
• Push – classifiers listen on port 80
• Pull – classifier starts pulling from an established work queue
• Both highly scalable and flexible with respect to cost.
• Stateless
• R
• Human trained machine learning packages available
• Cloud friendly – no licenses
• Automatable – from install, configuration, execution
Architecture - Store
• Store JSON as an object (document store) or normalize (relational
database)?
• Relational databases
• disk I/O intensive – not cloud friendly
• allow complex indexing
• Easy to get a business intelligence front end on them
• Requires a schema / ETL
• Key-value document stores
• Designed to be scalable – doesn’t need fast disks
• Indexing is not nearly as flexible as RDBMS
• More difficult to front a UI – no “drag and drop” tools
• No schema / ETL needed.
• Not as mature
• MongoDB / Riak
Architecture – Presentation
• Least need for cloud friendly scalability here?
• Options
• Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho
• Open source BI software – SpagoBI
• Roll your own - PHP, Ruby, Visual Basic, Javascript, etc
• Connect to an existing system instead?
Costs – Real Time Classification
• Number of tweets collected per day: 1,000,000 (comfortable - .25%)
• Machine used on EC2 to acquire (node.js): micro
• $.02/hr * 24 hrs = .48/day
• Machine used on EC2 to classify (R): small (x2)
• $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day
• Machine used on EC2 to store (MongoDB): large
• $.24/hr * 24 hrs = $5.76 /day
• Machine used on EC2 for GUI (Apache): small
• $.06/hr * 24 = $1.44
•
$0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 =
.00001056 cents/tweet
Can add more zeros if you relax real-time classification (spot instances)
Costs - Archive
• Size of average tweet: 2.5 KB
• Cost to archive:
• s3 : .095 GB/month
• 0.0000002 per tweet per month
• Glacier: .01 GB/month
• 0.00000002 per tweet per month
• Compression will add even more zeros, but will require more
computing power, and mean more latency for post collection
data analysis. Can be automated.
Use Cases
• Foodborne Chicago (http://foodborne.smartchicagoapps.org/)
• Public-private partnership with City of Chicago Dept. of Public Health
and Smart Chicago Collaborative
• Reach out to city residents on Twitter tweeting about food poisoning
symptoms, in an attempt to get them to log information in the City’s
311 database (via the Open311 API)
• Once in the 311 database, it follows established City workflows, and
becomes actionable
• Numbers (1 year):
• 2,390 tweets classified as related to food poisoning
• 282 tweets responded to
• 205 reports submitted
• 145 inspections
• Real time classification examples:
• “Ugh! I got food poisoning from the McDonalds’s on Halstead!”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead
• “U of Chicago releases a new paper on the effects of food poisoning”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning
• Video:http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be
Use Cases
• Disease Tracker
• Large scale attempt to track disease occurrences in the United
States.
• Sponsored by the Dept. of HHS
• Approximately 1 million tweets a day (cold, flu) classified in real
time
• EC2 scalable instances
• Geolocation
• Cost to run for 6 months: $850
Future Directions
• Turnkey service
• Can all this functionality be abstracted down to a pushbutton
service?
• Open data
• Can you advertise the data collected, how you enriched it, and
allow others to come along an enrich it as well?
• General purpose bridge between Twitter and issue tracking
databases
• Big industry problem
Github Sources
• Tweet Collector
• https://github.com/smartchicago/TweetCollector
• Classifier Code
• https://github.com/corynissen/foodborne_classifier

Weitere ähnliche Inhalte

Was ist angesagt?

NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
Lynn Langit
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 

Was ist angesagt? (20)

Comparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesComparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform Technologies
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
 
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Relational to Big Graph
Relational to Big GraphRelational to Big Graph
Relational to Big Graph
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
 
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 

Andere mochten auch

プロジェクト進捗レポート
プロジェクト進捗レポートプロジェクト進捗レポート
プロジェクト進捗レポート
zuborawka
 

Andere mochten auch (17)

Expungement Talk from LAF
Expungement Talk from LAF Expungement Talk from LAF
Expungement Talk from LAF
 
Chicago's role in the national civic innovation network
Chicago's role in the national civic innovation networkChicago's role in the national civic innovation network
Chicago's role in the national civic innovation network
 
プロジェクト進捗レポート
プロジェクト進捗レポートプロジェクト進捗レポート
プロジェクト進捗レポート
 
check
checkcheck
check
 
Yaneth leon
Yaneth leonYaneth leon
Yaneth leon
 
Civic Technology on the Front Lines
Civic Technology on the Front LinesCivic Technology on the Front Lines
Civic Technology on the Front Lines
 
How to Level Up Your Event - Code for America Brigade Training
How to Level Up Your Event - Code for America Brigade TrainingHow to Level Up Your Event - Code for America Brigade Training
How to Level Up Your Event - Code for America Brigade Training
 
Civic Hacking 101 - 2015
Civic Hacking 101 - 2015Civic Hacking 101 - 2015
Civic Hacking 101 - 2015
 
Niver edinéia
Niver edinéiaNiver edinéia
Niver edinéia
 
4. silabus
4. silabus4. silabus
4. silabus
 
Across
AcrossAcross
Across
 
Building A Civic Innovation Ecosystem in Chicago
Building A Civic Innovation Ecosystem in ChicagoBuilding A Civic Innovation Ecosystem in Chicago
Building A Civic Innovation Ecosystem in Chicago
 
Office Technology Training
Office Technology TrainingOffice Technology Training
Office Technology Training
 
Jay Van Patten OpenGov Hack Night Presentation
Jay Van Patten OpenGov Hack Night PresentationJay Van Patten OpenGov Hack Night Presentation
Jay Van Patten OpenGov Hack Night Presentation
 
Civic Technology on the Front Lines
Civic Technology on the Front LinesCivic Technology on the Front Lines
Civic Technology on the Front Lines
 
Bambolina - livro sem fala
Bambolina - livro sem falaBambolina - livro sem fala
Bambolina - livro sem fala
 
Code for Japan / Civic Tech Forum (Japanese Version)
Code for Japan / Civic Tech Forum (Japanese Version) Code for Japan / Civic Tech Forum (Japanese Version)
Code for Japan / Civic Tech Forum (Japanese Version)
 

Ähnlich wie Open Data Summit Presentation by Joe Olsen

Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04
marc_harrison
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
Mercedes Coyle
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 

Ähnlich wie Open Data Summit Presentation by Joe Olsen (20)

Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Big problems Big Data, simple solutions
Big problems Big Data, simple solutionsBig problems Big Data, simple solutions
Big problems Big Data, simple solutions
 
Big problems Big data, simple AWS solution
Big problems Big data, simple AWS solutionBig problems Big data, simple AWS solution
Big problems Big data, simple AWS solution
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Big data
Big dataBig data
Big data
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
HP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big DataHP Discover: Real Time Insights from Big Data
HP Discover: Real Time Insights from Big Data
 

Mehr von Christopher Whitaker

1 april business technology courses
1 april business technology courses1 april business technology courses
1 april business technology courses
Christopher Whitaker
 

Mehr von Christopher Whitaker (20)

CTDC Ecosystem Mapping Guide
CTDC Ecosystem Mapping Guide  CTDC Ecosystem Mapping Guide
CTDC Ecosystem Mapping Guide
 
CTDC DC Case Study
CTDC DC Case StudyCTDC DC Case Study
CTDC DC Case Study
 
Harnessing Civic Tech & Data for Justice in STL
Harnessing Civic Tech & Data for Justice in STL Harnessing Civic Tech & Data for Justice in STL
Harnessing Civic Tech & Data for Justice in STL
 
CTDC 21st Century Solutions
CTDC 21st Century SolutionsCTDC 21st Century Solutions
CTDC 21st Century Solutions
 
CTDC Infographic
CTDC Infographic CTDC Infographic
CTDC Infographic
 
01 boston cs_final_update
01 boston cs_final_update01 boston cs_final_update
01 boston cs_final_update
 
Cook County at Chi Hack Night
Cook County at Chi Hack NightCook County at Chi Hack Night
Cook County at Chi Hack Night
 
Modelling pension reform in illinois
Modelling pension reform in illinoisModelling pension reform in illinois
Modelling pension reform in illinois
 
Swop job description data specialist 2014-11-24
Swop job description   data specialist 2014-11-24Swop job description   data specialist 2014-11-24
Swop job description data specialist 2014-11-24
 
Chicago connected training schedule november 2014
Chicago connected training schedule november 2014Chicago connected training schedule november 2014
Chicago connected training schedule november 2014
 
Tech gyrls jitterbug
Tech gyrls jitterbugTech gyrls jitterbug
Tech gyrls jitterbug
 
August 2014 ctc schedule
August 2014 ctc scheduleAugust 2014 ctc schedule
August 2014 ctc schedule
 
Ctc july 2014 schedule
Ctc july 2014 scheduleCtc july 2014 schedule
Ctc july 2014 schedule
 
Tech gyrls google sketch up
Tech gyrls google sketch upTech gyrls google sketch up
Tech gyrls google sketch up
 
Ywca ctc may schedule 1
Ywca ctc may schedule 1Ywca ctc may schedule 1
Ywca ctc may schedule 1
 
Mindstorms lego flyer 2014
Mindstorms lego flyer 2014 Mindstorms lego flyer 2014
Mindstorms lego flyer 2014
 
CTC Course description 1
CTC Course description 1CTC Course description 1
CTC Course description 1
 
Kelly Hall YMCA May Schedule
Kelly Hall YMCA  May ScheduleKelly Hall YMCA  May Schedule
Kelly Hall YMCA May Schedule
 
1 april business technology courses
1 april business technology courses1 april business technology courses
1 april business technology courses
 
Techgirls flyer 1
Techgirls flyer 1Techgirls flyer 1
Techgirls flyer 1
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Open Data Summit Presentation by Joe Olsen

  • 1. JoeOlson DataArchitect SmartChicagoCollaborative 27Mar2014 joe.olson@cct.org (All the cool buzzwords in one place!) Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics
  • 2. Social Media - Twitter • What can we learn from Twitter? • 400 million tweets per day source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter • 218 million users source: http://techcrunch.com/2013/10/03/bweeting/ • Excellent source of sentiment • Excellent source of big data • Prototyping • Modeling natural language • Resume padding
  • 3. Social Media - Twitter • How do we get at the data? • Twitter provided APIs: • https://dev.twitter.com/docs • Streaming • Set up a real time data stream (json) based on keywords • REST (v1.1) • Make REST requests, and get results • Possible parameters: • Geospatial bounding box • By time • By user, hashtag, retweets etc • Fire hose • Big $$$. Big data
  • 4. Social Media - Twitter • Information & Obstacles • Who • What • At best: Plain English (!) • Worse: (Spanish or Arabic or Portuguese...) • Worst: “Textspeak” symbols :-0, UTF8 chars, etc. • Absolute Worst: combination of all of them • Where • 1-2% with latitude / longitude • Geocode • When
  • 5. Social Media - Twitter JSON Tweet example: • "created_at":"Sun Oct 27 13:57:40 +0000 2013", • "id":394462908261740540, • "text":"Flu :(", • "source":"<a href="http://twitmania.com" rel="nofollow">TwitMania™</a>", • "user":{ • "id":594141140, • "name":"Yultiana Farida N", • "screen_name":"yultiana", • "followers_count":231, • "friends_count":252, • "created_at":"Tue May 29 23:58:25 +0000 2012", • "statuses_count":2397, • }, • "geo":null,
  • 6. Cloud Computing • What does cloud computing bring to the table? • Amazon’s EC2: • Commoditized hardware • Low cost • Only charged for resources you use • No long term commitments • Scalable • "Throwaway" mentality **IF** you play by their rules!
  • 7. Cloud Computing – AWS • Tools • Virtual Machines • # of Processors, RAM, OS, disk capacity and I/O – all configurable • Price range: $.02/hr - $4.60/hr • Licensed OSes cost 50% more than Linux OSes • Archive Storage • S3 / Glacier • Work Queues • SQS • Data Stores • Dynamo (key value store), Red Shift (analysis store) • Virtual Networking • Routers, VPN gateways, access control lists, etc • APIs • Command line • HTTPS REST • Native programming languages (Python, bash, PHP, Java etc.) Ideal for rapid prototyping / proof of concepts
  • 8. Cloud Computing – AWS • APIs • Basic • Start an instance (and start billing) • Stop an instance (stop billing) • Insert item into queue • Remove item from queue • Write to backup store • Ultra advanced • Reserved vs. on demand vs. spot instances • Price can drop as much as 80% due to market demand • Instance can disappear at any time
  • 9. Big Data Analytics • Can we skirt the “big data” problem by distilling the tweets down from millions and millions “noise” tweets into a more desirable data set? • Enrich in real time, rather than on archived data, and avoid the overhead of map/reduce? • Possible Enrichment of raw data: • Classification – separate tweets into “relevant” and “irrelevant” • Geocoding – improve on the 1-2% ? • Aggregation –> map reduce • Mapping -> Reduce Function -> Output • AWS – Elastic Map Reduce • Clustering
  • 10. Machine Learning • Classification: relevant, or irrelevant? • Human trained model • Once model is established, bounce new data off it for classification • Validation of model • Accuracy = (Total # of classifications – Mismatches between machine / human) Total # of classifications • Crowdsourcing – AWS Mechanical Turk • Improve model by feeding disagreements back into the model • Our best text classification model to date: low 90%
  • 11. Open Source • Friendly to the commoditized computing paradigm • Don’t have to worry about licensing issues • Contributes to the “throwaway” discipline • Don’t have to re-invent the wheel (collaboration) • Solutions applicable to all parts of the architecture • Acquire data: Node.js – non blocking • Analyze data: R – statistical engine • Store and query data: MongoDB (document store) or Riak (key- value database)
  • 12. Architecture • We know Twitter is providing a mountain of data from all parts of the world • We know Amazon is providing a framework of low cost, on- demand, no commitment computing • Open source is providing a rich tool set • Goals: • Architect with cost in mind! • Enrichment - Real time and after-the-fact enrichment (open data) • Scalable • Decoupled • Service based • Rapid development • Prove the concepts
  • 13. Architecture - Acquire • Acquire the data from Twitter • If classifying in real time: • Store then classify? • Classify then store? • Tools • Twitter streaming API • Keywords • Node.js • Several different packages to interface with Twitter APIs • Amazon • EC2 • SQS (?) Extremely useful, but drives the cost up
  • 14. Architecture - Analyze • Classification interface • Service based – HTTP REST • Push or pull? • Push – classifiers listen on port 80 • Pull – classifier starts pulling from an established work queue • Both highly scalable and flexible with respect to cost. • Stateless • R • Human trained machine learning packages available • Cloud friendly – no licenses • Automatable – from install, configuration, execution
  • 15. Architecture - Store • Store JSON as an object (document store) or normalize (relational database)? • Relational databases • disk I/O intensive – not cloud friendly • allow complex indexing • Easy to get a business intelligence front end on them • Requires a schema / ETL • Key-value document stores • Designed to be scalable – doesn’t need fast disks • Indexing is not nearly as flexible as RDBMS • More difficult to front a UI – no “drag and drop” tools • No schema / ETL needed. • Not as mature • MongoDB / Riak
  • 16. Architecture – Presentation • Least need for cloud friendly scalability here? • Options • Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho • Open source BI software – SpagoBI • Roll your own - PHP, Ruby, Visual Basic, Javascript, etc • Connect to an existing system instead?
  • 17. Costs – Real Time Classification • Number of tweets collected per day: 1,000,000 (comfortable - .25%) • Machine used on EC2 to acquire (node.js): micro • $.02/hr * 24 hrs = .48/day • Machine used on EC2 to classify (R): small (x2) • $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day • Machine used on EC2 to store (MongoDB): large • $.24/hr * 24 hrs = $5.76 /day • Machine used on EC2 for GUI (Apache): small • $.06/hr * 24 = $1.44 • $0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 = .00001056 cents/tweet Can add more zeros if you relax real-time classification (spot instances)
  • 18. Costs - Archive • Size of average tweet: 2.5 KB • Cost to archive: • s3 : .095 GB/month • 0.0000002 per tweet per month • Glacier: .01 GB/month • 0.00000002 per tweet per month • Compression will add even more zeros, but will require more computing power, and mean more latency for post collection data analysis. Can be automated.
  • 19. Use Cases • Foodborne Chicago (http://foodborne.smartchicagoapps.org/) • Public-private partnership with City of Chicago Dept. of Public Health and Smart Chicago Collaborative • Reach out to city residents on Twitter tweeting about food poisoning symptoms, in an attempt to get them to log information in the City’s 311 database (via the Open311 API) • Once in the 311 database, it follows established City workflows, and becomes actionable • Numbers (1 year): • 2,390 tweets classified as related to food poisoning • 282 tweets responded to • 205 reports submitted • 145 inspections • Real time classification examples: • “Ugh! I got food poisoning from the McDonalds’s on Halstead!” http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead • “U of Chicago releases a new paper on the effects of food poisoning” http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning • Video:http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be
  • 20. Use Cases • Disease Tracker • Large scale attempt to track disease occurrences in the United States. • Sponsored by the Dept. of HHS • Approximately 1 million tweets a day (cold, flu) classified in real time • EC2 scalable instances • Geolocation • Cost to run for 6 months: $850
  • 21. Future Directions • Turnkey service • Can all this functionality be abstracted down to a pushbutton service? • Open data • Can you advertise the data collected, how you enriched it, and allow others to come along an enrich it as well? • General purpose bridge between Twitter and issue tracking databases • Big industry problem
  • 22. Github Sources • Tweet Collector • https://github.com/smartchicago/TweetCollector • Classifier Code • https://github.com/corynissen/foodborne_classifier