Big Data examples

•

2 gefällt mir•2,297 views

David Strom

This is a talk that I gave at StampedeCon.com, a big data conference in St. Louis in August 2012

Technologie News & Politik

How Big Data Can Help Your
Business:
Case Studies from ReadWriteWeb
David Strom
StampedeCon
August 1, 2012
david@strom.com
Download this here: http://slideshare.net/davidstrom

My publications
Editorial management positions:

Some oddball stuff
• Planes, trains and automobiles
• Fun with maps
• Big and little ovens
• Lessons learned from P&G
• Noteworthy scientists
• And of course sex!

Three skills for big data CEOs
• Strategic data planning. Data is the new raw
material for any business.
• Analytical skills. CEOs should be incredibly
smart about asking the right questions.
• Technology skills. Embrace the technology
and make it a key part of your CEO skill set.

Mason’s 5-step Big Data process
• Obtain
• Scrub
• Explore
• Model
• Interpret

Questions?
David Strom
david@strom.com
314 277 7832
@dstrom (Twitter)
http://strominator.com

Empfohlen

How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...Gianfranco Palumbo

An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest

Content Management Systems and MongoDBMitch Pirtle

MongoDBnikhil2807

User Data Management with MongoDB MongoDB

Content Management with MongoDB by Mark HelmstetterMongoDB

Webinar: Data Processing and Aggregation OptionsMongoDB

Introduction to MongoDB and WorkshopAhmedabadJavaMeetup

Empfohlen

How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...Gianfranco Palumbo

An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest

Content Management Systems and MongoDBMitch Pirtle

MongoDBnikhil2807

User Data Management with MongoDB MongoDB

Content Management with MongoDB by Mark HelmstetterMongoDB

Webinar: Data Processing and Aggregation OptionsMongoDB

Introduction to MongoDB and WorkshopAhmedabadJavaMeetup

MongoDB - Ekino PHPFlorent DENIS

Why NoSQL and MongoDB for Big DataWilliam LaForest

MongoDBAnthony Slabinck

Introduction to MongoDB with PHPfwso

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

An Introduction to Map/Reduce with MongoDBRainforest QA

MongoDB & Machine LearningTom Maiaroto

Spark Twitter fails Mar2023David Strom

Getting Your First Cybersecurity JobDavid Strom

Understanding passwordless technologiesDavid Strom

What endpoint protection solutions are available on the market today?David Strom

Fears and fulfillment with IT securityDavid Strom

Protecting your digital and online privacyDavid Strom

AI and cyber security: new directions, old fearsDavid Strom

The legalities of hacking backDavid Strom

How to market your book in today's social media worldDavid Strom

Understanding the Internet of ThingsDavid Strom

How to make your mobile phone safe from hackersDavid Strom

Implications and response to large security breaches David Strom

Using social networks to find your next job (2017)David Strom

Security v. Privacy: the great debateDavid Strom

Weitere ähnliche Inhalte

Andere mochten auch

MongoDB - Ekino PHPFlorent DENIS

Why NoSQL and MongoDB for Big DataWilliam LaForest

MongoDBAnthony Slabinck

Introduction to MongoDB with PHPfwso

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

An Introduction to Map/Reduce with MongoDBRainforest QA

MongoDB & Machine LearningTom Maiaroto

Andere mochten auch (8)

MongoDB - Ekino PHP

Why NoSQL and MongoDB for Big Data

MongoDB

Introduction to MongoDB with PHP

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...

An Introduction to Map/Reduce with MongoDB

MongoDB & Machine Learning

Mehr von David Strom

Spark Twitter fails Mar2023David Strom

Getting Your First Cybersecurity JobDavid Strom

Understanding passwordless technologiesDavid Strom

What endpoint protection solutions are available on the market today?David Strom

Fears and fulfillment with IT securityDavid Strom

Protecting your digital and online privacyDavid Strom

AI and cyber security: new directions, old fearsDavid Strom

The legalities of hacking backDavid Strom

How to market your book in today's social media worldDavid Strom

Understanding the Internet of ThingsDavid Strom

How to make your mobile phone safe from hackersDavid Strom

Implications and response to large security breaches David Strom

Using social networks to find your next job (2017)David Strom

Security v. Privacy: the great debateDavid Strom

Using OpenStack to Control VM ChaosDavid Strom

Notable Twitter failsDavid Strom

How to make the move towards hybrid cloud computingDavid Strom

Listen to Your Customers: How IT Can Provide Better SupportDavid Strom

Network security practice: then and nowDavid Strom

Biggest startup mistakesDavid Strom

Mehr von David Strom (20)

Spark Twitter fails Mar2023

Getting Your First Cybersecurity Job

Understanding passwordless technologies

What endpoint protection solutions are available on the market today?

Fears and fulfillment with IT security

Protecting your digital and online privacy

AI and cyber security: new directions, old fears

The legalities of hacking back

How to market your book in today's social media world

Understanding the Internet of Things

How to make your mobile phone safe from hackers

Implications and response to large security breaches

Using social networks to find your next job (2017)

Security v. Privacy: the great debate

Using OpenStack to Control VM Chaos

Notable Twitter fails

How to make the move towards hybrid cloud computing

Listen to Your Customers: How IT Can Provide Better Support

Network security practice: then and now

Biggest startup mistakes

Kürzlich hochgeladen

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica

Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Kürzlich hochgeladen (20)

Generative Artificial Intelligence: How generative AI works.pdf

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

React Native vs Ionic - The Best Mobile App Framework

Testing tools and AI - ideas what to try with some tool examples

[Webinar] SpiraTest - Setting New Standards in Quality Assurance

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

Potential of AI (Generative AI) in Business: Learnings and Insights

Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Decarbonising Buildings: Making a net-zero built environment a reality

Zeshan Sattar- Assessing the skill requirements and industry expectations for...

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Big Data examples

1. How Big Data Can Help Your Business: Case Studies from ReadWriteWeb David Strom StampedeCon August 1, 2012 david@strom.com Download this here: http://slideshare.net/davidstrom

2. My publications Editorial management positions:

3. Some oddball stuff • Planes, trains and automobiles • Fun with maps • Big and little ovens • Lessons learned from P&G • Noteworthy scientists • And of course sex!

10.

11.

12.

13.

14.

15.

16. StartupCompass.co

17.

18. The reason behind

19. Three skills for big data CEOs • Strategic data planning. Data is the new raw material for any business. • Analytical skills. CEOs should be incredibly smart about asking the right questions. • Technology skills. Embrace the technology and make it a key part of your CEO skill set.

20.

21. More from Jeff Jonas vs.

22.

23. Mason’s 5-step Big Data process • Obtain • Scrub • Explore • Model • Interpret

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35. Questions? David Strom david@strom.com 314 277 7832 @dstrom (Twitter) http://strominator.com

Hinweis der Redaktion

Let’s look at planes, trains and automobiles first.
http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentimentMeanwhile, the immediacy and accessibility of Twitter provides a real-time glimpse into consumer's frustration as you can see in this collection of just three tweets. Jeffrey Breen of Cambridge Aviation Research put this together to show sentiment analysis.
Here is his flowchart of how it put this all together, using R and various other data collection tools.
http://www.forbes.com/sites/toddwoody/2012/05/23/fedex-delivers-on-green-goals-with-electric-trucks/To tackle what is essentially a Big Data dilemma, FedEx is collaborating with General Electric – which is providing the company with commercial charging stations – utility Con Edison and Columbia University researchers, who are developing artificial intelligence programs to manage when and where the electric trucks charge in a 10-vehicle pilot project.“We’re collecting data on what is the load on the facility, what is the load of each truck, how many miles does that truck drive,” says Sondhi. “The algorithms from Columbia will identify that a truck is going to drive 16 miles tomorrow, so don’t give it 30 amps, give it 8 amps so we minimize the load on the entire facility.”
http://www.wired.com/autopia/2012/05/ford-sync-insurance/Currently, Ford collects and aggregates data from the 4 million vehicles that use in-car sensing and remote app management software to create a virtuous cycle of information. The data allows Ford engineers to glean information on a range of issues, from how drivers are using their vehicles, to the driving environment, to electromagnetic forces affecting the vehicle, and feedback on other road conditions that could help them improve the quality, safety, fuel economy and emissions of the vehicle. Here you see a typical Sync dash of a Ford sedan.Drivers willing to share how many miles they’ve traveled could get discounts between 10 and 40 percent in exchange for providing State Farm with a more accurate picture of their vehicle-use habits, which they obtain from directly accessing the Sync telematics systems in the cars electronically.Your car has become a data hub, with USB ports, a SD card reader, Bluetooth connections to your phone and even a mobile Wifi hotspot.
http://transport.wspgroup.fi/hklkartta/defaultEn.aspxYou can watch the positions for the various trains in Helsinki as they move about the map here.
Speaking of maps, there are thousands of big data mapping apps. Google Maps is certainly popular, but another site makes it even easier called Crowdmap. Here is a map of sexual violence against Syrian women that was found using that service at https://womenundersiegesyria.crowdmap.com/
http://geospaced.blogspot.co.uk/2012/07/world-wine-web.htmlDavid Smith put this together from about 400 wineries in the Napa Valley area. Not only can you scroll and zoom the map, but clicking on one of the winery markers will tell you its address and whether an appointment is required for tastings. He worked with Barry Rowlingson who used OpenStreetMaps and his own R package to build this map:
http://www.inside-r.org/howto/quantifying-uncertainty-it-estimatesAccurate estimates of IT work effort are critical for deciding where in technology a business should invest. Lacking experience with similar projects, the business is often at a loss for hard data. In this article, we describe our benefit from the power and convenience of R in the elicitation task, or, in other words, in quantifying the uncertainty around IT project lifespans using probability distributions. We show how R's built in functionality makes the elicitation task painless, while demonstrating how the methodology can be implemented in a user-friendly format. The power of R's probability toolbox allowed us to rapidly prototype an application which transported the basic concepts of elicitation to the IT project management space.
http://www.inside-r.org/howto/towards-ideal-steel-plant-online-liquid-steel-temperature-prediction-using-rR seems a suitable means for solving the task of providing accurate, understandable and automatable models for the desired temperature predictions. The R-project has proved to be most useful for the implementation of the calculated results, the same as the external control of its functionalities in a process automation environment. The presented mathematical approach and the developed R-code and framework program enable steel plant production engineers and technical staff to plan, carry out and adjust their tasks and doings on the basis of highly stable and precise temperature preset-values. Instead of adding off-sets and thresholds to the assumed heat target temperatures and by that adding extra processing time and extra energy during each processing step, to be on the safe side and rather deliver the melt above the final casting temperature than below, the new temperature prediction model will allow for the optimization of process stability, throughput and material quality in the steel plant, especially in ladle treatment.
We are looking at a hospital autoclave, which is used for sterilizing instruments. This is just one type of Industrial equipment which are among the products that Axeda is working with other companies to rig with sensors and cellular connections. Each of these devices has an IP address and an Internet connection, so that use of those devices can then be monitored remotely, so that their supply, maintenance and management can all be optimized, without having to go and look at the machines themselves. "Typically engineers would find logs through customer tickets and it would take months to find trends based on call center traffic,” You can collect data about uptime, need for repairs, machine run completion and detergent levels into a smartphone app that hospital employees can use.
Startup Compass collects data from tens of thousands of startups around the world. It collects lots of data, then creates best practices, recommendations and benchmarks to help entrepreneurs make better product and business decisions. Startups can learn which key performance indicators actually matter. Most startups don’t even know which KPIs they should track or why they should track them. Second, they learn how their KPIs compare to other companies’ KPIs so they will know if they’re on the right track. See, for example, their customer acquisition costs. The third thing they learn is what actions they need to be taking. We help businesses take the next steps.”
http://practicalanalytics.wordpress.com/2012/02/28/proctor-gamble-quadrupling-analytics-expertise/This is Proctor and Gamble’s Business Sphere big data situation room in their Cincinnati HQ. A big data analyst drives these large screens that display data visualizations on sales, market share, ad spending and the like, so everyone in the meeting is seeing the same information based on 4 billion daily transactions of P&G products. P&G isn’t after new data types; it still wants to share and analyze point-of-sale, inventory, ad spending, and shipment data. What’s new is the higher frequency and speed at which P&G gets that data, and the finer granularity. Even with all this gear, P&G has about two-thirds of the real-time data it needs.
They are trying to come to address the reason behind Why? was it a bad TV ad, out-of-stock shelves, or a competitor’s new product or price cut that caused a problem? Right now, the P&G IT team is working on automating analysis of the why, so employees get alerts when key events like a supply chain snafu or rival product launch happen. Their data visualizations can answer things such as -- Is a sales dip in detergent in France because of one retailer, so that’s where to focus? - Is that retailer buying less only in France, or across Europe?
http://www.readwriteweb.com/cloud/2012/02/strata-2012-3-essential-skills.phpDiego Saenz of Data Driven CEO
Jeff Jonas is a data scientist that now works for IBM. One of his jobs was designing the casino security systems in Las Vegas, where he currently lives. He worked for the surveillance intelligence group of several casinos, and automated various manual processes, adding facial recognition software that was key to slowing down the MIT card counting group. "We built [another] system to immediately identify risk in real time so they could get these people out of the casino quickly." This software is still offered by IBM as its InfoSphere Identity Insight event processing and identity tracking technology.
If someone has three phone numbers - no big deal. On the other hand, if someone has five different dates of birth, that just doesn't seem quite right does it? That would be confusing. Why is this important? Well, if you are looking to analytics to make important decisions, wouldn't you want to know during the decision making process if there was related confusion ... before [any] action is taken."
http://www.readwriteweb.com/enterprise/2011/09/measuring-the-lifespan-of-shar.phpHilary Mason analyzed shortened links posted to Twitter have a mean half life of 2.8 hours. Facebook boosts that to 3.2 hours, and direct sharing has a half-life of 3.4 hours. YouTube, however, beats them all hands down with a half life of 7.4 hours. In other words, you might get a slight edge by posting to Facebook versus Twitter (if you don't do both) but the content matters most. Good (or controversial) stuff rises to the top and has a longer life. Uninteresting stuff sinks quickly.
you need to start thinking about how to make your data sets smaller. "Big Data usually refers to a data set that is too big to fit into your available memory, or too big to store on your own hard drive, or too big to fit into an Excel spreadsheet," says Mason. This is the "scrub" section. The smaller the dataset, the easier it is to manipulate.
Mason and others have mentioned the now iconic Enron email archive that has since passed into the public domain and is used by a number of big data researchers to test their email algorithms and is available from a number of online academic websites.
http://strataconf.com/strata2012/public/schedule/detail/22449Jesper Andersen gave this talk at Strata eariler this year and showed how to integrate basic public data from the city, street and mapping data from Open Street Maps, real estate and rental listings data, data from social services like Foursquare, Yelp and Instagram, and analyze photographs of streets from mapping services to create a holistic view of a very famous street in San Francisco, Haight Street. Surprisingly, you'll find a lot of Swedish folks on the upper half of Haight Street. Not surprisingly for San Francisco, many people on Haight speak Spanish or Japanese. Tweet stream analysis found that more negative sentiment on the lower part of the street, which corresponds with higher crime stats.
The Associated Press has launched a content analysis tool that is used to search the millions of articles in their archives to create custom archive products for their customers. Users can query for particular keywords, and the AP can use the search query traffic to see trending topics and deliver article collections to particular B2B customers. For example, they could create references on a particular subject or moment in time. The project makes use of a solution from MarkLogic. AP Creates New Big Data Approach to its Article ArchiveDavid Strom· March 19th, 2012 3 Comments58inShareIf you are looking for large content repositories, you probably can't get much larger than the article archive of the Associated Press. Today they announced they have launched a content analysis tool that is used to search the millions of articles in their archives to create custom archive products for their customers. Users can query for particular keywords, and the AP can use the search query traffic to see trending topics and deliver article collections to particular B2B customers. For example, they could create references on a particular subject or moment in time. The project makes use of a solution from MarkLogic, a major Big Data enabler that is used by many different kinds of publishers for this type of purpose, such as Lexis/Nexis. We have written about prior efforts by the AP to help modernize their archives, such as this project to provide non-profits with free information feeds.The AP didn't start out by using the MarkLogic solution, but tried to implement a more traditional relational database structure only to run into problems. Their archives are in XML, which was difficult to design the right kind of data structures. Plus, they didn't have a consistent metadata collection across the archives. The MarkLogic implementation took 16 weeks from start to finish and was the first time that the AP had made use of their services. It enables them to run complex, Boolean searches across millions of articles in our content archive and get back precise returns in seconds or minutes instead of days or weeks. This much quicker response time is already transforming their B2B product offerings and help them to manage searching for unstructured content in near real-time
http://www.readwriteweb.com/hack/2012/02/data-scraping-comes-of-age-wit.phpThe company is called ScraperWiki.com and was started by Julian Todd and Aidan McGuire, two U.K.-based analysts who have been long involved in opening up government data to the public.
This is showing data that was mined from the UN peacekeeping troop levels, as one example of what you can do with the scraperwiki site. They have lots of public data sets that are available for anyone to analyze and try to help journalists publish the information.
Appistry FedEx's logistics apps, Sprint's fraud detection services, and at defense contractor Northrop Grumman. San Francisco-based Presidio Health used a variety of products to boost its cloud performance. "Presidio had to handle a 16 times increase in data volume in a year and replace some aging hardware," says its CTO Thomas Gregory. It was able to increase its computing power by 70% without increasing the costs of its IT equipment. "We didn't want a lot of capital expense, and we wanted an environment that was safe and could spread our risk around." The company uses a combination of Eclipse and Spring-based open source software and Appistry for handling its cloud services management. "Appistry has integration with Spring, it was easy to use and saved us months of effort to move our software into this environment," he said. "Plus we don't have to expose any of our services externally."
http://blog.okcupid.com/index.php/gay-sex-vs-straight-sex/Ok, on to sex. The dating site Okcupid looked through more than 4 million matches that they have made to find out patterns about gay and straight sexual preferences. The median number of sexual partners for both men and women are six, exploding the myth that gays are more promiscuous,
Here are straight people who either have had or would like to have a same-sex experience in the continental U.S. and lower Canada. You can see some sharp geographic divides.Awesomely, the mountain West lives up to its Brokeback reputation, and Canada is orange nearly coast-to-coast. Even in the yellow and blue areas, you can see pockets of gay curiosity in interesting places: Austin, Madison, Asheville. Anywhere soy milk is served, basically. This is based on millions of responses, On averageactive users have answered about 3000 questions; they've hidden the profiles of several thousand users they aren't interested in; they've voted for about 4000 profiles.
When OKCupid asked its members for factual questions, this is how they sorted out by sexual preference and gender. We always knew that women were smarter.
Kaggle routinely hosts various big data contests and this one that concluded last month was a way for Facebook to evaluate prospective employees. More than 400 people submitted entries.
http://www.theatlantic.com/technology/archive/2012/05/the-perfect-milk-machine-how-big-data-transformed-the-dairy-industry/256423/Still think big data is a lot of bull? Well, not according to the USDA. 8 million Holstein dairy cows in the United States, there is exactly one bull that has been scientifically calculated to be the very best in the land. He goes by the name of Badger-Bluff Fanny Freddie, who has 346 daughters who are on the books already. Their equations predicted from his DNA that he would be the best bullUSDA research geneticist reviewed pedigree records and looked at things such as milk production and fat and protein content to optimize the breed. To give you an idea of how this industry has changed, In 1942 the average dairy cow produced less than 5,000 pounds of milk in its lifetime. Now, the average cow produces over 21,000 pounds of milk.