AWS Community Day CPH - Three problems of Terraform
Titanic prediction
1. Research Triangle Analysts
rtpanalysts.org
• Intro to Kaggle.com
• Titantic Getting Started Competition
• Prediction Problem with two outcome Levels
• Opportunity for an extended Data Shootout with Kaggle.com
providing data, scoring, tutorials, forums.
• Public domain data allows for detailed discussion of modeling
issues and solutions without client data confidentiality concerns.
• A common ground for in depth learning and debates on
analytics topics.
• Participants of all levels of expertise welcome
• You influence the direction of this effort by your
participation. Post questions and thoughts on
rtpanalysts.org .
• Welcome!
Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group member list 1
2. Classification Problems
• 2- levels or outcomes
• Data Model Predictions
• Examples
– Find customers who are likely to buy product
– Id patients likely to be admitted to hospital
– Categorize cells as cancerous or benign
– Who survives the Titanic disaster?
Slides by Linda Schumacher. Contact via Research Triangle Analysts LinkedIn group member
2
list
3. Classifier - Trees
• Decision Trees
All
Passengers
Female] Male
Second
First Class Age < 16 Age >= 16
Third Class
Slides by Linda Schumacher. Contact via
Research Triangle Analysts LinkedIn group 3
member list
4. Classifier - Logistic Regression
• Equation – Logistic Regression
• F(x) = sigmoid(age+class-embarked+gender)
Slides by Linda Schumacher. Contact via
Research Triangle Analysts LinkedIn group 4
member list
5. Titanic Data
• Passenger List
– Name, class, fare, embarked, family
members, age, cabin, etc
– Survival
• Training Set of 891 Passengers
• Test Set of 418
Slides by Linda Schumacher. Contact via
Research Triangle Analysts LinkedIn group 5
member list
6. Kaggle.com
• Data
• Tutorials
– Tools – Excel, Python
– Models – Trees, Random Forests
• Submission
• Leaderboard
Slides by Linda Schumacher. Contact via
Research Triangle Analysts LinkedIn group 6
member list
7. Where to Start
• create a Kaggle account
http://www.kaggle.com/account/register
• read and agree to the rules if you choose to continue
• enter the Kaggle Titantic Competition
http://www.kaggle.com/c/titanic-gettingStarted
• download train.csv and test.csv
• If you choose to use R, obtain-download R from
http://www.r-project.org/ You will have to choose a
‘mirror’ or site – usually a university or research site
• If you share code or data outside of your Kaggle
team, be sure to post a copy on Kaggle Titanic Forum
see http://www.kaggle.com/c/titanic-
gettingStarted/details/rules
Slides by Linda Schumacher. Contact via
Research Triangle Analysts LinkedIn group 7
member list
8. Benefits
• Extended Data Shoot-Out
• Tailor participation
• Opportunities
- New classifiers
- New tools, languages
- Training vs test error
- Round Table Discussion of Solutions
- Compare model results
Slides by Linda Schumacher. Contact via
Research Triangle Analysts LinkedIn group 8
member list