R - datascience

R FOR DATA
SCIENCE
M O D U L E T O L E A R N I N G R A N D H O W T O H A N D L E D ATA
T H AT P U R P O S E D T O D ATA S C I E N C E
A U T H O R : I K E K U R N I AT I

DAY 1 : 6 H O U R S
1. Pengenalan singkat Data Science
2. Data Mining
3. History of R
4. Getting started with R
5. R NUT bolts
6. Getting data IN & OUT of R
7. Using the reader package
8. Using textual data
9. Interface to the outside world
10. Sub setting R objects
11. Vectorized Operation
12. Dates and time

DAY 2 : 7 H O U R S
1. Simple CaseAnalysis
1. Analysis Correlation
2. Linear Regression
2. CRIPS –DM Methodology
3. PredictiveAnalytic
1. Estimation
2. Classification
3. Clustering
3. Text Analysis

TARGET & GOAL
Target Peserta
Orang-orang yang ingin mempelajari R untuk
data science
Goal
• G1.Peserta mengerti, memahami
Konseptual data science, data mining dan
Mesin Learning.
• G2. Peserta mampu menggunakan R
• G3. Peserta mampu menggunakan R untuk
beberapa contoh dasar Data Science

PERQUISITES
• R Installation : https://cran.r-project.org
• R Studio :https://www.rstudio.com
• Internet Connection

PRE TEST
1. Sebutkan apa yang anda ketahui tentang
Data Science
2. Apa yang anda ketahui tentang data
mining
3. Apa yang anda ketahui tentang machine
learning
4. Apa yang anda ketahui tentang R
5. Dapatkan anda menggunakan R

Data science mencakup disiplin ilmu yang
luas, berdasarkan diagram diatas terdapat
3 disiplin ilmu yang berfokus pada data
science.
Data science adalah ilmu interdisiplin
yang berarti data science terbentuk dari
berbagai ilmu pengetahuan. Menurut
Staven Geringer Raleigh (2014),
pembentuk data science atau ilmu data
dapat diilustrasikan dalam diagram venn
berikut,
DATA SCIENCE

PUSTAKA
1. "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2014-01-27.
2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Third Edition, Elsevier, 2012
3. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical Machine Learning Tools and Techniques
3rd Edition, Elsevier, 2011
4. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining Use Cases and Business Analytics
Applications, CRC Press Taylor & Francis Group, 2014
5. Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, 2005
6. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT Press, 2014
7. Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011
8. Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition,
Springer, 2010
9. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data:
Algorithms and Applications, World Scientific, 2007
10. Wilson, Brian. Soft System Metodhology. British Library Cataloguing in Publication Data, 2001.
11. Emmanuel Paradis. R for Beginners. Institut des Sciences de l’E ́volution Universit ́e Montpellier II F-34095
Montpellier c ́edex 05 France. 2005
12. Roger D. Peng. R Programming for Data Science. Leanpub book. 2015

APA YANG
MENDASARI
LAHIRNYA DATA
MINING ?
P E S E R TA M E M A H A M I A PA YA N G M E N DA S A R I
L A H I R N YA DATA M I N I N G . S E H I N G G A
P E S E R TA M E M A H A M I A PA P E N T I N G N YA
DATA M I N I N G

YANG MENDASARI MUNCULNYA DATA
MINING
Data Yang Besar
Belum di
Eksplorasi
Teknik
Komputasi
& Ilmu
Komputer
Kebutuhan
Membuka
Informasi yg
tersembunyi

Banking Hospital Corporate Education
Cuaca Sport
MANUSIA MEMPRODUKSI DATA

PERTUMBUHAN DATA
Astronomi
• Sloan Digital Sky Survey
– New Mexico,2000
– 140TB over 10 years
• Large Synoptic SurveyTelescope
– Chile, 2016
– Will acquire 140TB every five days
Biologi dan Kedokteran
• European Bioinformatics Institute (EBI)
– 20PB of data (genomic data doubles in size each year)
kilobyte (kB) 103
megabyte (MB) 106
gigabyte (GB) 109
terabyte (TB) 1012
petabyte (PB) 1015
exabyte (EB) 1018
zettabyte (ZB) 1021
yottabyte (YB) 1024

PERTUMBUHAN DATA
• Mobile Electronics market
– 5B mobile phones in use in 2010
– 150M tablets was sold in 2012 (IDC)
– 200M is global notebooks shipments in 2012 (Digitimes Research)
• Web and Social Networks generates
amount of data
– Google processes 100 PB per day,3 million servers
– Facebook has 300 PB of user data per day
– Youtube has 1000PB video storage
– 235 TBs data collected by the US Library of Congress
– 15 out of 17 sectors in the US have more data stored per company than
the US Library of Congress
kilobyte (kB) 103
megabyte (MB) 106
gigabyte (GB) 109
terabyte (TB) 1012
petabyte (PB) 1015
exabyte (EB) 1018
zettabyte (ZB) 1021
yottabyte (YB) 1024

PERUBAHAN KULTUR DAN PERILAKU
(Insight, Big DataTrends
for Media, 2015)
19

EKSPLORASI
Menggali Lebih
dalam
Mengangkat
Kepermukaan
MenemukanSesuatu
yang berarti

TEKNIK KOMPUTASI & ILMU
KOMPUTER

APA YANG
DIMAKSUD
DENGAN DATA
MINING?
P E S E R TA M E M A H A M I A PA YA N G D I M A K S U D
D E N G A N D ATA M I N I N G D A N A PA S A J A
TA R G E T G O A L D A R I D ATA M I N I N G

DE FIN ISI DATA
M IN IN G & GOA L
O F DATA M IN IN G
Data mining adalah proses komputasi untuk
menemukan pola dalam data set yang besar
dengan melibatkan metodeArtificial
intelligence, Machine learning, statistik, dan
sistem basis data.
Tujuan dari proses data mining adalah untuk
mengekstrak informasi dari kumpulan data dan
mengubahnya menjadi struktur yang dimengerti
untuk digunakan lebih lanjut.
"Data Mining Curriculum".ACM SIGKDD.2006-
04-30.Retrieved2014-01-27.

AKAR ILMU DATA MINING
Statistic
Artificial Intelligence
Pattern Recognition
Basis Data

Komputational
Visualisasi
Data
Statistika
Machine
Learning
Artificial
Intelleigence
Asosiasi
Sekuensial
Pattern
Recognition
Basis Data
Basis Data

TA H A PA N U TA M A
DATA M IN IN G

DATA SET
Matrik Data
Data
Transaksi
Data
Dokumen
Record
WWW
Struktur
Molekul
Graph
Data Spasial
Data
Temporal
Data
Sekuensial
Data
Urutan
Genetik
Ordered
data set

C O N TO H
PE N E R A PA N DATA
M IN IN G• Penentuan pasokan listrik PLN untuk wilayah Jakarta
• Prediksi profile tersangka koruptor dari data pengadilan
• Perkiraan harga saham dan tingkat inflasi
• Analisis pola belanja pelanggan
• Memisahkan minyak mentah dan gas alam
• Menentukan kelayakan seseorang dalam kredit KPR
• Penentuan pola pelanggan yang loyal pada perusahaan
operator telepon
• Deteksi pencucian uang dari transaksi perbankan
• Deteksi serangan (intrusion) pada suatu jaringan

DATA - INFORMASI – PENGETAHUAN
Data Kehadiran Pegawai
NIP TGL DATANG PULANG
1103 02/12/2004 07:20 15:40
1142 02/12/2004 07:45 15:33
1156 02/12/2004 07:51 16:00
1173 02/12/2004 08:00 15:15
1180 02/12/2004 07:01 16:31
1183 02/12/2004 07:49 17:00
Romi satrioWahono, Slide & Presentation,2016

InformasiAkumulasi Bulanan Kehadiran Pegawai
NIP Masuk Alpa Cuti Sakit Telat
1103 22
1142 18 2 2
1156 10 1 11
1173 12 5 5
1180 10 12

Pola Kebiasaan Kehadiran Mingguan Pegawai
Senin Selasa Rabu Kamis Jumat
Terlambat 7 0 1 0 5
Pulang
Cepat
0 1 1 1 8
Izin 3 0 0 1 4
Alpa 1 0 2 0 2

DATA - INFORMASI – PENGETAHUAN -
KEBIJAKAN
• Kebijakan penataan jam kerja karyawan khusus
untuk hari senin dan jumat
• Peraturan jam kerja:
– Hari Senin dimulai jam 10:00
– Hari Jumat diakhiri jam 14:00
– Sisa jam kerja dikompensasi ke hari lain

DATA MINING PADA BUSINESS
INTELLIGENCE
37
Increasing potential
to support business
decisions
End User
Business Analyst
Data Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

MODEL ADALAH?
• Model (of any kind) are not description of the real world they are descriptions of ways of
thinking about the real world. ( BrianWilson, 2001: 4)

PERAN UTAMA
DATA MINING Estimation
Prediction
ClassificationClustering
Association

METODE DATA MINING
1. Estimation (Estimasi):
– Linear Regression,Neural Network,Support Vector Machine,etc
2. Prediction/Forecasting (Prediksi/Peramalan):
– Linear Regression,Neural Network,Support Vector Machine,etc
3. Classification (Klasifikasi):
– Naive Bayes,K-Nearest Neighbor,C4.5, ID3, CART, Linear
Discriminant Analysis,Logistic Regression,etc
4. Clustering (Klastering):
– K-Means,K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means,etc
5. Association (Asosiasi):
– FP-Growth,A Priori, Coefficient of Correlation,Chi Square,etc

ESTIMASI
• Menemukan/ memberikan suatu nilai yang belum diketahui, misal menerka berapa jumlah garam
yang harus di import ketika informasi mengenai garam diketahui.Metode yang digunakan antara
lain Point Estimation dan Confidence Interval Estimations, Simple Linear Regression dan Correlation,
dan Multiple Regression.
• Digunakan ketika menghitung waktu estimasi

ESTIMASI WAKTU PENGIRIMAN KFC
Custome
r
Jumlah Pesanan
(P)
Jumlah Traffic Light
(TL)
Jarak (J) Waktu Tempuh
(T)
1 3 3 3 16
2 1 7 4 20
3 2 4 6 18
4 4 6 8 36
...
1000 2 4 2 12
Waktu Tempuh (T) = 0.48P + 0.23TL + 0.5J
Pembelajaran dengan
Metode Estimasi (Regresi Linier)
Label

ASOSIASI
• Association (asosiasi), fungsi ini mengidentifikasi item-
item produk yang kemungkinan dibeli konsumen
bersamaan dengan produk lain. Metode atau algoritma
dalam fungsi ini adalah Apriori, Generalized Sequential
Pattern (GSP), FP-Growthdan GRI algorithm.
• apabila data yang kita gunakan seperti data transaksi, jadi
pada metode Asosiasi ini sangat cocok untuk data
transaksi misalnya transaksi belanja pembeli pada suatu
swalayan atau minimarket, metode yang menghubungkan
antara item, misal transaksi pertama orang membeli kopi,
gula, teh. lalu orang kedua membeli barang kopi, susu,
shampo, sabun dan lainya. kemudian data ini kita olah
menggunakan metode asosiasi contohnya adalah FP-
Growth .

ATURAN ASOSIASI PEMBELIAN BARANG
Pembelajaran dengan
Metode Asosiasi (FP-Growth)

PENGETAHUAN BERUPA ATURAN ASOSIASI

PREDIKSI
• Untuk memperkirakan nilai masa mendatang,missal memprediksi stok barang satu tahun ke
depan.Fungsi ini mencakup metode Neural Network,DecisionTree,dan k–Nearest Neighbor.
• Digunakan apabila data yang kita peroleh merupakan data numerik seperti rentan waktu atau
time series

CLUSTERING
• pengelompokan mengidentifikasi data yang
memiliki karakteristik tertentu.Metode dalam
fungsi ini diantaranya Hierarchical Clustering,
metode K-Means,dan Self Organizing
Map (SOM).
• Gunakan apabila datanya tidak memiliki label,
makanya metode klastering juga dikenal
dengan metode unsupporvise learning atau
pembelajarannya tidak memerlukan guru,
berbeda dengan metode Estimasi,Prediksi dan
klasifikasi yang merupakan metode supporvise
learning.

KLASIFIKASI
• penemuan model atau fungsi yang menjelaskan atau membedakan konsep atau kelas data,
dengan tujuan untuk dapat memperkirakan kelas dari suatu objek yang labelnya tidak diketahui.
Metode yang digunakan antara lain Neural Network, DecisionTree, k-Nearest Neighbor, dan Naive
Bayes.
• Gunakan apabila atribut data bisa berupa numerik atau nominal dengan label data nominal

KLASIFIKASI KELULUSAN MAHASISWA
NIM Gender Nilai
UN
Asal
Sekolah
IPS1 IPS2 IPS3 IPS 4 ... Lulus
Tepat
Waktu
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMA DK 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
50
Pembelajaran dengan
Metode Klasifikasi (C4.5)
Label

PENGETAHUAN BERUPA POHON KEPUTUSAN

PENGETAHUAN (POLA/MODEL)
1. Formula/Function (Rumus atau Fungsi Regresi)
– WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2
PESANAN
2. DecisionTree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
– IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)

EVALUASI (AKURASI, ERROR, ETC)
1. Estimation:
– Error:Root Mean Square Error (RMSE),MSE,MAPE,etc
2. Prediction/Forecasting (Prediksi/Peramalan):
– Error:Root Mean Square Error (RMSE) ,MSE,MAPE,etc
3. Classification:
– Confusion Matrix:Accuracy
– ROC Curve:Area Under Curve (AUC)
4. Clustering:
– Internal Evaluation:Davies–Bouldin index,Dunn index,
– External Evaluation: Rand measure,F-measure, Jaccard index,Fowlkes–
Mallows index,Confusion matrix
5. Association:
– Lift Charts:Lift Ratio
– Precision and Recall (F-measure)

LATIHAN
1. Sebutkan 5 peran utama data mining!
2. Jelaskan perbedaan estimasi dan prediksi!
3. Jelaskan perbedaan prediksi dan klasifikasi!
4. Jelaskan perbedaan klasifikasi dan klastering!
5. Jelaskan perbedaan klastering dan association!
6. Jelaskan perbedaan estimasi dan klasifikasi!
7. Jelaskan perbedaan estimasi dan klastering!
8. Jelaskan perbedaan supervised dan unsupervised learning!
9. Sebutkan tahapan utama proses data mining!
54

CROSS INDUSTRY
STANDARD
PROCESS FOR
DATA MINING

Cross Industry Standard
Process for Data Mining,
commonly known by its
acronym CRISP-DM,[1] is
a data mining process model
that describes commonly used
approaches that data mining
experts use to tackle
problems.

CRIPS-DM
• Business Understanding: Memahami scope bisnis, mendefinisikan problem yang akan
diselesaikan
• Data Understanding: Inisialisasi data, Pengumpulan data, Mengidentifikasi kualitas data (ada atau
tidaknya missing value, lengkap atau tidak lengkap data yang dibutuhkan untuk proses data mining,
dsb)
• Data preparation: Feature selection, Handle missing value, Data cleansing.
• Modeling: Pemilihan Model berdasarkan kategori problem dan output yang diharapkan, Run
Model.
• Evaluation: Mengevaluasi apakah model yang sudah dibuat memenuhi tujuan bisnis,
Mengidentifikasi problem bisnis ,Mampu mengidentifikasi masalah bisnis yang sudah didefinisikan.
• Deployment: Implementasi model, setting model agar dapat dijalankan secara continous.

https://www.datacamp.com/community/tutorials/r-or-python-
for-data-analysis

MESIN LEARNING
CABANG DARI ARTIFICIAL INTELLIGENCE
Tom Mitchel, 1998
T: Task:Tugas; P:Performance:Nilai hasil kerja; E: Experience: Pengalaman
komputer dikatakan melakukan learning apabila dalam mengerjakan tugas T, hasil kerjanya P semakin
baik dengan bertambahnya pengalaman E

JENIS DAN CARA KERJA
MESIN LEARNING
• Supervised learning
manusia memberi seperangkat contoh hasil yang benar, komputer
menggunakan contoh tersebut untuk menemukan hasil bagi data
masukan lain
• Unsupervised learning
manusia tidak campur tangan memberikan jawaban yang benar,
komputer dibiarkan menemukan sendiri pola dalam data masukan
• Reinforced learning
Mesin mencoba langkah-langkah dan mendapat umpan balik positif
atau negatif pada setiap langkah tersebut

SUPERVISED LEARNING
Ada proses diberitahu kumpulan data yang benar kemudian
yang dilakukan adalah memecahkan permasalahannya
dengan algoritma yang sesuai, misal;
untuk mencari prediksi dari nilai riil (continuous value)
digunakan pendekatan regression problem (contoh:
linear/multivariate regression) [2], biasanya untuk nilai
yang sangat banyak meskipun diskrit maka akan
dianggap sebagai angka riil
untuk mencari prediksi dari nilai diskrit/bulat (discrete
value) digunakan pendekatan classification
problem (contoh: logistic regression), misalnya untuk
memilah benar atau salah, memilah nilai a atau b atau
c, dst

ALGORITMA SUPERVISED
LEARNING
•Decision tree
•Nearest - Neighbor Classifier
•Naive Bayes Classifier
•Artificial Neural Network
•Support Vector Machine
•Fuzzy K-Nearest Neighbor

UNSUPERVISED LEARNING
Unsupervised Learning
Tidak ada pemberitahuan mana data yang benar
Mencari struktur (pattern) dari data yang ada,
kemudian melakukan pengelompokan
berdasarkan informasi yang dimiliki, misal:
algoritma clustering (pengelompokan data)
•contoh penggunaan unsupervised learning ini
misalnya pada social network analysis,
memecahkan cocktail party problem

ALGORITMA UNSUPERVISED
LEARNING
•K-Means
•Hierarchical Clustering
•DBSCAN
•Fuzzy C-Means
•Self-Organizing Map

REINFORCEMENT LEARNING
Reinforcement Learning adalah salah satu paradigma baru di dalam learning theory. RL
dibangun dari proses mapping (pemetaan) dari situasi yang ada di environment (states) ke
bentuk aksi (behavior) agar dapat memaksimalkan reward. Agent yang bertindak sebagai sang
learner tidak perlu diberitahukan behavior apakah yang akan sepatutnya dilakukan, atau dengan
kata lain, biarlah sang learner belajar sendiri dari pengalamannya. Ketika ia melakukan sesuatu
yang benar berdasarkan rule yang kita tentukan, ia akan mendapatkan reward, dan begitu juga
sebaliknya.
RL secara umum terdiri dari 4 komponen dasar, yaitu :
• Policy : kebijaksanaan
• Reward function
• Value function
• Model of environment

BASIC FEATURES OF R
R
Statistical
Features
Basic Statistic
Statistic
Graphics
Probability
Distribution
Programming
Features
Distribute
Computing
R Package

LIMITATION OF R
At a higher level one “limitation” of R is that its functionality is
based on consumer demand and (voluntary) user contributions.

Official Manuals
CRAN21 are
•An Introduction to R22
•R Data Import/Export23
•Writing R Extensions24: Discusses how to write and organize R packages
•R Installation and Administration25: This is mostly for building R from the source code)
•RInternals26:This manual describes the low level structure of Ran dissimilarly for developers
•and R core members
•R Language Definition27: This documents the R language and, again, is primarily for developers
R RESOURCES

G E T T I N G S TA R T E D
W I T H R

GETTING STARTED WITH THE R INTERFACE

R - S T U D I O -
R P R O J E C T

Using Projects
RStudio projects make it straightforward to divide your
work into multiple contexts, each with their own
working directory, workspace, history, and source
documents.
Creating Projects
RStudio projects are associated with R working
directories. You can create an RStudio project:
• In a brand new directory
• In an existing directory where you already have R
code and data
• By cloning a version control (Git or Subversion)
repository
To create a new project use the Create
Project command (available on the Projects menu and
on the global toolbar):

When a new project is created RStudio:
1. Creates a project file (with an .Rproj extension) within the project directory. This
file contains various project options (discussed below) and can also be used as a
shortcut for opening the project directly from the filesystem.
2. Creates a hidden directory (named .Rproj.user) where project-specific temporary
files (e.g. auto-saved source documents, window-state, etc.) are stored. This
directory is also automatically added to .Rbuildignore, .gitignore, etc. if required.
3. Loads the project into RStudio and display its name in the Projects toolbar (which
is located on the far right side of the main toolbar)

Working with Projects
Opening Projects
There are several ways to open a project:
1. Using the Open Project command (available from both the
Projects menu and the Projects toolbar) to browse for and
select an existing project file (e.g. MyProject.Rproj).
2. Selecting a project from the list of most recently opened
projects (also available from both the Projects menu and
toolbar).
3. Double-clicking on the project file within the system shell
(e.g. Windows Explorer, OSX Finder, etc.).

When a project is opened within RStudio the following
actions are taken:
1. A new R session (process) is started
2. The .Rprofile file in the project's main directory (if any) is
sourced by R
3. The .RData file in the project's main directory is loaded (if
project options indicate that it should be loaded).
4. The .Rhistory file in the project's main directory is loaded
into the RStudio History pane (and used for Console
Up/Down arrow command history).
5. The current working directory is set to the project directory.
6. Previously edited source documents are restored into editor
tabs
7. Other RStudio settings (e.g. active tabs, splitter positions,
etc.) are restored to where they were the last time the project
was closed.

Quitting a Project
When you are within a project and choose to either Quit, close the
project, or open another project the following actions are taken:
1. RData and/or .Rhistory are written to the project directory (if
current options indicate they should be)
2. The list of open source documents is saved (so it can be restored
next time the project is opened)
3. Other RStudio settings (as described above) are saved.
4. The R session is terminated.

Working with Multiple Projects at Once
You can work with more than one RStudio project at a time by
simply opening each project in its own instance of RStudio.
There are two ways to accomplish this:
1. Use the Open Project in New Window command located
on the Project menu.
2. Opening multiple project files via the system shell (i.e.
double-clicking on the project file).

Project Options
There are several options that can be set on a
per-project basis to customize the behavior of
RStudio. You can edit these options using
the Project Options command on
the Projectmenu:

General
Note that the General project options are all overrides of existing global
options. To inherit the default global behavior for a project you can
specify (Default) as the option value.
1. Restore .RData into workspace at startup — Load the .RData file
(if any) found in the initial working directory into the R workspace
(global environment) at startup. If you have a very large .RData file
then unchecking this option will improve startup time considerably.
2. Save workspace to .RData on exit — Ask whether to save .RData
on exit, always save it, or never save it. Note that if the workspace
is not dirty (no changes made) at the end of a session then no
prompt to save occurs even if Ask is specified.
3. Always save history (even when not saving .RData) — Make
sure that the .Rhistory file is always saved with the commands from
your session even if you choose not to save the .RData file when
exiting.

Editing
1. Index R source files — Determines whether R source files within
the project directory are indexed for code navigation (i.e. go to
file/function, go to function definition). Normally this should
remain enabled, however if you have a project directory with
thousands of files and are concerned about the overhead of
monitoring and indexing them you can disable indexing here.
2. Insert spaces for tab — Determine whether the tab key inserts
multiple spaces rather than a tab character (soft tabs). Configure the
number of spaces per soft-tab.
3. Text encoding — Specify the default text encoding for source files.
Note that source files which don't match the default encoding can
still be opened correctly using the File : Reopen with
Encoding menu command.

Version Control
1. Version control system — Specify the version control system to
use with this project. Note that RStudio automatically detects the
presence of version control for projects by scanning for a .git or
.svn directory. Therefore it isn't normally necessary to change this
setting. You may want to change the setting for the following
reasons:
• You have both a .git and .svn directory within the project and
wish to specify which version control system RStudio should
bind to.
• You have no version control setup for the project and you
want to add a local git repository (equivalent to executing git
init from the project root directory).
2. Origin — Read-only display of the remote origin (if any) for the
project version control repository.

ENTERING INPUT
At the R prompt we type expressions. The <- symbol is the assignment operator
The grammar of the language determines whether
an expression is complete or not.
The # character indicates a comment. Anything to the
right of the # (including the # itself) is ignored. This is
the only comment character in R. Unlike some other
languages, R does not support multi-line comments or
comment blocks.

E VA L UAT IO N
When a complete expression is
entered at the prompt, it is
evaluated and the result of the
evaluated expression is returned.
The result may be auto-printed.

E VA L UAT IO N
The numbers in the square
brackets are not part of the vector
itself, they are merely part of the
printed output.

R OBJECTS
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)

NUMBERS
• Numbers in R are generally treated as numeric objects (i.e. double precision real
numbers). This means that even if you see a number like “1” or “2” in R, which you
might think of as integers, they are likely represented behind the scenes as numeric
objects (so something like “1.00” or “2.00”). This isn’t important most of the
time...except when it is.
• If you explicitly want an integer, you need to specify the L suffix. So entering 1 in R
gives you a numeric object; entering 1L explicitly gives you an integer object.
• There is also a special number Inf which represents infinity. This allows us to
represent entities like 1 / 0. This way, Inf can be used in ordinary calculations; e.g. 1 /
Inf is 0. The value NaN represents an undefined value (“not a number”); e.g. 0 / 0; NaN
can also be thought of as a missing value (more on that later).

ATTRIBUTES
R objects can have attributes, which are like metadata for the object. These metadata can
be very useful in that they help to describe the object. For example, column names on a
data frame help to tell us what data are contained in each of the columns. Some examples
of R object attributes are
• names, dimnames
• dimensions (e.g. matrices, arrays)
• class (e.g. integer, numeric)
• length
• other user-defined attributes/metadata
Attributes of an object (if any) can be accessed using the attributes() function. Not all R
objects contain attributes, in which case the attributes() function returns NULL.

C R E AT IN G
V E C TO R S
The c() function can be used to
create vectors of objects by
concatenating things together.
Note that in the above example, T and F are short-hand ways to specify
TRUE and FALSE. However, in general one should try to use the explicit
TRUE and FALSE values when indicating logical values. The T and F
values are primarily there for when you’re feeling lazy. You can also
use the vector() function to initialize vectors.

M IXIN G O BJE C T S
There are occasions when
different classes of R objects get
mixed together. Sometimes this
happens by accident but it can also
happen on purpose. So what
happens with the following code?
In each case above, we are mixing objects of two different
classes in a vector. But remember that the only rule about
vectors says this is not allowed. When different objects are
mixed in a vector, coercion occurs so that every element in the
vector is of the same class.
In the example above, we see the effect of implicit coercion.
What R tries to do is find a way to represent all of the objects in
the vector in a reasonable fashion. Sometimes this does exactly
what you want and...sometimes not. For example, combining a
numeric object with a character object will create a character
vector, because numbers can usually be easily represented as
strings.

E XPL IC IT
C O E R C IO N
Objects can be explicitly coerced
from one class to another using
the as.* functions, if available.Sometimes, R can’t figure out how to coerce an object and this
can result in NAs being produced.
When nonsensical coercion takes place, you will usually get a
warning from R.

M AT R IC E S
Matrices are vectors with a
dimension attribute.The
dimension attribute is itself an
integer vector of length 2
(number of rows, number of
columns)
Matrices are constructed
column-wise, so entries
can be thought of starting
in the “upper left” corner
and running down the
columns.

M AT R IC E S
Matrices can also be created
directly from vectors by
adding a dimension attribute.
Matrices can be created by
column-binding or row-
binding with the cbind() and
rbind() functions.

L IST S
1. Lists are a special type of vector that can
contain elements of different classes.
2. Lists are a very important data type in R
and you should get to know them well.
Lists, in combination with the various
“apply” functions discussed later, make for
a powerful combination.
3. Lists can be explicitly created using the
list() function, which takes an arbitrary
number of arguments.
We can also create an empty list of
a prespecified length with the
vector() function

FAC TO R S
Factors are used to represent
categorical data and can be unordered
or ordered.One can think of a factor as
an integer vector where each integer
has a label. Factors are important in
statistical modeling and are treated
specially by modelling functions like lm()
and glm().
Using factors with labels is better than
using integers because factors are self-
describing. Having a variable that has
values “Male” and “Female” is better
than a variable that has values 1 and 2.
Factor objects can be created with the
factor() function.

Often factors will be automatically
created for you when you read a
dataset in using a function like
read.table().Those functions often
default to creating factors when
they encounter data that look like
characters or strings.
The order of the levels of a factor
can be set using the levels
argument to factor().This can be
important in linear modelling
because the first level is used as
the baseline level.

MISSING VALUES
Missing values are denoted by NA or NaN for
q undefined mathematical operations.
• is.na() is used to test objects if they are NA
• is.nan() is used to test for NaN
• NA values have a class also, so there are
integer NA, character NA, etc.
• A NaN value is also NA but the converse is
not true

DATA FRAME
• Data frames are used to store tabular data in R. They are an important type of object in R and are used
in a variety of statistical modeling applications. Hadley Wickham’s package dplyr has an optimized set
of functions designed to work efficiently with data frames.
• Data frames are represented as a special type of list where every element of the list has to have the same
length. Each element of the list can be thought of as a column and the length of each element of the list
is the number of rows.
• Unlike matrices, data frames can store different classes of objects in each column. Matrices must have
every element be the same class (e.g. all integers or all numeric).
• In addition to column names, indicating the names of the variables or predictors, data frames have a
special attribute called row.names which indicate information about each row of the data frame.
• Data frames are usually created by reading in a dataset using the read.table() or read.csv(). However,
data frames can also be created explicitly with the data.frame() function or they can be coerced from
other types of objects like lists.
• Data frames can be converted to a matrix by calling data.matrix(). While it might seem that the
as.matrix() function should be used to coerce a data frame to a matrix, almost always, what you want is
the result of data.matrix().

NAMES
R objects can have names, which is very useful for writing
readable code and self-describing objects. Here is an example of
assigning names to an integer vector.
Matrices can have both column and row names.
Column names and row names can be set
separately using the colnames() and rownames()
functions.

DATA FRAME
Note that for data frames, there is a separate function for setting the row names, the row.names() function.
Also, data frames do not have column names, they just have names (like lists). So to set the column
names of a data frame just use the names() function. Yes, I know its confusing. Here’s a quick summary:

G E T T I N G D A T A I N & O U T

PRINCIPAL FUNCTIONS READING DATA
INTO R
• read.table, read.csv, for reading tabular data
• readLines, for reading lines of a text file
• source, for reading in R code files (inverse of dump)
• dget, for reading in R code files (inverse of dput)
• load, for reading in saved workspaces
• unserialize,for reading single R objects in binary form

There are of course, many R packages that have been developed to read in all kinds of other datasets, and you may need to resort to
one of these packages if you are working in a specific area.
There are analogous functions for writing data to files
• write.table, for writing tabular data to text files (i.e. CSV) or connections
• writeLines, for writing character data line-by-line to a file or connection
• dump, for dumping a textual representation of multiple R objects
• dput, for outputting a textual representation of an R object
• save, for saving an arbitrary number of R objects in binary format (possibly compressed) to a file.
• serialize, for converting an R object into a binary format for outputting to a connection (or file).

READING DATA FILES WITH READ.TABLE()
1. The read.table() function is one of the most commonly used functions for reading data. The
help file for read.table() is worth reading in its entirety if only because the function gets used
a lot (run ?read.table in R). I know, I know, everyone always says to read the help file, but
this one is actually worth reading.
2. The read.table() function has a few important arguments

FEW IMPORTANT ARGUMENT
• file, the name of a file, or a connection
• header, logical indicating if the file has a header line
• sep, a string indicating how the columns are separated
• colClasses, a character vector indicating the class of each column in the dataset
• nrows, the number of rows in the dataset. By default read.table() reads an entire file.
• comment.char, a character string indicating the comment character. This defalts to "#". If
there are no commented lines in your file, it’s worth setting this to be the empty string "".
• skip, the number of lines to skip from the beginning
• stringsAsFactors, should character variables be coded as factors? This defaults to TRUE
because back in the old days, if you had data that were stored as strings, it was because those
strings represented levels of a categorical variable. Now we have lots of data that is text data
and they don’t always represent categorical variables. So you may want to set this to be
FALSE in those cases. If you always want this to be FALSE, you can set a global option via
options(stringsAsFactors = FALSE). I’ve never seen so much heat generated on discussion
forums about an R function argument than the stringsAsFactors argument. Seriously.

For small to moderately sized datasets, you can usually call
read.table without specifying any other arguments
In this case, R will automatically
• skip lines that begin with a #
figure out how many rows there are (and how much memory
needs to be allocated)
• figure what type of variable is in each column of the table.

READING IN LARGER DATASETS WITH
READ.TABLE
• Read the help page for read.table, which contains many hints
• Make a rough calculation of the memory required to store your dataset
(see the next section for an example of how to do this). If the dataset is
larger than the amount of RAM on your computer, you can probably stop
right here.
• Set comment.char = "" if there are no commented lines in your file.
• Use the colClasses argument. Specifying this option instead of using the
default can make ’read.table’ run MUCH faster, often twice as fast. In
order to use this option, you have to know the class of each column in
your data frame. If all of the columns are “numeric”,for example, then you
can just set colClasses = "numeric".

A quick an dirty way to figure out the classes of each column is
the following:
Setnrows. This doesn’t make R run faster but it helps with
memory usage.

In general, when using R with larger datasets, it’s also useful to know a
few things about your system.
• How much memory is available on your system?
• What other applications are in use? Can you close any of
them?
• Are there other users logged into the same system?
• What operating system are you using? Some operating system
scan limit the amount of memory a single process can access

CALCULATING MEMORY REQUIREMENTS
FOR R OBJECTS
• Because R stores all of its objects physical memory,
it is important to be cognizant of how much memory
is being used up by all of the data objects residing in
your workspace. One situation where it’s
particularly important to understand memory
requirements is when you are reading in a new
dataset into R. Fortunately, it’s easy to make a back
of the envelope calculation of how much memory
will be required by a new dataset.
• For example, suppose I have a data frame with
1,500,000 rows and 120 columns, all of which are
numeric data. Roughly, how much memory is
required to store this data frame? Well, on most
modern computers double precision floating point
numbers38 are stored using 64 bits of memory, or 8
bytes. Given that information, you can do the
following calculation

So the dataset would require about 1.34 GB of RAM. Most computers these days have
at least that much RAM. However, you need to be aware of
• what other programs might be running on your computer, using up RAM
• what other R objects might already be taking up RAM in your workspace
Reading in a large dataset for which you do not have enough RAM is one easy way to
freeze up your computer (or at least your R session). This is usually an unpleasant
experience that usually requires you to kill the R process, in the best case scenario, or
reboot your computer, in the worst case. So make sure to do a rough calculation of
memeory requirements before reading in a large dataset.

C O N N E C T T O E X I S T I N G
D A T A S O U R C E

Overview
The RStudio Connections Pane makes it possible to easily connect to a variety of data
sources, and explore the objects and data inside the connection. It extends, and is designed
to work with, a variety of other tools for working with databases in R. You can read more
about these other tools on the Databases with RStudio site.The Connection Pane helps
you to connect to existing data sources. It is not a connection manager like you would see
in PGAdmin, Toad, or SSMS. Like the Data Import feature, it helps you craft an R statement
that you can run to help work with your data in R. It also remembers the R statement so
that you can reconnect easily, and provides a means of exploring the data source once
you're connected.

Prerequisites
The Connections Pane is currently available only in the preview
release of RStudio 1.1. If you plan to work with ODBC data sources
in the Connections Pane, you’ll also need the latest version of the odbc
package from Github, which you can install as follows:
Connect to existing data sources
There are two ways to connect to an existing data source:
• Use the New Connection button
• Click the New Connection button to create a new data connection.

Opening a Data Connection
Data connections are typically ephemeral and are closed
when your R session ends or is restarted. To re-establish a
data connection, click the Connections tab. This shows a list
of all the connections RStudio knows about (see Connections
History below for details).
1. R Console will create the connection immediately by
executing the code at the R console.
2. New R Script will put your connection into a new R script,
and then immediately run the script.
3. New R Notebook will create a new R Notebook with a setup
chunk that connects to the data, and then immediately run the
setup chunk.
4. Copy to Clipboard will place the connection code onto the
clipboard, to make it easy to insert into an existing script or
document.

Exploring Connections
When you select a connection that is currently connected, you can explore the
objects and data in the connection.
Use the blue expanding arrows on the left to drill down to the object you’re
interested in. If the object contains data, you’ll see a table icon on the right; click
on it to see the first 1,000 rows of data in the object.

U S I N G
T H E R E A D R
P A C K A G E

• The readr package is recently developed by Hadley Wickham to deal with
reading in large flat files quickly. The package provides replacements for
functions like read.table() and read.csv(). The analogous functions in readr are read_table()
and read_csv(). This functions are oven much faster than their base R analogues
and provide a few other nice features such as progress meters.
• For the most part, you can read use read_table() and read_csv() pretty much
anywhere you might use read.table() and read.csv(). In addition, if there are non-fatal
problems that occur while reading in the data, you will get a warning and the
returned data frame will have some information about which
rows/observations triggered the warning. This can be very helpful for
“debugging” problems with your data before you get neck deep in data
analysis.

U S I N G D P U T ( ) A N D
D U M P ( )
One way to pass data around is by deparsing the R object with dput() and reading
it back in (parsing it) using dget().
Notice that the dput() output is in the form of R code and that it preserves
metadata like the class of the object, the row names, and the column names.
The output of dput() can also be saved directly to a file.
Multiple objects can be deparsed at once using the dump function and read back
in using source.

W E C A N D U M P ( ) R
O B J E C T S TO A F I LE
B Y PA S S I N G A
C H A R AC T E R
V E C TO R O F T H E I R
N A M E S .
We can dump() R objects to a file by passing a character vector of their names.

I N T E R F A C E S
T O T H E O U T S I D E
W O R L D

FILE CONNECTIONS
Connections to text files can be created with the file() function.
The file() function has a number of arguments that are common to many other connection
functions so it’s worth going into a little detail here.
• description is the name of the file
• open is a code indicating what mode the file should be opened in
The open argument allows for the following options:
• “r” open file in read only mode
• “w” open a file for writing (and initializing a new file)
• “a” open a file for appending
• “rb”, “wb”, “ab” reading, writing, or appending in binary mode
(Windows)

In practice, we often don’t need to deal with the connection interface directly as many functions for reading and writing data
just deal with it in the background. For example, if one were to explicitly use connections to read a CSV file in to R, it might
look like this,
• In the background, read.csv() opens a
connection to the file foo.txt, reads from it,
and closes the connection when its done.
• The above example shows the basic approach
to using connections. Connections must be
opened, then the are read from or written to,
and then they are closed.

READING LINES OF A TEXT FILE
Text files can be read line by line using the readLines() function. This function is useful for reading text files that
may be unstructured or contain non-standard data.
• For more structured text data like CSV files or tab-delimited files, there are other functions like read.csv() or
read.table().
• The above example used the gzfile() function which is used to create a connection to files compressed using the gzip
algorithm. This approach is useful because it allows you to read from a file without having to uncompress the file first,
which would be a waste of space and time.
• There is a complementary function writeLines() that takes a character vector and writes each element of the vector one
line at a time to a text file.

READING FROM A URL CONNECTION
The readLines() function can be useful for reading in lines of webpages. Since web pages are
basically text files that are stored on a remote server, there is conceptually not much difference
between a web page and a local text file. However, we need R to negotiate the communication
between your computer and the web server. This is what the url() function can do for you, by
creating a url connection to a web server.
This code might take time depending on your connection speed.

• While reading in a simple web page is sometimes useful, particularly if data are embedded in
the web page somewhere. However, more commonly we can use URL connection to read in
specific data files that are stored on web servers.
• Using URL connections can be useful for producing a reproducible analysis, because the code
essentially documents where the data came from and how they were obtained. This is approach
is preferable to opening a web browser and downloading a dataset by hand. Of course, the code
you write with connections may not be executable at a later date if things on the server side are
changed or reorganized.

S U B S E T T I N G R
O B J E C T S

SUB SETTING R OBJECTS
There are three operators that can be used to extract subsets of R objects
• The [ operator always returns an object of the same class as the original. It can be
used to select multiple elements of an object
• The [[ operator is used to extract elements of a list or a data frame. It can only be
used to extract a single element and the class of the returned object will not
necessarily be a list or data frame.
• The$operatorisusedtoextractelementsofalistordataframebyliteralname.Itssemantics
are similar to that of [[.

S U B S E T T I N G A
V E C TO R
The [ operator can be used to extract multiple
elements of a vector by passing the operator an
integer sequence. Here we extract the first four
elements of the vector.
Vectors are basic objects in R and they
can be subsetted using the [ operator.
The sequence does not have to be in order; you can
specify any arbitrary integer vector
We can also pass a logical
sequence to the [ operator to
extract elements of a vector
that satisfy a given
condition. For example,
here we want the elements
of x that come
lexicographically after the
letter “a”.

Another, more compact, way to do this would be to skip the
creation of a logical vector and just subset the vector directly
with the logical expression.

SU B SE T T IN G A
M AT R IX
Matrices can be subsetted in the usual way
with (i,j) type indices. Here, we create
simple $2times 3$ matrix with the matrix
function.
We can access the $(1, 2)$ or the $(2, 1)$ element of this
matrix using the appropriate indices.
Indices can also be missing. This behavior is used to
access entire rows or columns of a matrix.

D R O P P I N G M AT R I X
D I M E N S I O N S
By default, when a single element of a
matrix is retrieved, it is returned as a
vector of length 1 rather than a $1times
1$ matrix. Often, this is exactly what we
want, but this behavior can be turned off
by setting drop = FALSE.
Similarly, when we extract a single row
or column of a matrix, R by default
drops the dimension of length 1, so
instead of getting a $1times 3$ matrix
after extracting the first row, we get a
vector of length 3. This behavior can
similarly be turned off with the drop =
FALSE option.
Be careful of R’s
automatic dropping of
dimensions. This is a
feature that is often quite
useful during interactive
work, but can later come
back to bite you when
you are writing longer
programs or functions.

SU B SE T T IN G
L IST S
Lists in R can be subsetted using all three of the operators mentioned
above, and all three are used for different purposes.
The [[ operator can be used to extract single elements
from a list. Here we extract the first element of the list.

One thing that differentiates the [[ operator from the $ is that the [[ operator can be used with
computed indices. The $ operator can only be used with literal names.

SUBSETTING NESTED ELEMENTS OF A
LIST
The [[ operator can take an integer sequence if you want to
extract a nested element of a list.

EXTRACTING MULTIPLE ELEMENTS OF A
LIST
The [ operator can be used to extract multiple elements from a list. For example, if you wanted to extract the
first and third elements of a list, you would do the following
Note that x[c(1, 3)] is NOT the same as x[[c(1, 3)]].
Remember that the [ operator always returns an object of the same class
as the original. Since the original object was a list, the [ operator returns
a list. In the above code, we returned a list with two elements (the first
and the third).

PARTIAL MATCHING
Partial matching of names is allowed with [[ and $. This is often very useful during interactive work if the object you’re
working with has very long element names. You can just abbreviate those names and R will figure out what element
you’re referring to.
In general, this is fine for interactive work, but you shouldn’t resort to partial matching if you are
writing longer scripts, functions, or programs. In those cases, you should refer to the full element
name if possible. That way there’s no ambiguity in your code.

REMOVING NA VALUES
A common task in data analysis is removing missing values (NAs).
What if there are multiple R objects and you want to take the subset
with no missing values in any of those objects?

REMOVING NA VALUES
You can use complete.cases on data frames too.

V E C T O R I Z E D
O P E R AT I O N S

Many operations in R are vectorized, meaning that operations occur in
parallel in certain R objects. This allows you to write code that is
efficient, concise, and easier to read than in non-vectorized languages.
The simplest example is when adding two vectors together.
Another operation you can do in a vectorized manner is logical
comparisons. So suppose you wanted to know which elements of a
vector were greater than 2. You could do he following.
Here are other vectorized logical operations.
Notice that these logical operations return a logical
vector of TRUE and FALSE.
Of course, subtraction, multiplication and division are
also vectorized

VECTORIZED MATRIX OPERATIONS
Matrix operations are also vectorized, making for nicly compact notation. This way, we can do element-by-
element operations on matrices without having to loop over every element.

• R has developed a special representation for dates and
times. Dates are represented by the Date class and times
are represented by the POSIXct or the POSIXlt class. Dates
are stored internally as the number of days since 1970-
01-01 while times are stored internally as the number
of seconds since 1970-01-01.
• It’s not important to know the internal representation
of dates and times in order to use them in R. I just
thought those were fun facts.

DATES IN R
Dates are represented by the Date class and can be coerced from a character string using the as.Date() function. This
is a common way to end up with a Date object in R.
You can see the internal representation of a Date object by using the unclass() function.

TIMES IN R
Times are represented by the POSIXct or the POSIXlt class. POSIXct is just a very large integer under the hood.
It use a useful class when you want to store times in something like a data frame. POSIXlt is a list underneath
and it stores a bunch of other useful information like the day of the week, day of the year, month, day of the
month. This is useful when you need that kind of information.
There are a number of generic functions that work on dates and times to help you extract pieces of dates and/or
times.
• weekdays: give the day of the week
• months: give the month name
• quarters: give the quarter number (“Q1”, “Q2”, “Q3”, or “Q4”)
Times can be coerced from a character string using the as.POSIXlt or as.POSIXct function.

• Strptime() function in case your dates are written
in a different format.
• Strptime() takes a character vector that has dates
and times and converts them into to a POSIXlt
object.

OPERATIONS ON DATES AND TIMES
You can use mathematical operations on dates and times. Well, really just + and -. You can do comparisons too
(i.e. ==, <=)

Here’s an example where two different time zones are in play
(unless you live in GMT timezone, in which case they will be
the same!).

SUMMARY
• Dates and times have special classes in R that allow for
numerical and statistical calculations
• Date subset the Date class
• Time subset the POSIXctand POSIXltclass
• Character strings can be coerced to Date/Time classes using
the strptime function or the as.Date, as.POSIXlt, or as.POSIXct

M A N A G I N G D A T A F R A M E S
W I T H T H E
D P L Y R P A C K A G E

DATA FRAME - DPLYR (PACKAGE)
• The data frame is a key data structure in statistics and in R. The basic structure of a data
frame is that there is one observation per row and each column represents a variable, a
measure, feature, or characteristic of that observation. R has an internal implementation of
data frames that is likely the one you will use most often. However, there are packages on
CRAN that implement data frames via things like relational databases that allow you to
operate on very very large data frames (but we won’t discuss them here).
• Given the importance of managing data frames, it’s important that we have good tools for
dealing with them. In previous chapters we have already discussed some tools like the
subset() function and the use of [ and $ operators to extract subsets of data frames. However,
other operations, like filtering, re-ordering, and collapsing, can often be tedious operations in
R whose syntax is not very intuitive. The dplyr package is designed to mitigate a lot of these
problems and to provide a highly optimized set of routines specifically for dealing with data
frames.

THE DPLYR PACKAGE
• The dplyr package was developed by Hadley Wickham of RStudio and is an optimized and
distilled version of his plyr package. The dplyr package does not provide any “new” functionality
to R per se, in the sense that everything dplyr does could already be done with base R, but it
greatly simplifies existing functionality in R.
• One important contribution of the dplyr package is that it provides a “grammar” (in particular,
verbs) for data manipulation and for operating on data frames. With this grammar, you can
sensibly communicate what it is that you are doing to a data frame that other people can
understand (assuming they also know the grammar).

DPLYR GRAMMAR
Some of the key “verbs” provided by the dplyr package are
• select: return a subset of the columns of a data frame, using a flexible notation
• filter: extract a subset of rows from a data frame based on logical conditions
• arrange: reorder rows of a data frame
• rename: rename variables in a data frame
• mutate: add new variables/columns or transform existing variables
• summarise / summarize: generate summary statistics of different variables in the dataframe,
possibly within strata
• %>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline
The dplyr package as a number of its own data types that it takes advantage of. For example, there is
a handy print method that prevents you from printing a lot of data to the console. Most of the time,
these additional data types are transparent to the user and do not need to be worried about.

COMMON DPLYR FUNCTION PROPERTIES
The dplyr package can be installed from CRAN or from GitHub using the devtools package and the install_github()
function. The GitHub repository will usually contain the latest updates to the package and the development version.
To install from CRAN, just run
After installing the package it is
important that you load it into your R
session with the library() function.

SELECT( )
For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the city of
Chicago53 in the U.S. The dataset is available from my web site. After unzipping the archive, you can load the data
into R using the readRDS() function.
You can see some basic characteristics of the dataset with the dim() and str() functions
The select() function can be used to select
columns of a data frame that you want to
focus on. Often you’ll have a large data
frame containing “all” of the data, but any
given analysis might only use a subset of
variables or observations. The select()
function allows you to get the few
columns you might need.

Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example
use numerical indices. But we can also use the names directly.

Note that the : normally cannot be used with names or strings, but inside the select() function you can use it to specify a
range of variable names. You can also omit variables using the select() function by using the negative sign. With select()
you can do

The select() function also allows a special syntax that allows you to specify variable names
based on patterns. So, for example, if you wanted to keep every variable that ends with a
“2”, we could do
Or if we wanted to keep every variable that starts with a “d”, we
could do

FILTER()
• The filter() function is used to extract subsets of rows from a data frame. This function is similar to the existing
subset() function in R but is quite a bit faster in my experience.
• Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 are greater than 30
(which is a reasonably high level), we could do

You can see that there are now only 194 rows in the data frame and the distribution of the pm25tmean2 values is.
We can place an arbitrarily complex logical sequence inside of filter(), so we could for example extract the rows where
PM2.5 is greater than 30 and temperature is greater than 80 degrees Fahrenheit.

ARRANGE()
• The arrange() function is used to reorder rows of a data
frame according to one of the variables/- columns.
Reordering rows of a data frame (while preserving
corresponding order of other columns) is normally a
pain to do in R. The arrange() function simplifies the
process quite a bit.
• Here we can order the rows of the data frame by date, so
that the first row is the earliest (oldest) observation and
the last row is the latest (most recent) observation.
We can now check the first few rows
We can now check the first few rows

RENAME ()
Renaming a variable in a data frame in R is surprisingly
hard to do! The rename() function is designed to make
this process easier.
Here you can see the names of the first five variables in
the chicago data frame.
The dptp column is supposed to represent the dew point
temperature and the pm25tmean2 column provides the
PM2.5 data. However, these names are pretty obscure
or awkward and probably be renamed to something
more sensible.

MUTATE()
• The mutate() function exists to compute transformations of variables in a data frame. Often, you
want to create new variables that are derived from existing variables and mutate() provides a clean
interface for doing that.
• Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.

There is also the related transmute() function, which does the same thing as mutate() but then drops all non-
transformed variables. Here we detrend the PM10 and ozone (O3) variables.

GROUP_BY()
• The group_by() function is used to generate
summary statistics from the data frame
within strata defined by a variable.
• The general operation here is a combination
of splitting a data frame into separate pieces
defined by a variable or group of variables
(group_by()), and then applying a summary
function across those subsets (summarize()).
First, we can create a year varible using as.POSIXlt().
Now we can create a separate data frame that splits the
original data frame by year.
Finally, we compute summary statistics for each year in
the data frame with the summarize() function.

Finally, we compute summary statistics for each year in the data frame
with the summarize() function.
summarize() returns a data frame with year as the first column, and then
the annual averages of pm25, o3, and no2.

From the table, it seems there isn’t a strong relationship between pm25 and o3, but there appears to be a positive
correlation between pm25 and no2. More sophisticated statistical modeling can help to provide precise answers to
these questions, but a simple application of dplyr functions can often get you most of the way there.

%>%
The pipeline operater %>% is very handy for stringing together
multiple dplyr functions in a sequence of operations. Notice above
that every time we wanted to apply more than one function, the
sequence gets buried in a sequence of nested function calls that
is difficult to read, i.e.
Take the example that we just did in the last section where we computed
the mean of o3 and no2 within quintiles of pm25. There we had to
1. create a new variable pm25.quint
2. split the data frame by that new variable
3. compute the mean of o3 and n]o2 in the sub-groups defined by
pm25.quint
That can be done with the following sequence in a single R expression.

SUMMARY
The dplyr package provides a concise set of operations for managing data frames.
With these functions we can do a number of complex operations in just a few lines of
code. In particular, we can often conduct the beginnings of an exploratory analysis
with the powerful combination of group_by() and summarize(). Once you learn the
dplyr grammar there are a few additional benefits:
• dplyr can work with other data frame “backends” such as SQL databases.
There is an SQL interface for relational databases via the DBI package
• dplyr can be integrated with the data.table package for large fast tables
The dplyr package is handy way to both simplify and speed up your data frame
management code. It’s rare that you get such a combination at the same time!

C O R R E L A T I O N A N A L Y S I S

• Korelasi Pearson merupakan salah satu ukuran korelasi
yang digunakan untuk mengukur kekuatan dan arah
hubungan linier dari dua veriabel.
• Dua variabel dikatakan berkorelasi apabila perubahan
salah satu variabel disertai dengan perubahan variabel
lainnya, baik dalam arah yang sama ataupun arah yang
sebaliknya.
• Harus diingat bahwa nilai koefisien korelasi yang kecil (tidak
signifikan) bukan berarti kedua variabel tersebut tidak saling
berhubungan. Mungkin saja dua variabel mempunyai
keeratan hubungan yang kuat namun nilai koefisien
korelasinya mendekati nol, misalnya pada kasus hubungan non
linier.
• Dengan demikian, koefisien korelasi hanya mengukur
kekuatan hubungan linier dan tidak pada hubungan
non linier. Harus diingat pula bahwa adanya
hubungan linier yang kuat di antara variabel tidak
selalu berarti ada hubungan kausalitas, sebab-akibat.

• Korelasi mempunyai kemungkinan pengujian hipotesis dua
arah (two tailed).
• Korelasi searah jika nilai koefesien korelasi diketemukan positif;
sebaliknya jika nilai koefesien korelasi negatif, korelasi disebut
tidak searah.
• Yang dimaksud dengan koefesien korelasi ialah suatu
pengukuran statistik kovariasi atau asosiasi antara dua variabel.
• Jika koefesien korelasi diketemukan tidak sama dengan nol (0),
maka terdapat hubungan antara dua variabel tersebut.
• Jika koefesien korelasi diketemukan +1. maka hubungan
tersebut disebut sebagai korelasi sempurna atau hubungan
linear sempurna dengan kemiringan (slope) positif. Sebaliknya.
jika koefesien korelasi diketemukan -1. maka hubungan
tersebut disebut sebagai korelasi sempurna atau hubungan
linear sempurna dengan kemiringan (slope) negatif.
• Dalam korelasi sempurna tidak diperlukan lagi pengujian
hipotesis mengenai signifikansi antar variabel yang
dikorelasikan, karena kedua variabel mempunyai hubungan
linear yang sempurna. Artinya variabel X mempunyai
hubungan sangat kuat dengan variabel Y.

Korelasi Pearson, misalnya, menunjukkan adanya kekuatan
hubungan linier dalam dua variabel.
Linieritas artinya asumsi adanya hubungan dalam bentuk garis lurus
antara variabel. Linearitas antara dua variabel dapat dinilai melalui
observasi scatterplots bivariat. Jika kedua variabel berdistribusi normal
dan behubungan secara linier, maka scatterplot berbentuk oval; jika
tidak berdistribusi normal scatterplot tidak berbentuk oval.
Dalam praktinya kadang data yang digunakan akan menghasilkan
korelasi tinggi tetapi hubungan tidak linier; atau sebaliknya korelasi
rendah tetapi hubungan linier. Dengan demikian agar linieritas
hubungan dipenuhi, maka data yang digunakan harus mempunyai
distribusi normal.
Konsep Linieritasdan Korelasi

CONTOH JENIS PERTANYAAN PENELITIAN KORELASI PEARSON
Apakah ada hubungan yang
signifikan antara usia, yang diukur
dalam tahun, dan tinggi badan,
diukur dalam inci?
Adakah hubungan antara kepuasan
kerja,yang diukur dengan index
kepuasan kerja,dan pendapatan,
diukur dalam rupiah?
Contoh 3..
Contoh 4 …
Contoh 5 …
Participan diminta menemukan contoh Pertanyaan yang
bisa diajukan dalam analisis korelasi

Sarah is a regional sales manager for a nationwide supplier of fossil fuels
for home heating. Recent volatility in market prices for heating oil
specifically, coupled with wide variability in the size of each order for
home heating oil, has Sarah concerned. She feels a need to understand
the types of behaviors and other factors that may influence the demand
for heating oil in the domestic market. What factors are related to heating
oil usage, and how might she use a knowledge of such factors to better
manage her inventory, and anticipate demand? Sarah believes that data
mining can help her begin to formulate an understanding of these factors
and interactions.
Perspective & Concept
Sarah’s goal is to better understand how her company can succeed in the
home heating oil market. She recognizes that there are many factors that
influence heating oil consumption, and believes that by investigating the
relationship between a number of those factors, she will be able to better
monitor and respond to heating oil demand. She has selected correlation
as a way to model the relationship between the factors she wishes to
investigate. Correlation is a statistical measure of how strong the
relationships are between attributes in a data set.
Business Understanding

•Insulation: This is a density rating, ranging from one to ten,
indicating the thickness of each home’s insulation. A home with a
density rating of one is poorly insulated, while a home with a
density of ten has excellent insulation.
• Temperature: This is the average outdoor ambient temperature
at each home for the most recent year, measure in degree
Fahrenheit.
• Heating_Oil: This is the total number of units of heating oil
purchased by the owner of each home in the most recent year.
• Num_Occupants: This is the total number of occupants living
in each home.
• Avg_Age: This is the average age of those occupants.
• Home_Size: This is a rating, on a scale of one to eight, of the
home’s overall size. The
•higher the number, the larger the home.
Data Understanding

Default Method
## Default S method:cor.test(x, y,alternative = c("two.sided",
"less", "greater"), method = c("pearson","kendall", "spearman"),
exact = NULL,conf.level = 0.95, continuity = FALSE, ...)
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.test.html

library(dplyr)
library(stats)
library(ggcorrplot)
setwd("/Users/ikekurniati/Documents/R/CBI")
data_heating_oil<-
read.csv(file="Chapter04DataSet.csv",
header=TRUE, sep=",")
str(data_heating_oil)
cor_heating_oil<-cor(data_heating_oil)
cormat_heating_oil<- round(cor_heating_oil, 2)
#The package reshape is required to melt the
correlation matrix:
library(reshape2)
ggcorrplot(cormat_heating_oil)
Modeling

T A S K - C O R R E L A T I O N

Context
The World Happiness Report is a landmark survey of the state of global
happiness. The first report was published in 2012, the second in 2013, the
third in 2015, and the fourth in the 2016 Update. The World Happiness
2017, which ranks 155 countries by their happiness levels, was released
at the United Nations at an event celebrating International Day of
Happiness on March 20th. The report continues to gain global
recognition as governments, organizations and civil society increasingly
use happiness indicators to inform their policy-making decisions.
Leading experts across fields – economics, psychology, survey analysis,
national statistics, health, public policy and more – describe how
measurements of well-being can be used effectively to assess the
progress of nations. The reports review the state of happiness in the world
today and show how the new science of happiness explains personal and
national variations in happiness.
Content
The happiness scores and rankings use data from the Gallup World Poll.
The scores are based on answers to the main life evaluation question asked
in the poll. This question, known as the Cantril ladder, asks respondents to
think of a ladder with the best possible life for them being a 10 and the
worst possible life being a 0 and to rate their own current lives on that
scale. The scores are from nationally representative samples for the years
2013-2016 and use the Gallup weights to make the estimates
representative. The columns following the happiness score estimate the
extent to which each of six factors – economic production, social support,
life expectancy, freedom, absence of corruption, and generosity –
contribute to making life evaluations higher in each country than they are
in Dystopia, a hypothetical country that has values equal to the world’s
lowest national averages for each of the six factors. They have no impact
on the total score reported for each country, but they do explain why some
countries rank higher than others.
Data: input_happines.xlsx

Dystopia is an imaginary country that has the world’s least-happy people.
The purpose in establishing Dystopia is to have a benchmark against which
all countries can be favorably compared (no country performs more poorly
than Dystopia) in terms of each of the six key variables, thus allowing each
sub-bar to be of positive width. The lowest scores observed for the six key
variables, therefore, characterize Dystopia. Since life would be very
unpleasant in a country with the world’s lowest incomes, lowest life
expectancy, lowest generosity, most corruption, least freedom and least
social support, it is referred to as “Dystopia,” in contrast to Utopia.
What Is Dystopia ?
The residuals, or unexplained components, differ for each country, reflecting
the extent to which the six variables either over- or under-explain average
2014-2016 life evaluations. These residuals have an average value of
approximately zero over the whole set of countries. Figure 2.2 shows the
average residual for each country when the equation in Table 2.1 is applied
to average 2014- 2016 data for the six variables in that country. We combine
these residuals with the estimate for life evaluations in Dystopia so that the
combined bar will always have positive values. As can be seen in Figure 2.2,
although some life evaluation residuals are quite large, occasionally
exceeding one point on the scale from 0 to 10, they are always much smaller
than the calculated value in Dystopia, where the average life is rated at 1.85
on the 0 to 10 scale.
What Is Residual?
The following columns: GDP per Capita, Family, Life Expectancy, Freedom,
Generosity, Trust Government Corruption describe the extent to which these
factors contribute in evaluating the happiness in each country. The Dystopia
Residual metric actually is the Dystopia Happiness Score(1.85) + the Residual
value or the unexplained value for each country as stated in the previous
answer.If you add all these factors up, you get the happiness score so it might be
un-reliable to model them to predict Happiness Scores.
What do the columns succeeding the Happiness Score(like
Family,Generosity,etc.) describe?

L I N E A R R E G R E S S I O N

FRANCIS GALTON
regresi (regression) sebagai nama
proses umum untuk memprediksi
satu variabel, yaitu tinggi badan
anak dengan menggunakan
variabel lain, yaitu tinggi badan
orang tua
http://onlinelibrary.wiley.com/doi/10.1111/j.1740-
9713.2011.00509.x/full

Regresi linier adalah model prediksi yang menggunakan kumpulan data
latih dan scoring untuk menghasilkan prediksi numerik dalam data.
Regresi linier menggunakan tipe data numerik untuk semua atributnya.
Regresi Linier adalah formula menggunakan rumus aljabar untuk
menghitung kemiringan garis untuk menentukan di mana pengamatan akan
jatuh di sepanjang garis imajiner melalui scoring data.
Setiap atribut dalam kumpulan data dievaluasi secara statistik karena
kemampuannya untuk memprediksi atribut target.
Definisi?

Description
lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of
covariance (although aov may provide a more convenient interface for these).
Usage
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...) Arguments
Default Linier Regression usage
Description
This function gives internal and cross-validation measures of predictive
accuracy for ordinary linear regression. The data are randomly assigned to
a number of `folds'. Each fold is removed, in turn, while the remaining
data is used to re-fit the regression model and to predict at the deleted
observations.
Usage
cv.lm(df = houseprices, form.lm = formula(sale.price ~ area), m=3, dots
= FALSE, seed=29, plotit=TRUE, printit=TRUE)

library(UsingR)
library(ggplot2)
data(father.son)
head(father.son)
names(father.son)
ggplot(father.son,aes(x=fheight,y=sheight)) +
geom_point(size=2,alpha=0.7) + xlab("Height of
father") + ylab("Height of son") + ggtitle("Father-
son Height Data")
model_reg<-lm(sheight ~ fheight,data=father.son)
predicted_df <-
data.frame(pred=predict(model_reg,father.son))
ggplot(father.son,aes(x=fheight,y=sheight)) +
geom_point(size=2,alpha=0.7) + xlab("Height of
father") + ylab("Height of son") +
ggtitle("Father-son Height Data")+
geom_smooth(method=lm,se=FALSE,
color='blue',size=1,linetype="solid")+
xlab("Height of father") + ylab("Height of son")

Sarah, the regional sales manager. Business is her sales is booming,
her sales team is signing up thousands of new clients, and she
wants to be sure the company will be able to meet this new level
of demand. She was so pleased with our assistance in finding
correlations in her data, she now is hoping we can help her do
some prediction as well. She knows that there is some correlation
between the attributes in her data set (things like temperature,
insulation, and occupant ages), and she’s now wondering if she
can use the data set from Chapter 4 to predict heating oil usage for
new customers. You see, these new customers haven’t begun
consuming heating oil yet, there are a lot of them (42,650 to be
exact), and she wants to know how much oil she needs to expect
to keep in stock in order to meet these new customers’ demand.
Can she use data mining to examine household attributes and
known past consumption quantities to anticipate and meet her
new customers’ needs?

Business Understanding
Sarah’s new data mining objective is pretty clear: she wants to
anticipate demand for a consumable product. We will use a linear
regression model to help her with her desired predictions. She
has data, 1,218 observations from the Chapter 4 data set that give
an attribute profile for each home, along with those homes’ annual
heating oil consumption. She wants to use this data set as training
data to predict the usage that 42,650 new clients will bring to her
company. She knows that these new clients’ homes are similar in
nature to her existing client base, so the existing customers’ usage
behavior should serve as a solid gauge for predicting future usage
by new customers.

As a review, our data set from Chapter 4 contains the following
attributes:
• Insulation: This is a density rating, ranging from one to ten,
indicating the thickness of
•each home’s insulation. A home with a density rating of one is
poorly insulated, while a
•home with a density of ten has excellent insulation.
• Temperature: This is the average outdoor ambient temperature
at each home for the
•most recent year, measure in degree Fahrenheit.
• Heating_Oil: This is the total number of units of heating oil
purchased by the owner of
•each home in the most recent year.
• Num_Occupants: This is the total number of occupants living
in each home.
• Avg_Age: This is the average age of those occupants.
• Home_Size: This is a rating, on a scale of one to eight, of the
home’s overall size. The
•higher the number, the larger the home.
Data Understanding

The R package "party" is used to create
decision trees.
Decision tree adalah jenis algoritma supervice
learning (having a pre-defined target variable) yang
banyak digunakan untuk klasifikasi. Decision tree
digunakan untuk variabel input dan output
kategorikal dan kontinyu.

Basic Formula:
Input Data: Data yang digunakan adalah data sample dari package R
party yaitu readingSkill. Data readingSkill menjelaskan score
seseorang dengan beberapa variabel input age, shoesize, dan score.
Data ini digunnakan untuk mengklasifikasikan native speaker Yes OR
not .

Richard works for a large online retailer. His company is launching a next-generation eReader soon, and
they want to maximize the effectiveness of their marketing. They have many customers, some of whom
purchased one of the company’s previous generation digital readers. Richard has noticed that certain types
of people were the most anxious to get the previous generation device, while other folks seemed to content
to wait to buy the electronic gadget later. He’s wondering what makes some people motivated to buy
something as soon as it comes out, while others are less driven to have the product.
Richard’s employer helps to drive the sales of its new eReader by offering specific products and services for
the eReader through its massive web site—for example, eReader owners can use the company’s web site to
buy digital magazines, newspapers, books, music, and so forth. The company also sells thousands of other
types of media, such as traditional printed books and electronics of every kind. Richard believes that by
mining the customers’ data regarding general consumer behaviors on the web site, he’ll be able to figure out
which customers will buy the new eReader early, which ones will buy next, and which ones will buy later on.
He hopes that by predicting when a customer will be ready to buy the next-gen eReader, he’ll be able to
time his target marketing to the people most ready to respond to advertisements and promotions.

Organizational Understanding
Richard wants to be able to predict the timing of buying behaviors, but he also
wants to understand how his customers’ behaviors on his company’s web site
indicate the timing of their purchase of the new eReader. Richard has studied the
classic diffusion theories that noted scholar and sociologist Everett Rogers first
published in the 1960s. Rogers surmised that the adoption of a new technology or
innovation tends to follow an ‘S’ shaped curve, with a smaller group of the most
enterprising and innovative customers adopting the technology first, followed by
larger groups of middle majority adopters, followed by smaller groups of late
adopters (Figure 10-1).

Those at the front of the blue curve are the smaller group that are first to
want and buy the technology. Most of us, the masses, fall within the middle
70-80% of people who eventually acquire the technology. The low end tail
on the right side of the blue curve are the laggards, the ones who
eventually adopt. Consider how DVD players and cell phones have
followed this curve.
Understanding Rogers’ theory, Richard believes that he can categorize his
company’s customers into one of four groups that will eventually buy the
new eReader: Innovators, Early Adopters, Early Majority or Late Majority.
These groups track with Rogers’ social adoption theories on the diffusion
of technological innovations, and also with Richard’s informal
observations about the speed of adoption of his company’s previous
generation product. He hopes that by watching the customers’ activity on
the company’s web site, he can anticipate approximately when each person
will be most likely to buy an eReader. He feels like data mining can help
him figure out which activities are the best predictors of which category a
customer will fall into. Knowing this, he can time his marketing to each
customer to coincide with their likelihood of buying.
Organizational Understanding

Richard has engaged us to help him with his project. We have decided to use a
decision tree model in order to find good early predictors of buying behavior.
Because Richard’s company does all of its business through its web site, there is a
rich data set of information for each customer, including items they have just
browsed for, and those they have actually purchased. He has prepared two data sets
for us to use. The training data set contains the web site activities of customers who
bought the company’s previous generation reader, and the timing with which they
bought their reader. The second is comprised of attributes of current customers
which Richard hopes will buy the new eReader. He hopes to figure out which
category of adopter each person in the scoring data set will fall into based on the
profiles and buying timing of those people in the training data set.
In analyzing his data set, Richard has found that customers’ activity in the areas of
digital media and books, and their general activity with electronics for sale on his
company’s site, seem to have a lot in common with when a person buys an eReader.
With this in mind, we have worked with Richard to compile data sets comprised of
the following attributes:
Data Understanding

• User_ID: A numeric, unique identifier assigned to each person who has an account on the company’s
web site.
• Gender: The customer’s gender, as identified in their customer account. In this data set, it is recorded a
‘M’ for male and ‘F’ for Female. The Decision Tree operator can handle non- numeric data types.
• Age: The person’s age at the time the data were extracted from the web site’s database. This is calculated
to the nearest year by taking the difference between the system date and the person’s birthdate as
recorded in their account.
• Marital_Status: The person’s marital status as recorded in their account. People who indicated on their
account that they are married are entered in the data set as ‘M’. Since the web site does not distinguish
single types of people, those who are divorced or widowed are included with those who have never been
married (indicated in the data set as ‘S’).
• Website_Activity: This attribute is an indication of how active each customer is on the company’s web
site. Working with Richard, we used the web site database’s information which records the duration of
each customers visits to the web site to calculate how frequently, and for how long each time, the
customers use the web site. This is then translated into one of three categories: Seldom, Regular, or
Frequent.
• Browsed_Electronics_12Mo: This is simply a Yes/No column indicating whether or not the person
browsed for electronic products on the company’s web site in the past year.
• Bought_Electronics_12Mo: Another Yes/No column indicating whether or not they purchased an
electronic item through Richard’s company’s web site in the past year.
• Bought_Digital_Media_18Mo: This attribute is a Yes/No field indicating whether or not the person
has purchased some form of digital media (such as MP3 music) in the past year and a half. This attribute
does not include digital book purchases.
Data Understanding

• Bought_Digital_Books: Richard believes that as an indicator of buying behavior relative to the company’s new
eReader, this attribute will likely be the best indicator. Thus, this attribute has been set apart from the purchase of
other types of digital media. Further, this attribute indicates whether or not the customer has ever bought a digital
book, not just in the past year or so.
• Payment_Method: This attribute indicates how the person pays for their purchases. In cases where the person
has paid in more than one way, the mode, or most frequent method of payment is used. There are four options:
• Bank Transfer—payment via e-check or other form of wire transfer directly from the bank to the
company.
• Website Account—the customer has set up a credit card or permanent electronic funds transfer on their
account so that purchases are directly charged through their account at the time of purchase.
• Credit Card—the person enters a credit card number and authorization each time they purchase
something through the site.
• Monthly Billing—the person makes purchases periodically and receives a paper or electronic bill which
they pay later either by mailing a check or through the company web site’s payment system.
• eReader_Adoption: This attribute exists only in the training data set. It consists of data for customers who
purchased the previous-gen eReader. Those who purchased within a week of the product’s release are recorded in
this attribute as ‘Innovator’. Those who purchased after the first week but within the second or third weeks are
entered as ‘Early Adopter’. Those who purchased after three weeks but within the first two months are ‘Early
Majority’. Those who purchased after the first two months are ‘Late Majority’. This attribute will serve as our label
when we apply our training data to our scoring data.
Data Understanding

What is K Means Clustering?
K Means Clustering is an unsupervised learning algorithm that tries to
cluster data based on their similarity. Unsupervised learning means that
there is no outcome to be predicted, and the algorithm just tries to find
patterns in the data. In k means clustering, we have the specify the
number of clusters we want the data to be grouped into. The algorithm
randomly assigns each observation to a cluster, and finds the centroid of
each cluster. Then, the algorithm iterates through two steps:
• Reassign data points to the cluster whose centroid is closest.
• Calculate new centroid of each cluster.
These two steps are repeated till the within cluster variation cannot be
reduced any further. The within cluster variation is calculated as the
sum of the euclidean distance between the data points and their
respective cluster centroids.

The iris dataset contains data about sepal length, sepal width,
petal length, and petal width of flowers of different species. Let
us see what it looks like:

After a little bit of exploration, I found that Petal.Length and Petal.
Width were similar among the same species but varied considerably
between different species, as demonstrated below:
library(ggplot2)ggplot(iris,aes(Petal.Length,Petal.Width,color =
Species)) + geom_point()

Clustering
Okay, now that we have seen the data, let us try to cluster it.
Since the initial cluster assignments are random, let us set the
seed to ensure reproducibility.

Since we know that there are 3 species involved, we ask the algorithm
to group the data into 3 clusters, and since the starting assignments are
random, we specify nstart = 20. This means that R will try 20 different
random starting assignments and then select the one with the lowest
within cluster variation.
We can see the cluster centroids, the clusters that each data point was
assigned to, and the within cluster variation.
Let us compare the clusters with the species.

Analisis Komponen Utama (Principal Component
Analysis) adalah analisis multivariate yang
mentransformasi variabel-variabel asal yang saling
berkorelasi menjadi variabel-variabel baru yang tidak
saling berkorelasi dengan mereduksi sejumlah variabel
tersebut sehingga mempunyai dimensi yang lebih kecil
namun dapat menerangkan sebagian besar keragaman
variabel aslinya.
Principal component analysis (PCA) is a statistical
procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated
variables called principal components. ... PCA is
sensitive to the relative scaling of the original variables.
https://tgmstat.wordpress.com/2013/11/28/computing-and-
visualizing-pca-in-r/#ref1

Computing the Principal Components (PC)
I will use the classical iris dataset for the demonstration. The
data contain four continuous variables which corresponds to
physical measures of flowers and a categorical variable
describing the flowers’ species.
The data contain four continuous variables which corresponds
to physical measures of flowers and a categorical variable
describing the flowers’ species.

We will apply PCA to the four continuous variables and use the
categorical variable to visualize the PCs later. Notice that in the
following code we apply a log transformation to the continuous
variables as suggested by [1] and set center and scale.
equal to TRUE in the call to prcomp to standardize the variables prior to
the application of PCA:
Since skewness and the magnitude of the variables influence the
resulting PCs, it is good practice to apply skewness transformation,
center and scale the variables prior to the application of PCA. In the
example above, we applied a log transformation to the variables but we
could have been more general and applied a Box and Cox
transformation [2]. See at the end of this post how to perform all those
transformations and then apply PCA with only one call to
the preProcess function of the caret package.

Analyzing the results
The prcomp function returns an object of class prcomp, which
have some methods available. The print method returns the
standard deviation of each of the four PCs, and their rotation (or
loadings), which are the coefficients of the linear combinations
of the continuous variables.

The plot method returns a plot of the variances (y-axis)
associated with the PCs (x-axis). The Figure below is useful to
decide how many PCs to retain for further analysis. In this
simple case with only 4 PCs this is not a hard task and we can
see that the first two PCs explain most of the variability in the
data.

We can use the predict function if we observe new
data and want to predict their PCs values. Just for
illustration pretend the last two rows of the iris data
has just arrived and we want to see what is their PCs
values:

I
It projects the data on the first two PCs. Other PCs
can be chosen through the argument choices of the
function. It colors each point according to the flowers’
species and draws a Normal contour line
with ellipse.prob probability (default to ) for each
group. More info about ggbiplot can be obtained by
the usual ?ggbiplot.I think you will agree that the
plot produced by ggbiplot is much better than the
one produced by biplot(ir.pca)(Figure below).

PCA on caret package
As I mentioned before, it is possible to first apply a Box-Cox
transformation to correct for skewness, center and scale each
variable and then apply PCA in one call to
the preProcessfunction of the caret package.

T E X T M I N I N G U N T U K S E N T I M E N T
A N A L Y S I S

• Komputational
• Visualisasi
Statistika
• Machine
Learning
Artificial
Intelleigence
• Asosiasi
• Sekuensial
Pattern
Recognition
• Basis Data
Basis Data
Definisi Text Mining
Text mining mengacu pada pencarian
informasi, pertambangan data, mesin-
learning, statistik, dan komputasi
linguistic terhadap informasi yang
disimpan sebagai teks(Bridge, C 2011).

Proses Text Mining
Data
Teks
Tokenisaisi
Sentimen
Positif
Algoritma Machine
Learning
Sentimen
Negatif
End
Input Proses Output
Twitter data
Autentifikasi
berdasarkan
Token akun
Ekstrak
berdasarkan
filter
Data Preparation
Visualisasi
sentimen
analisisdalam
Bentuk grafik

R P A C K A G E
S E N T I M E N T
T I M O T H Y J U R K A

R- Package Sentiment (classify)
R menyediakan library sentiment dalam R package yang di
buat oleh Timothy Jurka. Dalam package sentiment ini
berfungsi dua fungsi yaitu classify_emotion dan
classify_polarity.
• classify_emotion. Fungsi ini membantu mengklasifikasikan
emotion kedalam beberapa klasifikasi yaitu: anger, fear, joy,
sadness and supprise.
• classify_polarity. Mengkasifikasikan kedalam respon
positive, negative dan neutral.

teknik analisis sentimen dapat diklasifikasikan ke dalam dua kategori:
• Lexicon based: Teknik ini bergantung pada kamus kata yang
dijelaskan dengan orientasi, digambarkan sebagai polaritas positive,
negative dan netral. Metode ini memberikan hasil presisi tinggi
selama leksikon digunakan memiliki cakupan yang baik dari kata-
kata yang dihadapi dalam teks yang dianalisis.
• Learning Based: Teknik ini memerlukan pelatihan classifier dengan
contoh polaritas dikenal disajikan sebagai teks diklasifikasikan ke
dalam kelas yang positif, negatif dan netral.
TeknikAnalisis Sentiment

R- Package Sentiment
Classify_polarity.R
Classify_emotion.R
Subjectivity.csv.gz
Emotion.csv.gz

S E N T I M E N A N A L I S I S M E N G G U N A K A N
T E X T M I N I N G S O C I A L M E D I A T W I T T E R
S E B A G A I C O N T R O L I N G P A S A R
P A R I W I S A T A I N D O N E S I A

R - datascience

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie R - datascience

Ähnlich wie R - datascience (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

R - datascience