I am sharing the slides I used for teaching my "Data Science by R" class. You can sign up a class at http://www.nycdatascience.com/ ----NYC Data Science Academy. We offer classes in R, Python, Processing, D3.js, Hadoop, and etc.
2. Data Visualization
http://nycdatascience.com/part4_en/
Data visualization
We will study the application of primary drawing functions and advanced drawing functions in R and
will focus on understanding the methods of data exploration by visualization.
· The related functions in R
· The properties of a single variable
· Displaying compositions
· The relationship between variables
· Exhibiting change over time
· Geographic information
Case study and excercise: Analyzing the NBA data with graphics
2 of 98
2/4/14, 7:31 AM
4. Data Visualization
http://nycdatascience.com/part4_en/
Data visualization
A figure is worth a thousand words.
data <- read.table('data/anscombe.txt',T)
data <- data[,-1]
head(data)
1
2
3
4
5
6
4 of 98
x1
10
8
13
9
11
14
x2
10
8
13
9
11
14
x3 x4
y1
y2
y3
y4
10 8 8.04 9.14 7.46 6.58
8 8 6.95 8.14 6.77 5.76
13 8 7.58 8.74 12.74 7.71
9 8 8.81 8.77 7.11 8.84
11 8 8.33 9.26 7.81 8.47
14 8 9.96 8.10 8.84 7.04
2/4/14, 7:31 AM
5. Data Visualization
http://nycdatascience.com/part4_en/
Data visualization
Try to calculate some statistical indicators. First calculate the mean of these datasets, and then
calculate the correlation coefficient of the four groups of data
colMeans(data)
x1 x2 x3 x4 y1 y2 y3 y4
9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5
sapply(1:4,function(x) cor(data[,x],data[,x+4]))
[1] 0.816 0.816 0.816 0.817
5 of 98
2/4/14, 7:31 AM
7. Data Visualization
http://nycdatascience.com/part4_en/
Some basic principles
1. Determine the target of visualization from the beginning
· Exploratory visualization
· Explanatory visualization
2. Understanding the characteristics of the data and the audience
· Which variables are important and interesting
· Consider the role and background of the audience
· Select a proper mapping
3. Keep concise but give enough information
7 of 98
2/4/14, 7:31 AM
35. Data Visualization
http://nycdatascience.com/part4_en/
ggplot package
Polishing your plots for publication
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1) +
opts(title='Vehicle model and fuel consumption') +
labs(y='Highway miles per gallon',
x='Urban miles per gallon',
size='Displacement',
colour = 'Model')
35 of 98
2/4/14, 7:31 AM
40. Data Visualization
http://nycdatascience.com/part4_en/
Histogram
We can customize the histogram as follows:
p <- ggplot(iris,aes(x=Sepal.Length))+
geom_histogram(binwidth=0.1,
# Set the group gap
fill='skyblue', # Set the fill color
colour='black') # Set the border color
40 of 98
2/4/14, 7:31 AM
42. Data Visualization
http://nycdatascience.com/part4_en/
Histograms plus density curve
The main role of the histogram of is to show counting by groups and distribution characteristics. The
distribution of a sample in traditional statistics is of important significance. But there is another
method that can also show the distribution of data, namely the kernel density estimation curve. We
can estimate a density curve that represents the distribution, according to the data. We can display
the histogram and density curve at the same time.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..),
fill='skyblue',
color='black') +
geom_density(color='black',
linetype=2,adjust=2)
42 of 98
2/4/14, 7:31 AM
44. Data Visualization
http://nycdatascience.com/part4_en/
Density curve
Similar to the window width parameter, the adjust parameter will control the presentation of the
density curve. We try different parameters to draw mutiple density curves. The smaller the parameter
is, the more volatile and sensitive the curve is.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..), # Note: set y to relative frequency
fill='gray60',
color='gray') +
geom_density(color='black',linetype=1,adjust=0.5) +
geom_density(color='black',linetype=2,adjust=1) +
geom_density(color='black',linetype=3,adjust=2)
44 of 98
2/4/14, 7:31 AM
46. Data Visualization
http://nycdatascience.com/part4_en/
Density curve
Density curve is also convenient for comparison between different data. For example, we want to
compare the Sepal.Length distribution of three different flowers of the iris, like this:
p <- ggplot(iris,aes(x=Sepal.Length,fill=Species)) + geom_density(alpha=0.5,color='gray')
print(p)
46 of 98
2/4/14, 7:31 AM
47. Data Visualization
http://nycdatascience.com/part4_en/
Boxplot
In addition to the histograms and density map, We can also use boxplots to show the distribution of
one-dimensional data. The boxplot is also convenient for comparison of different data.
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_boxplot()
print(p)
47 of 98
2/4/14, 7:31 AM
48. Data Visualization
http://nycdatascience.com/part4_en/
Violin plot
A violin plot contains more information than a boxplot about the (sub-)distributions of the data:
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_violin()
print(p)
48 of 98
2/4/14, 7:31 AM
52. Data Visualization
http://nycdatascience.com/part4_en/
Stacked bar chart
The proportion of each vehicle model in the mpg dataset and these proportions grouped by years
mpg$year <- factor(mpg$year)
p <- ggplot(mpg,aes(x=class,fill=year)) +
geom_bar(color='black')
52 of 98
2/4/14, 7:31 AM
58. Data Visualization
http://nycdatascience.com/part4_en/
Rose diagram
Wind rose, a commonly used graphics tool by meteorologists, describes the wind speed and
direction distributions in a specific place.
set.seed(1)
# Randomly generate 100 wind directions, and divide them into 16 intervals.
dir <- cut_interval(runif(100,0,360),n=16)
# Randomly generate 100 wind speed, and divide them into 4 intensities.
mag <- cut_interval(rgamma(100,15),4)
sample <- data.frame(dir=dir,mag=mag)
# Map wind direction to X-axie, frequency to Y-axie and speed to fill colors. Transform the coo
p <- ggplot(sample,aes(x=dir,fill=mag)) +
geom_bar()+ coord_polar()
58 of 98
2/4/14, 7:31 AM
62. Data Visualization
http://nycdatascience.com/part4_en/
The proportion structure of continuous data
data <- read.csv('data/soft_impact.csv',T)
library(reshape2)
data.melt <- melt(data,id='Year')
p <- ggplot(data.melt,aes(x=Year,y=value,
group=variable,fill=variable)) +
geom_area(color='black',size=0.3,
position=position_fill()) +
scale_fill_brewer()
62 of 98
2/4/14, 7:31 AM
67. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
Represent different years with different shapes
mpg$year <- factor(mpg$year)
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,shape=year))
print(p)
67 of 98
2/4/14, 7:31 AM
68. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
With large data sets, the points in a scatter plot may obscure each other due to overplotting, we can
make some random disturbance to solve this problem.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position =
print(p)
68 of 98
2/4/14, 7:31 AM
69. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
For the trend of the scatterplot, we can draw out the regression line.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(color=year),alpha=0.5,position = "jitter") +
geom_smooth(method='lm')
print(p)
69 of 98
2/4/14, 7:31 AM
70. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
In addition to color, We can also use the size of the dot to reflect another variable, such as the size
of the cylinder. Some refer to plots like this as "bubble charts".
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(color=year,size=displ),alpha=0.5,position = "jitter") +
geom_smooth(method='lm') +
scale_size_continuous(range = c(4, 10))
70 of 98
2/4/14, 7:31 AM
72. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
Although we can show all the variables in a picture, we can also split it into multiple pictures to show
the characteristics of different variables. This method is called grouping, conditioning, or faceting.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1)
72 of 98
2/4/14, 7:31 AM
74. Data Visualization
http://nycdatascience.com/part4_en/
ggplot exercise II
· make scatter plot for diamond data
· use transparency and small size points, look into size and alpha option in geom_point()
· use bin chart to observe intensity of points,look into stat_bin2d()
· estimate
data
dentisy,look
into
stat_density2d()
and
use
+cooord_cartesian(xlim=c(0,1.5), ylim=c(0,6000))
74 of 98
2/4/14, 7:31 AM
78. Data Visualization
http://nycdatascience.com/part4_en/
Scatter plot of multidimensional data
The typical scatter plot is to show a relationship between two variables. When you want to look at
many bivariate relationships at once, you can use a scatter plot matrix.
78 of 98
2/4/14, 7:31 AM
81. Data Visualization
http://nycdatascience.com/part4_en/
Change over time
For visualization of time series data, the first step is looking at how the variable changes over time.
For example, we'll have a look at American employment GDP data visualization.
fillcolor <- ifelse(economics[440:470,'unemploy']<8000,'steelblue','red4')
p <- ggplot(economics[440:470,],aes(x=date,y=unemploy)) +
geom_bar(stat='identity',
fill=fillcolor)
81 of 98
2/4/14, 7:31 AM
83. Data Visualization
http://nycdatascience.com/part4_en/
Change over time
For the time series of small amount of data, we can use the bar graph to display. At the same time
display the number of positive and negative values with different colors.For the time series of large
scale data, the bar will be crowded, and lines and points can be used to represent the strip.
p <- ggplot(economics[300:470,],aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color='grey20',size=0.5) +
geom_point(aes(y=psavert),color='red4') +
theme_bw()
83 of 98
2/4/14, 7:31 AM
85. Data Visualization
http://nycdatascience.com/part4_en/
Change over time
When the data is more intensive, we can use line graph or area chart to show the change of a trend.
Also, some important time points or time interval can be marked in the time series graph, such as
marking 80's as a key time.
fill.color <- ifelse(economics$date > '1980-01-01' &
economics$date < '1990-01-01',
'steelblue','red4')
p <- ggplot(economics,aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color=fill.color,size=0.9) +
geom_text(aes(x=as.Date("1985-01-01",'%Y-%m-%d'),y=13),label="1980'") +
theme_bw()
85 of 98
2/4/14, 7:31 AM
89. Data Visualization
http://nycdatascience.com/part4_en/
Map
Two types of drawing map
· Download the geographic information data, and then draw the geographical boundaries, and
identify areas and locations according to the need
· Download bitmap data of Google map, and then mark the location and path information on the
google map
89 of 98
2/4/14, 7:31 AM
94. Data Visualization
http://nycdatascience.com/part4_en/
Drawing a map of China based on a bitmap
Another method to drawing China map is to download a document containing bitmap data from
Google or openstreetmap, and then to overlap points and lines elements on it with ggplot2. This
document does not include information of latitude and longitude, just a simple bitmap, for fast
mapping.
library(ggmap)
library(XML)
webpage <-'http://data.earthquake.cn/datashare/globeEarthquake_csn.html'
tables <- readHTMLTable(webpage,stringsAsFactors = FALSE)
raw <- tables[[6]]
data <- raw[,c(1,3,4)]
names(data) <- c('date','lan','lon')
data$lan <- as.numeric(data$lan)
data$lon <- as.numeric(data$lon)
data$date <- as.Date(data$date, "%Y-%m-%d")
#Read the map data from Google by the ggmap package, and mark the previous data on the map.
earthquake <- ggmap(get_googlemap(center = 'china', zoom=4,maptype='terrain'),extent='device'
geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+
theme(legend.position = "none")
94 of 98
2/4/14, 7:31 AM
96. Data Visualization
http://nycdatascience.com/part4_en/
R and interactive visualization
GoogleVis is R package providing a interface between R and Google visualization API. It allows the
user to use the Google Visualization API for data visualization without the need to upload data.
We want to compare the development trajectory of 20 country group over the past several years. In
order to obtain the data, we selected three variables from the world bank database, which reflect the
change of GDP, CO2 emissions and life expectancy between 2001 to 2009.
library(googleVis)
library(WDI)
DF <- WDI(country=c("CN","RU","BR","ZA","IN",'DE','AU','CA','FR','IT','JP','MX','GB','US'
M <- gvisMotionChart(DF, idvar="country", timevar="year",
xvar='EN.ATM.CO2E.KT',
yvar='NY.GDP.MKTP.CD')
plot(M)
96 of 98
2/4/14, 7:31 AM
98. Data Visualization
http://nycdatascience.com/part4_en/
Exercise III: Analyzing NBA data
· Calculate the seasonal winning rate, and draw a bar chart
· Calculating the seasonal winning rate at home and on the road, and draw a bar chart
· According to the seasonal scores of home side, draw a set of four histograms
· According to the seasonal scores of home side,draw the boxplots of five seasons
· Draw the boxplots of scores of all competitions for home side and opposite side
· Calculate the average and winning percentage for each opponent, and make a scatterplot to find
the strong and the weak team.
98 of 98
2/4/14, 7:31 AM