1. R Activity in BIOSTATISTICS
Autida, Trexia B.
Sutliz, Larry J.
Torrejas, April Rose C.
BSE-BIOLOGY 3
TTh 8:30 – 10:00 A.M.
2. > #ACTIVITY 1
> #1.1. Pick two different integers of your choice and assign x to be the smaller integer and y
the bigger integer.
> x<-3
> y<-8
> #1.2. Find the results of these R commands:
> #a
> x+y
[1] 11
> x<-5
> y<-9
> x+y
[1] 14
> #b
> sqrt(x)
[1] 2.236068
> #c
> x^2
[1] 25
> #d
> y-5
[1] 4
> #e
> y-x
[1] 4
> #f
> x/y
[1] 0.5555556
> #g
> x*y
[1] 45
> #h
> y*7
[1] 63
> #i
> log(y)
[1] 2.197225
> #j
> factorial(x)
[1] 120
> #1.3. Find the value of the following expressions.
> #a.
> sqrt(x^2+y^2)
[1] 10.29563
> #b
> sqrt((y-x)/(x*y))
[1] 0.2981424
> #c
> ((x*y)/y)^2
[1] 25
3. > #d
> factorial(y)/(2*factorial(x))
[1] 1512
> #1.II.DATA ENTRY
> #1. Suppose you list your commute times for two weeks (10 days) and you listed the
following times in minutes,
17 16 20 24 22 15 21 15 17 22
a) Enter these numbers into R.
b) Find the longest commute time and the minimum commute time.
c) Arrange the data in increasing order.
d) List the number categories
> #a
> H<-c(17,16,20,24,22,15,21,15,17,22)
> H
[1] 17 16 20 24 22 15 21 15 17 22
> #b
> max(H)
[1] 24
> #c
> min(H)
[1] 15
> #d
> sort(H)
[1] 15 15 16 17 17 20 21 22 22 24
> #f
> table(H)
> H
15 16 17 20 21 22 24
2 1 2 1 1 2 1
> #2. Your cell phone bills vary from month to month. Suppose your year has the following
monthly amounts
460 330 390 370 460 300 480 320 490 350 300 480
a) Enter this date into a variable called phone bill.
b) How much have you spent this year on cell phone bill?
c) What is the smallest amount you spent in a month?
d) What is the largest?
e) Give the amounts greater than 400.
> #a
> phone.bill<-c(460,330,390,370,460,300,480,320,490,350,300,480)
> phone.bill
[1] 460 330 390 370 460 300 480 320 490 350 300 480
> #b
4. > sum(phone.bill)
[1] 4730
> #c
> min(phone.bill)
[1] 300
> #d
> max(phone.bill)
[1] 490
> #e
> phone.bill[phone.bill>400]
[1] 460 460 480 490 480
> #3 Suppose 4 people are asked three questions; their wight (lbs), height (cm), and gender.
The data are as follows:
Weight Height Gender
150 65 female
135 61 female
210 70 male
140 65 female
166 61 male
a) Enter the data in R.
b) Extract a data frame which holds only the weight and the height column..
c) Extract the information of the tallest person.
d) Make a table with assigned names (of your own choice) on the 4 people.
> #3.a
> Weight<-c(150,135,210,140,166)
> Height<-c(65,61,70,65,61)
> Gender<c("female","female","male","female","male")
Error: object'Gender'not found
> Gender<-c("female","female","male","female","male")
> df<-data.frame(Weight,Height,Gender)
> df
WeightHeightGender
1 150 65 female
2 135 61 female
3 210 70 male
4 140 65 female
5 166 61 male
5. > #3.b
> df[,1:2]
WeightHeight
1 150 65
2 135 61
3 210 70
4 140 65
5 166 61
> #3.c
> df['3',]
WeightHeightGender
3 210 70 male
> #3.d
> row.names(df)<-c("Trexia","April","Larry","April","Stefan")
> df
WeightHeightGender
Trexia 150 65 female
April 135 61 female
Larry 210 70 male
April 140 65 female
Stefan 166 61 male
> #ACTIVITY 2
> #2.1.
> hours<-
c(5,20,8,11,15,10,6,8,5,14,5,8,6,10,9,13,9,9,10,20,6,7,20,7,20,6,8,9,5,6)
> hist(hours,col="green")
> #2.2
> X<-c(2,3,4,6,8,9,10,10,11,12)
> Y<-c(4,5,6,8,8,10,10,11,12,12)
> plot(X,Y,type="o",xlab="X",ylab="Y",ylim=c(1,15),main="X and
Y",col="green")
#2.1. The number of hours by selected 30 high school students on a computer games per
week are recorded below.
5 10 5 13 6 6
20 6 8 9 7 8
8 8 6 9 20 9
11 5 10 10 7 5
15 14 9 20 20 6
a) Enter the data into R.
b) Make a histogram indicating the title and labels.
6. >#2.2. Enter the following data in R and create a line graph.
Make your graph appears more attractive.
x 2 3 4 6 8 9 10 10 11 12
y 4 5 6 8 8 10 10 11 12 12
X<-c(2,3,4,6,8,9,10,10,11,12)
Y<-c(4,5,6,8,8,10,10,11,12,12)
plot(X,Y,type="o",xlab="X",ylab="Y",ylim=c(1,15),main="X and Y",col="blue")
> 2.3. Load the built-in data set cars by typing
<cars
in the command line. Make a line graph which indicates the points on the
graph and make appropriate labels and title.
> cars
speed dist
1 4 2
2 4 10
8. >
plot(cars$speed,cars$dist,type="o",xlab="Speed",ylab="Distance",ylim=c(0,120)
,main="Speed and Distance",col="black")
>#ACTIVITY 3
>#A.
>#3.1. You want to by reconditioned cellphone and find that over three months of watching at
Gaisano mall, you see the following prices (suppose the cellphones are all similar)
9000 9500 9400 9400 10000 9500 10300 10200
Use R commands to find
a) the mean value
b) the median
c) What is the variance?
d) The standard deviation?
> prices<-c(9000,9500,9400,9400,10000,9500,10300,10200)
> mean(prices)
[1] 9662.5
> median(prices)
[1] 9500
> var(prices)
[1] 205535.7
> sd(prices)
[1] 453.3605
>#3.2. Fifteen randomly selected statistics students were asked for the number of hours they
spent in studying at night. The resulting data are as follows:
2 2 3 1 4 5 2 3 2 4 3 2 1 1 1.5
> study<-c(2,2,3,1,4,5,2,3,2,4,3,2,1,1,1.5)
> #a. Arrange the data in increasing order
> sort(study)
[1] 1.0 1.0 1.0 1.5 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 4.0 4.0 5.0
10. 30 18.0 80 51.0
31 20.6 87 77.0
>#a. Use summary command on the data. What does it give?
> summary(trees)
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
>#b. What is the total of all Girths?
> sum(trees$Girth)
[1] 410.7
>#c. Give the table of frequencies of the heights.
> table(trees$Height)
63 64 65 66 69 70 71 72 74 75 76 77 78 79 80 81 82 83 85 86 87
1 1 1 1 1 1 1 2 2 3 2 1 1 1 5 2 1 1 1 1 1
>#d. Give the volume of the trees greater than 40.
> trees$Volume[trees$Volume>40]
[1] 42.6 55.4 55.7 58.3 51.5 51.0 77.0
>#3.4. Below is a cost and return analysis in marketing tomato using different packaging
material.
>#a. Enter the data in R which gives a similar table.
>Packaging.material<-c("Control","Cellophane","Wooden.box")
> Product.quantity<-c(20,20,20)
> Gross.Income<-c(400,540,700)
> Cost.Pesos<-c(275,258,821)
> Net.return<-c(124,281,378)
> df<-
data.frame(Packaging.material,Product.quantity,Gross.Income,Cost.Pesos,Net.re
turn)
> df
Product.quantity Gross.Income Cost.Pesos Net.return
Control 20 400 275 124
Cellophane 20 540 258 281
Wooden Box 20 700 821 378
>#b.What is the mean cost?
11. > mean(Cost.Pesos)
[1] 451.3333
>#c) Which material has the highest net return?
> df[3,]
Product.quantity Gross.Income Cost.Pesos Net.return
Wooden Box 20 700 821 378
>#d) What is the standard deviation of the gross income?
> sd(Gross.Income)
[1] 150.1111
>#ACTIVITY 4
>#4.1. Find the value of:
>#a. 7P7
> factorial(7)/factorial(7-7)
[1] 5040
>#b) 99 C 66
> choose(99,66)
[1] 1.974439e+26
>#c. 9C6
> choose(9,6)
[1] 84
>#d. 10P5
> factorial(10)/factorial(10-5)
[1] 30240
>#e. 50C50
> choose(50,50)
[1] 1
>#4.2. Let z be a standard normal random variable. Find the following probabilities
a. P ( z < 1.65 )
b. P (-0.25 < z < 1.64)
c. P( z > 1.91)
#a
> pnorm(1.65)
[1] 0.9505285
>#b
> pnorm(1.64)-pnorm(-0.25)
[1] 0.5482037
12. >#c
> 1-pnorm(1.91)
[1] 0.02806661
>#B
>#4.3. A set of scores in a Statistics examination is approximately normally distributed with a
mean of 74 and a standard deviation of 7.9. find the probability that a student received a score
between 75 and 80.
> pnorm(80,74,7.9)-pnorm(75,74,7.9)
[1] 0.2258569
>#4.4. A multiple choice quiz has 10 questions, each with four possible answers of which only
one is the correct answer. What is the probability that sheer guesswork would yield at most 1
correct answer?
> y<-c(0,1)
> sum(dbinom(y,10,.25))
[1] 0.2440252
>#4.5. A family has 6 children. Find the probability P that there are
>#a.3 boys and 3 girls
> dbinom(3,6,0.5)
[1] 0.3125
>#b.fewer boys than girls.
> #let x=number of boys where boys are fewer than girls
> #let n=number of trials=6
> #let p=probability of getting a boy=0.5
> x<-c(2,1,0)
>sum(dbinom(x,6,0.5))
[1] 0.34375
>#ACTIVITY 5
>#5.1. Students use many kinds of criteria when selecting course. “Teacher who is a very easy
grader “ is often one criterion. Three teachers are scheduled to teach statistics.
> Professor.1<-c(12,16,35)
> Professor.2<-c(11,29,30)
> Professor.3<-c(27,25,15)
> grades<-data.frame(Professor.1,Professor.2,Professor.3)
> grades
13. Professor.1 Professor.2 Professor.3
1 12 11 27
2 16 29 25
3 35 30 15
> row.names(grades)<-c("A","B","C")
> grades
Professor.1 Professor.2 Professor.3
A 12 11 27
B 16 29 25
C 35 30 15
> chisq.test(grades)
Pearson's Chi-squared test
data: grades
X-squared = 21.318, df = 4, p-value = 0.0002739
>#5.2. Test the hypothesis that the average running time of films produced by company
exceeds the running time of films produced by company 1 by 10 minutes against the one sided
alternative that the difference is more than 10 minutes. Use a 0.10 level of significance and
assume the distributions of times to be approximately normal.
> company1<-c(102,86,98,109,92)
> company2<-c(81,165,97,134,92,87,114)
> t.test(company1,company2,mu=10,alt="greater",conf.level=0.90)
Welch Two Sample t-test
data: company1 and company2
t = -1.8689, df = 7.376, p-value = 0.9491
alternative hypothesis: true difference in means is greater than 10
90 percent confidence interval:
-29.62057 Inf
sample estimates:
mean of x mean of y
97.4 110.0
before<-c(9,12,6,15,3,18,10,13,7)
after<-c(9,17,9,20,2,21,15,22,6)
t.test(before,after,mu=0,alt="greater",paired=T,conf.level=0.90)
>#3. As an aid for improving student’s habits, nine students were randomly selected to attend a
seminar on the importance of education in life. The table shows the number of hours each
student studied per week before and after the seminar. At ?=0.10, did attending the seminar
increase the number of hours the students studied per week?
Before 9 12 6 15 3 18 10 13 7
After 9 17 9 20 2 21 15 22 6
> before<-c(9,12,6,15,3,18,10,13,7)
> after<-c(9,17,9,20,2,21,15,22,6)
14. > t.test(before,after,mu=0,alt="greater",paired=T,conf.level=0.90)
Paired t-test
data: before and after
t = -2.8, df = 8, p-value = 0.9884
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
-4.663128 Inf
sample estimates:
mean of the differences
-3.111111
#ACTIVITY6
> #6.1. An educator wants to see how the number of absences a student in her class has affects
the students' grade.
> #a) Enter the data in R.
> #b) Make a scatter plot on the data.
> #c) Plot the regression line on the scatter plot.
> #d) Use lm command for the regression analysis.
> #e) What is the equation of the regression line?
> #f) What is the estimated grade when the student has 7 absences?(Note: use
the equation of the line formula and input it in R)
> no.of.absence<-c(10,12,2,0,8,5)
> final.grade<-c(70,65,96,94,75,82)
> plot(final.grade ~ no.of.absence)
> lm(final.grade ~ no.of.absence)
Call:
lm(formula = final.grade ~ no.of.absence)
Coefficients:
(Intercept) no.of.absence
96.784 -2.668
15. > a<-c(96.784)
> b<-c(-2.668)
> x=no.of.absence
> y=(a+(b*x))
> no.of.absence<-c(7)
> y=a+(b*no.of.absence)
> y
[1] 78.108
>#6.2. Consider the following data of weight (g) and length (cm) of milkfish:
Weight 150 139 100 145 121 128 143 155 138 153
Length 28 25 20 27 23 20 28 28 26 29
> #1. Generate a scatter plot on the data (length vs.weight).
> #2. Fit the regression line on the scatter plot.
> #3. Carefully examine the regression plot. What does this indicate?
> #4. Find the correlation coefficient r and interpret.
> weight<-c(150,139,100,145,121,128,143,155,138,153)
> length<-c(28,25,20,27,23,20,28,28,26,29)
> plot(length~weight)
> plot(length~weight)
> lm(length~ weight)
Call:
lm(formula = length ~ weight)
Coefficients:
(Intercept) weight
1.1075 0.1771
> a<-c(1.1075)
> b<-c(0.1771)
> y=a+(b*weight)
> y
[1] 27.6725 25.7244 18.8175 26.7870 22.5366 23.7763 26.4328 28.5580 25.5473
[10] 28.2038
> cor(length,weight,method="pearson")
[1] 0.8939989
> # the r value 0.893999 is closer to 1 which indicate that there is a
positively high linear relationship between weight and length.
> #6.3.
> #a. Estimate the equation of the regression line.
> #b. Predict the moisture content of the raw material if the relative
humidity is 50.
> #c. Compute the sample coefficient of determination and interpret.
> x<-c(46,53,37,42,34,29,60,44,41,48,33,40)
> y<-c(12,14,11,13,10,8,17,12,10,21,9,13)
> lm(y~ x)
Call:
16. lm(formula = y ~ x)
Coefficients:
(Intercept) x
-0.7367 0.3133
> a<-c(-0.7367)
> b<-c(0.3133)
> y=a+(b*x)
> # b)
> x<-c(50)
> y=a+(b*x)
> y
[1] 14.9283
> # c)
> x<-c(46,53,37,42,34,29,60,44,41,48,33,40)
> y<-c(12,14,11,13,10,8,17,12,10,21,9,13)
> cor(y,x,method="pearson")
[1] 0.7612409
> # the r value 0.7612409 is closer to 1 which indicate that there is a
positively high linear relationship between x and y.
>#ACTIVITY 7 The following data represent the scores in the final examination obtained by 4
students in mathematics, English, and biology:
Student
Subjects
Mathematics English Biology
1 68 57 61
2 83 94 86
3 72 81 59
4 55 73 66
Use a 0.05 level of significance t test the hypothesis that
>#a.the course are of equal difficulty;
>#b.the students have equal ability.
> Math<-c(68,83,72,55)
> Eng<-c(57,94,81,73)
> Bio<-c(61,86,59,66)
> student<-c("1","2","3","4")
> #a
> df<-data.frame(student,Math,Eng,Bio)
> df
student Math Eng Bio
1 1 68 57 61
2 2 83 94 86
3 3 72 81 59
4 4 55 73 66
> subj<-stack(df)
Warning message:
In stack.data.frame(df) : non-vector columns will be ignored
> subj
17. values ind
1 68 Math
2 83 Math
3 72 Math
4 55 Math
5 57 Eng
6 94 Eng
7 81 Eng
8 73 Eng
9 61 Bio
10 86 Bio
11 59 Bio
12 66 Bio
>
> anova(lm(values~ind, data=subj))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
ind 2 154.5 77.25 0.4407 0.6568
Residuals 9 1577.8 175.31
> #b
> stud<-c(1,2,3,4,1,2,3,4,1,2,3,4)
> New<-data.frame(subj,stud)
> New
values ind stud
1 68 Math 1
2 83 Math 2
3 72 Math 3
4 55 Math 4
5 57 Eng 1
6 94 Eng 2
7 81 Eng 3
8 73 Eng 4
9 61 Bio 1
10 86 Bio 2
11 59 Bio 3
12 66 Bio 4
> anova(lm(values~ind+stud, data=subj))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
ind 2 154.50 77.25 0.3947 0.6863
stud 1 12.15 12.15 0.0621 0.8095
Residuals 8 1565.60 195.70
>#7.2. The strains of rats were studied under 2 environmental conditions for their performances
in a maze test
>#Use a 0.05 level of significance to test the hypothesis that
>#a.there is no difference in error scores for different environments;
>#b.there is no difference in error scores for different strains; the environments and strains of rats
> Environment<-
c("free","free","free","free","restricted","restricted","restricted","restric
ted")
19. 11 36 Mixed free
12 14 Mixed free
13 60 Mixed restricted
14 89 Mixed restricted
15 35 Mixed restricted
16 126 Mixed restricted
17 101 Dull free
18 94 Dull free
19 33 Dull free
20 83 Dull free
21 136 Dull restricted
22 120 Dull restricted
23 38 Dull restricted
24 153 Dull restricted
> #a
> anova(lm(values~Environment, data=New.rats))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
Environment 1 8067 8066.7 5.5509 0.02779 *
Residuals 22 31971 1453.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #b
> anova(lm(values~ind, data=New.rats))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
ind 2 11834 5917.2 4.4059 0.02525 *
Residuals 21 28203 1343.0
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(lm(values~ind+Environment, data=New.rats))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
ind 2 11834.3 5917.2 5.8771 0.009824 **
Environment 1 8066.7 8066.7 8.0121 0.010334 *
Residuals 20 20136.3 1006.8
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>#7.3. The data of the initial weight of quail (Coturnix Japonica) subjected to different light
bulb are given below.
Treatment
Observation
1 2 3 4
A 75 76 78 80
B 79 76 75 76
C 79 78 75 78
20. D 76 78 76 78
Analyze the data using alpha 0.01. Compare it when alpha is 0.05. Interpret your results.
> Obs1=c(75,79,79,76)
> Obs2=c(76,76,78,78)
> Obs3=c(78,75,75,76)
> Obs4=c(80,76,78,78)
> Treatment=c("A","B","C","D")
> Trmt=data.frame(Treatment,Obs1,Obs2,Obs3,Obs4)
> Trmt
Treatment Obs1 Obs2 Obs3 Obs4
1 A 75 76 78 80
2 B 79 76 75 76
3 C 79 78 75 78
4 D 76 78 76 78
> LAREXIL=stack(Trmt)
Warning message:
In stack.data.frame(Trmt) : non-vector columns will be ignored
> Treat=data.frame(LAREXIL,Treatment)
> Treat
values ind Treatment
1 75 Obs1 A
2 79 Obs1 B
3 79 Obs1 C
4 76 Obs1 D
5 76 Obs2 A
6 76 Obs2 B
7 78 Obs2 C
8 78 Obs2 D
9 78 Obs3 A
10 75 Obs3 B
11 75 Obs3 C
12 76 Obs3 D
13 80 Obs4 A
14 76 Obs4 B
15 78 Obs4 C
16 78 Obs4 D
> anova(lm(values~ind, data=Treat, conf.level=0.99))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
ind 3 8.1875 2.7292 1.065 0.4002
Residuals 12 30.7500 2.5625
Warning message:
In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
extra argument ‘conf.level’ is disregarded.
> anova(lm(values~ind, data=Treat, conf.level=0.95))
Analysis of Variance Table
Response: values
Df Sum Sq Mean Sq F value Pr(>F)
21. ind 3 8.1875 2.7292 1.065 0.4002
Residuals 12 30.7500 2.5625
Warning message:
In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
extra argument ‘conf.level’ is disregarded.