Analytics Recipes : Some Exercises in R

Representation of Data and Frequency Distributions

Problem Set 1, (a)

states<-c("Punjab","Haryana","Tamil Nadu","Andhra Pradesh",
"Uttar Pradesh","Others")
procurement<-c(5486,1248,589,3987,1296,1654)
rice.proc<-data.frame(states,procurement)
rice.proc<-transform(rice.proc,percentage=(rice.proc$procurement/14260)*100)
rice.proc

##           states procurement percentage
## 1         Punjab        5486     38.471
## 2        Haryana        1248      8.752
## 3     Tamil Nadu         589      4.130
## 4 Andhra Pradesh        3987     27.959
## 5  Uttar Pradesh        1296      9.088
## 6         Others        1654     11.599

attach(rice.proc)
pie(procurement,labels=c("Punjab(38.47%)","Haryana(8.75%)","Tamil Nadu(4.13%)",
"Andhra Pradesh(27.96%)","Uttar Pradesh(9.09%)","Others(11.60%)"),
main="Procurement of Rice('000 tonnes)\nduring Oct. 1993 to Sept. 1994",
col=c("royalblue4","orchid","chocolate4","olivedrab3","orange3","purple1","violetred4"))

plot of chunk unnamed-chunk-1

detach(rice.proc)

Comment: Procurement of rice is maximum (38.47%) for Punjab during the given time period, this implies that there was a big crisis for rice in Punjab i.e. production of rice is very poor. Similarly procurement is minimum (4.13%)for Tamil Nadu implying a good production of rice there for that period.

Problem Set 1, (b)

consumption<-matrix(c(6.93,4.70,0.95,0.56,0.52,5.45,4.84,0.42,0.19,0.15),
nrow=2,ncol=5,byrow=T)
rownames(consumption)<-c("Rural","Urban")
colnames(consumption)<-c("Rice","Wheat","Jowar","Bazra","Maize")
consumption

##       Rice Wheat Jowar Bazra Maize
## Rural 6.93  4.70  0.95  0.56  0.52
## Urban 5.45  4.84  0.42  0.19  0.15

barplot(consumption,beside=T,main="Consumption of cereals (in kg./month)\n
during 1989-'90",legend.text=rownames(consumption))

plot of chunk unnamed-chunk-2

Comment: Except wheat, consumption is higher for other cereals for rural India than the urban. Even it is remarkably higher for rice. This indicate rural Indians are mostly dependent on rice. Wheat is consumed in more or less same quantity in rural & urban areas. In some parts rural peoples consumed Jowar, Bazra & Maize in a relatively small amount than rice. But the overall consumption is higher for rural people, which indicates that rural Indians are dependent over these cereals, where as urban Indians are familiar with modern foods.

Problem Set 2, (a)

obv<-c(11.5,8.8,10.1,8.2,9.3,10,9.7,10.1,10.3,10.3,11.3,9.8,
10.7,9.8,10.7,9.8,9.3,8.6,10.4,9.8,11.3,8.4,9,10.7,9.6,11.2)
freq.tab<-table(obv) #creates frequency table
freq<-unname(freq.tab)
x<-as.numeric(names(freq.tab)) #set of distinct observations
am=sum(x*freq)/sum(freq) #arithmatic mean
am

## [1] 9.95

gm=prod(x^freq)^(1/sum(freq)) #geometric mean
gm

## [1] 9.91

hm=sum(freq)/sum(freq/x) #harmonic mean
hm

## [1] 9.869

mode<-x[which(freq==max(freq))] #obs. with max. frequency
mode

## [1] 9.8

obv<-sort(obv) #ordered sample
n<-length(obv)
median<-(obv[n/2]+obv[(n/2)+1])/2
median

## [1] 9.9

std.dev<-sqrt(sum((obv-mean(obv))^2)/n)
std.dev

## [1] 0.8863

md1<-sum(abs(obv-10))/n #mean dev. about 10
md1

## [1] 0.7115

md2<-sum(abs(obv-mean(obv)))/n #mean dev. about mean
md2

## [1] 0.7115

range<-max(obv)-min(obv) #range
range

## [1] 3.3

quantile(obv)

##     0%    25%    50%    75%   100% 
##  8.200  9.375  9.900 10.625 11.500

u<-unname(quantile(obv))
qrtl.dev<-u[4]-u[2] #quartile deviation
qrtl.dev

## [1] 1.25

m1<-sum(x*freq)/sum(freq)  #1st order raw moment
m1

## [1] 9.95

m2<-sum(x^2*freq)/sum(freq)#2nd order raw moment
m2

## [1] 99.79

m3<-sum(x^3*freq)/sum(freq)#3rd order raw moment
m3

## [1] 1008

m4<-sum(x^4*freq)/sum(freq)#4th order raw moment
m4

## [1] 10266

M2<-(m2-m1^2)             #2nd order central moment
M2

## [1] 0.7856

M3<-m3-(3*m2*m1)+(2*m1^3) #3rd order central moment
M3

## [1] -0.096

M4<-m4-(4*m3*m1)+(6*m2*m1^2)-(3*m1^4) #4th order central moment
M4

## [1] 1.434

g1<-M3/(std.dev^3)        #1st measure of skewness
g1

## [1] -0.1379

b1<-g1^2                  #2nd measure of skewness
b1

## [1] 0.01901

g2<-(M4/(std.dev^4))-3    #1st measure of kurtosis
g2

## [1] -0.6759

b2<-g2+3                  #2nd measure of kurtosis
b2

## [1] 2.324

Comment: On the basis of g1 & g2 we can say that the distribution is negatively skewed and platykurtic.

Problem Set 2, (b)

x<-1:8 #observations
freq<-c(21,25,43,51,40,39,12,10)
obv<-rep(x,freq)
am=sum(x*freq)/sum(freq) #arithmatic mean
am

## [1] 4.158

gm=prod(x^freq)^(1/sum(freq)) #geometric mean
gm

## [1] 3.675

hm=sum(freq)/sum(freq/x) #harmonic mean
hm

## [1] 3.088

mode<-x[which(freq==max(freq))] #obs. with max. frequency
mode

## [1] 4

obv<-sort(obv) #ordered sample
n<-length(obv)
median<-(obv[n/2]+obv[(n/2)+1])/2
median

## [1] 4

std.dev<-sqrt(sum((obv-mean(obv))^2)/n)
std.dev

## [1] 1.811

md1<-sum(abs(obv-10))/n #mean dev. about 10
md1

## [1] 5.842

md2<-sum(abs(obv-mean(obv)))/n #mean dev. about mean
md2

## [1] 1.478

range<-max(obv)-min(obv) #range
range

## [1] 7

quantile(obv)

##   0%  25%  50%  75% 100% 
##    1    3    4    6    8

u<-unname(quantile(obv))
qrtl.dev<-u[4]-u[2] #quartile deviation
qrtl.dev

## [1] 3

m1<-sum(x*freq)/sum(freq)  #1st order raw moment
m1

## [1] 4.158

m2<-sum(x^2*freq)/sum(freq)#2nd order raw moment
m2

## [1] 20.56

m3<-sum(x^3*freq)/sum(freq)#3rd order raw moment
m3

## [1] 113.3

m4<-sum(x^4*freq)/sum(freq)#4th order raw moment
m4

## [1] 673.3

M2<-(m2-m1^2)             #2nd order central moment
M2

## [1] 3.278

M3<-m3-(3*m2*m1)+(2*m1^3) #3rd order central moment
M3

## [1] 0.5451

M4<-m4-(4*m3*m1)+(6*m2*m1^2)-(3*m1^4) #4th order central moment
M4

## [1] 25.47

g1<-M3/(std.dev^3)        #1st measure of skewness
g1

## [1] 0.09184

b1<-g1^2                  #2nd measure of skewness
b1

## [1] 0.008434

g2<-(M4/(std.dev^4))-3    #1st measure of kurtosis
g2

## [1] -0.6294

b2<-g2+3                  #2nd measure of kurtosis
b2

## [1] 2.371

Comment: Comment: On the basis of the measures g1 & g2 we can say that this distribution is positively skewed and platykurtic.

Problem Set 2, (c)

height<-seq(147.05,182.05,5)
freq<-c(1,3,24,58,60,27,2,2)
a<-rep(height,freq)
hist(a,main="Histogram of freq. distn. of height
\nof 177 Indian adult males",ylim=c(0,70),xlab="Height(cm.)",
ylab="Frequency Density",col="darkseagreen2")

plot of chunk unnamed-chunk-5

mean<-sum(height*freq)/sum(freq) #A.M
mean

## [1] 164.7

dfrm<-data.frame(height,freq)
dfrm<-transform(dfrm,cumfreq.l=cumsum(freq))
attach(dfrm)

## The following objects are masked _by_ .GlobalEnv:
## 
##     freq, height

me.c<-min(which(dfrm$cumfreq.l>=88.5))#'mc' for median class
me.l=164.55;me.u=169.55 #class boundaries of median class 
f0=freq[me.c] #freq. of median class
cl.width=5 #class width
median<-me.l+(sum(freq)/2-cumfreq.l[me.c-1])*cl.width/f0 #Median
median

## [1] 164.8

mo.c<-which(freq==max(freq)) #modal class
mo.l<-height[mo.c]-2.5;mo.u<-height[mo.c]+2.5 #cls. boundaries of modal cls.
mode<-mo.l+
(((freq[mo.c]-freq[mo.c-1])*cl.width)/(2*freq[mo.c]-freq[mo.c-1]-freq[mo.c+1])) #Mode
mode

## [1] 164.8

Comment: Mean, median and mode are more or less equal, it implies that the frequency distribution of height of 177 adult Indian males is more or less symmetric. A histogram of this distribution gives the graphical proof.

Problem Set 2, (d)

word.length<-c(5,4,3,5,8,6,6,3,4,3,4,4,5,8,2,6,7,6,4,5,6,4,9,6,4,2,
2,2,9,2,3,3,3,2,4,7,7,2,4,4,4,3,4,4,2,4,4,9,3,7,4,5,12,6,3,5,2,5,10,
3,5,7,3,3,3,6,2,5,3,3,3,2,4,5,8,5,3,4,4,6,7,2,3,5,5,5,3,2,4,5)
fr.tab<-table(word.length) #frequency table
fr.tab

## word.length
##  2  3  4  5  6  7  8  9 10 12 
## 13 19 20 15  9  6  3  3  1  1

freq<-unname(fr.tab) #extracts the frequencies only
freq

##  [1] 13 19 20 15  9  6  3  3  1  1

wl<-as.numeric(names(fr.tab))
plot(wl,freq,type="h",lwd=10,main="Column Diagram for Word Length Data",
xlab="word length",ylab="Frequency")

plot of chunk unnamed-chunk-6

plot(wl,freq,type="l",lwd=2,main="Frequency Polygon for Word Length Data",
xlab="Word length",ylab="Frequency")
grid(10,10,col="black",lty=1,lwd=1.25)

plot of chunk unnamed-chunk-6

cum.fr.l<-cumsum(freq)
cum.fr.g<-rev(cumsum(rev(freq)))
plot(wl,cum.fr.l,type="s",main="Cumulative Frequency Diagram\nWord length Data",
xlab="Word length",ylab="Cum. freq",col="red")
lines(wl,cum.fr.g,type="s",col="blue",lty=1,lwd=1)
legend(7.5,45,legend=c("less than type","greater than type"),
       col=c("red","blue"),lty=c(1,1),lwd=c(1,1))

plot of chunk unnamed-chunk-6

Problem Set 2, (e)

height<-seq(147.05,182.05,5)
freq<-c(1,3,24,58,60,27,2,2)
b<-rep(height,freq)
hist(b,freq=T,main="Histogram of freq. distn. of height
\nof 177 Indian adult males",ylim=c(0,70),xlab="Height(cm.)",
ylab="Frequency Density",col="orange")

plot of chunk unnamed-chunk-7

Comment: The frequency distribution of 177 Indian adult males seems to be more or less symmetric height measure 165 cm.

Problem Set 3,(d)

library(car)
res<-rchisq(50,df=5)
par(mfrow=c(1,2))
qqPlot(res,"norm",mean=0,sd=1,main="Normal Q-Q Plot",xlab="Theoretical Quantiles",
ylab="Sample Quantiles",col="dark green",envelope=F)
box()
qqPlot(res,"chisq",df=5,main="Chi-squared Q-Q Plot",xlab="Theoretical Quantiles",
ylab="Sample Quantiles",col="dark green",envelope=F) 
box()

plot of chunk unnamed-chunk-8

par(mfrow=c(1,1))

Problem Set 4,(a)

#Hypergeometric Distribution
N=50;p=0.08;m=10
(a1<-rhyper(50,N*p,N*(1-p),m))

##  [1] 0 2 3 1 0 4 0 0 0 1 1 1 1 2 3 0 1 1 2 2 0 1 1 1 0 0 0 0 1 1 1 0 1 0 2
## [36] 2 2 0 0 1 0 1 0 1 0 0 1 0 1 0

hist(a1,prob=T,breaks=40,main="Hypergeometric Distribution")

plot of chunk unnamed-chunk-9

#Binomial Distribution
(a2=rbinom(50,size=10,prob=0.08))

##  [1] 1 1 0 1 0 0 0 1 1 1 1 1 3 1 0 0 2 0 1 2 1 0 1 0 1 2 2 1 1 0 1 0 1 0 1
## [36] 1 0 1 1 3 0 2 2 0 0 0 1 2 1 1

hist(a2,prob=T,breaks=40,main="Binomial Distribution")

plot of chunk unnamed-chunk-9

#Poisson Distribution
(a3=rpois(50,lambda=5))

##  [1]  5  6  8  0  7 11  2  6  1  5  4  1  5  3  2  4  1  6 13  3  4  4  4
## [24]  4  5  6  3  4  2  4  7  2  6  6  7  7  3  4  7  4  3  6  5  5  5  6
## [47]  6  2  5  3

hist(a3,prob=T,breaks=40,main="Poisson Distribution")

plot of chunk unnamed-chunk-9

#Negative binomial Distribution
(a4=rnbinom(50,size=2,p=0.3))

##  [1]  3  3  4  4  1  3  0  4  2  5  4  6 12  8  1  4 10  2  2  2  0  0  1
## [24]  0  4  4  5  4  1  6  2  1  4  3  4 12  7  4  4  4  2  3  2  2  0  5
## [47]  8  5  2  2

hist(a4,prob=T,breaks=40,main="Negative binomial distribution")

plot of chunk unnamed-chunk-9

#Uniform Distribution
(a5=runif(50,min=-1,max=1))

##  [1]  0.83248  0.43611  0.64405  0.12942 -0.83009  0.36332 -0.88719
##  [8] -0.12689 -0.89908  0.23645  0.36525 -0.65092 -0.92070  0.31048
## [15] -0.11837  0.38925  0.61704  0.07788  0.28311  0.71714  0.48361
## [22]  0.39368 -0.15009  0.73115  0.55297 -0.76352 -0.41193 -0.28702
## [29] -0.07498 -0.19975  0.44213  0.51524  0.01614 -0.28080  0.07127
## [36]  0.10289  0.48865 -0.78649 -0.67313  0.57839 -0.46182 -0.53311
## [43] -0.61379  0.01316  0.21253  0.11320  0.03991  0.31441 -0.62265
## [50] -0.39472

hist(a5,prob=T,breaks=40,main="Uniform distribution")

plot of chunk unnamed-chunk-9

#Exponential Distribution
(a6=rexp(50,rate=1))

##  [1] 0.021322 2.818142 1.665370 1.127690 0.408367 3.226427 0.010526
##  [8] 3.191992 1.392076 1.246930 0.084053 1.352184 1.077719 0.069680
## [15] 0.852086 0.215419 0.919785 1.619913 0.190371 0.316022 1.819239
## [22] 1.070554 1.150554 1.368658 0.049182 1.527954 2.501414 0.433714
## [29] 1.749039 3.569746 2.025098 0.682150 0.991434 0.482948 0.427279
## [36] 0.492884 0.838371 0.884066 0.684527 0.733355 0.497777 0.425395
## [43] 0.678024 0.407726 0.003668 0.529263 0.100722 0.165243 3.525454
## [50] 2.918235

hist(a6,prob=T,breaks=40,main="Exponential distribution")

plot of chunk unnamed-chunk-9

#Gamma Distribution
(a7=rgamma(50,shape=2,scale=1/0.5))

##  [1]  1.5194  8.7323  5.0112  5.3808  2.5689  3.7560  0.9024  5.1878
##  [9]  8.8795  6.7652  0.8639  6.3527  7.8125  3.7566  5.7860 10.5738
## [17]  5.1313  3.2750  0.7868  5.2295  1.1398  7.9196  2.8757  4.4374
## [25]  0.2272  2.0854  1.7733  5.8107  3.2011  7.2478  9.2464  1.9531
## [33]  1.8189  4.5381  0.4502  3.8431  4.6504  6.3185  3.8842  3.1025
## [41]  9.7407  7.6229  1.4624  5.3390  4.3490  4.4336 11.1100  5.0727
## [49]  4.0605  0.9832

hist(a7,prob=T,breaks=40,main="Gamma distribution")

plot of chunk unnamed-chunk-9

#Normal Distribution
(a8=rnorm(50,mean=2,sd=0.9))

##  [1]  2.26133  1.62027 -0.25955  1.93170  2.20045  1.64113  2.28744
##  [8]  2.98327  2.25620  0.52294  2.50472  2.34229 -0.43761  2.01171
## [15]  2.82413  2.04092  1.68098  2.44506  2.20083  3.61741  3.81555
## [22]  0.61219  2.37247 -0.42067  1.55787  1.34497  2.30083  1.32917
## [29]  1.44494  2.91677  2.75784  0.92687  1.19043  1.39576  3.05951
## [36]  2.09502  1.23137  0.41385  0.10059  1.50195  2.27822  2.87851
## [43] -0.04716  0.49346  2.35194  1.96581  1.51957  2.16626  1.56421
## [50]  2.31095

hist(a1,prob=T,breaks=40,main="Norma distribution, N(2,0.81)")

plot of chunk unnamed-chunk-9

(rnorm(50,mean=2,sd=0.5))

##  [1] 1.1892 2.8363 2.5936 2.3212 1.9049 2.5503 2.5789 2.4428 2.2551 2.4952
## [11] 1.0741 2.6539 1.9142 1.9648 0.9494 1.7294 1.9934 1.8712 1.8771 2.6616
## [21] 2.7913 2.4826 2.6862 2.0696 1.2455 2.0866 2.3095 2.5894 2.2323 2.2689
## [31] 2.4324 1.8221 1.9627 3.2848 2.6822 1.4466 2.2866 2.9505 1.9575 2.6219
## [41] 2.9105 1.9332 2.1261 1.7391 2.3877 1.7545 2.2162 2.0320 1.8253 1.5066

(rnorm(50,mean=2,sd=1.5))

##  [1]  1.876242  6.037625 -0.451847  1.373633  0.532232  3.510680  3.340926
##  [8] -0.007715  0.088956  4.453317  2.911617  0.921625  1.311006 -0.743663
## [15]  2.808198  2.514291  3.509133  2.187453  1.499554  1.807960  2.455559
## [22]  1.474674  1.485533  0.930782  2.142526  3.338360  1.575877  1.907328
## [29]  2.005079  2.448050  2.132124  1.944677  3.466303 -0.168645  3.908298
## [36]  0.821826  2.516435  5.170779  2.320600  3.440130  1.425027  2.446219
## [43]  1.769183  2.097993  1.625106  2.596603  0.977233 -0.519476  1.495581
## [50] -0.776313

#Log-Normal Distribution
(a9=rlnorm(50,meanlog=0,sdlog=1))

##  [1]  0.3005  1.7389 13.1347  4.2291  0.6998  1.8213  0.2785  2.1887
##  [9]  1.4761  0.6681  0.8359  1.4582  0.3899  3.0194  1.4543  1.4540
## [17]  0.4870  3.6922  1.5046  0.7333  0.4288  1.6527  0.6560  0.8621
## [25]  0.4482  0.3156  1.4359  1.5796  5.8439  0.8201  4.7405  1.3964
## [33]  2.5449  4.5905  1.6255  2.0441  0.8883  0.2330  1.0264  0.7197
## [41]  0.4117  0.9750  0.5959  2.5151  0.5787  1.2945  0.3047  1.4198
## [49]  1.4584  0.5964

hist(a9,prob=T,breaks=40,main="Log-Normal distribution")

plot of chunk unnamed-chunk-9

#Beta Distribution
(a10=rbeta(50,shape1=1.5,shape2=1.4))

##  [1] 0.66212 0.61942 0.20544 0.95167 0.91171 0.95260 0.40890 0.54824
##  [9] 0.29276 0.55729 0.81590 0.56886 0.56937 0.03990 0.26728 0.46295
## [17] 0.33556 0.06203 0.06079 0.41014 0.48468 0.47872 0.24377 0.76013
## [25] 0.56997 0.27136 0.29022 0.51441 0.44391 0.87780 0.85589 0.09426
## [33] 0.54265 0.15311 0.46802 0.70016 0.17142 0.86240 0.23439 0.54585
## [41] 0.58151 0.69536 0.46311 0.64763 0.05720 0.85511 0.55383 0.55512
## [49] 0.19872 0.68647

hist(a10,prob=T,breaks=40,main="Beta distribution, Beta(1.5,1.4)")

plot of chunk unnamed-chunk-9

(rbeta(50,shape1=1.1,shape2=1.4))

##  [1] 0.32144 0.96026 0.42660 0.36804 0.62129 0.82827 0.94386 0.62923
##  [9] 0.30584 0.92903 0.55857 0.96414 0.31448 0.14435 0.84991 0.69109
## [17] 0.20793 0.49479 0.51844 0.41784 0.24067 0.20343 0.52068 0.38265
## [25] 0.51435 0.32633 0.16032 0.64395 0.42573 0.22513 0.13632 0.04522
## [33] 0.71912 0.23443 0.51307 0.44959 0.54495 0.18560 0.17618 0.21729
## [41] 0.33008 0.28586 0.66376 0.16088 0.39746 0.92421 0.14569 0.85733
## [49] 0.06345 0.73436

(rbeta(50,shape1=1.5,shape2=1.1))

##  [1] 0.22491 0.97405 0.57967 0.47966 0.42494 0.20674 0.88001 0.28277
##  [9] 0.70045 0.62073 0.79499 0.03310 0.53514 0.63557 0.91030 0.21056
## [17] 0.05013 0.30046 0.79456 0.49406 0.86579 0.98466 0.43654 0.14420
## [25] 0.82091 0.87598 0.33492 0.90129 0.06676 0.90037 0.78858 0.45385
## [33] 0.15637 0.62456 0.45036 0.79109 0.47455 0.77335 0.76393 0.62468
## [41] 0.20989 0.96819 0.32499 0.99826 0.79783 0.44038 0.40867 0.63830
## [49] 0.64034 0.56435

Analytics Recipes

Sunday, 14 September 2014

Some Exercises in R

Representation of Data and Frequency Distributions

Koushik Khan

Tuesday, September 09, 2014

Problem Set 1, (a)

Problem Set 1, (b)

Problem Set 2, (a)

Problem Set 2, (b)

Problem Set 2, (c)

Problem Set 2, (d)

Problem Set 2, (e)

Problem Set 3,(d)

Problem Set 4,(a)

No comments:

Post a Comment

About Me