Random samples

sampling

Sampling from larger datasets is a common practice in data mining and machine learning. For example, you may want to select two random samples, creating a predictive model from one and validating its effectiveness on the other

The sample() function enables you to take a random sample (with or without replacement) of size n from a dataset. You could take a random sample of size 3 from the leadership dataset using the statement

mysample <- leadership[sample(1:nrow(leadership), 3, replace=FALSE),]

The first argument to the sample() function is a vector of elements to choose from. Here, the vector is 1 to the number of observations in the data frame. The second argument is the number of elements to be selected, and the third argument indicates sampling without replacement. The sample() function returns the randomly sampled elements, which are then used to select rows from the data frame

R has extensive facilities for sampling, including drawing and calibrating survey samples (see the sampling package) and analyzing complex survey data (see the survey package)

Other methods that rely on sampling, including bootstrapping and resampling statistics , are described in later

Using SQL statements to manipulate data frames

Until now, you’ve been using R statements to manipulate data. But many data analysts come to R well versed in(通晓的;精通的; 纯熟;) Structured Query Language (SQL). It would be a shame to lose all that accumulated knowledge. Therefore, before we end, let me briefly mention the existence of the sqldf package. (If you’re unfamiliar with SQL, please feel free to skip this section.)

# install.packages("sqldf")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## Loading required package: RSQLite
newdf <- sqldf("select * from mtcars where carb=1 order by mpg",row.names = T)

newdf
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
s <- sqldf("select avg(mpg) as avg_mpg, avg(disp) as avg_disp, gear from mtcars where cyl in (4,6) group by gear")
Avatar
Tank (Xiao-Ning Zhang)
PhD Student @ Data Miner & Coder

I’m a PhD Student majoring in Bioinformatics and Biostatistics who loves computer programming such as C(++), Java, Python and R.

comments powered by Disqus