sampling
Sampling from larger datasets is a common practice in data mining and machine learning. For example, you may want to select two random samples, creating a predictive model from one and validating its effectiveness on the other
The sample() function enables you to take a random sample (with or without replacement) of size n from a dataset. You could take a random sample of size 3 from the leadership dataset using the statement
mysample <- leadership[sample(1:nrow(leadership), 3, replace=FALSE),]
The first argument to the sample() function is a vector of elements to choose from. Here, the vector is 1 to the number of observations in the data frame. The second argument is the number of elements to be selected, and the third argument indicates sampling without replacement. The sample() function returns the randomly sampled elements, which are then used to select rows from the data frame
R has extensive facilities for sampling, including drawing and calibrating survey samples (see the sampling package) and analyzing complex survey data (see the survey package)
Other methods that rely on sampling, including bootstrapping and resampling statistics , are described in later
Using SQL statements to manipulate data frames
Until now, you’ve been using R statements to manipulate data. But many data analysts come to R well versed in(通晓的;精通的; 纯熟;) Structured Query Language (SQL). It would be a shame to lose all that accumulated knowledge. Therefore, before we end, let me briefly mention the existence of the sqldf package. (If you’re unfamiliar with SQL, please feel free to skip this section.)
# install.packages("sqldf")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## Loading required package: RSQLite
newdf <- sqldf("select * from mtcars where carb=1 order by mpg",row.names = T)
newdf
## mpg cyl disp hp drat wt qsec vs am gear carb
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
s <- sqldf("select avg(mpg) as avg_mpg, avg(disp) as avg_disp, gear from mtcars where cyl in (4,6) group by gear")