Data
Info about variables:
- Sales - Unit sales (in thousands) at each location
- CompPrice - Price charged by competitor at each location
- Income - Community income level (in thousands of dollars)
- Advertising - Local advertising budget for company at each location (in thousands of dollars)
- Population - Population size in region (in thousands)
- Price - Price company charges for car seats at each site
- ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
- Age -Average age of the local population
- Education -Education level at each location
- Urban - A factor with levels No and Yes to indicate whether the store is in an urban or rural location
- US - A factor to indicate whether the store is in the US or not
Libraries
#install.packages("tidyverse")
#install.packages("GGally")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Exercises:
Reading data
- Check visually the file for presence of missing values. Use
read_delim
command (indicating the symbol for missing values if needed) to read in the data setCarseats_mod.csv
. Make sure that all variables that, according to description, should be numeric are numeric after reading data in.
carseats = read_delim("Carseats_mod.csv", delim=",", na = c(".", "NA"))
## Rows: 400 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ShelveLoc, Urban, US
## dbl (8): Sales, CompPrice, Income, Advertising, Population, Price, Age, Educ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Missing values in the file are presented as
.
symbol.
- Use
summary()
function to look at summary information about variables. Do you see anything which may indicate data errors? What? How many missing values are in the data set?
summary(carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77.0 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115.0 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :124.5 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :124.9 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135.0 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175.0 Max. :120.00 Max. :29.000
## NA's :2
## Population Price ShelveLoc Age
## Min. : 10.0 Min. : 24.0 Length:400 Min. : 9.0
## 1st Qu.:139.0 1st Qu.:100.0 Class :character 1st Qu.:39.0
## Median :272.0 Median :117.0 Mode :character Median :54.0
## Mean :264.8 Mean :115.8 Mean :53.2
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.0
## Max. :509.0 Max. :191.0 Max. :80.0
## NA's :1
## Education Urban US
## Min. :10.0 Length:400 Length:400
## 1st Qu.:12.0 Class :character Class :character
## Median :14.0 Mode :character Mode :character
## Mean :13.9
## 3rd Qu.:16.0
## Max. :18.0
##
There are 3 missing values in the data set.
Data cleaning
- Use
na.omit()
command to define a new data setcarseats2
where rows with missing values in the original data are left out
carseats2 <- na.omit(carseats)
summary(carseats2)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.00
## 1st Qu.: 5.360 1st Qu.:115 1st Qu.: 43.00 1st Qu.: 0.00
## Median : 7.490 Median :125 Median : 69.00 Median : 5.00
## Mean : 7.472 Mean :125 Mean : 68.66 Mean : 6.61
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.00
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.00
## Population Price ShelveLoc Age
## Min. : 10.0 Min. : 24.0 Length:397 Min. : 9.00
## 1st Qu.:139.0 1st Qu.:100.0 Class :character 1st Qu.:39.00
## Median :272.0 Median :117.0 Mode :character Median :54.00
## Mean :264.9 Mean :115.9 Mean :53.11
## 3rd Qu.:398.0 3rd Qu.:131.0 3rd Qu.:66.00
## Max. :509.0 Max. :191.0 Max. :80.00
## Education Urban US
## Min. :10.00 Length:397 Length:397
## 1st Qu.:12.00 Class :character Class :character
## Median :14.00 Mode :character Mode :character
## Mean :13.91
## 3rd Qu.:16.00
## Max. :18.00
New data set contains 397 rows
- Use
select()
command to select columnsSales
and fromPrice
toUS
fromcarseats2
and applyggpairs()
function to the the result. Are any possible data errors visible from the output?
carseats2 %>%
select(Sales, Price:US) %>%
ggpairs()
US
factor has 3 different levels:Yes
,No
,yes
.
Data transformation
- Use
mutate()
command withif_else()
to correct data error in one of the factor variables, store the corrected data with the same namecarseats2
levels(as.factor(carseats2$US))
## [1] "No" "yes" "Yes"
carseats2 <- carseats2 %>%
mutate(US = if_else(US == "yes", "Yes", US))
levels(as.factor(carseats2$US))
## [1] "No" "Yes"
Now
US
factor has 2 levels:Yes
,No
.
Summary statistics
- Use
summarize()
command fromdplyr
package to compute a table with mean sales amount for each combination ofUrban
andShelveLoc
carseats2 %>%
group_by(Urban, ShelveLoc) %>%
summarize(MeanSales = mean(Sales))
## # A tibble: 6 × 3
## # Groups: Urban [2]
## Urban ShelveLoc MeanSales
## <chr> <chr> <dbl>
## 1 No Bad 5.55
## 2 No Good 9.75
## 3 No Medium 7.24
## 4 Yes Bad 5.52
## 5 Yes Good 10.3
## 6 Yes Medium 7.34
Data visualization
- Use
ggplot()
withgeom_point
to produce a scatter plot ofPrice
(on x-axis) andSales
, color points according to shelve location. Discuss the graph - is it showing expected relationships between variables?
ggplot(carseats2, aes(Price, Sales, color = ShelveLoc)) +
geom_point()
The scatter plot is showing relationships:
- between
Price
andSales
(Car seats with lower price usually have more sales)- between
ShelveLoc
andSales
(Car seats with better shelving location usually have more sales)