Homework 1

Aleksei Parm

2022-09-08

About

The homework is about exploring Carseats_mod.csv data set.

GitHub

Data

Info about variables:

  • Sales - Unit sales (in thousands) at each location
  • CompPrice - Price charged by competitor at each location
  • Income - Community income level (in thousands of dollars)
  • Advertising - Local advertising budget for company at each location (in thousands of dollars)
  • Population - Population size in region (in thousands)
  • Price - Price company charges for car seats at each site
  • ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
  • Age -Average age of the local population
  • Education -Education level at each location
  • Urban - A factor with levels No and Yes to indicate whether the store is in an urban or rural location
  • US - A factor to indicate whether the store is in the US or not

Libraries

#install.packages("tidyverse")
#install.packages("GGally")

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Exercises:

Reading data

  1. Check visually the file for presence of missing values. Use read_delim command (indicating the symbol for missing values if needed) to read in the data set Carseats_mod.csv. Make sure that all variables that, according to description, should be numeric are numeric after reading data in.
carseats = read_delim("Carseats_mod.csv", delim=",", na = c(".", "NA"))
## Rows: 400 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ShelveLoc, Urban, US
## dbl (8): Sales, CompPrice, Income, Advertising, Population, Price, Age, Educ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Missing values in the file are presented as . symbol.


  1. Use summary() function to look at summary information about variables. Do you see anything which may indicate data errors? What? How many missing values are in the data set?
summary(carseats)
##      Sales          CompPrice         Income        Advertising    
##  Min.   : 0.000   Min.   : 77.0   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115.0   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :124.5   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :124.9   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135.0   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175.0   Max.   :120.00   Max.   :29.000  
##                   NA's   :2                                        
##    Population        Price        ShelveLoc              Age      
##  Min.   : 10.0   Min.   : 24.0   Length:400         Min.   : 9.0  
##  1st Qu.:139.0   1st Qu.:100.0   Class :character   1st Qu.:39.0  
##  Median :272.0   Median :117.0   Mode  :character   Median :54.0  
##  Mean   :264.8   Mean   :115.8                      Mean   :53.2  
##  3rd Qu.:398.5   3rd Qu.:131.0                      3rd Qu.:66.0  
##  Max.   :509.0   Max.   :191.0                      Max.   :80.0  
##                  NA's   :1                                        
##    Education       Urban                US           
##  Min.   :10.0   Length:400         Length:400        
##  1st Qu.:12.0   Class :character   Class :character  
##  Median :14.0   Mode  :character   Mode  :character  
##  Mean   :13.9                                        
##  3rd Qu.:16.0                                        
##  Max.   :18.0                                        
## 

There are 3 missing values in the data set.

Data cleaning

  1. Use na.omit() command to define a new data set carseats2 where rows with missing values in the original data are left out
carseats2 <- na.omit(carseats)
summary(carseats2)
##      Sales          CompPrice       Income        Advertising   
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.00  
##  1st Qu.: 5.360   1st Qu.:115   1st Qu.: 43.00   1st Qu.: 0.00  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.00  
##  Mean   : 7.472   Mean   :125   Mean   : 68.66   Mean   : 6.61  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.00  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.00  
##    Population        Price        ShelveLoc              Age       
##  Min.   : 10.0   Min.   : 24.0   Length:397         Min.   : 9.00  
##  1st Qu.:139.0   1st Qu.:100.0   Class :character   1st Qu.:39.00  
##  Median :272.0   Median :117.0   Mode  :character   Median :54.00  
##  Mean   :264.9   Mean   :115.9                      Mean   :53.11  
##  3rd Qu.:398.0   3rd Qu.:131.0                      3rd Qu.:66.00  
##  Max.   :509.0   Max.   :191.0                      Max.   :80.00  
##    Education        Urban                US           
##  Min.   :10.00   Length:397         Length:397        
##  1st Qu.:12.00   Class :character   Class :character  
##  Median :14.00   Mode  :character   Mode  :character  
##  Mean   :13.91                                        
##  3rd Qu.:16.00                                        
##  Max.   :18.00

New data set contains 397 rows


  1. Use select() command to select columns Sales and from Price to US from carseats2 and apply ggpairs() function to the the result. Are any possible data errors visible from the output?
carseats2 %>%
  select(Sales, Price:US) %>%
  ggpairs()

US factor has 3 different levels: Yes, No, yes.

Data transformation

  1. Use mutate() command with if_else() to correct data error in one of the factor variables, store the corrected data with the same name carseats2
levels(as.factor(carseats2$US))
## [1] "No"  "yes" "Yes"
carseats2 <- carseats2 %>%
  mutate(US = if_else(US == "yes", "Yes", US))

levels(as.factor(carseats2$US))
## [1] "No"  "Yes"

Now US factor has 2 levels: Yes, No.

Summary statistics

  1. Use summarize() command from dplyr package to compute a table with mean sales amount for each combination of Urban and ShelveLoc
carseats2 %>%
  group_by(Urban, ShelveLoc) %>%
  summarize(MeanSales = mean(Sales))
## # A tibble: 6 × 3
## # Groups:   Urban [2]
##   Urban ShelveLoc MeanSales
##   <chr> <chr>         <dbl>
## 1 No    Bad            5.55
## 2 No    Good           9.75
## 3 No    Medium         7.24
## 4 Yes   Bad            5.52
## 5 Yes   Good          10.3 
## 6 Yes   Medium         7.34

Data visualization

  1. Use ggplot() with geom_point to produce a scatter plot of Price (on x-axis) and Sales, color points according to shelve location. Discuss the graph - is it showing expected relationships between variables?
ggplot(carseats2, aes(Price, Sales, color = ShelveLoc)) +
  geom_point()

The scatter plot is showing relationships:

  • between Price and Sales (Car seats with lower price usually have more sales)
  • between ShelveLoc and Sales (Car seats with better shelving location usually have more sales)