Review of the first day’s material

# Review of the first day’s material

---

---

# Project management

---

## Best practices for project management

.pull-left[
- Use R Projects
- Use `here::here()`
- Use a style guide for code, filenaming
- Use a consistent folder layout
- Document your code or let your code speak for itself
]

.pull-right[
- Use Sections in scripts to separate your file
- Save data in `data/` and R scripts in `R/`
- Keep R scripts concise and with a goal
- Use `source()` to run code in another script
- Don't repeat yourself (DRY), aka create functions
]

---

## Basics of R

```r
# vector
1:10
#>  [1]  1  2  3  4  5  6  7  8  9 10
c("a", "b")
#> [1] "a" "b"

# data.frame
head(sleep, 2)
#>   extra group ID
#> 1   0.7     1  1
#> 2  -1.6     1  2

# Object assignment
my_name <- "Luke"
my_name
#> [1] "Luke"
```
]

```r
# Viewing data.frames
colnames(sleep)
#> [1] "extra" "group" "ID"
str(sleep)
#> 'data.frame':	20 obs. of  3 variables:
#>  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#>  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
summary(sleep)
#>      extra        group        ID   
#>  Min.   :-1.600   1:10   1      :2  
#>  1st Qu.:-0.025   2:10   2      :2  
#>  Median : 0.950          3      :2  
#>  Mean   : 1.540          4      :2  
#>  3rd Qu.: 3.400          5      :2  
#>  Max.   : 5.500          6      :2  
#>                          (Other):8
```
]

---

# Data Management and wrangling

---

## Best practices for wrangling

.pull-left[
- Don't edit your raw data
- Wrangle and manage your data using code
- Save final wrangled form as a csv file in `data/` folder
- Try to keep data "tidy" (column and row should uniquely describe the data value)
]

.pull-right[
- Make use of the `%>%` pipe to chain functions together
- Use the common data wrangling "verbs": 
    - dplyr: `mutate()`, `select()`, `rename()`, `filter()`, `arrange()`,
    `group_by()`, `summarise()`, 
    - tidyr: `gather()`, `spread()`
]

---

## Final exercise: Review of mutate and select

```r
nhanes_wrangled <- NHANES %>% 
*   mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>%
*   select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol,
*          DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>%
    rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies, 
           DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>% 
    filter(Age >= 18, Age <= 75)
nhanes_wrangled
```

```
#> # A tibble: 10,000 x 12
#>   SurveyYr Gender   Age Poverty   BMI BPSysAve BPDiaAve TotChol DiabetesAge
#>   <fct>    <fct>  <int>   <dbl> <dbl>    <int>    <int>   <dbl>       <int>
#> 1 2009_10  male      34    1.36  32.2      113       85    3.49          NA
#> 2 2009_10  male      34    1.36  32.2      113       85    3.49          NA
#> 3 2009_10  male      34    1.36  32.2      113       85    3.49          NA
#> 4 2009_10  male       4    1.07  15.3       NA       NA   NA             NA
#> # … with 9,996 more rows, and 3 more variables: nBabies <int>,
#> #   MoreThan5DaysActive <lgl>, AlcoholDay <int>
```

---

## Final exercise: Review of rename

```r
nhanes_wrangled <- NHANES %>% 
    mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>%
    select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol,
           DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>%
*   rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies,
*          DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>%
    filter(Age >= 18, Age <= 75)
nhanes_wrangled
```

```
#> # A tibble: 10,000 x 12
#>   SurveyYr Gender   Age Poverty   BMI BPSysAve BPDiaAve TotalCholesterol
#>   <fct>    <fct>  <int>   <dbl> <dbl>    <int>    <int>            <dbl>
#> 1 2009_10  male      34    1.36  32.2      113       85             3.49
#> 2 2009_10  male      34    1.36  32.2      113       85             3.49
#> 3 2009_10  male      34    1.36  32.2      113       85             3.49
#> 4 2009_10  male       4    1.07  15.3       NA       NA            NA   
#> # … with 9,996 more rows, and 4 more variables:
#> #   AgeDiabetesDiagnosis <int>, NumberOfBabies <int>,
#> #   MoreThan5DaysActive <lgl>, DrinksOfAlcoholInDay <int>
```

---

## Final exercise: Review of filter

```r
nhanes_wrangled <- NHANES %>% 
    mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>%
    select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol,
           DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>%
    rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies, 
           DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>% 
*   filter(Age >= 18, Age <= 75)
nhanes_wrangled
```

```
#> # A tibble: 6,964 x 12
#>   SurveyYr Gender   Age Poverty   BMI BPSysAve BPDiaAve TotalCholesterol
#>   <fct>    <fct>  <int>   <dbl> <dbl>    <int>    <int>            <dbl>
#> 1 2009_10  male      34    1.36  32.2      113       85             3.49
#> 2 2009_10  male      34    1.36  32.2      113       85             3.49
#> 3 2009_10  male      34    1.36  32.2      113       85             3.49
#> 4 2009_10  female    49    1.91  30.6      112       75             6.7 
#> # … with 6,960 more rows, and 4 more variables:
#> #   AgeDiabetesDiagnosis <int>, NumberOfBabies <int>,
#> #   MoreThan5DaysActive <lgl>, DrinksOfAlcoholInDay <int>
```

---

## Final exercise: Review of gather

```r
nhanes_wrangled %>% 
*   gather(Measure, Value, -SurveyYr, -Gender) %>%
    group_by(SurveyYr, Gender, Measure) %>% 
    summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>% 
    arrange(Measure, Gender, SurveyYr) %>% 
    spread(SurveyYr, Mean)
```

```
#> # A tibble: 69,640 x 4
#>   SurveyYr Gender Measure Value
#>   <fct>    <fct>  <chr>   <dbl>
#> 1 2009_10  male   Age        34
#> 2 2009_10  male   Age        34
#> 3 2009_10  male   Age        34
#> 4 2009_10  female Age        49
#> # … with 6.964e+04 more rows
```

---

## Final exercise: Review of group_by and summarise

```r
nhanes_wrangled %>% 
    gather(Measure, Value, -SurveyYr, -Gender) %>% 
*   group_by(SurveyYr, Gender, Measure) %>%
*   summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>%
    arrange(Measure, Gender, SurveyYr) %>% 
    spread(SurveyYr, Mean)
```

```
#> # A tibble: 40 x 4
#> # Groups:   SurveyYr, Gender [4]
#>   SurveyYr Gender Measure               Mean
#>   <fct>    <fct>  <chr>                <dbl>
#> 1 2009_10  female Age                   44.0
#> 2 2009_10  female AgeDiabetesDiagnosis  48.1
#> 3 2009_10  female BMI                   29.0
#> 4 2009_10  female BPDiaAve              67.7
#> # … with 36 more rows
```

---

## Final exercise: Review of arrange

```r
nhanes_wrangled %>% 
    gather(Measure, Value, -SurveyYr, -Gender) %>%
    group_by(SurveyYr, Gender, Measure) %>% 
    summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>% 
*   arrange(Measure, Gender, SurveyYr) %>%
    spread(SurveyYr, Mean)
```

```
#> # A tibble: 40 x 4
#> # Groups:   SurveyYr, Gender [4]
#>   SurveyYr Gender Measure  Mean
#>   <fct>    <fct>  <chr>   <dbl>
#> 1 2009_10  female Age      44.0
#> 2 2011_12  female Age      44.2
#> 3 2009_10  male   Age      43.1
#> 4 2011_12  male   Age      43.9
#> # … with 36 more rows
```

---

## Final exercise: Review of spread

```r
nhanes_wrangled %>% 
    gather(Measure, Value, -SurveyYr, -Gender) %>%
    group_by(SurveyYr, Gender, Measure) %>% 
    summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>% 
    arrange(Measure, Gender, SurveyYr) %>% 
*   spread(SurveyYr, Mean)
```

```
#> # A tibble: 20 x 4
#> # Groups:   Gender [2]
#>   Gender Measure              `2009_10` `2011_12`
#>   <fct>  <chr>                    <dbl>     <dbl>
#> 1 female Age                       44.0      44.2
#> 2 female AgeDiabetesDiagnosis      48.1      46.5
#> 3 female BMI                       29.0      28.6
#> 4 female BPDiaAve                  67.7      70.0
#> # … with 16 more rows
```