class: center, middle, inverse, title-slide # Review of the first day’s material --- layout: true <div class="my-footer"> <span> <img src="../img/au_logo_black.png" alt="Aarhus University", width="140"> </span> </div> --- class: center, middle # Project management --- ## Best practices for project management .pull-left[ - Use R Projects - Use `here::here()` - Use a style guide for code, filenaming - Use a consistent folder layout - Document your code or let your code speak for itself ] .pull-right[ - Use Sections in scripts to separate your file - Save data in `data/` and R scripts in `R/` - Keep R scripts concise and with a goal - Use `source()` to run code in another script - Don't repeat yourself (DRY), aka create functions ] --- ## Basics of R .pull-left[ ```r # vector 1:10 #> [1] 1 2 3 4 5 6 7 8 9 10 c("a", "b") #> [1] "a" "b" # data.frame head(sleep, 2) #> extra group ID #> 1 0.7 1 1 #> 2 -1.6 1 2 # Object assignment my_name <- "Luke" my_name #> [1] "Luke" ``` ] .pull-right[ ```r # Viewing data.frames colnames(sleep) #> [1] "extra" "group" "ID" str(sleep) #> 'data.frame': 20 obs. of 3 variables: #> $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ... #> $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ... #> $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ... summary(sleep) #> extra group ID #> Min. :-1.600 1:10 1 :2 #> 1st Qu.:-0.025 2:10 2 :2 #> Median : 0.950 3 :2 #> Mean : 1.540 4 :2 #> 3rd Qu.: 3.400 5 :2 #> Max. : 5.500 6 :2 #> (Other):8 ``` ] --- class: center, middle # Data Management and wrangling --- ## Best practices for wrangling .pull-left[ - Don't edit your raw data - Wrangle and manage your data using code - Save final wrangled form as a csv file in `data/` folder - Try to keep data "tidy" (column and row should uniquely describe the data value) ] .pull-right[ - Make use of the `%>%` pipe to chain functions together - Use the common data wrangling "verbs": - dplyr: `mutate()`, `select()`, `rename()`, `filter()`, `arrange()`, `group_by()`, `summarise()`, - tidyr: `gather()`, `spread()` ] --- ## Final exercise: Review of mutate and select ```r nhanes_wrangled <- NHANES %>% * mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>% * select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol, * DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>% rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies, DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>% filter(Age >= 18, Age <= 75) nhanes_wrangled ``` ``` #> # A tibble: 10,000 x 12 #> SurveyYr Gender Age Poverty BMI BPSysAve BPDiaAve TotChol DiabetesAge #> <fct> <fct> <int> <dbl> <dbl> <int> <int> <dbl> <int> #> 1 2009_10 male 34 1.36 32.2 113 85 3.49 NA #> 2 2009_10 male 34 1.36 32.2 113 85 3.49 NA #> 3 2009_10 male 34 1.36 32.2 113 85 3.49 NA #> 4 2009_10 male 4 1.07 15.3 NA NA NA NA #> # … with 9,996 more rows, and 3 more variables: nBabies <int>, #> # MoreThan5DaysActive <lgl>, AlcoholDay <int> ``` --- ## Final exercise: Review of rename ```r nhanes_wrangled <- NHANES %>% mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>% select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol, DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>% * rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies, * DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>% filter(Age >= 18, Age <= 75) nhanes_wrangled ``` ``` #> # A tibble: 10,000 x 12 #> SurveyYr Gender Age Poverty BMI BPSysAve BPDiaAve TotalCholesterol #> <fct> <fct> <int> <dbl> <dbl> <int> <int> <dbl> #> 1 2009_10 male 34 1.36 32.2 113 85 3.49 #> 2 2009_10 male 34 1.36 32.2 113 85 3.49 #> 3 2009_10 male 34 1.36 32.2 113 85 3.49 #> 4 2009_10 male 4 1.07 15.3 NA NA NA #> # … with 9,996 more rows, and 4 more variables: #> # AgeDiabetesDiagnosis <int>, NumberOfBabies <int>, #> # MoreThan5DaysActive <lgl>, DrinksOfAlcoholInDay <int> ``` --- ## Final exercise: Review of filter ```r nhanes_wrangled <- NHANES %>% mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>% select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol, DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>% rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies, DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>% * filter(Age >= 18, Age <= 75) nhanes_wrangled ``` ``` #> # A tibble: 6,964 x 12 #> SurveyYr Gender Age Poverty BMI BPSysAve BPDiaAve TotalCholesterol #> <fct> <fct> <int> <dbl> <dbl> <int> <int> <dbl> #> 1 2009_10 male 34 1.36 32.2 113 85 3.49 #> 2 2009_10 male 34 1.36 32.2 113 85 3.49 #> 3 2009_10 male 34 1.36 32.2 113 85 3.49 #> 4 2009_10 female 49 1.91 30.6 112 75 6.7 #> # … with 6,960 more rows, and 4 more variables: #> # AgeDiabetesDiagnosis <int>, NumberOfBabies <int>, #> # MoreThan5DaysActive <lgl>, DrinksOfAlcoholInDay <int> ``` --- ## Final exercise: Review of gather ```r nhanes_wrangled %>% * gather(Measure, Value, -SurveyYr, -Gender) %>% group_by(SurveyYr, Gender, Measure) %>% summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>% arrange(Measure, Gender, SurveyYr) %>% spread(SurveyYr, Mean) ``` ``` #> # A tibble: 69,640 x 4 #> SurveyYr Gender Measure Value #> <fct> <fct> <chr> <dbl> #> 1 2009_10 male Age 34 #> 2 2009_10 male Age 34 #> 3 2009_10 male Age 34 #> 4 2009_10 female Age 49 #> # … with 6.964e+04 more rows ``` --- ## Final exercise: Review of group_by and summarise ```r nhanes_wrangled %>% gather(Measure, Value, -SurveyYr, -Gender) %>% * group_by(SurveyYr, Gender, Measure) %>% * summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>% arrange(Measure, Gender, SurveyYr) %>% spread(SurveyYr, Mean) ``` ``` #> # A tibble: 40 x 4 #> # Groups: SurveyYr, Gender [4] #> SurveyYr Gender Measure Mean #> <fct> <fct> <chr> <dbl> #> 1 2009_10 female Age 44.0 #> 2 2009_10 female AgeDiabetesDiagnosis 48.1 #> 3 2009_10 female BMI 29.0 #> 4 2009_10 female BPDiaAve 67.7 #> # … with 36 more rows ``` --- ## Final exercise: Review of arrange ```r nhanes_wrangled %>% gather(Measure, Value, -SurveyYr, -Gender) %>% group_by(SurveyYr, Gender, Measure) %>% summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>% * arrange(Measure, Gender, SurveyYr) %>% spread(SurveyYr, Mean) ``` ``` #> # A tibble: 40 x 4 #> # Groups: SurveyYr, Gender [4] #> SurveyYr Gender Measure Mean #> <fct> <fct> <chr> <dbl> #> 1 2009_10 female Age 44.0 #> 2 2011_12 female Age 44.2 #> 3 2009_10 male Age 43.1 #> 4 2011_12 male Age 43.9 #> # … with 36 more rows ``` --- ## Final exercise: Review of spread ```r nhanes_wrangled %>% gather(Measure, Value, -SurveyYr, -Gender) %>% group_by(SurveyYr, Gender, Measure) %>% summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>% arrange(Measure, Gender, SurveyYr) %>% * spread(SurveyYr, Mean) ``` ``` #> # A tibble: 20 x 4 #> # Groups: Gender [2] #> Gender Measure `2009_10` `2011_12` #> <fct> <chr> <dbl> <dbl> #> 1 female Age 44.0 44.2 #> 2 female AgeDiabetesDiagnosis 48.1 46.5 #> 3 female BMI 29.0 28.6 #> 4 female BPDiaAve 67.7 70.0 #> # … with 16 more rows ```