Efficiency in data analysis and best practices

Session details

Objectives

  1. To learn some simple practices to be efficient in data analysis.
  2. To learn how functions are structured and some best practices for making and documenting them.
  3. To create your own functions to let the code repeat the work for you.
  4. To learn about functional programming and vectorisation.
  5. To apply functions (map) that do iterations at a high-level, so you can do your work without worrying too much about the details.
  6. To briefly learn about parallel processing (and how easy it is).

At the end of this session you will be able:

  • Create your own function.
  • Learn how to not repeat yourself.
  • Learn how to avoid for loops by using functional programming like map.

Best practices for being efficient

Here are a few tips for being efficient. Also check out the resources at the bottom for more places to learn about being efficient.

  • Aggressively keep your code as simple and concise as you can. The less code you have to write, debug, and maintain, the better. If you repeat code, stop and think how to not repeat it. You will save time in the end by taking time to think and plan.
  • Try to always think in terms of wrangling data as data.frames. There is a lot of power and utility from structure most things as data.frames. (e.g. “how can I get the dataframe to the form I need to make a figure?“)
  • Keep learning. Keep reading books and help documentation. Keep taking courses and workshops. And start teaching R to others! hint hint 😉 Seriously, it’s a great way for you to learn as it challenges you to know what you are talking about and teaching!
  • When encountering a difficult problem, think about how to solve it, then try to find R packages or functions that do your problem for you. You may hear some poeple say “oh, don’t bother with R packages, do everything in base R”… don’t listen to them. Do you build a computer from scratch everytime you want to do any type of work? Or a car when you want to drive somewhere? No, you don’t. Make use of other people’s hard work to make tools that simplify your life.
  • Use a style guide and create descriptive object names. Follow the principles of tidy tools when creating functions and code.
  • Split up your analyses steps into individual files (e.g. “model” file, “plot” file). Then source those files as needed or save the output in data/ to use it in other files.
  • Try not to have R scripts be too long (e.g. more than 500 lines of code). Keep script sizes as small as necessary and as specific as possible (have a single purpose). A script should have an end goal.

Creating functions

Before we begin, create a new file for this session. Make sure you are in the correct R Project.

usethis::use_r("efficiency-session")

As mentioned before, all actions in R are functions. For instance, the + is a function, mean() is a function, [] is a function, and so on. So creating your functions can make doing your work easier and more efficient. Making functions always has a basic structure of:

  1. Giving a name to the function (e.g. mean).
  2. Starting the function call using function(), assigning it to the name with <-. This tells R that the name is a function object.
  3. Optionally providing arguments to give to the function call, for instance function(arg1, arg2, arg3).
  4. Filling out the body of the function, with the arguments (if any) contained inside, that does some action.
  5. Optionally, but strongly encouraged, use return to indicate what you want the function to output.

While there is no minimum or maximum number of arguments you can provide for a function (e.g. you could have zero or dozens of arguments), its good practice for yourself and for others if you have as few arguments as possible to get the job done.

So, the structure is:

name <- function(arg1, arg2) {
    # body of function
    ... code ....
    return(output)
}

… and an example:

add_nums <- function(num1, num2) {
    added <- num1 + num2
    return(added)
}

You can use the new function by running the above code and writing out your new function, with arguments to give it.

add_nums(1, 2)
#> [1] 3

The function name is fairly good… add_nums can be read as “add numbers”. It’s also good practice to add some formal documentation to the function. Use the “Insert Roxygen Skeleton” in the “Code” menu list (or by typing Ctrl-Shift-Alt-R) and you can add template documentation right above the function. It looks like:

#' Title
#'
#' @param num1 
#' @param num2 
#'
#' @return
#' @export
#'
#' @examples
add_nums <- function(num1, num2) {
    added <- num1 + num2
    return(added)
}

In the Title area, this is where you type out a brief sentence or several words that describe the function. Creating a new paragraph below this line allows you to add a more detailed description. The other items are:

  • @param num lines are to describe what each argument is for.
  • @return describes what output the function provides. Is it a data.frame? A plot? What else does the output give?
  • @export tells R that this function should be accessible to the user of your package. Delete it for non-package functions.
  • @examples lines below this are used to show examples of how to use the function. Very useful when making packages, but not really in this case.
#' Add two numbers together.
#'
#' This is just an example function to show how to create one.
#'
#' @param num1 A number here.
#' @param num2 A number here.
#'
#' @return Returns the sum of the two numbers.
#'
add_nums <- function(num1, num2) {
    added <- num1 + num2
    return(added)
}

Functions to wrangle

Let’s get to something you might actually do in a data analysis project. Let’s suppose you want to compare the mean and standard deviation of several variables by a given categorical variable, such as SurveyYr or Gender, that you might want to include as an exploratory table in a report or document. Without making a function, you would normally run a series of functions for each of the categorical variables. For instance:

NHANES %>% 
    select(Gender, BMI, TotChol, Pulse, BPSysAve, BPDiaAve) %>% 
    gather(Measurement, Value, -Gender) %>% 
    na.omit() %>% 
    group_by(Gender, Measurement) %>% 
    summarise(Mean = round(mean(Value), 2),
              SD = round(sd(Value), 2),
              MeanSD = str_c(Mean, " (", SD, ")")) %>% 
    select(-Mean, -SD) %>% 
    spread(Gender, MeanSD)
#> # A tibble: 5 x 3
#>   Measurement female         male          
#>   <chr>       <chr>          <chr>         
#> 1 BMI         26.77 (7.9)    26.55 (6.81)  
#> 2 BPDiaAve    66.39 (13.5)   68.58 (15.09) 
#> 3 BPSysAve    116.31 (17.82) 120.02 (16.45)
#> 4 Pulse       75.43 (11.93)  71.67 (12.1)  
#> 5 TotChol     4.96 (1.08)    4.8 (1.07)

Nice, but now if we did it for SurveyYr instead of Gender? We would have to replace all Gender variables with SurveyYr. Instead, let’s create a function. First, copy and paste the code above into a function structure.

mean_sd_by_group <- function() {
    by_group_output <- NHANES %>% 
        select(Gender, BMI, TotChol, Pulse, BPSysAve, BPDiaAve) %>% 
        gather(Measurement, Value, -Gender) %>% 
        na.omit() %>% 
        group_by(Gender, Measurement) %>% 
        summarise(Mean = round(mean(Value), 2),
                  SD = round(sd(Value), 2),
                  MeanSD = str_c(Mean, " (", SD, ")")) %>% 
        select(-Mean, -SD) %>% 
        spread(Gender, MeanSD)
    return(by_group_output)
}

Great, now we need to add in arguments and place the arguments in the correct locations. (I tend to put a . before my arguments to differentiate them from other variables.)

mean_sd_by_group <- function(.dataset, .groupvar) {
    by_group_output <- .dataset %>% 
        select(.groupvar, BMI, TotChol, Pulse, BPSysAve, BPDiaAve) %>% 
        gather(Measurement, Value, -.groupvar) %>% 
        na.omit() %>% 
        group_by(.groupvar, Measurement) %>% 
        summarise(Mean = round(mean(Value), 2),
                  SD = round(sd(Value), 2),
                  MeanSD = str_c(Mean, " (", SD, ")")) %>% 
        select(-Mean, -SD) %>% 
        spread(.groupvar, MeanSD)
    return(by_group_output)
}
mean_sd_by_group(NHANES, Gender)
#> Error in .f(.x[[i]], ...): object 'Gender' not found

What happened? We had an error. We’ve encountered a problem due to “non-standard evaluation” (or NSE). NSE is very commonly used in most tidyverse packages as well as throughout base R. It’s one of the first things computer scientists complain about when they use R, because it is such a foreign thing in other programming languages. But NSE is what allows you to use formulas (e.g. y ~ x + x2 in modelling) or allows you to type out select(Gender, BMI). In other programming languages, it would be select("Gender", "BMI"). It gives a lot of flexibility to use for doing data analysis, but can give some headaches when programming in R. So instead we have to use quotes instead and use the _at() combined with vars() version of dplyr functions. The tidyr functions gather and spread don’t require these changes.

mean_sd_by_group <- function(.dataset, .groupvar) {
    by_group_output <- .dataset %>% 
        select_at(vars(.groupvar, BMI, TotChol, Pulse, BPSysAve, BPDiaAve)) %>% 
        gather(Measurement, Value, -.groupvar) %>% 
        na.omit() %>% 
        group_by_at(vars(.groupvar, Measurement)) %>% 
        summarise(Mean = round(mean(Value), 2),
                  SD = round(sd(Value), 2),
                  MeanSD = str_c(Mean, " (", SD, ")")) %>% 
        select(-Mean, -SD) %>% 
        spread(.groupvar, MeanSD)
    return(by_group_output)
}
mean_sd_by_group(NHANES, "Gender")
#> # A tibble: 5 x 3
#>   Measurement female         male          
#>   <chr>       <chr>          <chr>         
#> 1 BMI         26.77 (7.9)    26.55 (6.81)  
#> 2 BPDiaAve    66.39 (13.5)   68.58 (15.09) 
#> 3 BPSysAve    116.31 (17.82) 120.02 (16.45)
#> 4 Pulse       75.43 (11.93)  71.67 (12.1)  
#> 5 TotChol     4.96 (1.08)    4.8 (1.07)

Now we can also use other categorical variables:

mean_sd_by_group(NHANES, "SurveyYr")
#> # A tibble: 5 x 3
#>   Measurement `2009_10`     `2011_12`    
#>   <chr>       <chr>         <chr>        
#> 1 BMI         26.88 (7.47)  26.44 (7.28) 
#> 2 BPDiaAve    66.66 (14.22) 68.3 (14.45) 
#> 3 BPSysAve    117.6 (17.09) 118.7 (17.39)
#> 4 Pulse       73.49 (12.17) 73.63 (12.14)
#> 5 TotChol     4.93 (1.08)   4.83 (1.07)
mean_sd_by_group(NHANES, "Diabetes")
#> # A tibble: 5 x 3
#>   Measurement No             Yes           
#>   <chr>       <chr>          <chr>         
#> 1 BMI         26.16 (7.09)   32.56 (8.15)  
#> 2 BPDiaAve    67.43 (14.37)  68.04 (14.17) 
#> 3 BPSysAve    117.23 (16.73) 127.95 (19.42)
#> 4 Pulse       73.58 (12.05)  73.32 (13.22) 
#> 5 TotChol     4.89 (1.07)    4.78 (1.15)

Nifty eh? If you want a fancy table when using R Markdown, use the kable::knitr function.

mean_sd_by_group(NHANES, "Diabetes") %>% 
    knitr::kable(caption = "Mean and SD of some measurements by Diabetes status.")
Measurement No Yes
BMI 26.16 (7.09) 32.56 (8.15)
BPDiaAve 67.43 (14.37) 68.04 (14.17)
BPSysAve 117.23 (16.73) 127.95 (19.42)
Pulse 73.58 (12.05) 73.32 (13.22)
TotChol 4.89 (1.07) 4.78 (1.15)

A massive advantage of using functions is that if you want to make a change to all your code, you can very easily do it in the R function and it will change all your other code too!

If you want to make sure that who ever uses your function will not use a wrong argument, you can use “defensive programming” via the stopifnot() function. This forces the code to only work if .groupvar is a character (e.g. "this") argument.

mean_sd_by_group <- function(.dataset, .groupvar) {
    stopifnot(is.character(.groupvar))
    by_group_output <- .dataset %>% 
        select_at(vars(.groupvar, BMI, TotChol, Pulse, BPSysAve, BPDiaAve)) %>% 
        gather(Measurement, Value, -.groupvar) %>% 
        na.omit() %>% 
        group_by_at(vars(.groupvar, Measurement)) %>% 
        summarise(Mean = round(mean(Value), 2),
                  SD = round(sd(Value), 2),
                  MeanSD = str_c(Mean, " (", SD, ")")) %>% 
        select(-Mean, -SD) %>% 
        spread(.groupvar, MeanSD)
    return(by_group_output)
}

A good practice to use after you’ve created your function is to explicitly indicate which functions come from which package, since you shouldn’t use library calls in your function. We do this by using the packagename::function_name format (you don’t need to do this for base R functions). Plus, let’s add some documentation! (Ctrl-Alt-Shift-R).

#' Calculate mean and standard deviation by a grouping variable.
#'
#' @param .dataset The dataset
#' @param .group_variable Variable to group by
#'
#' @return Output a data frame.
#'
mean_sd_by_group <- function(.dataset, .groupvar) {
    stopifnot(is.character(.groupvar))
    by_group_output <- .dataset %>% 
        dplyr::select_at(dplyr::vars(.groupvar, BMI, 
                                     TotChol, Pulse, BPSysAve, BPDiaAve)) %>% 
        tidyr::gather(Measurement, Value, -.groupvar) %>% 
        na.omit() %>% 
        dplyr::group_by_at(dplyr::vars(.groupvar, Measurement)) %>% 
        dplyr::summarise(Mean = round(mean(Value), 2),
                  SD = round(sd(Value), 2),
                  MeanSD = stringr::str_c(Mean, " (", SD, ")")) %>% 
        dplyr::select(-Mean, -SD) %>% 
        tidyr::spread(.groupvar, MeanSD)
    return(by_group_output)
}

Functions to plot

Let’s do this for plots too, since we’ll likely be making lots of those! Let’s say we are doing multiple scatter plots, and we’ve already made a nice theme. But we want to create several scatter plots. The non-function code:

ggplot(NHANES, aes(x = Height, y = Weight, colour = Gender)) +
    geom_point(size = 2, alpha = 0.3) +
    scale_colour_viridis_d(option = "B", begin = 0.2, end = 0.7) +
    theme_minimal() +
    theme(text = element_text(colour = "grey40"))
#> Warning: Removed 366 rows containing missing values (geom_point).

Like before, maybe we want to see the scatter plot by different variables. So let’s get it into function form:

plot_scatter_by_group <- function() {
    scatter_plot <- ggplot(NHANES, aes(x = Height, y = Weight, 
                                       colour = Gender)) +
        geom_point(size = 2, alpha = 0.3) +
        scale_colour_viridis_d(option = "B", begin = 0.2, end = 0.7) +
        theme_minimal() +
        theme(text = element_text(colour = "grey40"))
    return(scatter_plot)
}

Then let’s add in the arguments.

plot_scatter_by_group <- function(.dataset, .xvar, .yvar, .groupvar) {
    scatter_plot <- ggplot(.dataset, aes(x = .xvar, y = .yvar, 
                                         colour = .groupvar)) +
        geom_point(size = 2, alpha = 0.3) +
        scale_colour_viridis_d(option = "B", begin = 0.2, end = 0.7) +
        theme_minimal() +
        theme(text = element_text(colour = "grey40"))
    return(scatter_plot)
}

Like the dplyr code above, this function won’t work because it also uses NSE, so ggplot2 will be confused by the x = .xvar since it will think you are asking for the .xvar column in the dataset. So, we need to change aes() to aes_string() to force ggplot2 to read .xvar as a character string that is the name of the column you want to plot.

plot_scatter_by_group <- function(.dataset, .xvar, .yvar, .groupvar) {
    scatter_plot <- ggplot(.dataset, aes_string(x = .xvar, y = .yvar, 
                                                colour = .groupvar)) +
        geom_point(size = 2, alpha = 0.3) +
        scale_colour_viridis_d(option = "B", begin = 0.2, end = 0.7) +
        theme_minimal() +
        theme(text = element_text(colour = "grey40"))
    return(scatter_plot)
}
plot_scatter_by_group(NHANES, "Height", "Weight", "Gender")
#> Warning: Removed 366 rows containing missing values (geom_point).

Let’s add in the stopifnot, ggplot2:: explicit calls, and the roxygen documentation (Ctrl-Alt-Shift-R):

#' Create a scatter plot with colouring by categorical variable.
#'
#' @param .dataset Data to plot.
#' @param .xvar Variable for the x axis.
#' @param .yvar Variable for the y axis.
#' @param .groupvar The variable to group/color by.
#'
#' @return Creates a ggplot.
#'
plot_scatter_by_group <- function(.dataset, .xvar, .yvar, .groupvar) {
    stopifnot(is.character(.xvar), is.character(.yvar), is.character(.groupvar))
    scatter_plot <- 
        ggplot2::ggplot(.dataset, ggplot2::aes_string(x = .xvar, y = .yvar, 
                                                      colour = .groupvar)) +
        ggplot2::geom_point(size = 2, alpha = 0.3) +
        ggplot2::scale_colour_viridis_d(option = "B", begin = 0.2, end = 0.7) +
        ggplot2::theme_minimal() +
        ggplot2::theme(text = ggplot2::element_text(colour = "grey40"))
    return(scatter_plot)
}

Now we can do other plots easily:

plot_scatter_by_group(NHANES, "BMI", "TotChol", "SurveyYr")
#> Warning: Removed 1594 rows containing missing values (geom_point).

plot_scatter_by_group(NHANES, "Poverty", "BMI", "Education")
#> Warning: Removed 3368 rows containing missing values (geom_point).

plot_scatter_by_group(NHANES, "BMI", "BPSysAve", "Diabetes")
#> Warning: Removed 1518 rows containing missing values (geom_point).

Iterating over items

You notice above that we copied and pasted multiple plot_scatter_by_group() calls to create three plots. Sometimes this is ok, but often when you start doing mostly the same thing, it’s better to try to reduce your repetition (i.e. don’t repeat yourself, DRY). Here we will get into iterations and using “functional programming” via the map function from the purrr package. Functional programming is a way of making use of vectors and vectorisation to “loop” over things (rather than use for loops). There are several reasons why you should avoid for loops in R.

  1. They are computationally very slow and easily use a lot of RAM memory.
  2. They are difficult to read and debug.
  3. They aren’t very expressive, meaning they don’t state a higher “purpose”, just that they loop over something.

But functionals, like map and most functions inside purrr (or base R functions like lapply, sapply) let you ignore the tiny details and focus on doing your intent. You intend to sum a vector of numbers, not loop over each number and add to the total (as you would in a for loop).

For the example, let’s say we want to calculate the mean and SD for several categorical variables. We can use map to easily do this, by giving it a vector of variables and providing the function to apply to each:

map(c("SurveyYr", "Diabetes", "Gender"),
    # the ~ tells map the next code is all together as one argument
    # .x says to insert the value from the first argument of map (Diabetes, etc)
    ~mean_sd_by_group(NHANES, .x))
#> [[1]]
#> # A tibble: 5 x 3
#>   Measurement `2009_10`     `2011_12`    
#>   <chr>       <chr>         <chr>        
#> 1 BMI         26.88 (7.47)  26.44 (7.28) 
#> 2 BPDiaAve    66.66 (14.22) 68.3 (14.45) 
#> 3 BPSysAve    117.6 (17.09) 118.7 (17.39)
#> 4 Pulse       73.49 (12.17) 73.63 (12.14)
#> 5 TotChol     4.93 (1.08)   4.83 (1.07)  
#> 
#> [[2]]
#> # A tibble: 5 x 3
#>   Measurement No             Yes           
#>   <chr>       <chr>          <chr>         
#> 1 BMI         26.16 (7.09)   32.56 (8.15)  
#> 2 BPDiaAve    67.43 (14.37)  68.04 (14.17) 
#> 3 BPSysAve    117.23 (16.73) 127.95 (19.42)
#> 4 Pulse       73.58 (12.05)  73.32 (13.22) 
#> 5 TotChol     4.89 (1.07)    4.78 (1.15)   
#> 
#> [[3]]
#> # A tibble: 5 x 3
#>   Measurement female         male          
#>   <chr>       <chr>          <chr>         
#> 1 BMI         26.77 (7.9)    26.55 (6.81)  
#> 2 BPDiaAve    66.39 (13.5)   68.58 (15.09) 
#> 3 BPSysAve    116.31 (17.82) 120.02 (16.45)
#> 4 Pulse       75.43 (11.93)  71.67 (12.1)  
#> 5 TotChol     4.96 (1.08)    4.8 (1.07)

It seems deceptively simple but there is a massive amount of power from doing this sort of coding. Let’s try with the plots.

map(c("Diabetes", "Education", "HomeOwn"),
    ~plot_scatter_by_group(NHANES, "Height", "Weight", .x))
#> [[1]]
#> Warning: Removed 371 rows containing missing values (geom_point).

#> 
#> [[2]]
#> Warning: Removed 2842 rows containing missing values (geom_point).

#> 
#> [[3]]
#> Warning: Removed 426 rows containing missing values (geom_point).

What if we want to provide two things at once? We can use map2 for that! (Note, the two arguments should be the same length.)

map2(c("TotChol", "Pulse"),
    c("Diabetes", "Education"),
    # .y is for the second argument of map
    ~plot_scatter_by_group(NHANES, "Height", .x, .y))
#> [[1]]
#> Warning: Removed 1587 rows containing missing values (geom_point).

#> 
#> [[2]]
#> Warning: Removed 3072 rows containing missing values (geom_point).

If you want to do all combinations of two vectors, use the purrr function cross_df() to create a data.frame of the combinations. Note that, like most purrr functions, it takes a list as input.

all_combinations <- cross_df(list(
    xvar = c("TotChol", "Pulse"),
    groupvar = c("Diabetes", "Education")
    ))
all_combinations
#> # A tibble: 4 x 2
#>   xvar    groupvar 
#>   <chr>   <chr>    
#> 1 TotChol Diabetes 
#> 2 Pulse   Diabetes 
#> 3 TotChol Education
#> 4 Pulse   Education
map2(all_combinations$xvar, all_combinations$groupvar,
    ~plot_scatter_by_group(NHANES, "BMI", .x, .y))

Using map is really useful when importing raw data that is in a standard form and structure. Combined with the fs package (fs for filesystem), this is very powerful. To show how to do it, let’s first export NHANES into the data/ folder.

NHANES %>% 
    filter(SurveyYr == "2009_10") %>% 
    write_csv(here::here("data/nhanes-2009_10.csv"))

NHANES %>% 
    filter(SurveyYr == "2011_12") %>% 
    write_csv(here::here("data/nhanes-2011_12.csv"))

Then we can use the dir_ls() (for directory list) from fs combined with the regexp (for regular expression) argument to show all files that have nhanes in the name, have anything in the middle (.*), and end with ($) csv.

data_files <- fs::dir_ls(here::here("data/"), regexp = "nhanes-.*csv$")
data_files
#> /home/luke/Documents/teaching/rqawr-course/data/nhanes-2009_10.csv
#> /home/luke/Documents/teaching/rqawr-course/data/nhanes-2011_12.csv

Then we can import all at once using map_dfr (for map and convert to a df data.frame, stacking by r row):

data_files %>% 
    map_dfr(read_csv)
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   SurveyYr = col_character(),
#>   Gender = col_character(),
#>   AgeDecade = col_character(),
#>   Race1 = col_character(),
#>   Race3 = col_logical(),
#>   Education = col_character(),
#>   MaritalStatus = col_character(),
#>   HHIncome = col_character(),
#>   HomeOwn = col_character(),
#>   Work = col_character(),
#>   BMICatUnder20yrs = col_logical(),
#>   BMI_WHO = col_character(),
#>   Testosterone = col_logical(),
#>   Diabetes = col_character(),
#>   HealthGen = col_character(),
#>   LittleInterest = col_character(),
#>   Depressed = col_character(),
#>   SleepTrouble = col_character(),
#>   PhysActive = col_character(),
#>   TVHrsDay = col_logical()
#>   # ... with 12 more columns
#> )
#> See spec(...) for full column specifications.
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   SurveyYr = col_character(),
#>   Gender = col_character(),
#>   AgeDecade = col_character(),
#>   Race1 = col_character(),
#>   Race3 = col_character(),
#>   Education = col_character(),
#>   MaritalStatus = col_character(),
#>   HHIncome = col_character(),
#>   HomeOwn = col_character(),
#>   Work = col_character(),
#>   BMICatUnder20yrs = col_character(),
#>   BMI_WHO = col_character(),
#>   Diabetes = col_character(),
#>   HealthGen = col_character(),
#>   LittleInterest = col_character(),
#>   Depressed = col_character(),
#>   SleepTrouble = col_character(),
#>   PhysActive = col_character(),
#>   TVHrsDay = col_character(),
#>   CompHrsDay = col_character()
#>   # ... with 13 more columns
#> )
#> See spec(...) for full column specifications.
#> # A tibble: 10,000 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <dbl> <chr>    <chr>  <dbl> <chr>         <dbl> <chr> <chr> <chr>    
#>  1 51624 2009_10  male      34 30-39           409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 30-39           409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 30-39           409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 0-9              49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 40-49           596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 0-9             115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 0-9             101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 40-49           541 White <NA>  College …
#>  9 51647 2009_10  female    45 40-49           541 White <NA>  College …
#> 10 51647 2009_10  female    45 40-49           541 White <NA>  College …
#> # … with 9,990 more rows, and 67 more variables: MaritalStatus <chr>,
#> #   HHIncome <chr>, HHIncomeMid <dbl>, Poverty <dbl>, HomeRooms <dbl>,
#> #   HomeOwn <chr>, Work <chr>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <chr>, BMI_WHO <chr>,
#> #   Pulse <dbl>, BPSysAve <dbl>, BPDiaAve <dbl>, BPSys1 <dbl>,
#> #   BPDia1 <dbl>, BPSys2 <dbl>, BPDia2 <dbl>, BPSys3 <dbl>, BPDia3 <dbl>,
#> #   Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <dbl>,
#> #   UrineFlow1 <dbl>, UrineVol2 <dbl>, UrineFlow2 <dbl>, Diabetes <chr>,
#> #   DiabetesAge <dbl>, HealthGen <chr>, DaysPhysHlthBad <dbl>,
#> #   DaysMentHlthBad <dbl>, LittleInterest <chr>, Depressed <chr>,
#> #   nPregnancies <dbl>, nBabies <dbl>, Age1stBaby <dbl>,
#> #   SleepHrsNight <dbl>, SleepTrouble <chr>, PhysActive <chr>,
#> #   PhysActiveDays <dbl>, TVHrsDay <chr>, CompHrsDay <chr>,
#> #   TVHrsDayChild <dbl>, CompHrsDayChild <dbl>, Alcohol12PlusYr <chr>,
#> #   AlcoholDay <dbl>, AlcoholYear <dbl>, SmokeNow <chr>, Smoke100 <chr>,
#> #   Smoke100n <chr>, SmokeAge <dbl>, Marijuana <chr>, AgeFirstMarij <dbl>,
#> #   RegularMarij <chr>, AgeRegMarij <dbl>, HardDrugs <chr>, SexEver <chr>,
#> #   SexAge <dbl>, SexNumPartnLife <dbl>, SexNumPartYear <dbl>,
#> #   SameSex <chr>, SexOrientation <chr>, PregnantNow <chr>

Nifty eh?

Easier to use parallel processing with map

One big reason to get comfortable and good at using map is because you can use parallel processing to speed up your analysis with the future_ series of functions from the furrr package.

library(furrr)
# Use a combination of more of your CPU or creating multiple R sessions
plan(multiprocess)

# Use parallel processing
future_map2(all_combinations$xvar, all_combinations$groupvar,
    ~plot_scatter_by_group(NHANES, "BMI", .x, .y))

Final exercise

Time: Until end of session.

Create a new file for this exercise:

usethis::use_r("efficiency-exercise")

Copy and paste the template code (further below) into the R/efficiency-exercise.R file. Simplify the below code into three functions. Then iterate with one map to read in and wrangle each dataset, followed by two map calls to create the two different plots for each dataset.

nhanes_2009 <- read_csv(here::here("data/nhanes-2009_10.csv"))

nhanes_2009 %>% 
    mutate(ProblemBMIs = !between(BMI, 18.5, 40)) %>% 
    filter(!is.na(ProblemBMIs)) %>% 
    select(Age, Poverty, Pulse, BPSysAve, BPDiaAve, TotChol,
           SleepHrsNight, PhysActiveDays, ProblemBMIs) %>% 
    gather(Measurement, Value, -ProblemBMIs) %>% 
    na.omit() %>% 
    ggplot(aes(y = Value, x = Measurement, colour = ProblemBMIs)) +
    geom_jitter(position = position_dodge(width = 0.6)) +
    scale_color_viridis_d(end = 0.8) +
    labs(y = "", x = "") +
    theme_minimal() +
    theme(legend.position = c(0.85, 0.85)) +
    coord_flip()

nhanes_2009 %>% 
    mutate(ProblemBMIs = !between(BMI, 18.5, 40)) %>% 
    filter(!is.na(ProblemBMIs)) %>% 
    select(Age, Poverty, Pulse, BPSysAve, BPDiaAve, TotChol,
           SleepHrsNight, PhysActiveDays, ProblemBMIs) %>% 
    gather(Measurement, Value, -ProblemBMIs) %>% 
    na.omit() %>% 
    ggplot(aes(x = Value, fill = ProblemBMIs)) +
    geom_density(alpha = 0.35) +
    facet_wrap(~Measurement, scales = "free") +
    scale_fill_viridis_d(end = 0.8) +
    labs(y = "", x = "") +
    theme_minimal() +
    theme(legend.position = c(0.85, 0.15),
          strip.text = element_text(face = "bold")) 

nhanes_2011 <- read_csv(here::here("data/nhanes-2011_12.csv"))
    
nhanes_2011 %>% 
    mutate(ProblemBMIs = !between(BMI, 18.5, 40)) %>% 
    filter(!is.na(ProblemBMIs)) %>% 
    select(Age, Poverty, Pulse, BPSysAve, BPDiaAve, TotChol,
           SleepHrsNight, PhysActiveDays, ProblemBMIs) %>% 
    gather(Measurement, Value, -ProblemBMIs) %>% 
    na.omit() %>% 
    ggplot(aes(y = Value, x = Measurement, colour = ProblemBMIs)) +
    geom_jitter(position = position_dodge(width = 0.6)) +
    scale_color_viridis_d(end = 0.8) +
    labs(y = "", x = "") +
    theme_minimal() +
    theme(legend.position = c(0.85, 0.85)) +
    coord_flip()

nhanes_2011 %>% 
    mutate(ProblemBMIs = !between(BMI, 18.5, 40)) %>% 
    filter(!is.na(ProblemBMIs)) %>% 
    select(Age, Poverty, Pulse, BPSysAve, BPDiaAve, TotChol,
           SleepHrsNight, PhysActiveDays, ProblemBMIs) %>% 
    gather(Measurement, Value, -ProblemBMIs) %>% 
    na.omit() %>% 
    ggplot(aes(x = Value, fill = ProblemBMIs)) +
    geom_density(alpha = 0.35) +
    facet_wrap(~Measurement, scales = "free") +
    scale_fill_viridis_d(end = 0.8) +
    labs(y = "", x = "") +
    theme_minimal() +
    theme(legend.position = c(0.85, 0.15),
          strip.text = element_text(face = "bold")) 

Use this as a template for completing the exercise:

read_mutate_gather <- function(.file_path) {
    .file_path %>% 
        ___
}

plot_jitter <- function(.dataset) {
    .dataset %>% 
        ___
}

plot_density <- function(.dataset) {
    .dataset %>% 
        ___
}

# Read in and wrangle each csv file 
data_list <- 
    c(here::here("data/nhanes-2009_10.csv"),
      here::here("data/nhanes-2011_12.csv")) %>% 
    # (keep as list, i.e. don't use map_dfr)
    ___(___) 

# Plot each data with jitters
___(data_list, ___)

# Plot each data with density
___(data_list, ___)

If you’ve finished the exercise early, try using the furrr package to implement parallel processing on these map calls.

Resources for learning and help

For learning:

For help: