Monday, January 22, 2018

Defaults in R can make debugging incredibly hard for beginners

I am teaching a new undergraduate statistics class at Stanford, and an important part of the course is teaching students to run their own analyses using R/RStudio.  Most of the students have never coded before, and debugging turns out to be one of the major challenges. Working with students over the last few days I have found that a couple of the default features in R can combine to make debugging very difficult on occasion.  Changing these defaults could have a big impact on new users' early learning experiences.

One of the datasets that we use is the NHANES dataset via the NHANES library.  Over the last few days several students have experienced very strange problems, where the NHANES data frame doesn’t contain the appropriate data, even after restarting R and reloading the NHANES library.  It turns out that this is due to several “features” in R:
  • Users are asked when exiting whether to save the workspace image, and the default is to save it.
  • The global workspace (saved in ~/.RData) is by default automatically loaded upon starting R.
  • When a package is loaded that contains a data object, this object is masked by any object in the global workspace with the same name.  

Here is an example.  First I load the NHANES library, and check that the NHANES data frame contains the appropriate data.

> library(NHANES)
> head(NHANES)
     ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3    Education MaritalStatus    HHIncome HHIncomeMid
1 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
2 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
3 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
4 51625  2009_10   male   4       0-9        49 Other  <NA>         <NA>          <NA> 20000-24999       22500
5 51630  2009_10 female  49     40-49       596 White  <NA> Some College   LivePartner 35000-44999       40000
6 51638  2009_10   male   9       0-9       115 White  <NA>         <NA>          <NA> 75000-99999       87500

Now let’s say that I accidentally set NHANES to some other value:

NHANES=NA
> NHANES
[1] NA

Now I quit RStudio, clicking the default “Save” option to save the workspace, and then restart RStudio. I get a message telling me that the workspace was loaded, and I see that my altered version of the NHANES variable still exists.  I would think that reloading the NHANES library should fix this, but this is what happens:

> library(NHANES)

Attaching package: ‘NHANES’

The following object is masked _by_ ‘.GlobalEnv’:

    NHANES

> NHANES
[1] NA

That is, objects in the global environment take precedence over newly loaded objects.  If one didn't know how to parse that warning they would have no idea that this loading operation is having no effect.  The only way rid ourselves of this broken variable is either restart R after removing ~/.RData, or remove the variable from the global workspace:

> rm(NHANES, envir = globalenv())
> library(NHANES)
> head(NHANES)
     ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3    Education MaritalStatus    HHIncome HHIncomeMid
1 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
2 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
3 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
4 51625  2009_10   male   4       0-9        49 Other  <NA>         <NA>          <NA> 20000-24999       22500
5 51630  2009_10 female  49     40-49       596 White  <NA> Some College   LivePartner 35000-44999       40000
6 51638  2009_10   male   9       0-9       115 White  <NA>         <NA>          <NA> 75000-99999       87500

This seems like a combination of really problematic default behaviors to me: automatically saving and then loading the global workspace by default, and masking objects loaded from libraries with objects in the workspace.  Together they have resulted in hours of unnecessary confusion and frustration for my students, at exactly the point in their learning curve where it is most problematic to do so.

I have one simple suggestion for the R developers: Please turn off automatic loading of the workspace by default.  It would be as simple as changing the default on one radio box, and it would potentially save new users lots of time and frustration.

Until that happens, beginning R users should do the following:

  • Under the Preferences panel (the General Tab in R), unselect the “Restore .RData into workspace on startup” option.  
  • I would also recommend setting the “Save workspace to .RData on exit” preference to “Never”, since I find that I generally only restart R when I want the entire workspace cleared out, so this option will never be of use to me.

1 comment:

  1. I agree. I found exactly the same problem when I first started using R and quickly discovered the solutions you describe here. But the problem is the quirky default behavior. And I suspect most experienced R users don’t even think about it anymore (because they have changed their default settings). But it is annoying and especially so when you discover there is no appropriate command to clear the workspace and the recommended method is to restart R

    ReplyDelete