I am teaching a new undergraduate statistics class at Stanford, and an important part of the course is teaching students to run their own analyses using R/RStudio. Most of the students have never coded before, and debugging turns out to be one of the major challenges. Working with students over the last few days I have found that a couple of the default features in R can combine to make debugging very difficult on occasion. Changing these defaults could have a big impact on new users' early learning experiences.
One of the datasets that we use is the NHANES dataset via the NHANES library. Over the last few days several students have experienced very strange problems, where the NHANES data frame doesn’t contain the appropriate data, even after restarting R and reloading the NHANES library. It turns out that this is due to several “features” in R:
- Users are asked when exiting whether to save the workspace image, and the default is to save it.
- The global workspace (saved in ~/.RData) is by default automatically loaded upon starting R.
- When a package is loaded that contains a data object, this object is masked by any object in the global workspace with the same name.
Here is an example. First I load the NHANES library, and check that the NHANES data frame contains the appropriate data.
> library(NHANES)
> head(NHANES)
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid
1 51624 2009_10 male 34 30-39 409 White <NA> High School Married 25000-34999 30000
2 51624 2009_10 male 34 30-39 409 White <NA> High School Married 25000-34999 30000
3 51624 2009_10 male 34 30-39 409 White <NA> High School Married 25000-34999 30000
4 51625 2009_10 male 4 0-9 49 Other <NA> <NA> <NA> 20000-24999 22500
5 51630 2009_10 female 49 40-49 596 White <NA> Some College LivePartner 35000-44999 40000
6 51638 2009_10 male 9 0-9 115 White <NA> <NA> <NA> 75000-99999 87500
Now let’s say that I accidentally set NHANES to some other value:
NHANES=NA
> NHANES
[1] NA
Now I quit RStudio, clicking the default “Save” option to save the workspace, and then restart RStudio. I get a message telling me that the workspace was loaded, and I see that my altered version of the NHANES variable still exists. I would think that reloading the NHANES library should fix this, but this is what happens:
> library(NHANES)
Attaching package: ‘NHANES’
The following object is masked _by_ ‘.GlobalEnv’:
NHANES
> NHANES
[1] NA
That is, objects in the global environment take precedence over newly loaded objects. If one didn't know how to parse that warning they would have no idea that this loading operation is having no effect. The only way rid ourselves of this broken variable is either restart R after removing ~/.RData, or remove the variable from the global workspace:
> rm(NHANES, envir = globalenv())
> library(NHANES)
> head(NHANES)
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid
1 51624 2009_10 male 34 30-39 409 White <NA> High School Married 25000-34999 30000
2 51624 2009_10 male 34 30-39 409 White <NA> High School Married 25000-34999 30000
3 51624 2009_10 male 34 30-39 409 White <NA> High School Married 25000-34999 30000
4 51625 2009_10 male 4 0-9 49 Other <NA> <NA> <NA> 20000-24999 22500
5 51630 2009_10 female 49 40-49 596 White <NA> Some College LivePartner 35000-44999 40000
6 51638 2009_10 male 9 0-9 115 White <NA> <NA> <NA> 75000-99999 87500
This seems like a combination of really problematic default behaviors to me: automatically saving and then loading the global workspace by default, and masking objects loaded from libraries with objects in the workspace. Together they have resulted in hours of unnecessary confusion and frustration for my students, at exactly the point in their learning curve where it is most problematic to do so.
I have one simple suggestion for the R developers: Please turn off automatic loading of the workspace by default. It would be as simple as changing the default on one radio box, and it would potentially save new users lots of time and frustration.
Until that happens, beginning R users should do the following:
- Under the Preferences panel (the General Tab in R), unselect the “Restore .RData into workspace on startup” option.
- I would also recommend setting the “Save workspace to .RData on exit” preference to “Never”, since I find that I generally only restart R when I want the entire workspace cleared out, so this option will never be of use to me.