russpoldrack.org

Saturday, April 3, 2021

Teaching statistics online (for the first time): Lessons learned

We just completed the 2021 Winter Quarter at Stanford, and it was my first time to teach my introductory statistics course fully online. The goal of this post is to talk through how it went and the lessons learned. I apologize in advance for the length of this!

Course structure

This course (Stats 60/Psych 10) is the pre-calculus version of intro statistics that is meant to serve a broad range of students across the university (including, but certainly not limited to Psychology students), with a roughly even mix of students from each class (freshman through senior). The enrollment this quarter was well over 100 students. This is the fourth time that I have taught the class, but the first time to teach it fully online. We made a number of major changes to accommodate the online format:

Modular course structure: We reorganized the course around Canvas modules, each of which was focused on a particular topic and lasted one week. Each module included:

Pre-recorded lecture videos (usually 2-3 videos, 10-15 minutes each), with quiz questions embedded (using Panopto) to ensure engagement
Readings from my open-source textbook
A quiz (which had to be completed with 100% correct for credit, and could be retaken as many times as necessary to achieve that score)
Every other module included a problem set
Some modules included milestones related to the final project, which was an independent data analysis project using openly available data completed in groups of 3-4 students.

At any point in the quarter, the students could access the modules for the current week and the following two weeks, which gave them an opportunity to get ahead when needed but also ensured some degree of spaced learning.

Fully flipped class: There were no standard lectures during the synchronous class sessions (which met 3 times weekly, 50 minutes per session); pre-recorded lecture videos were provided for each module, and students were required to watch them and complete questions embedded in the videos (using Panopto on Canvas). Students were required to attend at least one of these synchronous sessions, or alternatively to complete a written makeup assignment. The synchronous sessions followed roughly the following organization:

Monday: review of core concepts (usually with me drawing on the Zoom whiteboard), and group activities
Wednesday: Live coding, including problem set review in weeks following a problem set deadline
Friday: Answering questions (which students could post to a Google Doc each week) and breakout room activities

Schedule-driven grading: Inspired by Patrick Watson’s outstanding post, I decided to move to a schedule-driven grading system, in which students start the quarter with 105 points, and lose 2.5 points for every assignment that they don’t complete (including attendance to at least one course session or completion of a makeup exercise, and attendance at discussion sections). Thus, they always know exactly what their grade is (assuming they do everything else in the quarter). Most grading was for completion; for the problem sets, we ran the submissions through an automated testing system, and students with multiple errors were given a chance to revise their submission. For lecture attendance, students self-reported their attendance; this was doublechecked on occasion against the Zoom logs, with no major discrepancies. The goal in general was to ensure not just completion but mastery of the material.

Purpose-built R tutorials: An essential part of the class is learning to perform statistical analyses using R. This is challenging because many of the students in the course have never coded before, and 10 weeks is a very short time to teach this! In past years we have used Datacamp tutorials, but found that they were not well aligned with the specific topics that we were teaching. For this year, I developed a set of interactive R tutorials (using the awesome learnr package) which were specifically built to emphasize the skills that we wanted them to have, with minimal distraction. This also allowed some changes in how we teach R. For example, in recent years we have introduced pipes from the very beginning of teaching the tidyverse, but found that they were very difficult for many students to conceptualize. This year we started with the tidyverse without using pipes (i.e. using strings of individual commands), and only introduced pipes in the last few weeks of the course. Again it’s hard to say since so many things changed, but this change definitely seemed to reduce confusion in the early stages of R learning.

Google Colab: In the past we have tried having students install RStudio on their own computers, or using rstudio.cloud for cloud access. However, the former is problematic since many students have Chromebooks that can’t run RStudio, and the latter now charges a substantial amount. We made the choice to try Google Colab as our coding platform; the ability to run R notebooks is somewhat obscured (they have to be created using a special link) but once created they work well.

Problem sets with embedded tests: In the past we have given students a “skeleton” code file that provided them with some scaffolding for their problem sets and ensured that variables have the correct names (so that our automated testing system is less likely to fail). This year, we decided to embed a test into each code cell in the skeleton, so that students could immediately see for each cell whether they had correctly completed the problem (primarily by testing to see that their variables had the correct sizes, types, and values), giving them a “good job!” message when they did so.

Sections: In the past, we have used discussion sections for concept review in addition to working on the group projects. This year, we decided to dedicate section time solely to working on group projects, given that students often need a lot of time to work on these.

Lessons learned:

Overall I thought the new course structure worked incredibly well, and the students seemed to agree. On the question of “How much did you learn from this course”, more than 2/3 of the students said “A lot” or “A great deal”. We didn’t obtain these overall ratings last year due to the COVID onset, but in 2019 only 45% rated the course at this level, suggesting that we have significantly improved the student experience (that difference is significant at p<.001 if you need statistical evidence :-). Clearly one needs to take any such comparison with a grain of salt since many things have changed, but the qualitative comments from the students were also markedly more positive this year. In what follows I will paraphrase some of the comments from the student evaluations.

Modular structure: Students appreciated the organization of the modules; the course received very high ratings on the question “How organized was this course?”, with 95% saying “Extremely organized” or “Very organized”; only 59% of students rated it that well in the 2019.

Flipped class structure: I felt that the ability to talk with students and walk through problems on the digital whiteboard was really effective at helping me understand which concepts they were struggling with. In addition, I was able to address the questions that they had posted to the weekly Google Doc. This allowed me to spend much more time focusing on concepts that needed additional attention. In addition, the chat function in Zoom seemed to help encourage questions from students who might have been reticent to speak up in class before.

Schedule-driven grading: Perhaps not surprisingly, the students loved the grading system; on the course evaluation question of “How did you feel about the schedule-driven grading system?”, 90% of students were strongly positive and 7% were weakly positive. A number of students mentioned in their comments that the grading system allowed them “focus on learning” rather than worrying about their grade. This was particularly noted by students with no coding experience, for whom the course can be quite daunting. The only major complaints regarding the grading system were from students who thought that they would have been more engaged in the course if they had been graded for accuracy rather than completion.

Purpose-built tutorials: These were a big hit, in particular because they directly matched the problem sets; whereas in previous years students might have to search online resources to find how to solve a particular problem, this year every coding concept that they needed for the problem sets had been covered in the tutorials. And they didn't have to learn a lot of R concepts that would never show up in class.

Colab: In comparison with previous years, there was very little friction around the coding platform; in general Colab worked very well. In particular, it made it very easy to share my notebooks from class, so that students could view them and create a copy that they could edit themselves if they wanted. 77% of students rated Colab positively, and only 5% rated it negatively. The main complaint was that it doesn’t allow simultaneous editing of a notebook by multiple people (ala Google Docs), which occasionally led to collisions if students were simultaneously editing a shared notebook. I will definitely choose Colab again for next year’s class!

I would say that the one thing that didn’t work well was breakout rooms during the synchronous class sessions; the majority of students rated them as Moderately useful (33%), Slightly useful (17%), or Not useful at all (18%). One major issue is that many students would apparently join the breakout room but then keep their cameras turned off and not participate in the discussion; occasionally this would lead a student to be the only responsive individual out of 5 or 6 people in a breakout room. Let’s hope that when I teach it again in Winter 2022 that we will be in person and not over Zoom…

That said, one thing that I think actually worked better over Zoom than in person was live coding. It was certainly better for me as the instructor, because I was able to use my large monitor and have more information at my fingertips (sharing only my notebook screen) than I can using my laptop in a lecture hall where my entire monitor is in view. In addition, it’s much easier for students to type coding answers into the chat window than it is to say them out loud. I am definitely going to consider a hybrid going forward where the live coding sessions are held remotely even if the remainder of the class is held in person.

In closing, I have to give a shout out to my awesome teaching team, who helped make the experience of teaching this class so seamless and enjoyable, as well as to my students who remained engaged despite the challenges of online learning. I’m looking forward to teaching the course in person next year, but I think that the experience of taking it fully online will definitely improve the course even when it’s back in a physical classroom.

Sunday, November 29, 2020

Editing lecture videos using Davinci Resolve

In my previous post I described a simple workflow for generating lecture videos. One limitation of this workflow is that when slides are shared as a virtual background in Zoom (which I like to do in order to keep my webcam image on the screen next to the slides), the cursor is not captured in the video recording. Since I occasionally need to highlight a portion of the image, this means that I need to edit the video files to add that highlighting. To do this I decided to use DaVinci Resolve 16, which is a powerful video editing tool that is available for free. It has a bit of a learning curve, but the power appears to be well worth it; here I will show my workflow for adding annotations to a lecture video. I'm mostly doing this so that I remember how to do it next time around, but hopefully it might also be useful for others.

In this example I am discussing Z-scores, and I want to highlight the location of the Z = 0 (i.e. the mean) and Z =1 (one standard deviation) on the normal distribution. After opening DaVinci Resolve, I start a new project for my video, which will open a browser for the media in my project. I then import my lecture video using File -> Import File -> Import Media; it will ask whether you want to change the video settings to match the current project, which I accept. The video will now appear in the media browser in the top left; drag the video onto the timeline browser in the lower leftmost portion of the screen, which will add it to your timeline. Now go to the "Edit" window by clicking the Edit button at the bottom (third icon from the left). Now you will see your video in the timeline at the bottom, along with a preview window at the top:

Now let's add our annotation. First, find the location in the video where you want to add the annotation. Then, right click in the area just to the left of the timeline, and add a new track:

Now open the Effects library using the tab at the top of the screen (if it's not already open), and navigate to the Effects panel under the Toolbox option. You should see an option for "Adjustment Clip" - grab this and drag it to the location on your video that you had identified.

Then, select and right click on the Adjustment Clip that was just created, and choose "Open in Fusion Page":

This will open the Fusion editor, which is a powerful tool for all sorts of video edits. You will see a section in the bottom left showing two nodes for MediaIn1 and MediaOut1; what you need to do is add a Paint node into the line connecting those nodes, which you can do by right-clicking onto the line and Adding a Paint tool node:

You will now see details about the Paint tool in the inspector to the top right. Here is perhaps the most important thing to know here: The tool that is opened by default (the "Multistroke" tool) doesn't do what we want it to do here, which is to create a graphic that remains on the screen for the length of our Adjustment clip. To do that, select the simple "Stroke" tool, which for me is the fourth icon in the panel above the preview:

You should then see a set of controls for Stroke1 in the Inspector to the top right:

Choose the color for your annotation along with changing any other features of interest; I will use a red painbrush, so I click on the color chooser and pick a red color. Then you simply start painting in the preview window:

Now go back to the editing window, and adjust the length of the Adjustment Clip as needed for your video. You can add as many additional annotations as needed using this same method.

Occasionally I realize that I have said something incorrect in the video. Rather than re-recording or trying to edit the video itself, I simply add a title to the screen nothing that I misspoke. This is easy using the Title feature from the Effects library in the editing page:

Once you are done, simply export the video using the QuickExport feature and you are ready to go!

Thursday, November 26, 2020

A quick and dirty workflow for creating lecture videos

I'm currently in the midst of fully reworking my undergraduate statistics course for online learning, which includes creating about 40 short (5-10 minute) mini-lectures for the students to view asynchronously. As much as I would love to put many hours into generating high-touch videos, my time is limited so I needed a workflow that would allow me to generate these videos with as little overhead as possible. Here is what I came up with.

Platform: I'm using a Macbook Pro as part of the setup described in my previous post on my home office setup.

Software: I use Keynote to create the slides, Zoom to record the presentation, and QuickTime Player for cleaning up the video.

Slide Prep: First, it's important to make sure that your slides don't use any builds, because (at least for Keynote) the Zoom "Slides as virtual background" feature doesn't support builds. So just separate your builds out into separate slides. Second, because your head will appear in the bottom right of the screen, you should make sure that there is no essential material that appears in that location. In the worst case, you can always just move your head out of the way, which I imagine is someone amusing for the viewer.

Recording workflow:

1. Start a Zoom meeting, and start Screen Sharing. Under the Advanced tab, choose "Slides as virtual background", share the screen and the choose your presentation file.

The slides will load, with a small image of your head in the bottom right.

One exception to this workflow is if you need to present video as part of your presentation, which doesn't work with the "Slides as virtual background" option. In this case you'll need to use a regular screen share, which will lose the talking head in the corner.

2. Start recording, being sure to select "Record on this computer".

3. After you start recording, give yourself a few seconds to settle and get any fidgets out of the way. Then start talking. I try to give the entire lecture without stopping, realizing that I will probably make a few mistakes, and that's ok. Occasionally I find myself totally flummoxed part way through, or realizing that I need to make a big change, in which case I simply quit and start over. Since each of the videos is relatively short, I don't lose that much time if I have to bail partway through.

One tip that I still find somewhat difficult to follow: Try to finish your comments about a particular slide before you flip to the next slide. I find that I have a habit of flipping forward to the next slide as I am finishing my comments about a slide. This is usually fine for talk, but for these lectures I am using Panopto within Canvas to embed quiz questions in the video, which I usually want to place at a transition between slides. However, if I am still talking about the previous slide after I have transitioned to the next slide, the quiz placement becomes awkward.

4. When you get to the end of the lecture, give yourself a few seconds of stillness on the last slide or on a blank slide inserted after the last slide.

5. End the Zoom session using the End button (no need to stop sharing). This will cause Zoom to save the video to a file, which will pop up in the Finder once it's done.

Post-processing workflow:

1. Open the mp4 file from the Zoom recording folder in QuickTime Player.

2. Find the point where you want to start the video, just before you start talking. With the player paused at that location, choose "Split Clip" from the Edit menu. Click on the leftmost section in the timeline, and press Delete to remove that leading section, then click Done to save the change. Now do the same for the end of the video, finding the point where you want to end and removing the trailing section.

There is a "Trim clip" feature that one can use to do this in a single step rather than two, but I find that it's easier to be precise about where the trimming happens using the Split Clip method.

3. Close and save the video to a new .mp4 file.

I find that this method takes me only a minute or so to post-process each video once it's recorded. Of course, you could do much fancier stuff if you wanted; in that case I would check out DaVinci Resolve, which is one of the most amazing pieces of free software ever created but has a pretty steep learning curve for serious video editing

Uploading the video:

If you are using Canvas and your instance supports Panopto, then I would recommend using that method to upload the videos, since it provides viewing statistics (e.g. for recording which students have watched the video) and also allows embedding quiz questions within the video.

As always, suggestions are welcome in the comments below!

Tuesday, October 13, 2020

Home office setup

One of my colleagues recently asked me about my home office setup, after noting that my video and audio quality is generally quite good on our frequent Zoom calls. Whenever anyone asks me a question like this, I take it as a good excuse to write a blog post!

We have all spent a lot of time in our home offices since March, and I’m lucky that we have a guest bedroom that I was able to repurpose as my home office. I’ve ended up spending a bit of money to make it nice, but I think in general the investments have been good. However, I’ve also gone cheap/DIY when I can. Here is a photo of my desk setup:

Here’s a quick rundown of the various items:

Camera: Logitech C930e
Microphone: Audio-Technica ATR2100x-USB with Sterling Audio Sterling SM5 Shock Mount
Pop Filter: Stedman Proscreen XL
Mic boom: Rode PSA1
Headphones (over-ear): Audio-Technica ATH-AD700X Open-air
Headphones (in-ear): Apple AirPods
Lighting: Homemade diffusers with Cree 5000K LED bulbs
Green screen: Homemade
Chair: Steelcase Leap

Camera: This webcam was scavenged from my lab at the beginning of the pandemic, back when it was impossible to find a webcam in stock for purchase. It works fine, though I wouldn't say that the picture quality is amazing. After seeing one of my colleagues get amazing video quality by using their DSLR as a webcam, I tried it out with our relatively ancient Canon Rebel - the color was much better but its video was way too laggy, and the camera/tripod setup took up too much room on my desk, so I’ve stuck with the Logitech. I use the Webcam Settings App for Mac to zoom the image so that my head takes up most of the image without having to lean into the camera.

Microphone setup: I wanted to get a boom mic rather than a stand mic, mostly because I didn’t want a stand mic taking up extra space on my desktop. I know that many people use either a lapel mic or a mic integrated into their headset, but neither of those sounded attractive to me. The microphone connects via USB to my computer, and works really well. I went for a nicer mic in part because I was planning to record an audio version of my statistics book to provide to my students, and I’ve been really happy with the sound quality. The shock mount does a good job of isolating low-frequency noise from the desk, though a tiny bit of keyboard noise is evident when I’m typing, even with the mic pointed directly away from the keyboard. The Rode mic stand can sometimes be difficult to keep in position, but works fine for my purposes. I don’t use the popscreen for Zoom calls, but it has been important for recording spoken word material, which otherwise sounds like I’m spitting on the listener.

Headphones: I generally alternate between in-ear and over-ear headphones over the day. I love the AirPods, but after a while they start hurting my ears, and they don’t have enough battery life to get me through a full day of Zoom meetings. The Audio-Technica headphones were my first open-back headphones, and I am definitely a convert - they let you hear the outside world, and don’t leave you with that closed-in feel that you get from closed-back headphones. They are also super comfortable. These are standard wired headphones, which I like both because they don’t have a lag like bluetooth headphones (not so important for Zoom calls but essential when I’m playing guitar), and also because I will never be stuck with a dead battery.

Lighting: Everything else involved buying some equipment, so for the lighting setup I decided to go DIY (with lots of help and encouragement from my designer/wife Jen). I wanted a simple two-point lighting setup from the two sides of my monitor, so we started with a couple of old table lamps that we had around the house. I took a couple of empty wooden picture frames and attached each one to one of the arms of the lamp using a plastic cable stay, which is not exactly bulletproof but so far as lasted several months without failing.

To create a diffuser I started with some architectural tracing paper which I affixed in a sleeve around the picture frames. ultimately this wasn’t quite enough diffusion (I was still seeing strong reflections of the light in my glasses), so I also attached a piece of standard printer paper to the front with a binder clip. I still get a bit of point glare, but it’s not too bad:

I'll probably try to do some more tweaking to resolve that. We started with some warmer bulbs but I didn’t love the color, so I replaced them with Cree 5000K LED bulbs which I’m pretty happy with.

Green screen: I don’t usually use a green screen, but sometimes I need it if I want to play with video editing software for lecture videos. This one is also DIY - basically a wheeled clothing rack with a green fleece blanket attached using some large binder clips.

Definitely not pretty, but gets the job done.

Chair: After spending the first few months of quarantine sitting in a cheapo office chair (and feeling the effects by the end of the day), I decided to splurge on a serious office chair. I already had a Steelcase Leap in my campus office, so I knew I would be happy with it. It has not disappointed - it’s definitely not cheap, but if you need a really good chair and have the budget I would definitely recommend it. Your butt will thank you!

I'm interested to hear your thoughts and any tips on how to further optimize the setup.

Thursday, July 23, 2020

Vacation fun: Making traditional Texas chili con carne

I’m on vacation at home this week, and one afternoon when it was especially gray (because San Francisco in July) I decided to cook up some chili con carne. The recipe that I use is a modification of one that I can no longer find online, written by Reece Lagnuas back when he was a butcher in Austin. I documented this cook because the recipe is such a great rendition of the traditional Texas chili that I grew up eating that I thought it should be out there for everyone to try. And this batch turned out to be especially good!

Your reward at the end of this journey

Be forewarned, this dish requires a pretty substantial time commitment - from start of prep until the dish was cooking it took me about 90 minutes. Once it’s cooking you just need to check it occasionally to make sure it’s simmering and not cooking too hard - it should be ready to eat within 2-3 hours. Perfect activity for a cool, gray vacation afternoon!

Also - I've never written a recipe before, I apologize in advance for how verbose it is...

Ingredients:

Meat: I hope it goes without saying that you should only cook with humanely raised meat. The meat for this cook came from our neighborhood butcher shop, Avedanos, which supports local family farms.
- ~ 2.5 pounds pork shoulder
- ~ 2.5 pounds brisket (preferablly from the fattier end, known variously as the point or deckle)
- if you have your own meat grinder then buy them whole, otherwise ask your butcher to grind them as coarsely as possible
1 large onion - diced relatively small
fresh chiles
- 2 red bell peppers
- 2 poblano peppers
- 2 large jalapeno peppers
dried chiles - I use a varying mix, this time it was:
- 2 chile ancho
- 3 dried pasilla
- 2 chile guajillo
seasonings: you can mix all of these together as they will be added at the same time
- salt (start with 1 tbs, we like it salty so usually add more to taste later in the cook)
- ground black pepper (1 tsp)
- cayenne pepper (if you want it spicy - for this cook, I added about 1/3 tsp of Penzey’s Black & Red which is a mix of black and cayenne pepper - the end result had just a very tiny bit of spicy kick)
- Chili Powder (3 tbs)
- Ground cumin seed (2 tsp)
- Garlic powder (1 tbs)
- Onion Powder (1 tbs)

Steps:

Roast the fresh peppers. The goal here is to char the skins so that they come off easily after steaming. I used my outdoor gas grill, but you can also do this directly over the burner of a gas range. If you don’t have gas then it sounds like you can also use an electric range or toaster oven. You want the skins to be charred black over as much of the pepper as possible, so you will need to turn them regularly; the larger peppers will probably take much longer than the small ones. Once they are nicely charred, then put them in a loosely sealed container to steam for at least 20 minutes.

Roasting the fresh chiles on the backyard gas grill

Roast and rehydrate the dried chiles. This will require a hot pan (I used the same Dutch oven that I will use to cook the chili) and about a quart of boiling water. Heat the pan on high and toss in the chiles, turning them regularly to prevent burning. When they start to smell roasty, place them in a heatproof bowl for soaking. Before you soak them, use some scissors to cut small holes in the side of each chili - this will make it easier to get any air out and submerge the chiles fully. After cutting the holes, pour the boiling water over the chilis.

Roasting the dried chiles

Prepare the meat. If your meat was ground by your butcher then you can skip this step. I like to grind the meat myself, since butchers often need time to set up their grinder for a coarse grind. I use the meat grinder attachment for our KitchenAid mixer. When grinding meat, it’s important for both the meat and grinder to be as cold as possible, so I put both of them in the freezer for about an hour before grinding the meat. Chop the meat into strips or chunks that are small enough to fit in the grinder feed; I like to leave most of the fat on and remove it later during the cook, but sometimes I will trim away large fat pieces.

Action shot - grinding the brisket

Clean and chop the chiles. Remove the skins from the fresh chilis (they should come off easily after steaming), and also remove the stem, seeds, and membranes inside the chili. Don’t wash them! For the dried chiles, try to remove as much of the seeds and membrane as possible (don’t worry about the skins). Then chop them until they are nearing the consistency of a paste; this generally takes a lot of work.

Dried chiles after roasting

Another action shot - chopping chiles

Time to start cooking! Add about 2 Tbs of oil to the large pot, and cook the onions on relatively high heat until they are just starting to brown, stirring constanly.

Blooming the spices - the smell is amazing

Add the spice mixture once the onions are starting to brown, and stir constantly for a minute or two. You should smell the spices bloom, especially the cumin seed.

Browning the meat. After blooming the spices, add the meat and cook for several minutes until it is starting to brown. You should be able to smell the meat browning and start to see fat from the meat rendering out in the pan.

Add the chili paste and mix into the meat. Then add just enough water to cover the meat; for this cook it was about 6 cups.

About 3 hours in - almost done!

Bring to a boil and then reduce to a simmer. The chili will then cook for at least 2 hours and preferably 3 or more hours; for this cook, it went a bit more than 3 hours.

Skim extra grease. A couple of hours into the cook, there will likely be a substantial amount of grease on the top of the chili. I like to remove some of this before serving, so that the chili isn’t too greasy. There are probably fancy ways to do this, but I simply use a Chinese soup spoon to skim the fat off of the top. This time around I ended removing about 1.5 cups of fat.

When you are ready to eat, taste the chili and add salt as needed to taste.

Enjoy! I don’t generally like adulterating my chili with any additions, but this time I tried it with a bit of guacamole on the side, and it was really good.

This recipe makes a lot of food — we usually have enough left over from this recipe for two additional meals (for two people). The chili keeps well in the freezer for at least a month, though it rarely lasts that long around here...

This work is licensed under a Creative Commons Attribution 4.0 International License.

Friday, January 24, 2020

Talking remotely: Lessons learned so far

Since making my commitment to reduce air travel for academic purposes, I’ve been giving a lot more remote talks. In the last 5 months I have given 10 remote talks - many thanks to those who have agreed to host me virtually rather than in person:

September:

National Academies Data Science in the Cloud workshop, Washington, DC
National Academies Brain Health Across the Lifespan workshop, Washington, DC

October:

Cognitive Science Colloquium, Institut d'Etudes Cognitives, École Normale Supérieure, Paris, France.
Johns Hopkins University Dept. of Electrical and Computer Engineering, Distinguished Lecturer Series, Baltimore, MD
NIMH Talk Series on Machine Learning in Brain Imaging, Neuroscience, and Psychology, Bethesda, MD

November:

Santa Fe Institute, Cognitive Regime Shift meeting, Santa Fe, NM
Montreal Neurological Institute, Open Science Symposium, Montreal
Johns Hopkins University Dept. of Biostatistics, Bethesda, MD

January:

IBI Data Standards and Sharing Working Group, Tokyo, Japan
Max Planck School of Cognition, Berlin, Germany

Some of these were already bunched together so they wouldn’t have required separate flights, but even considering that, my back-of-the-envelope calculation shows that these flights would have resulted almost 7 tons of CO2 being generated (as estimated using https://www.icao.int/environmental-protection/Carbonoffset/Pages/default.aspx). Not to mention lots of physiological stress from jet lag, and travel costs to be borne by my hosts. So in many ways it’s been a huge win for everyone.

An important issue, however, is what the experience was like, both for my hosts and the attendees and for myself. The visits varied from talks with a short Q&A session, to extended visits in which my talk was followed by individual meetings with researchers. For me the experience has been very positive — certainly not as good as being there in some ways, but still very satisfying. The least satisfying experience for me as a speaker has been in situations where I give a talk without time for Q&A afterwards. I think that my hosts have also largely found it to be a positive experience, at least from the feedback that I’ve received. In one case, I was the pilot test for hosting extended virtual visits, and afterwards they told me that the experience had convinced them to do it regularly.

Going through these talks has taught me a few lessons about how to improve the experience, both for the speaker and for the audience.

Always set up a time with the host to test things out in advance in the actual venue, preferably at least a few days before the talk.
On the day of the talk, arrange to meet the host online at least 15 minutes before the scheduled talk time. Even when everything is well oiled, problems can arise, and you don’t want to be debugging them in front of an audience.
Give your host your cell phone number, and keep your phone handy so that they have an alternate way to contact you if necessary.
In general I think it’s good for a virtual talk to be a bit shorter than a regular talk, simply because it’s easier for people to fade off when you are not present to look them in the eye. Erring on the side of going short rather than long is also a good general principle — As an audience member I have rarely been upset when a talk went shorter than expected, and it gives more time for questions, which are usually the most interesting part anyway.
For longer talks (over an hour), give the audience a short intermission. For example, for my talk to the Max Planck School of Cognition (a 90 min talk with 30 mins for questions), I asked the audience to stand up and stretch out about half way through, which they seemed to appreciate.

I also have several suggestions for hosts of virtual visits:

*Please* use a standard commercial conferencing system (like Zoom or Webex) rather than a home-grown system. Especially one that requires me to install special software! Having to install new software or log into a new system is just another potential point of failure for the talk. In general I have had the best experiences when using Zoom or Skype, but I’m sure there are other systems that are also good.
As a speaker I particularly like being able to see a chat window on my screen as I’m talking, so that people can post questions during the talk. This works well with systems like Zoom, but often doesn’t exist at all in home-grown systems.
Please provide a camera so that the speaker can see the audience. Talking without seeing the audience is much less pleasant and also makes it impossible to tell if people are disengaged, or if there is an unexpected problem with the A/V system.
Make clear to the audience up front how questions will work. I prefer having them submitted by chat window, but if they are going to be spoken, then there should be microphones explicitly for the question, and these should be tested beforehand to make sure that the speaker can hear them.
For extended visits, it has worked well to have a single Zoom room for the entire day, which individuals come into or out of throughout the day for their scheduled meetings. Please remember that people sitting in front a computer have biological needs just like people who are physically present, so schedule regular bio-breaks during the day.
For events that are more discussion based, it's important to have multiple microphones spread around the room so that the virtual attendees can hear what is being said. If someone is going to be writing on a whiteboard, it's also important to have a camera on the board.

Please leave other thoughts or suggestions in the comments below!

Thursday, December 12, 2019

Computing models for a neuroimaging lab

I had a conversation with a colleague recently about how to set up computing for a new neuroimaging lab. I thought that it might be useful for other new investigators to hear the various models that we discussed and my view of their pros and cons. My guess is that many of the same issues are relevant for other types of labs outside of neuroimaging as well - let me know in the comments below if you have further thoughts or suggestions!

The simplest model: Personal computers on premise

The simplest model is for each researcher in the lab to have their own workstation (or laptop) on which all of their data live and all of their computing is performed.

Pros:

Easy to implement
Freedom: Each researcher can do whatever they want (within the bounds of the institution’s IT policies) as they have complete control over their machine. NOTE: I have heard of institutions that do not allow anyone on campus to have administrative rights over their own personal computer. Before one ever agrees to take a position, I would suggest inquiring about the IT policies and make sure that they don’t prevent this; if they do, then ask the chair to add language to your offer letter than explicitly provides you with an exception to that policy. Otherwise you will be completely at the mercy of the IT staff — and this kind of power breeds the worst behavior in those staff. More generally, you should discuss IT issues with people at an institution before accepting any job offer, preferably with current students and postdocs, since they will be more likely to be honest about the challenges.

Cons:

Lack of scalability: Once they need to run more jobs than there are cores on the machine, the researcher could end up waiting a very long time for those jobs to complete, and/or crash the machine due to resource insufficiency. These systems also generally have limited disk space.
Underuse: One can of course buy workstations with lots of cores/RAM/storage, which can help address the previous point to some degree. However, then one is paying a lot of money for resources that will sit underutilized most of the time.
Admin issues: Individual researchers are responsible for managing their own systems. This means that each researcher in the lab will likely be using different versions of each software package, unless some kind of standardized container system is implemented. This also means that each researcher needs to spend their precious time dealing with software installation issues, etc, unless there is a dedicated system admin, which costs $$$$.
Risk: The systems used for these kinds of operations are generally commodity-level systems, which are more likely to fail compared to enterprise-level systems (discussed below). Unless the lab has a strict policy for backup or duplication (e.g. on Dropbox or Box) then it’s almost certain that at some point data will be lost. There is also a non-zero risk of personal computers being stolen or lost.

Verdict: I don’t think this is generally a good model for any serious lab. The only strong reason that I could see for having local workstations for data analysis is if one’s analysis requires a substantial amount of graphics-intensive manual interaction.

Virtual machines in the cloud

Under this model, researchers in the lab house their data on a commercial cloud service, and spin up virtual machines on that service as needed for data analysis purposes.

Pros:

Flexibility: This model allows the researcher to allocate just enough resources for the job at hand. For the smallest jobs, one can sometimes get by with the free resources available from these providers (I will use Amazon Web Services[AWS] as an example here since it’s the one I’m most familiar with). On AWS, one can obtain a free t2.micro instance (with 1 GB RAM and 1 virtual CPU); this will not be enough to do any real analysis, but could be sufficient for many other functions such as working with files. At the other end, one can also allocate a c5.24xlarge instance with 96 virtual CPUs and 192 GiB of RAM for about $4/hour. This range of resources should encompass the needs of many labs. Similarly, on the space side, you can scale your storage space in an effectively unlimited way.
Resource-efficiency: You only use what you pay for.
Energy-efficiency: Cloud services are thought to be much more energy-efficient compared to on-premise computers, due to their higher degree of utilization (i.e. they are not sitting idle most of the time) and the fact that they often obtain their power from renewable resources. AWS estimates that cloud computing can reduce carbon emissions by up to 88% compared to on-premise computers.
Resilience: Occasionally the hardware on a cloud VM goes out. When this happens, you simply spin up a new one --- no hardware replacement cost.

Cons:

Administration and training: Since most scientists will not have experience spinning up and administering cloud systems, there will be some necessary training to make this work well; preferably, one would have access to a system administrator with cloud experience. Researchers need to be taught, for example, to shut down expensive instances after using them, lest the costs begin to skyrocket.
Costs: Whereas the cost of a physical computer is one-time, cloud computing has ongoing costs. If one is going to be a serious user of cloud computing, then they will need to deeply understand the cost structure of their cloud computing services. For example, there are often substantial costs to upload and download data from the cloud, in addition to the costs of the resources themselves. Cloud users should also implement billing alarms, particularly to catch any cases where credentials are compromised. In one instance in my lab, criminals obtained our credentials (which were accidentally checked into Github) and spent more than $20,000 within about a day; this was subsequently refunded by AWS, but it caused substantial anxiety and extra work.
Scalability: There will be many cases in which an analysis cannot be feasibly run on a single cloud instance in reasonable time (e.g., running fMRIprep on a large dataset). One can scale beyond single instances, but this requires a substantial amount of work, and is really only feasible if one has a serious cloud engineer involved. It is simply not a good use of a scientist’s time to figure out how to spin up and manage a larger cluster on a cloud service; I know this because I’ve done it, and those are many hours that I will never get back that could have been used to do something more productive (like play guitar, do yoga, or go for a nice long walk). One could of course spin up many individual instances and manually run jobs across them, but this requires a lot of human effort, and there are better solutions available, as I outline below.

Verdict: For a relatively small lab with limited analysis needs and reasonably strong system administration skills or support, I think this is a good solution. Be very careful with your credentials!

Server under a desk (SUAD)

Another approach for many labs is a single powerful on-premise server shared by multiple researchers in the lab, usually located in some out-of-the-way location so that no one (hopefully) spills coffee on it or walks away with it. It will often have a commodity-grade disk array attached to it for storage.

Pros:

Flexibilty: As with the on-premise PC model, the administrator has full control.

Cons:

Basically all the same cons as the on-premise PC model, with the added con that it's a single point of failure for the entire lab.

Same scaling issues as cloud VMs

Administration: I know that there are numerous labs where either faculty or graduate students are responsible for server administration. This is a terrible idea! Mostly because it's time they could better spend reading, writing, exercising, or simply having a fun conversation over coffee.

Verdict: Don't do it unless you or your grad students really enjoy spending your time diagnosing file system errors and tuning firewall rules.

Cluster in a closet (CIIC)

This is a common model for researchers who have outgrown the single-computer-per-researcher or SUAD model. It’s the model that we followed when I was a faculty member at UCLA, and that I initially planned to follow when I moved from UCLA to UT Austin in 2009. The CIIC model generally involves a rack-mounted system with some number of compute nodes and a disk array for storage. Usually shoved in a closet that is really too small to accommodate it.

Pros:

Scalability: CIIC generally allows for much better scalability. With current systems, one can pack more than 1000 compute cores alongside substantial storage within a single full-height rack. Another big difference that allows much greater scalability is the use of a scheduling (or queueing) system, which allows jobs to be submitted and then run as resources are available. Thus, one can submit many more jobs than the cluster can handle at any one time, and the scheduler will deal with this gracefully. It also prevents problems that happen often under the SUAD model when multiple users log in and start jobs on the server and overrun its resources.
Flexibility: One can configure one’s cluster however they want, because they will have administrative control over the system.

Cons:

Administration:Administering a cluster well is a complex job that needs a professional system administrator, not a scientist moonlighting as an sysadmin; again, I know this because I lived it. In particular, as a cluster gets bigger, the temptation for criminals to compromise it grows as well, and only a professional sysadmin is going to be able to keep up with cybercriminals who break into systems for a living.
Infrastructure: Even a reasonably sized cluster requires substantial infrastructure that is unlikely to be met by a random closet in the lab. The first is power: A substantial cluster will likely need a dedicated power line to supply it. The second is cooling: Computers generate lots of heat, to a degree that most regular rooms will not be able to handle. On more than one occasion we had to shut down the cluster at UCLA because of overheating, and this can also impact the life of the computer’s components. The third is fire suppression: If a fire starts in the closet, you don’t want regular sprinklers dumping a bunch of water on your precious cluster. It is for all of these reasons that many campuses are no longer allowing clusters in campus buildings, instead moving them to custom-built data centers that can address all of these needs.
Cost: The cost of purchasing and running a cluster can be high. Commercial-level hardware is expensive, and when things break you have to find money to replace them, because your team and colleagues will have come to rely on them.
Training: Once you move to a cluster with more than a single node, you will need to use a scheduler to submit and run jobs. This requires a change in mindset about how to do computing, and some researchers find it annoying at first. It definitely requires letting go of a certain level of control, which is aversive for many people.
Interactivity: It can be more challenging to do interactive work on a remote cluster than on a local workstation, particularly if it is highly graphics-intensive work. One usually interacts with these systems using a remote window system (like VNC), and these often don’t perform very well.

Verdict: Unless you have the resources and a good sysadmin, I’d shy way from running your own cluster. If you are going to do so, locate it in a campus data center rather than in a closet.

High-performance computing centers

When I moved from UCLA to UT Austin in 2009, I had initially planned to set up my own CIIC. However, once I arrived I realized that I had another alternative, which was to instead take advantage of the resources at the Texas Advanced Computing Center, which is the local high-performance computing (HPC) center (that also happens to be world-class). My lab did all of its fMRI analyses using the TACC systems, and I have never looked back. Since moving to Stanford, we now also take advantage of the cluster at the Stanford Research Computing Facility, while also continuing to use the TACC resources as well.

Pros:

Scalability: Depending on the resources available at one’s HPC center, one can often scale well beyond the resources of any individual lab. For example, on the Frontera cluster at TACC (its newest, currently the 5th most powerful supercomputer on Earth), a user can request up to 512 nodes (28,672 cores) for up to 48 hrs. That's a lot of Freesurfer runs. The use of scheduling systems also makes the management of large jobs much easier. These centers also usually make large-scale storage available for a reasonable cost.
Professional management: HPC centers employ professional system administrators whose expertise lies in making these systems work well and fixing them when they break. And the best part is that you generally don’t have to pay their salary! (At least not directly).

Cons:

Training: The efficient usage of HPC resources requires that researchers learn a new model for computing, and a new set of tools required for job submission and management. For individuals with solid UNIX skills this is rarely a problem, but for researchers without those skills it can be a substantial lift.
Control: Individual users will not have administrative control (“root”) on HPC systems, which limits the kinds of changes one can make to the system. Conversely, the administrators may decide to make changes that impact one’s research (e.g. software upgrades).
Sharing: Using HPC systems requires good citizenship, since the system is being shared by many users. Most importantly: Users must *never* run jobs on the login node, as tempting as that might sometimes be.
Waiting: Sometimes the queues will become filled up and one may have to wait a day for one's jobs to run (especially just before the annual Supercomputing conference).
Access: If one’s institution has an HPC center, then one may have access to those resources. However, not all such centers are built alike. I’ve been lucky to work with centers at Texas and Stanford that really want researchers to succeed. However, I have heard horror stories at other institutions, particularly regarding HPC administrators who see users as an annoyance rather than as customers, or who have a very inflexible approach to system usage that doesn’t accomodate user needs. For researchers without local HPC access, there may be national resources that one can gain access to, such as the XSEDE network in the US.

Verdict: For a lab like mine with significant computing needs, I think that HPC is the only way to go, assuming that one has access to a good HPC center. Once you live through the growing pains, it will free you up to do much larger things and stop worrying about your cluster overheating because an intruder is using it to mine Bitcoin.

These are of course just my opinions, and I'm sure others will disagree. Please leave your thoughts in the comment section below!