Tuesday, November 27, 2018

Automated web site generation using Bookdown, CircleCI, and Github

For my new open statistics book (Statistical Thinking for the 21st Century), I used Bookdown which is a great tool for writing a book using RMarkdown.  However, as the book came together, the time to build the book grew to more than 10 mins due to the many simulations and Bayesian model estimation.  And since each output type (of which there are currently three: Gitbook, PDF, and EPUB) requires a separate build run, rebuilding the full book distribution became quite an undertaking.  For this reason, I decided to implement an automated solution using the CircleCI continuous integration service. We already use this service for many of the software development projects in our lab (such as fMRIPrep and  MRIQC), so it was a natural choice for this project as well.

The use of CircleCI for this project is made particularly easy by the fact that both the book source and the web site for the book are hosted on Github — the ability to set up hooks between Github and CircleCI allows two important features. First, it allows us to automatically trigger a rebuild of the site whenever there is a new push to the source repo.  Second, it allows CircleCI to push a new copy of the book files to the separate repo that the site is served from.

Here are the steps to setting this up - see the Makefile and CircleCI config.yml file in the repo for questions.  And if you come across anything that I missed please leave a comment below!

  1. Create a CircleCI account linked to the relevant GitHub account.
  2. Add the source repo to CircleCI.
  3. Create the CircleCI config.yml file.  Here is the content of my config file, with comments added to explain each step:

version: 2
jobs:
  build:
    docker:
# this is my custom Docker image
      - image: poldrack/statsthinking21

CircleCI spins up a VM specified by a Docker image, to which we can then add any necessary additional software pieces.  I initially started with an image with R and the tidyverse preinstalled (https://hub.docker.com/r/rocker/tidyverse/) but installing all of the R packages as well as the TeX distribution needed to compile the PDF took a very long time, quickly using up the 1,000 build minutes per month that come with the CircleCI free plan.  In order to save this time I build a custom Docker container (Dockerfile) that incorporates all of the dependencies needed to build the book; this way, CircleCI can simply pull the container from my DockerHub repo and run it straight away rather than having to build a bunch of R packages.   

    steps:
      - add_ssh_keys:
          fingerprints:
            - "73:90:5e:75:b6:2c:3c:a3:46:51:4a:09:ac:d9:84:0f”

In order to be able to push to a github repo, CircleCI needs a way to authenticate itself.  A relatively easy way to do this is to generate an SSH key and install the public key portion as a “deploy key” on the Github repo, then install the private key as an SSH key on CircleCI.  I had problems with this until I realized that it requires a very specific type of SSH key (a PEM key using RSA encryption), which I generated on my Mac using the following command:

ssh-keygen -m PEM -t rsa -C "poldrack@gmail.com


# check out the repo to the VM - it also becomes the working directory
      - checkout
# I forgot to install ssh in the docker image, so install it here as we will need it for the github push below
      - run: apt-get install -y ssh
# now run all of the rendering commands
      - run:
           name: rendering pdf
           command: |
             make render-pdf
      - run:
           name: rendering epub
           command: |
             make render-epub
      - run:
           name: rendering gitbook
           command: |
             make render-gitbook

The Makefile in the source repo contains the commands to render the book in each of the three formats that we distribute: Gitbook, PDF, and EPUB.  Here we build each of those.

# push the rendered site files to its repo on github
      - run:
           name: check out site repo
           command: |
             cd /tmp
             ssh-keyscan github.com >> ~/.ssh/known_hosts

The ssh-keyscan command is necessary in order to allow headless operation of the ssh command necessary to access github below.  Otherwise the git clone command will sit and wait at the host authentication prompt for a keypress that will never come.

# clone the site repo into a separate directory
             git clone git@github.com:psych10/thinkstats.git
             cd thinkstats
# copy all of the site files into the site repo directory
             cp -r ~/project/_book/* .
             git add .
# necessary config to push
             git config --global user.email poldrack@gmail.com             git config --global user.name "Russ Poldrack"
             git commit -m"automated update"
             git push origin master

That’s it! CircleCI should now build and deploy the book any time there is a new push to the repo.  Don’t forget to add a CircleCI badge to the README to show off your work!   

Tuesday, November 20, 2018

Statistical Thinking for the 21st Century - a new intro statistics book

I have published an online draft of my new introductory statistics book, titled "Statistical Thinking for the 21st Century", at http://thinkstats.org.  This book was written for my undergraduate statistics course at Stanford, which I started teaching last year.  The first time around I used Andy Field's An Adventure in Statistics, which I really like but most of my students disliked because the statistical content was buried within a lot of story.  In addition, there are a number of topics (overfitting, cross-validation, reproducibility) that I wanted to cover in the course but weren't covered deeply in the book.  So I decided to write my own, basically transcribing my lecture slides into a set of RMarkdown notebooks and generating a book using Bookdown.

There are certainly many claims in the book that are debatable, and almost certainly things that I have gotten wrong as well, given that I Am Not A Statistician.  If you have the time and energy, I'd love to hear your thoughts/suggestions/corrections - either by emailing me, or by posting issues at the github repo. 

I am currently looking into options for publishing this book in low-cost paper form - if you would be interested in using such a book for a course you teach, please let me know.  Either way, the electronic version will remain freely available online.