Bootcamp registration data: Visualizing

About

This page documents code used to visualize the Bootcamp 2026 registration data.

Setup

library(gt)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyr)

Import

We have saved an anonymized version of the data in data_public.

bootcamp_26 <- readr::read_csv("data_public/bootcamp-2026-registrations-public.csv", show_col_types = FALSE)

dim(bootcamp_26)

[1] 94  8

str(bootcamp_26)

spc_tbl_ [94 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ timestamp  : POSIXct[1:94], format: "2026-03-05 10:40:52" "2026-03-05 10:42:42" ...
 $ attend_days: chr [1:94] "Mon May 11, Tue May 12" "Mon May 11, Tue May 12" "Mon May 11, Tue May 12" "Mon May 11, Tue May 12" ...
 $ dept       : chr [1:94] "Psychology" "Psychology" "Psychology" "Psychology" ...
 $ position   : chr [1:94] "Graduate student" "Graduate student" "Graduate student" "Graduate student" ...
 $ dropped_out: chr [1:94] NA NA NA NA ...
 $ college    : chr [1:94] "CLA" "CLA" "CLA" "CLA" ...
 $ .default   : chr [1:94] "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ .missing   : chr [1:94] "Unknown" "Unknown" "Unknown" "Unknown" ...
 - attr(*, "spec")=
  .. cols(
  ..   timestamp = col_datetime(format = ""),
  ..   attend_days = col_character(),
  ..   dept = col_character(),
  ..   position = col_character(),
  ..   dropped_out = col_character(),
  ..   college = col_character(),
  ..   .default = col_character(),
  ..   .missing = col_character()
  .. )
 - attr(*, "problems")=<pointer: 0xae0fad360>

Tabular summaries

Positions

bootcamp_26 |>
  dplyr::group_by(position) |>
  dplyr::summarise(n_registrants = n())

# A tibble: 7 × 2
  position                    n_registrants
  <chr>                               <int>
1 Graduate student                       47
2 Instructor/Teaching Faculty             3
3 Postdoc/Research Faculty               10
4 Staff                                  10
5 Tenure-stream Faculty                  12
6 Undergraduate student                  10
7 <NA>                                    2

Visualization can lead to more data cleaning

We note that we do not have a position assigned to two individuals.

We will need to go back to the cleaning steps to diagnose and fix that problem manually.

What I might do in this case is look the person up in the Penn State directory.

Then, if I found their department, I would add that as a step in my cleaning protocol, leaving the original raw data untouched.

Colleges

bootcamp_26 |>
  dplyr::group_by(college) |>
  dplyr::summarise(n_registrants = n()) |>
  kableExtra::kbl()

college	n_registrants
AgSci	4
CLA	22
Comm	1
ECoS	7
EMS	2
Education	4
Engineering	16
HHD	22
ICDS	1
IST	2
Libraries	2
Medicine	1
OVPR	1
NA	9

Missing colleges

We see that n=9 people have missing values for the college variable. Let’s see if we can learn more about these.

bootcamp_26 |>
  dplyr::filter(is.na(college)) |>
  dplyr::select(position, dept, college)

# A tibble: 9 × 3
  position                 dept                               college
  <chr>                    <chr>                              <chr>  
1 Staff                    NARC                               <NA>   
2 Tenure-stream Faculty    Psychology, CLA                    <NA>   
3 Graduate student         <NA>                               <NA>   
4 Postdoc/Research Faculty Meteorology & Atmospheric Sciences <NA>   
5 Undergraduate student    <NA>                               <NA>   
6 <NA>                     <NA>                               <NA>   
7 Postdoc/Research Faculty Meteorology & Atmospheric Sciences <NA>   
8 Graduate student         Energy Technology and Management   <NA>   
9 Graduate student         Nutrition Sciences                 <NA>

Back to the drawing board

We see some odd department (‘NARC’), one non-standard department (“Psychology, CLA”) we can standardize, one department that should be easy to standardize (“Meteorology & Atmospheric Sciences”), and three others we’ll need names for to understand further.

Departments

bootcamp_26 |>
  dplyr::group_by(dept) |>
  dplyr::summarise(n_registrants = n())

# A tibble: 47 × 2
   dept                                  n_registrants
   <chr>                                         <int>
 1 Agricultural & Biological Engineering             1
 2 BBH                                               4
 3 Biology                                           2
 4 Biomedical Engineering                            1
 5 CTSI                                              1
 6 Chemical Engineering                              4
 7 Chemical/Biomedical Engineering                   1
 8 Civil Engineering                                 1
 9 College of Education                              1
10 Communication Arts & Sciences                     1
# ℹ 37 more rows

Position by college

Since there are missing values for college, the cross-tabulation will have issues, but let’s make one anyway.

Using the {xtabs} package.

xtabs(formula = ~ college + position, data = bootcamp_26)

             position
college       Graduate student Instructor/Teaching Faculty
  AgSci                      1                           0
  CLA                       19                           1
  Comm                       1                           0
  ECoS                       2                           2
  Education                  3                           0
  EMS                        0                           0
  Engineering                4                           0
  HHD                       13                           0
  ICDS                       0                           0
  IST                        1                           0
  Libraries                  0                           0
  Medicine                   0                           0
  OVPR                       0                           0
             position
college       Postdoc/Research Faculty Staff Tenure-stream Faculty
  AgSci                              2     1                     0
  CLA                                0     1                     1
  Comm                               0     0                     0
  ECoS                               0     2                     0
  Education                          0     0                     0
  EMS                                1     0                     1
  Engineering                        2     0                     4
  HHD                                1     2                     4
  ICDS                               0     1                     0
  IST                                1     0                     0
  Libraries                          0     1                     1
  Medicine                           1     0                     0
  OVPR                               0     1                     0
             position
college       Undergraduate student
  AgSci                           0
  CLA                             0
  Comm                            0
  ECoS                            1
  Education                       1
  EMS                             0
  Engineering                     6
  HHD                             1
  ICDS                            0
  IST                             0
  Libraries                       0
  Medicine                        0
  OVPR                            0

Or using {tidyverse} functions.

bootcamp_26 |>
  dplyr::count(college, position) |>
  tidyr::pivot_wider(names_from = position, values_from = n, values_fill = 0) |>
  gt()

college	Graduate student	Postdoc/Research Faculty	Staff	Instructor/Teaching Faculty	Tenure-stream Faculty	Undergraduate student	NA
AgSci	1	2	1	0	0	0	0
CLA	19	0	1	1	1	0	0
Comm	1	0	0	0	0	0	0
ECoS	2	0	2	2	0	1	0
EMS	0	1	0	0	1	0	0
Education	3	0	0	0	0	1	0
Engineering	4	2	0	0	4	6	0
HHD	13	1	2	0	4	1	1
ICDS	0	0	1	0	0	0	0
IST	1	1	0	0	0	0	0
Libraries	0	0	1	0	1	0	0
Medicine	0	1	0	0	0	0	0
OVPR	0	0	1	0	0	0	0
NA	3	2	1	0	1	1	1