library(gt)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)This page documents code used to visualize the Bootcamp 2026 registration data.
library(gt)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)We have saved an anonymized version of the data in data_public.
bootcamp_26 <- readr::read_csv("data_public/bootcamp-2026-registrations-public.csv", show_col_types = FALSE)dim(bootcamp_26)[1] 94 8
str(bootcamp_26)spc_tbl_ [94 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ timestamp : POSIXct[1:94], format: "2026-03-05 10:40:52" "2026-03-05 10:42:42" ...
$ attend_days: chr [1:94] "Mon May 11, Tue May 12" "Mon May 11, Tue May 12" "Mon May 11, Tue May 12" "Mon May 11, Tue May 12" ...
$ dept : chr [1:94] "Psychology" "Psychology" "Psychology" "Psychology" ...
$ position : chr [1:94] "Graduate student" "Graduate student" "Graduate student" "Graduate student" ...
$ dropped_out: chr [1:94] NA NA NA NA ...
$ college : chr [1:94] "CLA" "CLA" "CLA" "CLA" ...
$ .default : chr [1:94] "Unknown" "Unknown" "Unknown" "Unknown" ...
$ .missing : chr [1:94] "Unknown" "Unknown" "Unknown" "Unknown" ...
- attr(*, "spec")=
.. cols(
.. timestamp = col_datetime(format = ""),
.. attend_days = col_character(),
.. dept = col_character(),
.. position = col_character(),
.. dropped_out = col_character(),
.. college = col_character(),
.. .default = col_character(),
.. .missing = col_character()
.. )
- attr(*, "problems")=<pointer: 0xae0fad360>
bootcamp_26 |>
dplyr::group_by(position) |>
dplyr::summarise(n_registrants = n())# A tibble: 7 × 2
position n_registrants
<chr> <int>
1 Graduate student 47
2 Instructor/Teaching Faculty 3
3 Postdoc/Research Faculty 10
4 Staff 10
5 Tenure-stream Faculty 12
6 Undergraduate student 10
7 <NA> 2
We note that we do not have a position assigned to two individuals.
We will need to go back to the cleaning steps to diagnose and fix that problem manually.
What I might do in this case is look the person up in the Penn State directory.
Then, if I found their department, I would add that as a step in my cleaning protocol, leaving the original raw data untouched.
bootcamp_26 |>
dplyr::group_by(college) |>
dplyr::summarise(n_registrants = n()) |>
kableExtra::kbl()| college | n_registrants |
|---|---|
| AgSci | 4 |
| CLA | 22 |
| Comm | 1 |
| ECoS | 7 |
| EMS | 2 |
| Education | 4 |
| Engineering | 16 |
| HHD | 22 |
| ICDS | 1 |
| IST | 2 |
| Libraries | 2 |
| Medicine | 1 |
| OVPR | 1 |
| NA | 9 |
We see that n=9 people have missing values for the college variable. Let’s see if we can learn more about these.
bootcamp_26 |>
dplyr::filter(is.na(college)) |>
dplyr::select(position, dept, college)# A tibble: 9 × 3
position dept college
<chr> <chr> <chr>
1 Staff NARC <NA>
2 Tenure-stream Faculty Psychology, CLA <NA>
3 Graduate student <NA> <NA>
4 Postdoc/Research Faculty Meteorology & Atmospheric Sciences <NA>
5 Undergraduate student <NA> <NA>
6 <NA> <NA> <NA>
7 Postdoc/Research Faculty Meteorology & Atmospheric Sciences <NA>
8 Graduate student Energy Technology and Management <NA>
9 Graduate student Nutrition Sciences <NA>
We see some odd department (‘NARC’), one non-standard department (“Psychology, CLA”) we can standardize, one department that should be easy to standardize (“Meteorology & Atmospheric Sciences”), and three others we’ll need names for to understand further.
bootcamp_26 |>
dplyr::group_by(dept) |>
dplyr::summarise(n_registrants = n())# A tibble: 47 × 2
dept n_registrants
<chr> <int>
1 Agricultural & Biological Engineering 1
2 BBH 4
3 Biology 2
4 Biomedical Engineering 1
5 CTSI 1
6 Chemical Engineering 4
7 Chemical/Biomedical Engineering 1
8 Civil Engineering 1
9 College of Education 1
10 Communication Arts & Sciences 1
# ℹ 37 more rows
Since there are missing values for college, the cross-tabulation will have issues, but let’s make one anyway.
Using the {xtabs} package.
xtabs(formula = ~ college + position, data = bootcamp_26) position
college Graduate student Instructor/Teaching Faculty
AgSci 1 0
CLA 19 1
Comm 1 0
ECoS 2 2
Education 3 0
EMS 0 0
Engineering 4 0
HHD 13 0
ICDS 0 0
IST 1 0
Libraries 0 0
Medicine 0 0
OVPR 0 0
position
college Postdoc/Research Faculty Staff Tenure-stream Faculty
AgSci 2 1 0
CLA 0 1 1
Comm 0 0 0
ECoS 0 2 0
Education 0 0 0
EMS 1 0 1
Engineering 2 0 4
HHD 1 2 4
ICDS 0 1 0
IST 1 0 0
Libraries 0 1 1
Medicine 1 0 0
OVPR 0 1 0
position
college Undergraduate student
AgSci 0
CLA 0
Comm 0
ECoS 1
Education 1
EMS 0
Engineering 6
HHD 1
ICDS 0
IST 0
Libraries 0
Medicine 0
OVPR 0
Or using {tidyverse} functions.
bootcamp_26 |>
dplyr::count(college, position) |>
tidyr::pivot_wider(names_from = position, values_from = n, values_fill = 0) |>
gt()| college | Graduate student | Postdoc/Research Faculty | Staff | Instructor/Teaching Faculty | Tenure-stream Faculty | Undergraduate student | NA |
|---|---|---|---|---|---|---|---|
| AgSci | 1 | 2 | 1 | 0 | 0 | 0 | 0 |
| CLA | 19 | 0 | 1 | 1 | 1 | 0 | 0 |
| Comm | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| ECoS | 2 | 0 | 2 | 2 | 0 | 1 | 0 |
| EMS | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| Education | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| Engineering | 4 | 2 | 0 | 0 | 4 | 6 | 0 |
| HHD | 13 | 1 | 2 | 0 | 4 | 1 | 1 |
| ICDS | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| IST | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| Libraries | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| Medicine | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| OVPR | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| NA | 3 | 2 | 1 | 0 | 1 | 1 | 1 |