Data Gathering and Cleaning

Set-up

We ensure that all package dependencies are installed.

if (!require(tidyverse)) {
  install.packages("tidyverse")
}
if (!require(googledrive)) {
  install.packages("googledrive")
}

## Loading required package: googledrive

suppressPackageStartupMessages(library("tidyverse"))

TODO: Convert code from magrittr pipe (%>%) to R pipe (|>).

Download data

The survey was generated and data collected using Google Forms. The survey questions are here: https://forms.gle/oT2ekzCsw7KVU8YU8.

We have separated the data update process from the generation of this report since some manual cleaning of the department names must be done first. So, the typical workflow is to run the following at the console:

source('../R/functions.R')
update_data(force_update = TRUE, google_credentials = Sys.getenv("GMAIL_ROG"))
survey <- open_survey()

clean_names(survey) |>
  show_unique_depts()

The results of show_unique_depts(survey) can then be compared to the code below, and updates made, as needed, to handle various edge cases.

This workflow could be improved.

One idea would be to use the targets package to update the data at specified intervals and then trigger the data cleaning operations.

Load data

Load the data file.

if (file.exists("csv/open-science-survey-2022-fall.csv")) {
  survey <- readr::read_csv("csv/open-science-survey-2022-fall.csv", show_col_types = FALSE)
} else {
  message("File not found: ", "csv/open-science-survey-2022-fall.csv")
  survey <- NULL
}

There are \(n=\) 104 responses.

Clean data

Examine the variable names.

if (is.null(survey)) {
  warning("Error loading data file")
} else {
  names(survey)
}

##  [1] "Timestamp"                                                                                                                                                                                                
##  [2] "What Penn State campus do you represent?"                                                                                                                                                                 
##  [3] "What is your primary department/unit?"                                                                                                                                                                    
##  [4] "What is your position at Penn State?"                                                                                                                                                                     
##  [5] "How many years have passed since you completed that degree?"                                                                                                                                              
##  [6] "What are the primary types of digital data that are used in your research? (choose all that apply)"                                                                                                       
##  [7] "Do you collect data that have legal or ethical restrictions governing who may access it or how it may be used?"                                                                                           
##  [8] "Where do you store data for active projects where data collection and analysis is still ongoing?"                                                                                                         
##  [9] "How important to you is sharing data from active projects with research collaborators at Penn State or outside of Penn State?"                                                                            
## [10] "How convenient is it for you to share data from active projects with research collaborators at Penn State or outside of Penn State?"                                                                      
## [11] "What are the main barriers to sharing data from active projects with research collaborators?"                                                                                                             
## [12] "How important to you is sharing data from completed projects with the broader research community (i.e., not direct collaborators)?"                                                                       
## [13] "Which of the following obstacles make sharing data with the research community harder for you? Mark all that apply."                                                                                      
## [14] "Do research funders in your field require data sharing?"                                                                                                                                                  
## [15] "Do journals in your field require data sharing?"                                                                                                                                                          
## [16] "If you have shared data with the research community, where have you shared it?"                                                                                                                           
## [17] "How well-equipped do you feel you, your colleagues, and trainees are to meet data management and sharing requirements of sponsors/funders or journals?"                                                   
## [18] "How often do you create computer scripts or data analysis code in the conduct of your research?"                                                                                                          
## [19] "How often do you share computer scripts or data analysis code with direct research collaborators ?"                                                                                                       
## [20] "Do you create other kinds of software in the conduct of your research?"                                                                                                                                   
## [21] "How often do you use open source code sharing tools (e.g., GitHub, GitLab, BitBucket)?"                                                                                                                   
## [22] "Do funders in your field require code sharing?"                                                                                                                                                           
## [23] "Do journals in your field require code sharing?"                                                                                                                                                          
## [24] "How often do you openly share other materials related to your research (protocols, reagents, samples, apparatus, designs, etc.) with other researchers?"                                                  
## [25] "What is your experience with/knowledge of open science practices?"                                                                                                                                        
## [26] "Describe your awareness of the FAIR (findable, accessible, interoperable, reusable) principles pertaining to research data."                                                                              
## [27] "Do you apply FAIR principles in your own data management and sharing practices?"                                                                                                                          
## [28] "Have you heard of the \"reproducibility crisis\" in science?"                                                                                                                                             
## [29] "Is there a reproducibility crisis in your area of research?"                                                                                                                                              
## [30] "How much benefit would you derive from a center at Penn State focused on supporting the adoption of best practices in data management and sharing, code sharing, open science, and reproducible research?"
## [31] "Select the services that would most benefit your research if offered by such a center."                                                                                                                   
## [32] "Any final comments about data management, data sharing, and open science?"                                                                                                                                
## [33] "(Optional) Provide us with your contact information if you would like us to follow up."                                                                                                                   
## [34] "What is the highest post-secondary degree you have earned?"                                                                                                                                               
## [35] "How often do you share computer scripts or data analysis code openly?"

Let’s rename them.

full_questions <- names(survey)

short_names <- c(
  "timestamp",
  "campus",
  "department",
  "position",
  "years_since_degree",
  "data_types",
  "restricted_data",
  "storage_active_projects",
  "importance_sharing_collab",
  "convenience_sharing_collab",
  "barriers_sharing_collab",
  "importance_share_community",
  "barriers_share_community",
  "funders_require_data_sharing",
  "journals_require_data_sharing",
  "where_shared_community",
  "equipped_data_mgmt_sharing",
  "create_analysis_code",
  "share_analysis_code_collab",
  "create_other_code",
  "use_code_sharing_tools",
  "funders_require_code_sharing",
  "journals_require_code_sharing",
  "share_materials_community",
  "knowledge_open_science",
  "awareness_FAIR",
  "apply_FAIR",
  "heardof_reproducibility_crisis",
  "my_area_reproducibility_crisis",
  "benefit_psu_center",
  "service_psu_center",
  "comments",
  "contact_info",
  "highest_degree_earned",
  "share_analysis_code_community"
)

if (length(short_names) == length(names(survey))) {
  names(survey) <- short_names
} else {
  message("Name vector lengths differ; no change made.")
}

Some of the variables have values that are easy to parse, while others are more challenging, data_types for example.

Modify `timestamp`

Make a standard date_time format.

survey <- survey |>
  dplyr::mutate(timestamp = lubridate::mdy_hms(timestamp, tz = "America/New_York"))

Modify `campus`

Make all campus locations lowercase and replace white space with underscores.

survey <- survey |>
  dplyr::mutate(campus = tolower(campus)) |>
  dplyr::mutate(campus = stringr::str_replace(campus, " ", "_"))

Modify `department`

survey <- clean_depts(survey)

Modify `position`

Make lowercase, replace spaces and dash with underscore.

survey <- survey |>
  dplyr::mutate(position = tolower(position)) |>
  dplyr::mutate(position = stringr::str_replace_all(position, "[ -]", "_"))

Modify `years_since_degree`

TODO: Make ordinal

Modify `data_types`

survey <- survey |>
  dplyr::mutate(collect_audio = stringr::str_detect(data_types, "Audio files")) |>
  dplyr::mutate(collect_video = stringr::str_detect(data_types, "Video files")) |>
  dplyr::mutate(collect_photos = stringr::str_detect(data_types, "Digital photographs and/or other images")) |>
  dplyr::mutate(
    collect_computer_data = stringr::str_detect(
      data_types,
      "Data automatically generated from or by computer programs"
    )
  ) |>
  dplyr::mutate(collect_sensor = stringr::str_detect(data_types, "Data collected from sensors")) |>
  dplyr::mutate(collect_docs = stringr::str_detect(data_types, "Documents or reports")) |>
  dplyr::mutate(collect_models = stringr::str_detect(data_types, "Models/algorithms")) |>
  dplyr::mutate(collect_obs = stringr::str_detect(data_types, "Observational data")) |>
  dplyr::mutate(collect_sims = stringr::str_detect(data_types, "Simulation data, models, and software code")) |>
  dplyr::mutate(
    collect_procedures = stringr::str_detect(data_types, "Standard operating procedures and protocols")
  ) |>
  dplyr::mutate(collect_txt = stringr::str_detect(data_types, "Text files")) |>
  dplyr::mutate(collect_genomic = stringr::str_detect(data_types, "Genomic")) |>
  dplyr::mutate(collect_image = stringr::str_detect(data_types, "Image data")) |>
  dplyr::mutate(collect_surveys = stringr::str_detect(data_types, "Survey results")) |>
  dplyr::mutate(collect_spreadsheets = stringr::str_detect(data_types, "Spreadsheets")) |>
  dplyr::mutate(collect_interviews = stringr::str_detect(data_types, "interview transcripts")) |>
  dplyr::mutate(collect_gis = stringr::str_detect(data_types, "Geographic Information Systems")) |>
  dplyr::mutate(collect_sketches = stringr::str_detect(data_types, "Sketches, diaries in digital form")) |>
  dplyr::mutate(collect_vr = stringr::str_detect(data_types, "Virtual reality, 3D models")) |>
  dplyr::mutate(collect_xml_json = stringr::str_detect(data_types, "Structured text files")) |>
  dplyr::mutate(collect_web_social = stringr::str_detect(data_types, "Websites and blogs"))

Modify `restricted_data`

survey <- survey |>
  dplyr::mutate(
    restricted_ethical = stringr::str_detect(restricted_data, "ethical concerns"),
    restricted_legal_ip = stringr::str_detect(restricted_data, "legal/intellectual"),
    restricted_sponsor = stringr::str_detect(restricted_data, "contractual restrictions"),
    restricted_none = str_detect(restricted_data, "No; My data are not restricted")
  )

Modify `storage_active_projects`

TODO: Handle other options

survey <- survey |>
  dplyr::mutate(store_usb = stringr::str_detect(storage_active_projects,
                                                "External USB or flash drive")) |>
  dplyr::mutate(store_pc_lab = stringr::str_detect(storage_active_projects,
                                                   "Personal/lab computer")) |>
  dplyr::mutate(
    store_dept_coll_server = stringr::str_detect(storage_active_projects,
                                                 "Departmental/college server")
  ) |>
  dplyr::mutate(store_icds = stringr::str_detect(storage_active_projects,
                                                 "ICDS/ROAR allocation")) |>
  dplyr::mutate(
    store_onedrive = stringr::str_detect(storage_active_projects,
                                         "Microsoft OneDrive/SharePoint")
  ) |>
  dplyr::mutate(store_googledrive = stringr::str_detect(storage_active_projects,
                                                        "Google Drive")) |>
  dplyr::mutate(store_dropbox = stringr::str_detect(storage_active_projects,
                                                    "Dropbox")) |>
  dplyr::mutate(store_box = stringr::str_detect(storage_active_projects,
                                                "Box"))

Modify `barriers_sharing_collab`

TODO: Clean these.

Modify `barriers_share_community`

TODO: Handle “other” cases.

survey <- survey |>
  dplyr::mutate(
    barriers_sharing_security = stringr::str_detect(
      barriers_share_community,
      "Ensuring security/restricting access"
    )
  ) |>
  dplyr::mutate(
    barriers_sharing_curation = stringr::str_detect(
      barriers_share_community,
      "Taking time to curate, organize, document data"
    )
  ) |>
  dplyr::mutate(
    barriers_sharing_alter_before_share = stringr::str_detect(
      barriers_share_community,
      "Altering data to make it suitable to share"
    )
  ) |>
  dplyr::mutate(
    barriers_sharing_resources = stringr::str_detect(barriers_share_community,
                                                     "Insufficient resources for sharing")
  ) |>
  dplyr::mutate(
    barriers_sharing_staff = stringr::str_detect(
      barriers_share_community,
      "Lack of available or knowledgeable staff"
    )
)

Modify `where_shared_community`

survey <- survey |>
  dplyr::mutate(share_inst_repo = stringr::str_detect(where_shared_community,
                                                      "Institutional repository")) |>
  dplyr::mutate(
    share_journal_suppl = stringr::str_detect(
      where_shared_community,
      "Supplemental material linked to journal article"
    )
  ) |>
  dplyr::mutate(share_lab_web = stringr::str_detect(where_shared_community,
                                                    "Lab/project website")) |>
  dplyr::mutate(share_ext_repo = stringr::str_detect(where_shared_community,
                                                     "External data repository")) |>
  dplyr::mutate(
    share_govt_repo = stringr::str_detect(where_shared_community,
                                          "Government data repository")
  ) |>
  dplyr::mutate(share_consortia = stringr::str_detect(where_shared_community,
                                                      "Research consortia"))

Modify `knowledge_open_science`

survey <- survey |>
  dplyr::mutate(
    knowledge_open_science = recode(
      knowledge_open_science,
      `No experience/knowledge` = "None",
      `Limited experience/knowledge` = "Limited",
      `Some experience/knowledge` = "Some",
      `Considerable experience/knowledge` = "Considerable",
      `Extensive experience/knowledge` = "Extensive"
    )
  )

Modify `awareness_FAIR`

survey <- survey |>
  dplyr::mutate(
    awareness_FAIR = recode(
      awareness_FAIR,
      `No awareness` = "None",
      `Limited awareness` = "Limited",
      `Some awareness` = "Some",
      `Considerable awareness` = "Considerable",
      `Extensive awareness` = "Extensive"
    )
  )

Modify `benefit_psu_center`

survey <- survey |>
  dplyr::mutate(
    benefit_psu_center = recode(
      benefit_psu_center,
      `No benefit` = "None",
      `Minimal benefit` = "Minimal",
      `Some benefit` = "Some",
      `Considerable benefit` = "Considerable",
      `Extensive benefit` = "Extensive"
    )
  )

Modify `service_psu_center`

survey <- survey |>
  dplyr::mutate(help_data_review_qa = stringr::str_detect(service_psu_center,
                                                          "Data review and quality")) |>
  dplyr::mutate(help_data_mgmt_plan = stringr::str_detect(service_psu_center,
                                                          "Data management plan")) |>
  dplyr::mutate(help_data_doc = stringr::str_detect(service_psu_center,
                                                    "Data documentation")) |>
  dplyr::mutate(
    help_data_analysis_verif = stringr::str_detect(service_psu_center,
                                                   "Third party verification")
  ) |>
  dplyr::mutate(
    help_student_staff_train = stringr::str_detect(service_psu_center,
                                                   "Training and technical assistance")
  )  |>
  dplyr::mutate(
    help_data_deidentif = stringr::str_detect(service_psu_center,
                                              "De-identification or anonymization")
  ) |>
  dplyr::mutate(
    help_funder_compliance = stringr::str_detect(service_psu_center,
                                                 "Ensuring compliance with funding")
  ) |>
  dplyr::mutate(
    help_where_to_share = stringr::str_detect(service_psu_center,
                                              "recommendation of suitable")
  )

Re-export cleaned data

readr::write_csv(survey, "csv/open-science-survey-2022-fall-clean.csv")

Select and export contact data for follow-up

contact_info <- survey |>
  dplyr::select(contact_info) |>
  dplyr::filter(!is.na(contact_info))

readr::write_csv(contact_info,
                 "csv/open-science-survey-2022-fall-contact-info.csv")

Data Management and Sharing Plan

Data Visualization

Data Gathering and Cleaning

Set-up

Download data

Load data

Clean data

Modify timestamp

Modify campus

Modify department

Modify position

Modify years_since_degree

Modify data_types

Modify restricted_data

Modify storage_active_projects

Modify barriers_sharing_collab

Modify barriers_share_community

Modify where_shared_community

Modify knowledge_open_science

Modify awareness_FAIR

Modify benefit_psu_center

Modify service_psu_center