I couldn't find what I was looking for anywhere else, so I hope I'm not asking something that is already solved. Sorry if I am.
I want to loop through each column individually for multiple dataframes and apply a function to check the data quality.
I want to find:
- number of missing values
- percentage of missing values
- number of empty rows
- percentage of empty rows
- number of distinct values
- percent of distinct values
- number of duplicates
- percentage of duplicates
- one example of a value in a row that is not empty "" and not missing
- (and any other information you suggest could tell me something about the data quality)
I then want to save the information in a dataframe that I can easily download, looking something like this:
table_name | column_name | # missing values | % missing values | # empty rows | etc...
Can this be done?
I have named my different dataframes "a", "b" and "c" (there are 80, but just for simplifying purposes), and store these in a list called "table_list". These different dataframes varies in number of variables/columns.
I have made this function:
analyze <- function(i) {
  data <- table_list[i]
  # Find number of missing values
  number_missing_values <- sum(is.na(data))
  # Find percentage of missing values
  percentage_missing_values <- sum(is.na(data)) / nrow(data)
  # Find number of empty rows
  number_missing_values <- sum(data == "", na.rm = TRUE)
  # Find percentage of empty rows
  percentage_empty_rows <- sum(data == "", na.rm = TRUE) / nrow(data)
  # Find number of distinct values
  number_distinct_values <- count(data %>% distinct())
  # Find percent of distinct values
  percentage_distinct_values <- count(data %>% distinct())/nrow(data)
This function lacks (not sure how to do it):
- number of duplicates
- percentage of duplicates
- one example of a value in a row that is not empty "" and not missing
I was planning to apply this function in this for-loop:
for (i in table_list) {
  analyze(i)
}
I'm also not sure how to make the result into a dataframe like i illustrated with the different column names above.
What am I getting wrong here, and what should I do different?
