I have recently migrated to R from Stata. I am unsure how to perform compute descriptive statistics on grouped and ungrouped observations.
Here's my data:
dput(DF)
structure(list(Product_Name = c("iPhone", "iPhone", "iPhone",
"iPhone", "iPhone", "iPhone", "Nexus 6P", "Nexus 6P", "Nexus 6P",
"Nexus 6P", "Nexus 6P", "Nexus 6P"), Product_Type = c("New",
"New", "Refurbished", "New", "New", "Refurbished", "Refurbished",
"Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished"
), Year = c(2006, 2011, 2009, 2008, 2011, 2009, 2012, 2007, 2013,
2015, 2009, 2010), Units = c(100, 200, 300, 400, 500, 600, 700,
200, 120, 125, 345, 340)), .Names = c("Product_Name", "Product_Type",
"Year", "Units"), row.names = c(NA, 12L), class = "data.frame")
My data has products sold by year and type. Each product could a refurbished product or a new product. Further, if it was sold before 2010, I would mark it as sold in "Time 1" otherwise I would mark it as sold in "Time 2".
Here's my code for this:
DF[DF$Year<2010,"Time"]<-"1"
DF[DF$Year>=2010,"Time"]<-"2"
Now, I want to generate descriptive statistics for these time periods.
DF %>%
group_by(Product_Name, Product_Type,Time) %>%
dplyr::summarise(Count = n(),
Sum_Units = sum(Units,na.rm=TRUE),
Avg_Units = mean(Units,na.rm = TRUE),
Max_Units=max(Units,na.rm = TRUE))
If we run above code, we would get descriptive statistics by Product_Name, Product_Type, and Time (i.e. grouped descriptive statistics). However, this is not what I want. I want descriptive statistics with and without considering groupings with Product_Type and Time. Meaning, I would want to compute descriptive statistics assuming that products were sold in Time 1 OR Time 2 (i.e. all years) and irrespective of the type of product sold, while retaining some of the grouped information above.
Expected output:
dput(DFOut)
structure(list(Product_Name = c("iPhone", "Nexus 6P"), New_Units_Sum_Time1 = c(500,
NA), Refurbished_Units_Sum_Time_1 = c(900, 545), Sum_Units_Time1 = c(1400,
545), Sum_Units_Time2 = c(700, 1285), Sum_Units_Time_1_And_2 = c(2100,
1830), Avg_Units_Time1 = c(350, 272.5), Avg_Units_Time2 = c(350,
321.25), Avg_Units_Time_1_And_2 = c(350, 305), Max_Units_Time1 = c(600,
345), Max_Units_Time2 = c(500, 700), Max_Units_Time_1_And_2 = c(600,
700)), .Names = c("Product_Name", "New_Units_Sum_Time1", "Refurbished_Units_Sum_Time_1",
"Sum_Units_Time1", "Sum_Units_Time2", "Sum_Units_Time_1_And_2",
"Avg_Units_Time1", "Avg_Units_Time2", "Avg_Units_Time_1_And_2",
"Max_Units_Time1", "Max_Units_Time2", "Max_Units_Time_1_And_2"
), row.names = 1:2, class = "data.frame")
In the output, you would see that I have some descriptive statistics:
a) based on the type of product and the period it was sold (e.g. New_Units_Sum_Time1 i.e. New and Time1). Please note that in the output, I have only shown the combination of New and Time1. If you can guide me how to produce descriptive statistics for other combinations of Refurbished and Time, it would be awesome.
b) based on ignoring type of product but not ignoring the period sold (e.g. Sum_Units_Time1 i.e. Time1)
c) based on ignoring both type of product and period it was sold (e.g. Sum_Units_Time_1_And_2).
Ditto for Avg and Mean.
How can I do this? I'd appreciate any help. I am really struggling on this.
Please note that I manually created DFOut using Excel. Although I triple checked it, there could be some manual errors--I will be more than happy to clarify them if there are questions. Thank you for your time.
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grDevices datasets stats graphics grid tcltk utils methods base
other attached packages:
[1] tables_0.8 Hmisc_4.0-2 Formula_1.2-1 survival_2.40-1
[5] ResourceSelection_0.3-0 magrittr_1.5 stringr_1.1.0 bit64_0.9-5
[9] bit_1.1-12 tufterhandout_1.2.1 knitr_1.15.1 rmarkdown_1.3
[13] tufte_0.2 corrplot_0.77 purrr_0.2.2 readr_1.0.0
[17] tibble_1.2 tidyverse_1.1.1 cowplot_0.7.0 plotly_4.5.6
[21] ggplot2_2.2.1 maps_3.1.1 directlabels_2015.12.16 tidyr_0.6.1
[25] ggthemes_3.3.0 R2HTML_2.3.2 lubridate_1.6.0 xts_0.9-7
[29] zoo_1.7-14 lattice_0.20-34 corrgram_1.10 hexbin_1.27.1
[33] sm_2.2-5.4 compare_0.2-6 installr_0.18.0 psych_1.6.12
[37] reshape2_1.4.2 readstata13_0.8.5 pastecs_1.3-18 boot_1.3-18
[41] vcd_1.4-3 car_2.1-4 xlsxjars_0.6.1 rJava_0.9-8
[45] debug_1.3.1 dplyr_0.5.0 foreign_0.8-67 gmodels_2.16.2
[49] openxlsx_4.0.0 plyr_1.8.4
loaded via a namespace (and not attached):
[1] minqa_1.2.4 colorspace_1.3-2 class_7.3-14 modeltools_0.2-21 mclust_5.2.2
[6] rprojroot_1.2 htmlTable_1.9 base64enc_0.1-3 MatrixModels_0.4-1 flexmix_2.3-13
[11] mvtnorm_1.0-5 xml2_1.1.1 codetools_0.2-15 splines_3.3.2 mnormt_1.5-5
[16] robustbase_0.92-7 jsonlite_1.2 nloptr_1.0.4 pbkrtest_0.4-6 broom_0.4.1
[21] cluster_2.0.5 kernlab_0.9-25 httr_1.2.1 backports_1.0.5 assertthat_0.1
[26] Matrix_1.2-7.1 lazyeval_0.2.0 acepack_1.4.1 htmltools_0.3.5 quantreg_5.29
[31] tools_3.3.2 gtable_0.2.0 Rcpp_0.12.9 trimcluster_0.1-2 gdata_2.17.0
[36] nlme_3.1-128 iterators_1.0.8 fpc_2.1-10 lmtest_0.9-34 lme4_1.1-12
[41] rvest_0.3.2 gtools_3.5.0 dendextend_1.4.0 DEoptimR_1.0-8 MASS_7.3-45
[46] scales_0.4.1 TSP_1.1-4 hms_0.3 parallel_3.3.2 SparseM_1.74
[51] RColorBrewer_1.1-2 gridExtra_2.2.1 rpart_4.1-10 latticeExtra_0.6-28 stringi_1.1.2
[56] gclus_1.3.1 mvbutils_2.7.4.1 foreach_1.4.3 checkmate_1.8.2 seriation_1.2-1
[61] caTools_1.17.1 prabclus_2.2-6 bitops_1.0-6 evaluate_0.10 htmlwidgets_0.8
[66] R6_2.2.0 gplots_3.0.1 DBI_0.5-1 haven_1.0.0 whisker_0.3-2
[71] mgcv_1.8-16 nnet_7.3-12 modelr_0.1.0 KernSmooth_2.23-15 viridis_0.3.4
[76] readxl_0.1.1 data.table_1.10.0 forcats_0.2.0 digest_0.6.12 diptest_0.75-7
[81] stats4_3.3.2 munsell_0.4.3 registry_0.3 viridisLite_0.1.3 quadprog_1.5-5