1

I have been struggling to fix a database that contains market size information. Because it is internal data I can't share the original file. The original file is just the example but much much bigger.

Right now I do not even know what tool to use, or even what function I could use. Sorry if maybe SE is not the appropriate place to be asking this because it is more about codding, but any general guidance helps a bunch! Thank you so much!

Description of data: I have a worksheet containing the market size of certain products per month for multiple countries. Example(Row 2): In January 2018(Column B) the market size of Albania for apples was 100kg. The data contains information from 2018 to 2019, and multiple countries and products.

The objective: Have a complete file of market shares for all periods, products and dates.

The problem: Sometimes dates are skipped. In the example, the market size for Honduras between February(02) and May (04) of 2018 are missing. The entries do not exist in the original file, but that does not mean that the market size does not exist. Instead, it means that the market size for that period, for that particular product, for that particular country, was 0.

The solution I am trying for: Create market size entries equal to 0 for these missing periods.

What I have tried: My first attempt was to fix this in Tableau by trying to input 0s in the missing fields, but then I realized that Tableau is not ideal for this job, and also I realized that the problem is not that the entries are NA, but instead that there are no entries. Tableau example

So I moved on to Excel, where I have tried to first identify the places in which there is missing information. Basic missing identifier Now I know where the missing data is, but I have no clue how to create the entries based on whether column E is equal to "missing" or not.

Finally I realized I should try to use R because there are over 400,000 entries and every calculation I make on Excel takes ages unless the software crashes first. However, I am not sure even how to begin to solve this with R.

Thanks in advance for any help! If anyone can tell me first what tool to use and maybe some functions that can create entries for missing values in a sequence, I would already have something to start with.

If you think my approach is completely off then let me know that as well.

Example of data:
Country Date Product MarketSize
Honduras 201801 apples 100
Honduras 201802 apples 500
Honduras 201806 apples 900
Honduras 201808 apples 1000
Honduras 201809 apples 300
Thailand 201905 pencils 100
Thailand 201906 pencils 0

Example of desired result:
Country Date Product MarketSize
Honduras 201801 apples 100
Honduras 201802 apples 500
Honduras 201803 apples 0
Honduras 201804 apples 0
Honduras 201805 apples 0
Honduras 201806 apples 900
Honduras 201808 apples 1000
Honduras 201809 apples 300
Honduras 201810 apples 0
Honduras 201811 apples 0
Honduras 201812 apples 0
Honduras 201901 apples 0
Honduras 201902 apples 0
Honduras 201903 apples 0
Honduras 201904 apples 0
Honduras 201905 apples 0
Honduras 201906 apples 0
Honduras 201907 apples 0
Honduras 201908 apples 0
Honduras 201909 apples 0
Honduras 201910 apples 0
Thailand 201801 pencils 0
Thailand 201802 pencils 0
Thailand 201803 pencils 0
Thailand 201804 pencils 0
Thailand 201805 pencils 0
Thailand 201806 pencils 0
Thailand 201807 pencils 0
Thailand 201808 pencils 0
Thailand 201809 pencils 0
Thailand 201810 pencils 0
Thailand 201811 pencils 0
Thailand 201812 pencils 0
Thailand 201901 pencils 0
Thailand 201902 pencils 0
Thailand 201903 pencils 0
Thailand 201904 pencils 0
Thailand 201905 pencils 100
Thailand 201906 pencils 0
Thailand 201907 pencils 0
Thailand 201908 pencils 0
Thailand 201909 pencils 0
Thailand 201910 pencils 0

1 Answers1

0

You are not looking to sum up related items, simply to fill in the blanks, so to speak?

Excel is fine for this, except for the slowness you report. I cannot address R as I do not use it, but I would think something very similar to the Excel fix would be easily done.

Excel fix: in an empty column, like the one just to the right of the one with the missing entries (call that column E because the data takes up the first four columns), put a formula like the following into the top cell, then fill down with it (use Fill, copy and paste, whatever you like best) to the end of the column:

=IF( D1=0, 0, D1 )

Excel, you see, already considers the blank cells to be 0 so blank cells will result in the formula putting a 0 in that cell in column E. If not blank, whatever it has already will be the result.

Then copy that column of cells and Paste|Special|Values in column D, the source data column. Now you have an actual value in EVERY cell in that column.

Then delete the spare column ("helper column") and you're all set.

R should certainly have some similar functionality, so you would just be adapting the typing of the formula to fit R's nomenclature.

Jeorje
  • 16