There is a free public data set available that has over 300 fields. I would like to create an ETL process to update and store the data locally on a mysql or sql server. Because the records are too big to fit into a database table, I will have to probably de-normalize, or perhaps normalize a different way. The following is a representation of my dataset:
+------------+---------+---------+-----+-----------+---------+
|  Column1   | Column2 | Column3 | ….  | Column319 | ColumnN |
+------------+---------+---------+-----+-----------+---------+
| HUEBERT    | ALLISON | L       | DR. |           | M.D.    |
| YOUNGMAYKA | CYNTHIA |         |     |           | P.A.    |
| DIFILIPPO  | EMIL    | A       | DR. |           | M.D.    |
| THACKER    | RICHARD | RANDALL | DR. |           | D.O.    |
|            |         |         |     |           |         |
| ROTHSTEIN  | MARK    | TERRY   | DR. |           | M.D.    |
| GIBBS      | ELMER   | RICKEY  | DR. |           | M.D.    |
| MARKMAN    | ALAN    | WILLIAM | DR. |           | M.D.    |
| TROTCHIE   | DEBBIE  | C       |     |           | APN     |
| DYSART     | STANLEY | H       |     |           | M.D.    |
|            |         |         |     |           |         |
| GRUNERT    | GEORGE  | M       | DR. |           | M.D.    |
| GOLDBERG   | STEVEN  | M       | DR. |           | M.D.    |
| DUBOSE     | JON     |         | DR. |           | MD      |
+------------+---------+---------+-----+-----------+---------+
I would like to de-normalize in this fashion by having 2 tables.
TableLeft:
+------------+---------+---------+
|  Column1   | Column2 | Column3 |
+------------+---------+---------+
| HUEBERT    | ALLISON | L       |
| YOUNGMAYKA | CYNTHIA |         |
| DIFILIPPO  | EMIL    | A       |
| THACKER    | RICHARD | RANDALL |
|            |         |         |
| ROTHSTEIN  | MARK    | TERRY   |
| GIBBS      | ELMER   | RICKEY  |
| MARKMAN    | ALAN    | WILLIAM |
| TROTCHIE   | DEBBIE  | C       |
| DYSART     | STANLEY | H       |
|            |         |         |
| GRUNERT    | GEORGE  | M       |
| GOLDBERG   | STEVEN  | M       |
| DUBOSE     | JON     |         |
+------------+---------+---------+
TableRight:
+-----+-----------+---------+
| ….  | Column319 | ColumnN |
+-----+-----------+---------+
| DR. |           | M.D.    |
|     |           | P.A.    |
| DR. |           | M.D.    |
| DR. |           | D.O.    |
|     |           |         |
| DR. |           | M.D.    |
| DR. |           | M.D.    |
| DR. |           | M.D.    |
|     |           | APN     |
|     |           | M.D.    |
|     |           |         |
| DR. |           | M.D.    |
| DR. |           | M.D.    |
| DR. |           | MD      |
+-----+-----------+---------+
The entire data set will probably be 10 gigs, or approximately 5 million rows, and in fact it might be closer to 4-6 joins to get the entire row of data.
What are the standards regarding working with data that needs to be partitioned in such a way?
You can view the excel file with the first 1000 records here.
 
    