CREATE TABLE hostname_table
(
id INT NOT NULL AUTO_INCREMENT,
hostname CHAR(65) NOT NULL,
interval_avg INT,
last_update DATETIME NOT NULL,
numb_updates INT,
PRIMARY KEY (id)
)
I have this table, and I import 500-600k rows of data into it. I do not check for duplicates when writing to the database, because I want to know how many duplicates of each host there is, and I also want to know the intervals between each update of said hostname.
Example values in hostname_table:
id  hostname          interval_avg  last_update          numb_updates
1   www.host.com      60            2012-04-25 20:22:21  1
2   www.hostname.com  10            2012-04-25 20:22:21  5
3   www.name.com      NULL          2012-04-25 20:22:21  NULL
4   www.host.com      NULL          2012-04-25 20:22:26  NULL
5   www.host.com      NULL          2012-04-25 20:22:36  NULL
Example of what I want it to look like when I have cleaned it up:
id  hostname          interval_avg  last_update          numb_updates
1   www.host.com      25            2012-04-25 20:22:36  3
2   www.hostname.com  10            2012-04-25 20:22:21  5
3   www.name.com      NULL          2012-04-25 20:22:21  NULL
With a huge database like this, I dont want to send too many queries to obtain this goal, but I believe 3 queries are the minimum for an operation like this(if I am wrong, please correct me). Each hour there will be ~500k new rows where ~50% or more will be duplicates, therefore its vital to get rid of those duplicates as efficiently as possible while still keeping a record of how many and how often the duplicates occured(hense the interval_avg and numb_update update).
This is a three step problem, and I was hoping the community here would give a helping hand.
So to summarize in pseudocode, I need help optimizing these queries;
- select all last_update and interval_avg values, get sum(numb_update), get count(duplicates) foreach hostname,
- update interval_avg in min(id), update numb_updates in min(id), update last_update in min(id) with the value from max(id),
- delete all duplicates except min(id)
SOLVED. I have optimized one part by 94%, and another part by ~97% over the course of a couple of days researching. I truely hope this will help other searching for the same solutions. mySQL and large databases can be a big problem if you choose the wrong solution. (I changed the last_update column from DATETIME to INT(10), and I changed from a formated time to a timestamp as value in my final solution to be able to get the max(last_update) and min(last_update) values)
(Thanks to GolezTrol for helping with parts of the problem)
 
     
    