I am trying to import a recent Wikipedia dump into a MySQL database. The problem is, I am inserting a 50 GiB text table using INSERT INTO text MySQL commands and I want to convert these into a text file.
My text.sql file has the following structure:
INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),...,(idN,'textN','flagsN');
However, using mysql -u USERNAME -p DBNAME < text.sql is very slow. I am already disabling autocommit,unique_checks and foreign_key_checks, and I am enclosing all transactions within a START TRANSACTION; ... COMMIT; block, but the import process is still very slow.
After researching, I read here that using LOAD DATA INFILE; can be much faster than using INSERT commands. Therefore, I am looking to convert text.sql to text.txt as follows:
id1,'text1','flags1'
id2,'text2','flags2'
...
idN,'textN','flagsN'
I was thinking of using awk for this, but my experience with regular expressions is very limited. Furthermore, each INSERT command is given in a single line, as shown above, making it for me even more difficult to extract the values.
Given that the text.sql file is 50 GiB, would you recommend using awk or to develop a C/C++ program? If awk is a good approach, how could I achieve the conversion?
Input #1 example:
INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),(id3,'text3','flags3');
Output #1 example:
id1,'text1','flags1'
id2,'text2','flags2'
id3,'text3','flags3'
Input #2 example: (with parenthesis in the values)
INSERT INTO page (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_content_model) VALUES (10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'),(12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext');
Output #2 example:
10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'
12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext'
Input #3 example: (with escaped ' or ")
INSERT INTO text (old_id,old_text,old_flags) VALUES (631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'),(703037144,'{{Redirect2|Anarchist|Anarchists
|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8');
Output #3 example:
631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'
703037144,'{{Redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8'
edit: after conducting some more research, it appears that examples #2 and #3 may not be converted using regular expressions: sources: #1, #2.