I have 2 tables (url_feed and clean_domains). I'm trying to copy all of the data from url_feed into clean_domains and at the same time insert the domain from the url column.
Also, it will change the status from queued to complete in url_feed after it copies the row to clean_domains.
Here's what the 2 tables look like:
url_feed
id | url | matches | networks | status
1 | http://whatever.com/example1.php | 5 | Facebook::Twitter Inc | queued
2 | http://www.example.com/other-stuff.php | 2 | MySpace::Facebook::Twitter | queued
3 | http://www.test.com/random-text | 12 | Instagram::Twitter | queued
clean_domains
id | domain | url | matches | networks | status
1 | whatever.com | http://whatever.com/example1.php | 5 | Facebook::Twitter Inc | queued
2 | example.com | http://www.example.com/other-stuff.php | 2 | MySpace::Facebook::Twitter | queued
3 | test.com | http://www.test.com/random-text | 12 | Instagram::Twitter | queued
Here's my code:
<?php
$con=mysqli_connect("localhost","redacted","redacted","redacted");
mysqli_query($con,"INSERT INTO clean_domains
(id,domain,url,matches,networks)
SELECT
id,
SUBSTRING_INDEX(REPLACE(REPLACE(REPLACE(REPLACE(url, 'http://', ''), 'https://', ''), 'http://www.', ''), 'https://www.', ''), '/', 1),
url,
matches,
networks
FROM url_feed
WHERE status = 'queued'");
mysqli_query($con,"UPDATE url_feed
SET status = 'complete'
WHERE status = 'queued' AND
id IN (SELECT id
FROM clean_domains)");
mysqli_close($con);
?>
My code is working for 99% of domains, but I can't quite figure out how to make it work perfectly.
Here's the 3 times it seems to not work perfectly:
Colons -
http://example.com:88/testing- URLs like this are having their domains output asexample.com:88whereas I'd want it to beexample.com.IP Addresses -
http://188.123.44.12/test.php- For IPs it seems to be correctly entering the IP address into the database. In the example it would enter188.123.44.12as thedomain- but I don't want that. I only want domain names, so if it's an IP, it shouldn't be copied over. It should be marked ascompleteinurl_feedand move on to the next row.Subdomains -
http://subdomain.whatever.example.com/test.html- This is being entered into thedomaincolumn assubdomain.whatever.example.comwhen I want it to beexample.cominstead.
The only way I can think of verifying if a domain being entered is truly a domain would be to run a whois query on each of them. If it doesn't come back as a valid one, it removed the first block of text. For example, it wouldn't get a valid result for subdomain.whatever.example.com so then it tries whatever.example.com and then it tries example.com until the result is valid or it skips it and marks the status column as complete.
Any ideas on what I can change to make this work properly?
Here's where I am right now with this:
$_url_string = 'https://testfewfew.dsd.google.co.uk/testing/whatever';
preg_match("/[a-z0-9\-]{1,63}\.[a-z\.]{2,6}$/", parse_url($_url_string, PHP_URL_HOST), $_domain_tld);
echo $_domain_tld[0];