I have a database with just over 70 million rows in. This data was originally parsed and imported from roughly 70,000 XML files. These files are updated every week so I need to scan through these XML files (via a cron on Sundays at 2AM in the morning) and update rows that have changed/insert new rows.
$operatorSQL = "INSERT IGNORE INTO `operator` (`reference`, `national_operator_code`, `operator_code`, `operator_short_name`, `operator_name_on_license`, `trading_name`) VALUES (:reference, :nationalOperatorCode, :operatorCode, :operatorShortName, :operatorNameOnLicense, :tradingName);";
$serviceSQL = "INSERT IGNORE INTO `service` (`service_code`, `private_code`, `date_start`, `date_end`, `mode`, `description`, `origin`, `destination`) VALUES (:serviceCode, :privateCode, :dateStart, :dateEnd, :mode, :description, :origin, :destination);";
$serviceOperatorSQL = "INSERT IGNORE INTO `service_operator` (`service_code`, `operator_reference`) VALUES (:serviceCode, :operatorReference);";
$journeyPatternSQL = "INSERT IGNORE INTO `journey_pattern` (`reference`, `direction`, `destination_display`, `vehicle_type_code`, `vehicle_type_description`) VALUES (:reference, :direction, :destinationDisplay, :vehicleTypeCode, :vehicleTypeDescription);";
$journeyPatternRouteSQL = "INSERT IGNORE INTO `journey_pattern_route` (`journey_pattern_reference`, `route_reference`) VALUES (:reference, :routeReference);";
$journeyPatternSectionLink = "INSERT IGNORE INTO `journey_pattern_section_link` (`journey_pattern_reference`, `journey_pattern_section_reference`) VALUES (:reference, :journeyPatternSectionReference);";
$journeyPatternSectionSQL = "INSERT IGNORE INTO `journey_pattern_section` (`reference`) VALUES (:reference);";
$lineSQL = "INSERT IGNORE INTO `service_line` (`service_code`, `name`) VALUES (:serviceCode, :name);";
$timingLinkSQL = "INSERT IGNORE INTO `journey_pattern_timing_link` (`reference`, `stop_from`, `stop_from_timing`, `stop_from_sequence_number`, `stop_from_activity`, `stop_to`, `stop_to_timing`, `stop_to_sequence`, `stop_to_activity`, `run_time`, `direction`) VALUES (:reference, :stopFrom, :stopFromTiming, :stopFromSequenceNumber, :stopFromActivity, :stopTo, :stopToTiming, :stopToSequenceNumber, :stopToActivity, :runTime, :direction)";
$timingLinkJpsSQL = "INSERT INTO `journey_pattern_timing_link_jps` (`journey_pattern_timing_link`, `journey_pattern_section_reference`) VALUES (:linkReference, :sectionReference);";
$timingLinkRouteLinkRefSQL = "INSERT INTO `journey_pattern_timing_link_rlr` (`journey_pattern_timing_link`, `route_link_reference`) VALUES (:linkReference, :routeLinkReference);";
$routeSQL = "INSERT IGNORE INTO `route` (`reference`, `private_code`, `description`) VALUES (:reference, :privateCode, :description);";
$routeSectionSQL = "INSERT IGNORE INTO `route_section` (`reference`) VALUES (:reference);";
$routeLinkSQL = "INSERT IGNORE INTO `route_link` (`reference`, `stop_from`, `stop_to`, `direction`, `distance`) VALUES (:reference, :stopFrom, :stopTo, :direction, :distance);";
$routeLinkSectionSQL = "INSERT INTO `route_link_section` (`route_link_reference`, `route_section_reference`) VALUES (:routeLinkReference, :routeSectionReference);";
$vehicleJourneySQL = "INSERT IGNORE INTO `vehicle_journey` (`reference`, `private_code`, `departure`) VALUES (:reference, :privateCode, :departure);";
$vehicleJourneyServiceSQL = "INSERT IGNORE INTO `vehicle_journey_service` (`vehicle_journey_reference`, `service_reference`) VALUES (:reference, :serviceRef);";
$vehicleJourneyLineSQL = "INSERT IGNORE INTO `vehicle_journey_line` (`vehicle_journey_reference`, `service_line_reference`) VALUES (:reference, :lineRef);";
$vehicleJourneyJpSQL = "INSERT IGNORE INTO `vehicle_journey_jp` (`vehicle_journey_reference`, `journey_pattern_reference`) VALUES (:reference, :journeyPatternRef);";
Above are all of the SQL queries that are performed. You will notice that the IGNORE
clause is used in the INSERT
statement, this is just to make sure that if any files have duplicate data no errors will stop the script, instead it'll just ignore it and move on.
I don't feel this is the most efficient way of doing it however as when I run the script again after doing the initial insert of all the data it's just as slow as when the original inserts are executed... surely if 99.9% of the rows are the same it should skim through? Any ideas why this is happening?