I want to insert a JSON file (also available as CSV) into a mySQL database using the cakePHP framework. The basics are clear, but the surrounding requirements make it difficult:
- The JSON/CSV file is large (approx. 200 MB and up to 200.000 lines).
- The file contains several fields. These fields need to be mapped to fields with different names in the mySQL database.
- The CSV contains a field named art_number. This field is also present in the mySQL database. The art_number is unique, but not the primary key in mySQL. I want to update the mySQL record if CSV and database have the same art_number. If not a new record should be created.
- Several fields of the CSV file need to be processed before they are stored. Also additional fields need to be added.
- The CSV contains an image_URL. If it is a NEW record (unknown art_number) to the database, this image should be copied, modified (with imagick) and stored on the server.
- The whole job needs to run on a daily basis.
As you can see there is a lot going on with some limitations (memory, runtime etc.). But I am not sure how to approach this from an architecture point of view. E.g. should I first try to insert everything into a seperate "import" database table and then run through the steps seperately? What is a good way to get the IDs from the database mapped to the CSV lines? Cakephp is able to perform either creating a new or updating an existing record if I am able to map the ID based on the art_number. Also changing and copying up to 200.000 images seems to be a big issue. So how to break this down into smaller chunks?
I would appreciate if you could help find the right strategy here. What do I need to consider in terms of memory and speed? Doe sit make sense to split the process into different jobs? What/how would oyu do that?