Questions modified slightly to improve understandability
My goal is to optimize a web application which has a very bad DB design and for which I can't touch the DB (I can't alter table nor introduce a new DB). I can operate on the code itself, on the filesystem or via a proxy. By "optimizing" I mean: reduce the requests sent to the webapp, as opposed to the ones sent to filesystem directly, keep DB queries to a minimum, reduce the number of different URL calls (keep caching in mind).
Let me try to construct a fictitious example, just to provide something to talk upon. Let us imagine this scenario:
- I have a php webapp, which exposes a database of a million different people. each person decided at some point if they are happy or sad
- when I visit
person.php?id=x {x=1,..1000000}
, the page creates a link toshow_picture_of_person.php?id=x
.show_picture_of_person.php
will go inside the db of a million rows and this db will tell me if the person is sad or happy by returning an image. I don't know what this image is, unless I extract it from the DB. If I extract it from the db, i can then analyze it and understand if it's either a sad face or a happy face. The functionshow_picture_of_person
actually outputs an image. Also the DB stores the image itself in a blob. Images are always either sad.jpg or happy.jpg.
what I would like to have, instead of a million link to show_picture_of_person.php?id=x
, is to have 2 links, one for a sad.jpg and one for a happy.jpg. Possible solutions in my mind:
- I write a script which calls all the possible combinations of
show_picture_of_person
, save all the images, understand which are the ones that are equal and then write a lookup table. I put that lookup table in a php functionmake_sensible_url("show_picture_of_person.php?id=x") -> happy.jpg
. This function will be called in theperson.php
script. I am worried about performance here of the php engine itself (and array like that would be by itself a 50+MB file!) - Same as above, but instead of constructing an array in PHP, I create a filesystem of a million text files, and inside each text file I will have the name of the actual static file of the image (avoiding repetitions). The function
make_sensible_url("show_picture_of_person.php?id=x")
will simply read and output the content of the file. I like this approach as no DB calls and reading to fs should be faster then the huge PHP array in solution 1. - I change
person.php
so that there is no link toshow_picture_of_person.php
and instead I makedata:images
. The issue with this is that, if I have x calls toperson.php
, I still have 2x calls to the DB (one forperson.php
and one forshow_picture_of_person.php
). I would like to have only x calls to the DB. Also this increases the size of the page (in my real case I have ~20 images in 1 page, so lots of queries and lots of bytes) - don't know what else..
How would you solve this? Thanks!
For completeness' sake, here was the original question:
This is the scenario:
- a database with various tables with data which is not properly indexed (let's say for this argument's sake that we have 5000 unique objects represented in around 50.000 rows - so duplicates are present)
- we are in a situation in which the database is non modifiable (this also mean that we can't insert another table)
- we have a php app exposing these objects
- there exists around 1 million php calls (all legitimate) which return one of those 5000 objects (e.g.: bad_url.php?id=bar, bad_url.php?id=foo, ..)
- there is no easy way to programmatically decide which of the 5000 objects will be returned
Our goal is to somehow convert the million+ calls to calls which will be giveme.php?id=x, where x is one of the 5000 unique objects.
Just to give you an idea of a theoretical approach:
- we could index all the millions calls and map them with which distinct object is returned
- we can create a hash table or something and create a php function which would work as give_me_nice_url("bad_url.php?....").
- my feeling is that creating an array with such solution would result in a 50-100MB array .. not sure how performant it would be running realtime under load.
My ask for this question is which approach would you use to solve this issue and handle the large data set? Does there exists a better way than a lookup table like in my solution? Remember I can't use a database in the final production setup.