I have a PHP web form that accepts file uploads (image and text), from which text is extracted (OCR and .pdf, .doc, etc stripped to plain text). The text extraction is performed by using exec
to invoke a jar file/command line process (I am not in control of the source for either) which returns the text. While testing there is no issue, however, with 5 simultaneous PDF uploads (each about 5MB) the server load maxes out. The entire process (each upload) takes 10-15 seconds and load drops back to normal immediately after.
I am assuming the issue is with Java and allocation to the JRE for each exec call; when manually invoking the jar file from the command line it takes about 10 seconds, so nearly the same as a single upload response. Running the extraction as background processes is not possible because the HTTP response contains the 'data' processed from the uploaded files text. I considered forking the process, but that doesn't help with the server load (will probably make it worse). I am hoping to avoid rewriting the service entirely in Java.
Is there a way to pre-load the Java process JRE or pipe successive files to the same, or something of the like?