I tried on my 2015 i5 laptop (two core, four thread). I made some test data like this:
$ mkdir sample
$ cd sample
$ vipsheader ../fg.png ../bg.png
../fg.png: 200x200 uchar, 4 bands, srgb, pngload
../bg.png: 500x500 uchar, 4 bands, srgb, pngload
$ for i in {0..1000}; do cp ../fg.png fg$i.png; done
$ for i in {0..1000}; do cp ../bg.png bg$i.png; done
So 1,000 500x500 and 200x200 PNG images.
First, the base case (IM 6.9.10):
$ time for i in {0..1000}; do convert bg$i.png -page +10+10 fg$i.png -background none -flatten out$i.png; done
real 0m49.461s
user 1m4.875s
sys 0m6.690s
49s is about 20 ops/second.
Next, I tried with GNU parallel. This is a simple way to run enough of them in parallel to keep all cores loaded:
$ time parallel convert bg{}.png -page +10+10 fg{}.png -background none -flatten out{}.png ::: {0..1000}
real 0m32.278s
user 1m46.428s
sys 0m11.897s
32s is 31 ops/second. This is on a two-core laptop -- you'd see a better speedup with a larger desktop machine.
Finally, I wrote a tiny pyvips program to do your task. pyvips is the Python binding for libvips, but there are Go bindings too.
import pyvips
for i in range(0, 1000):
bg_name = "bg" + str(i) + ".png"
fg_name = "fg" + str(i) + ".png"
out_name = "out" + str(i) + ".png"
bg = pyvips.Image.new_from_file(bg_name, access="sequential")
fg = pyvips.Image.new_from_file(fg_name, access="sequential")
result = bg.composite2(fg, "over", x=10, y=10)
result.write_to_file(out_name)
I see:
$ time ~/try/try289.py
real 0m25.887s
user 0m36.625s
sys 0m1.442s
26s is about 40 ops/second. You'd be able to get it a bit quicker if you ran several in parallel.
One of the limits you are hitting is the PNG format -- the library is single-threaded, and rather slow. If you are willing to try TIFF, you can get quite a bit more speed.
TIFF with deflate compression is functionally similar to PNG. If I try:
$ vips copy fg.png fg.tif[compression=deflate]
$ vips copy bg.png bg.tif[compression=deflate]
$ ls -l bg.*
-rw-r--r-- 1 john john 19391 Dec 27 20:48 bg.png
-rw-r--r-- 1 john john 16208 Jan 2 18:36 bg.tif
So it's actually slightly smaller, in this case. If I change the pyvips program to be:
bg_name = "bg" + str(i) + ".tif"
fg_name = "fg" + str(i) + ".tif"
out_name = "out" + str(i) + ".tif[compression=deflate]"
And run it, I see:
$ time ~/try/try289.py
real 0m17.618s
user 0m23.234s
sys 0m1.823s
About 55 ops/second.