So, let's summarize some things - based on reactions.
Probably the best way is to combine all possibilities. :-)
If this is the 1st (in the session - it is enough) incoming request, we may check the request immediatelly against multiple criterias. On server side we (may) have a dynamic database (built from user-agent info strings / IP addresses) We can create this db by mirroring public databases. (Yes, there are several public, regularly updated databases available on the internet to check bots. They contain not only user-agent strings but source IPs too)
If we have a hit we can quick check it using the database. If that filter says "OK", we may mark it as a trusted bot and serve the request.
We have a problem if there is no user-agent info available in the request... (Actually this was the origin of my question). What to do if we do not have user-agent info? :-)
We need to make a decision here.
The easiest way to simply deny these requests - consider this abnormal. Of course from this point we may loose real users. But according to our stats it is not a big risk - I think. It is also possible to send back a human-readable message "Sorry, but your browser doesn't send user-agent info so your request is denied" - or whatever. If this is a bot there will be noone to read that anyway. If this is a humanoid we may kindly give her/him useable instructions.
If we decide not to deny these requests, we may initiate a post-tracking mechanism suggested by MrCode here. OK, we serve THAT request but try to start collecting behaviour info. How? E.g. note the IP address in db (greylist that), and pass back a fake CSS file in the response - which will be served not by the webserver statically but our server side language: PHP, Java or whatever we are using. If this is a robot it is very unlikely that it will try to download a CSS file... While if this is a real browser it will definetly do - probably within a very short (e.g. 1-2 secs) time frame. We can easily continue the process on the action which is serving the fake CSS file. Just do an IP lookup in the greylist db, and if we judge the behaviour normal, we may white-list that IP address (for example..)
If we have another request from a grey-listed IP address again
a) within the 1-2 secs time frame : we may delay our response a few seconds (waiting for the parallel thread, maybe it will download the fake CSS meanwhile...), and check our greylist db periodically to see if the IP address disappeared or not
b) over the 1-2 secs time frame: we simply deny the response
So, something like that... How does it sounds?