To block or not to block stupid HTTP proxy software

September 18, 2015
technology
security
wordpress

A lot of HTTP proxy firewalls used by companies scan web pages (i.e., HTTP responses) received by users behind the firewall. That is reasonable as they have a legitimate need to protect their network from malware. What is not reasonable is that those proxy firewalls then pre-fetch any URL mentioned anywhere in the returned web page(s). Regardless of whether or not the human (and their browser) which made the original request would also request the URL.

This most often manifests itself via HTTP GET requests with a user-agent value of “Mozilla/4.0 (compatible;)” interspersed among other requests from the same source; albeit with a different user-agent value. Searching Google for “Mozilla/4.0 (compatible;)” returns several answers about that user-agent value. For example this one.

This behavior by HTTP proxy firewalls is extremely obnoxious. Not least because it makes reading and interpreting web server logs more difficult. It also adds load to a server (especially one running WordPress which relies heavily on PHP) that would not otherwise have to be handled.

Having said that I no longer blacklist based on that user-agent header. A careful review of my HTTP access logs showed that while it might have blocked a few instances of malware it was more often blocking access from proxy firewalls used by major corporations. So while I wish those companies would employ more intelligent proxy firewalls that don’t fetch URLs that are unlikely to be fetched by the people behind the firewall it isn’t worthwhile to penalize those companies by blacklisting their public addresses.