Symptoms
Summary: name resolution fails sometimes causing applications to fail. The nscd does not cache hosts which resolve to multiple IP addresses
Occurrences
At what times did this problem occur (used to estimate frequency):
Observations
The nscd does not seem to cache some host entries. Switching the debug level to >1 in the
/etc/nscd.conf
shows that for some hostnames always cache failures are returned, while the caching works correctly for others.
Experimentation shows that the caching systematically fails for hosts which resolve to multiple IP addresses.
As a particularly bad bonus, the lookup failures are correctly cached by the nscd, leading to a failure for all subsequent requests for that host resolution, until the cache is cleared again (
negative-time-to-live config parameter, 20s by default).
This situation is extremely bad on our T3, because the DMZ nameserver that we use is protected from too many requests from the same host during short time spans. So, we get host lookup failures for these cases. The problem was noted with CRAB jobs trying to resolve cmsdbprod for registering data sets.
Test example:
cmsdbsprod resolves to two IP addresses
host cmsdbsprod.cern.ch
cmsdbsprod.cern.ch has address 128.142.142.178
cmsdbsprod.cern.ch has address 128.142.142.133
a little stress test
for ((n=1;$n<200;n=$n+1)); do gethostip cmsdbsprod.cern.ch ; done
nscd.log entry example:
...
19128: handle_request: request received (Version = 2) from PID 26386
19128: GETHOSTBYNAME (cmsdbsprod.cern.ch)
19128: Haven't found "cmsdbsprod.cern.ch" in hosts cache!
19128: handle_request: request received (Version = 2) from PID 26386
19128: GETHOSTBYNAME (cmsdbsprod.cern.ch)
19128: Haven't found "cmsdbsprod.cern.ch" in hosts cache!
...
Solution or Workaround
Googling brought only one reference to this problem (
http://bugs.gentoo.org/196241). There, upgrading glibc to 2.8 was recommended, but no reply from the submitter is seen. We currently run glibc-2.3.4 on our SL4 installations.
An ugly workaround is to hardcode the few hosts that give problems into the
/etc/hosts
files. Done for the moment.
Monitoring for this condition
--
DerekFeichtinger - 17 Jun 2009