I recently ran into a bug where the compiled version of Simple IoT would not work on one of my computers. Interestingly,
go run ... worked fine. What was the difference?
At first, it seemed a race condition was at play – the compiled version was faster and would likely have different characteristics. SIOT is a highly concurrent application with many clients running in parallel. However, after spending some time debugging I could not find any issues with startup. The NATS client in SIOT could not get data from the embedded NATS server. Again, race conditions were assumed so I tried an external vs. embedded NATS server – no difference.
I finally observed the network traffic with Wireshark and could not observe any NATS traffic in the failing version. The NATS client was not even sending any requests. I then looked at
/etc/hosts and there was no localhost entry. After adding this, everything worked properly.
So why did the compiled binary fail when
go run ... worked? My best guess is that the compiled binary had CGO disabled and
go run does not. This results in the Go runtime using a different DNS resolver. Apparently, the Go DNS resolver does not resolve localhost if it is not in your
/etc/hosts. (This is not verified, just my best guess at the moment).
On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.
By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.
After discussing with @khem, we decided that instead of using
127.0.0.1 was probably more reliable. Interestingly, others have reached the same conclusion (see videos below). The reasons include:
- lookup takes more time
localhostresults in two IPs: IPv4 and IPv6
- IPv6 is known to cause problems
- some apps/machines don’t support localhost, or don’t have localhost configured correctly (like mine)
So to keep things simple and more likely to work in most cases, Simple IoT now sets the default NATS server to