Outsourced Clue

Providing big company technology recommendations to the masses

Archive for the ‘Troubleshooting’ Category

SOLVED: Problems with Safari 4, Nginx and Connections being reset

without comments

We had some issues with Safari 4 (only) and our Nginx load balancer setup. Turns out, it doesn’t like the keep alive settings to be anything but 0.  The default for nginx was 65, which for the Safari 4 users, the site would consistently not provide the full content back to the client (we use Nginx to load balance between a few apache servers).

Setting the keepalive_timeout value to 0 solved the problem.  Hopefully this helps someone out there.

Here is some more info on the issue from Ruby Forum.

Written by sleach

April 16th, 2009 at 4:38 pm

Posted in Troubleshooting

Tagged with , ,

Supporting “dig +trace” using an Unbound recursive/caching DNS server

without comments

dig +trace example.com is an extremely useful debugging tool with DNS.  It will walk the delegation path, showing the answer each authoritative DNS server in the path handed out, helping you track down some obscure DNS errors.  For example, here is a dig +trace for “outsourcedclue.com”.

 

; <<>> DiG 9.6.0-P1 <<>> +trace outsourcedclue.com
;; global options: +cmd
.			518073	IN	NS	F.ROOT-SERVERS.NET.
.			518073	IN	NS	M.ROOT-SERVERS.NET.
.			518073	IN	NS	B.ROOT-SERVERS.NET.
.			518073	IN	NS	D.ROOT-SERVERS.NET.
.			518073	IN	NS	K.ROOT-SERVERS.NET.
.			518073	IN	NS	A.ROOT-SERVERS.NET.
.			518073	IN	NS	H.ROOT-SERVERS.NET.
.			518073	IN	NS	J.ROOT-SERVERS.NET.
.			518073	IN	NS	E.ROOT-SERVERS.NET.
.			518073	IN	NS	L.ROOT-SERVERS.NET.
.			518073	IN	NS	C.ROOT-SERVERS.NET.
.			518073	IN	NS	G.ROOT-SERVERS.NET.
.			518073	IN	NS	I.ROOT-SERVERS.NET.
;; Received 512 bytes from 10.1.11.1#53(10.1.11.1) in 1 ms

com.			172800	IN	NS	I.GTLD-SERVERS.NET.
com.			172800	IN	NS	H.GTLD-SERVERS.NET.
com.			172800	IN	NS	J.GTLD-SERVERS.NET.
com.			172800	IN	NS	G.GTLD-SERVERS.NET.
com.			172800	IN	NS	F.GTLD-SERVERS.NET.
com.			172800	IN	NS	B.GTLD-SERVERS.NET.
com.			172800	IN	NS	A.GTLD-SERVERS.NET.
com.			172800	IN	NS	D.GTLD-SERVERS.NET.
com.			172800	IN	NS	L.GTLD-SERVERS.NET.
com.			172800	IN	NS	E.GTLD-SERVERS.NET.
com.			172800	IN	NS	M.GTLD-SERVERS.NET.
com.			172800	IN	NS	C.GTLD-SERVERS.NET.
com.			172800	IN	NS	K.GTLD-SERVERS.NET.
;; Received 496 bytes from 202.12.27.33#53(M.ROOT-SERVERS.NET) in 147 ms

outsourcedclue.com.	172800	IN	NS	ns1.softlayer.com.
outsourcedclue.com.	172800	IN	NS	ns2.softlayer.com.
;; Received 170 bytes from 192.35.51.30#53(F.GTLD-SERVERS.NET) in 45 ms

outsourcedclue.com.	86400	IN	A	208.43.45.4
outsourcedclue.com.	86400	IN	NS	ns2.softlayer.com.
outsourcedclue.com.	86400	IN	NS	ns1.softlayer.com.
;; Received 98 bytes from 67.228.255.5#53(ns2.softlayer.com) in 42 ms

 

I use Unbound as my recursive/caching DNS server of choice, and one day I noticed it didn’t support “dig +trace”.  Distraught, I dug in why.  Talking to a buddy, he suggested perhaps Unbound wasn’t allowing non-recursive queries, that +trace relies on.  So digging into the documentation, I discovered the allow_snoop option of the access-control directive.  So for example, if in your config file looks like this:

server:
    access-control: 0.0.0.0/0 refuse
    access-control: 127.0.0.0/8 allow
    access-control: 10.1.11.0/24 allow

 

just add the following directive to support “dig +trace” from the IP’s needed:

  access-control: 10.1.11.0/24 allow_snoop

Now you can dig +trace to your hearts content!

Written by sleach

April 13th, 2009 at 4:59 pm

Posted in DNS,Troubleshooting

Tagged with , , ,

Broken Caching DNS Server Causes Headaches

without comments

By now you have read the many reports of The Planet’s data center fire. Long story short, there was an explosion on the first floor of their Houston facility (old ev1servers data center) that affected network connectivity, servers and a ton of other items. Some buddies of mine, Pelago, have their gear in this facility. Luckily, their servers were fine (no downtime), but there were spotty network issues for 3 days, starting on Sunday, that are finally resolver as of yesterday.

One item that caused me/us a ton of headaches was spotty connectivity to their payment processor. They utilize the SOAP interface for submitting their payment information (when someone signs up for their Intervals project management application etc.). What we were seeing is all connections to the SOAP service (accessed over normal HTTPS) timing out. After some digging, the weird part was they were only timing out when run via the PHP interpreter embedded in Apache (i.e. when run as part of the normal web process). If we ran it via the PHP command line interpreter, it worked fine. It was driving us mad, the network path to Sage looked fine using the normal network troubleshooting tools. In addition, we could do easily simulate pulling down the SOAP WSDL file using CURL etc. So it wasn’t network path related, but we still couldn’t figure it out.

On a hunch, I decided to watch the DNS traffic during the transaction, and low and behold, I saw DNS queries to theplanet’s recursive servers (which was odd as I always configure local caching servers), and it was querying for the AAAA (IPv6 DNS record) for the gateway, and timing out, resulting in multiple retransmissions. Now, when we ran the script via the command line, it would query the local caching servers (as it should) and get a NOERROR right away (the correct response since the payment processor didn’t have AAAA records), it would then fallback and query for the A record and succeed.

Here are the packet traces for those interested in the FAILURE scenario (names and IP’s changed to protected the non-innocent):

11:30:22.301692 IP (tos 0x0, ttl 64, id 34396, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.47502 > 2.3.4.5.domain: [bad udp cksum de1d!] 20370+ AAAA? endpoint.paymentprocessor.net. (34)
11:30:27.302284 IP (tos 0x0, ttl 64, id 29396, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.47501 > 2.3.4.5.domain: [bad udp cksum e11d!] 20370+ AAAA? endpoint.paymentprocessor.net. (34)
11:30:32.303732 IP (tos 0x0, ttl 64, id 34397, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.47502 > 2.3.4.5.domain: [bad udp cksum de1d!] 20370+ AAAA? endpoint.paymentprocessor.net. (34)
11:30:37.304502 IP (tos 0x0, ttl 64, id 49399, offset 0, flags [DF], proto 17, length: 79) 1.2.3.4.47502 > 2.3.4.5.domain: [bad udp cksum b9c4!] 38226+ AAAA? endpoint.paymentprocessor.net.longerdomain.com. (51)
11:30:42.305874 IP (tos 0x0, ttl 64, id 54400, offset 0, flags [DF], proto 17, length: 79) 1.2.3.4.47503 > 2.3.4.5.domain: [bad udp cksum b6c4!] 38226+ AAAA? endpoint.paymentprocessor.net.longerdomain.com. (51)

You can see the retransmits. Now here is the CORRECT transaction:
05:30:54.793846 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.50700 > 3.4.5.6.domain: [bad udp cksum d6dc!] 33819+ AAAA? endpoint.paymentprocessor.net. (34)
05:30:54.842103 IP (tos 0x0, ttl 60, id 0, offset 0, flags [DF], proto 17, length: 126) 3.4.5.6.domain > 1.2.3.4.50700: [udp sum ok] 33819 q: AAAA? endpoint.paymentprocessor.net. 0/1/0 ns: paymentprocessor.net. SOA ns.example.com. soacontact.example.com. 1064587759 4800 2400 950400 2400 (98)

The Planet’s recursive server was broken in the manner that it responded to queries for AAAA records. It just dropped them on the floor instead of returning a NOERROR. Now knowing why the system was timing out, I still couldn’t figure out why Apache was using The Planet’s recursive servers, which was causing the timeout problem, and not using the local caching servers (which worked fine). Knowing that some apps have some issues with caching the recursive servers, instead of querying /etc/resolv.conf each time, I restarted apache, tested the script again, and lo and behold, it started using the proper recursive servers.

What I think happened was, when Apache was first started, they were using The Planet’s recursive servers, then when it was switched in /etc/resolv.conf, Apache, having been up all this time (hundreds of days), never reconsulted the /etc/resolv.conf file for the new recursive IP’s, and continued to use the old ones. And when theplanet had their fire and all their problems, the script was using theplanet recursives, and would have connectivity problems. It was only Apache (and come to find out later, Postfix) that was having this issue. I still don’t know if it was the recursive being “broken” or that the recursive had trouble reaching the authoritative servers for paymentprocessor.com (made up name).  I would assume the former since I was able to retrieve the A record fine from the same authoritative server.

All in all, it was a frustrating three days (more so for my buddies than me).  The Planet has some serious accountability issues right now, that kind of downtime is not acceptable for an enterprise data center company.

Written by sleach

June 5th, 2008 at 12:55 pm

Posted in DNS,Troubleshooting

Tagged with , , ,