Quick Tip: Find information about a US zip code using DNS
Want to know more about a particular zipcode? Open http://$ZIPCODE.us in your browser. i.e.
Build a RPM of Python 2.5 on CentOS 5 / Redhat Enterprise (RHEL) 5
It’s such a pain to get a newer version of Python installed on Redhat/CentOS. RHEL 4/CentOS 4 comes with Python 2.3, and RHEL 5/CentOS 5 comes with Python 2.4. I have noticed more and more apps requiring Python >= 2.5, so I had to find a good way to build an RPM of Python 2.5. Based on some sites I found out there and some mods I made, here are the instructions:
% sudo yum install autoconf bzip2-devel db4-devel elf-utils \
expat-devel findutils gcc-c++ gdbm-devel glibc-devel gmp-devel \
mesa-libGL-devel libX11-devel libtermcap-devel ncurses-devel \
openssl-devel pkgconfig readline-devel sqlite-devel tar \
tix-devel tk-devel rpm-build zlib-devel
% test -f ~/.rpmmacros || echo %_topdir %\(echo \"\$HOME\"\)/rpm >> ~/.rpmmacros
% mkdir -p $HOME/rpm/{BUILD,RPMS,SOURCES,SPECS}
% wget ftp://mirrors.kernel.org:/fedora/releases/10/Fedora/source/SRPMS/python-2*.src.rpm
% rpm -ivh python-2*.src.rpm
% rm python-2*.src.rpm
% sed -ie 's/DBLIBVER=4.7/DBLIBVER=4.3/' $HOME/rpm/SOURCES/python-2.5-config.patch
% sed -ie 's/db4-devel >= 4.7/db4-devel >= 4.3/' $HOME/rpm/SPECS/python.spec
% rpmbuild --define '__python_ver 25' -bb $HOME/rpm/SPECS/python.spec
SOLVED: Problems with Safari 4, Nginx and Connections being reset
We had some issues with Safari 4 (only) and our Nginx load balancer setup. Turns out, it doesn’t like the keep alive settings to be anything but 0. The default for nginx was 65, which for the Safari 4 users, the site would consistently not provide the full content back to the client (we use Nginx to load balance between a few apache servers).
Setting the keepalive_timeout value to 0 solved the problem. Hopefully this helps someone out there.
Here is some more info on the issue from Ruby Forum.
Supporting “dig +trace” using an Unbound recursive/caching DNS server
dig +trace example.com is an extremely useful debugging tool with DNS. It will walk the delegation path, showing the answer each authoritative DNS server in the path handed out, helping you track down some obscure DNS errors. For example, here is a dig +trace for “outsourcedclue.com”.
; <<>> DiG 9.6.0-P1 <<>> +trace outsourcedclue.com ;; global options: +cmd . 518073 IN NS F.ROOT-SERVERS.NET. . 518073 IN NS M.ROOT-SERVERS.NET. . 518073 IN NS B.ROOT-SERVERS.NET. . 518073 IN NS D.ROOT-SERVERS.NET. . 518073 IN NS K.ROOT-SERVERS.NET. . 518073 IN NS A.ROOT-SERVERS.NET. . 518073 IN NS H.ROOT-SERVERS.NET. . 518073 IN NS J.ROOT-SERVERS.NET. . 518073 IN NS E.ROOT-SERVERS.NET. . 518073 IN NS L.ROOT-SERVERS.NET. . 518073 IN NS C.ROOT-SERVERS.NET. . 518073 IN NS G.ROOT-SERVERS.NET. . 518073 IN NS I.ROOT-SERVERS.NET. ;; Received 512 bytes from 10.1.11.1#53(10.1.11.1) in 1 ms com. 172800 IN NS I.GTLD-SERVERS.NET. com. 172800 IN NS H.GTLD-SERVERS.NET. com. 172800 IN NS J.GTLD-SERVERS.NET. com. 172800 IN NS G.GTLD-SERVERS.NET. com. 172800 IN NS F.GTLD-SERVERS.NET. com. 172800 IN NS B.GTLD-SERVERS.NET. com. 172800 IN NS A.GTLD-SERVERS.NET. com. 172800 IN NS D.GTLD-SERVERS.NET. com. 172800 IN NS L.GTLD-SERVERS.NET. com. 172800 IN NS E.GTLD-SERVERS.NET. com. 172800 IN NS M.GTLD-SERVERS.NET. com. 172800 IN NS C.GTLD-SERVERS.NET. com. 172800 IN NS K.GTLD-SERVERS.NET. ;; Received 496 bytes from 202.12.27.33#53(M.ROOT-SERVERS.NET) in 147 ms outsourcedclue.com. 172800 IN NS ns1.softlayer.com. outsourcedclue.com. 172800 IN NS ns2.softlayer.com. ;; Received 170 bytes from 192.35.51.30#53(F.GTLD-SERVERS.NET) in 45 ms outsourcedclue.com. 86400 IN A 208.43.45.4 outsourcedclue.com. 86400 IN NS ns2.softlayer.com. outsourcedclue.com. 86400 IN NS ns1.softlayer.com. ;; Received 98 bytes from 67.228.255.5#53(ns2.softlayer.com) in 42 ms
I use Unbound as my recursive/caching DNS server of choice, and one day I noticed it didn’t support “dig +trace”. Distraught, I dug in why. Talking to a buddy, he suggested perhaps Unbound wasn’t allowing non-recursive queries, that +trace relies on. So digging into the documentation, I discovered the allow_snoop option of the access-control directive. So for example, if in your config file looks like this:
server:
access-control: 0.0.0.0/0 refuse
access-control: 127.0.0.0/8 allow
access-control: 10.1.11.0/24 allow
just add the following directive to support “dig +trace” from the IP’s needed:
access-control: 10.1.11.0/24 allow_snoop
Now you can dig +trace to your hearts content!
Setting up Unbound Recursive/Caching DNS Server on CentOS/Redhat
UPDATE: Modified for some changes and the latest version of Unbound (this includes 1.3.0)
NOTE: – If you are upgrading from a previous version, I would delete your *.pem files, regenerate them, and make sure to chown them to the unbound user/group.
There is a lot of noise out lately about the recently published DNS Caching Server vulnerability (and with good reason). A lot of patching of BIND and other vulnerable resolvers has commenced. Unbound, an open source recursive/caching resolver from the NLNetlabs guys doesn’t have the problem, and is just a good all around caching server. In this tutorial, I will show you how to setup a reliable and secure caching server.
Unbound 1.2.1 is the latest version. As with everything on Redhat/CentOS, I install packages via RPM. The Unbound tarball comes with spec file, so let’s use that (this is sort of a mini-tutorial of how to build RPM’s as well). I am using yum here, for the purposes of this document, you can substitute yum with up2date-nox if you are using Redhat 4.
1. Install rpm-build: yum install -y rpm-build
2. Create the directory tree needed for building RPM’s (I use $HOME/rpm):
mkdir -p ~/rpm/RPMS ~/rpm/SRPMS ~/rpm/SPECS ~/rpm/SOURCES ~/rpm/BUILD
3. Tell rpmbuild where to find it’s top level dir:
echo “%_topdir $HOME/rpm” > $HOME/.rpmmacros
4. Download unbound into the $HOME/rpm/SOURCES directory:
cd $HOME/rpm/SOURCES && wget http://unbound.net/downloads/unbound-latest.tar.gz
5. Now we want to extract the spec file and edit it:
tar zxf unbound-latest.tar.gz && cp unbound-1.2.1/contrib/unbound.spec $HOME/rpm/SPECS && rm -rf unbound-1.2.1. You need to edit the spec file and update the Version directive to 1.2.1
5. Let’s build the RPM now, it only requires flex and openssl-devel to be installed:
cd $HOME/rpm/SPECS && rpmbuild -bb unbound.spec
6. After lots of output, you should have a shiny new RPM in $HOME/rpm/RPMS/$arch where $arch is either i386 or x86_64
7. Now let’s install it, this will also create the “unbound” user and group:
rpm -ivh unbound-1.2.1-1.i386.rpm (or unbound-1.2.1-1.x86_64.rpm)
OK – we are all done with installation, it created a few directories and files
- /var/unbound – this is the main directory for all of the files. The configuration we are going to setup is for a chroot’d instance running in this directory
- /etc/init.d/unbound – The startup script
- /etc/unbound.conf – a symlink to the main config file in /var/unbound/unbound.conf
- The binary files, docs etc.
Let’s configure the thing now. There are a TON of configuration items, which can be viewed at this link, but we don’t need to worry about all those now (feel free to review at a later date). Here is the config I am using on most of my machines:
server:
verbosity: 1
interface:
interface: 127.0.0.1
do-ip6: no
access-control: 0.0.0.0/0 refuse
access-control: 127.0.0.0/8 allow_snoop
access-control: 1.2.3.0/24 allow_snoop
chroot: /var/unbound
remote-control:
control-enable: yes
The key items are interface and access-control. A secure recursive server is NOT open to the world, only your internal/controlled networks. So what we do with the access-control items is by default, refuse all queries (you can use a firewall for this too, but I chose the config items in this case). Then, we allow queries from localhost ( access-control: 127.0.0.0/8 allow) and from our local network (access-control: 1.2.3.0/24 allow). Nobody else can query this new recursive server. The interface option tells the system which IP address to listen on (for example, if you run an authoritative server on this same machine, they will both use port 53.
Also – allow_snoop allows you support dig +trace.
Let’s fire this bad boy up (first let’s verify the config file – need to run this under sudo as root) and set it to run at boot:
root# cd /var/unbound && unbound-checkconf unbound.conf
unbound-checkconf: no errors in unbound.conf
root# unbound-control-setup
root# chkconfig –on unbound
root# /etc/init.d/unbound start
We should be all good to go now, let’s test it:
dig google.com @localhost
You should have gotten the results back for google.com etc. If it didn’t work, check /var/log/messages, it will show if unbound started properly or not. Good luck!
Programmers are Causing Global Warming (Repost)
I posted this before on an old blog, thought I would repost it here:
Catchy title eh? I have decided that all of the issues of global warming can be attributed to programmers. Lazy programmers. “What is he talking about?” I hear shouted from the fourth row. I am talking about the extremely common mantra of “just throw hardware at the problem”. Instead of spending time to actually plan and optimize software, people throw up a quick piece of crap, and hope that it scales. When it doesn’t, they just buy bigger and more hardware. Problem solved.
I spoke about “How can see many people outgrow their data centers” before, this is really a follow up to that entry. The gist of it is that there is a LOT of electricity being wasted by half-ass solutions. This wasted electricity in turn releases carbon in the atmosphere, which causes global warming (this is of course a very watered down scientific analysis of global warming, but I am simple man that thinks in simple terms).
I was at the San Jose NANOG conference a few months ago, and sat in on a interesting panel titled Hot Time in the Big IDC: Power, cooling, and the data center. It was a round table discussion about what can be done about the severe lack of power and cooling that is affecting data centers around the world. This shortage affects their customers all too often (speaking from experience as well as talking to buddies who have similar challenges). It has become a nightmare to get sufficient power in data centers. Most will make you commit to a full cage if you need more breakers than are allocated for a single rack. Anyways, back to the panel. There were some pretty influential representatives from some large organizations, Cisco, Sun and Switch and Data (which purchased PAIX), to name a few. These individuals discussed some of the challenges facing IDC’s these days, and ways to solve them. The hardware people discussed how they are working to develop faster machines that draw less electricity and need less cooling. The data center/exchange people discussed some of their plans for bringing in more advanced cooling solutions. All of the topics were definitely paths they should take, but NO one touched on the most logical way to alleviate the problem. I wanted to stand up and yell “Hey Chuckos! If programmers and systems engineers just spent more time designing a proper system, then you would have AT LEAST a 50% reduction in cooling and capacity needs”. I say at least cause there is no hard numbers or facts I can point at to come up with a truly accurate number.
I can tell you from experience, I am amazed at some applications I have seen and how poorly they scale. Sometimes it’s as simple as slapping an index on column properly (I have seen an application that ran for years with the main sales report taking 4-5 minutes to run. A single index was placed on the proper column, and the time went down to 2 seconds. Larger database systems were purchased for this customer just so the system wouldn’t be “so slow”). This is an all too common issue that I know some of the more astute readers of this entry (if there are any readers of this entry) come across often.
So what do we do you ask? To help yourself and to help the world (give a man a fish and he eats for a day, show a man how to fish, and he eats forever or something like that), just sit down and think before your project starts where the bottlenecks could be, and how you can alleviate them. Then, understand how a computer and network actually work. Armed with this information, you should be able to design and develop a scalable system that doesn’t require 10 web servers, 5 database servers, and 5 application servers. And that my friends, would help save the world.
UPDATE: – Dan Prichett, from eBay, takes the discussion a step further.
Great Page Showing Some Cool Geek Posters
Kudos to the fellas at Pingdom for gathering a page showing some pretty slick geek posters. My favorite is probably the CAIDA network map (I have one like that from back in 2001. Memories…).
Broken Caching DNS Server Causes Headaches
By now you have read the many reports of The Planet’s data center fire. Long story short, there was an explosion on the first floor of their Houston facility (old ev1servers data center) that affected network connectivity, servers and a ton of other items. Some buddies of mine, Pelago, have their gear in this facility. Luckily, their servers were fine (no downtime), but there were spotty network issues for 3 days, starting on Sunday, that are finally resolver as of yesterday.
One item that caused me/us a ton of headaches was spotty connectivity to their payment processor. They utilize the SOAP interface for submitting their payment information (when someone signs up for their Intervals project management application etc.). What we were seeing is all connections to the SOAP service (accessed over normal HTTPS) timing out. After some digging, the weird part was they were only timing out when run via the PHP interpreter embedded in Apache (i.e. when run as part of the normal web process). If we ran it via the PHP command line interpreter, it worked fine. It was driving us mad, the network path to Sage looked fine using the normal network troubleshooting tools. In addition, we could do easily simulate pulling down the SOAP WSDL file using CURL etc. So it wasn’t network path related, but we still couldn’t figure it out.
On a hunch, I decided to watch the DNS traffic during the transaction, and low and behold, I saw DNS queries to theplanet’s recursive servers (which was odd as I always configure local caching servers), and it was querying for the AAAA (IPv6 DNS record) for the gateway, and timing out, resulting in multiple retransmissions. Now, when we ran the script via the command line, it would query the local caching servers (as it should) and get a NOERROR right away (the correct response since the payment processor didn’t have AAAA records), it would then fallback and query for the A record and succeed.
Here are the packet traces for those interested in the FAILURE scenario (names and IP’s changed to protected the non-innocent):
11:30:22.301692 IP (tos 0x0, ttl 64, id 34396, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.47502 > 2.3.4.5.domain: [bad udp cksum de1d!] 20370+ AAAA? endpoint.paymentprocessor.net. (34)
11:30:27.302284 IP (tos 0x0, ttl 64, id 29396, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.47501 > 2.3.4.5.domain: [bad udp cksum e11d!] 20370+ AAAA? endpoint.paymentprocessor.net. (34)
11:30:32.303732 IP (tos 0x0, ttl 64, id 34397, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.47502 > 2.3.4.5.domain: [bad udp cksum de1d!] 20370+ AAAA? endpoint.paymentprocessor.net. (34)
11:30:37.304502 IP (tos 0x0, ttl 64, id 49399, offset 0, flags [DF], proto 17, length: 79) 1.2.3.4.47502 > 2.3.4.5.domain: [bad udp cksum b9c4!] 38226+ AAAA? endpoint.paymentprocessor.net.longerdomain.com. (51)
11:30:42.305874 IP (tos 0x0, ttl 64, id 54400, offset 0, flags [DF], proto 17, length: 79) 1.2.3.4.47503 > 2.3.4.5.domain: [bad udp cksum b6c4!] 38226+ AAAA? endpoint.paymentprocessor.net.longerdomain.com. (51)
You can see the retransmits. Now here is the CORRECT transaction:
05:30:54.793846 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 62) 1.2.3.4.50700 > 3.4.5.6.domain: [bad udp cksum d6dc!] 33819+ AAAA? endpoint.paymentprocessor.net. (34)
05:30:54.842103 IP (tos 0x0, ttl 60, id 0, offset 0, flags [DF], proto 17, length: 126) 3.4.5.6.domain > 1.2.3.4.50700: [udp sum ok] 33819 q: AAAA? endpoint.paymentprocessor.net. 0/1/0 ns: paymentprocessor.net. SOA ns.example.com. soacontact.example.com. 1064587759 4800 2400 950400 2400 (98)
The Planet’s recursive server was broken in the manner that it responded to queries for AAAA records. It just dropped them on the floor instead of returning a NOERROR. Now knowing why the system was timing out, I still couldn’t figure out why Apache was using The Planet’s recursive servers, which was causing the timeout problem, and not using the local caching servers (which worked fine). Knowing that some apps have some issues with caching the recursive servers, instead of querying /etc/resolv.conf each time, I restarted apache, tested the script again, and lo and behold, it started using the proper recursive servers.
What I think happened was, when Apache was first started, they were using The Planet’s recursive servers, then when it was switched in /etc/resolv.conf, Apache, having been up all this time (hundreds of days), never reconsulted the /etc/resolv.conf file for the new recursive IP’s, and continued to use the old ones. And when theplanet had their fire and all their problems, the script was using theplanet recursives, and would have connectivity problems. It was only Apache (and come to find out later, Postfix) that was having this issue. I still don’t know if it was the recursive being “broken” or that the recursive had trouble reaching the authoritative servers for paymentprocessor.com (made up name). I would assume the former since I was able to retrieve the A record fine from the same authoritative server.
All in all, it was a frustrating three days (more so for my buddies than me). The Planet has some serious accountability issues right now, that kind of downtime is not acceptable for an enterprise data center company.