Thursday, January 14, 2010 - a Sysadminish Tale

<@Juerd> frettled: Blog about the mess you found when
you first logged in on feather yesterday :)


This will, incidentally, also explain why Trac is kindof unavailable now. is a Xen guest (a virtual machine, hereafter "VM") that's hosting several important services for the Perl 6 community. There's SVN web access, a Trac installation, and a bunch of other stuff I honestly don't know the half of.

Recently, the VM started running out of memory too often for comfort. What was going on? Juerd asked for help in tracking down the problem, as he didn't have the time to do so himself. And needing some distraction from work -- something to help me procrastinate -- I volunteered.

By now, you're probably banging your head on your keyboard in sympathy with me for saying something that may have been slightly less than brilliant. You know the feeling; Matt Trout reaches his right hand towards you, the world is suddenly in slow-motion, you see his hand closing in on you, his grin widening, and a voice saying "thhhhhaaaaannnnnkssss ffffooooorrr vooooluuunnnteeeerrriiiinnnnng", and you're basically up that creek with all the mud and dirt in it.

After handing over an SSH public key and getting sudo access (yeah, yeah, I know), I had a look anyway.

First, I went on a brief but wild goose chase, finding some error messages regarding ConsoleKit which appeared to be more frequent just before the server went out of memory, checked the Debian version (an unholy mix of Debian unstable and Debian experimental with lots of package updates pending someone's attention), and generally tried to get a feel of how the system was configured.

We already knew that Apache somehow might be responsible for gobbling up available memory, so my first action was to have a look at the last 100,000 lines of the Apache access log, using a simplistic log analysis script.

But which log? There were three Apache log directories to choose from. I (correctly) guessed that the one called simply /var/log/apache2 might be the interesting one, the others seemed to be legacy directories which should have been removed ages ago.

According to the script, there were 0 accesses in the last 100,000 lines.

Knowing the script, that was not so strange, because it makes a few assumptions regarding the log format, using a regexp belonging to the days before named captures and whatnot:
while (<>) {
if (/^(\S+) (\S+) - - \[[^\]]*\] \"(GET|POST) \S*
HTTP(|\/1\.[01])\" \d{3} (\d+) \"/) {
The regexp line has been split for the sake of the line width of this blog. There's nothing to be proud of here.

Anyway, I first had to remove the first capture; feather's logs weren't showing the virtualhost as the first column, and access types were most certainly not limited to only GET and POST:
frettled@feather:~$ sudo awk '{print $6}' /var/log/apache2/access.log|
sort -u

After straightening that up (and adding %v to the LogFormat specifications in the Apache config for future use), I got the following result:
Use of uninitialized value $size in addition (+)
at /usr/local/sbin/bandwidthips line 39, <> line 1002.
Use of uninitialized value $size in addition (+)
at /usr/local/sbin/bandwidthips line 40, <> line 1002.
Use of uninitialized value $size in addition (+)
at /usr/local/sbin/bandwidthips line 44, <> line 1002.
AAAARRGH! Idiot! Imbecile! Inept half-wit! Yep, I'd forgotten to renumber my captures. See, this is why Perl should be in version 5.10.1 or 6 when fiddling with those bloody annoying regexps.
frettled@feather:~$ sudo tail -100000 /var/log/apache2/access.log|
/usr/local/sbin/hitips|head Bytes = 14329487 (3.44%), Hits = 51503 (51.82%) Bytes = 132116084 (31.73%), Hits = 18111 (18.22%) Bytes = 50846948 (12.21%), Hits = 6236 (6.27%) Bytes = 54880221 (13.18%), Hits = 1894 (1.9%) Bytes = 460200 (0.11%), Hits = 1894 (1.9%) Bytes = 433388 (0.1%), Hits = 1686 (1.69%) Bytes = 1726871 (0.41%), Hits = 1635 (1.64%) Bytes = 429358 (0.1%), Hits = 1609 (1.61%) Bytes = 8461415 (2.03%), Hits = 1238 (1.24%) Bytes = 18945415 (4.55%), Hits = 1126 (1.13%)
So, uhm, around 52% of the hits come from, and nearly 25% from Google's indexer. Lovely.

Looking at the accesses from feather3, I quickly saw that they mostly had to do with svnweb.

Juerd had already stopped Apache, but someone -- I don't know who -- started it again at 12:00, probably anxious that SVN and such didn't work.

I then followed the running processes using the top command, updating each second (top d1), sorting by memory usage (typing M while top was running), hoping to catch some quickly growing processes.

Nopes. None, zilch, nada. Nothing that appeared horribly wrong. Sure, the apache2 processes used some memory (30-60 MB resident set, 50-100 virtual), but nothing appeared to be out of the ordinary. I changed the update frequency to each third second -- top sometimes uses an inordinate amount of CPU, depending on magic -- and waited. After a while, a couple of apache2 processes were using more CPU and memory than the others, around 60-90 MB resident. And they were growing. And according to lsof, they were active in the svnweb directory (and used a metric shitload of libraries). And after growing, they didn't release memory, they just kept on using it. But it wasn't enough to use up memory, there was still a bunch of free RAM.

So that was perhaps svnweb's fault, then?


But then my time ran out, and I had to drop the ball, leaving the top process running.

Five minutes later, the memory ran out again. It's just as if someone was waiting for me to go idle in order to produce the problem that I was looking for.


svnweb kindof remained the main suspect, until Juerd caught whatever was happening at the right time.

And catching what happens at the right time is bloody important.

Here's what he found, using Apache's server status:

Well, that's not svnweb. That's Trac. And the IP addresses belong to Google.

And that's spam, effectively creating a DoS or DDoS attack on our services as a side effect when search engines try to index the Trac webpages. It probably isn't intentional, but spammers just don't care.

So, what can we do to protect feather from suffering from such attacks in the future?

There's a lot that can be done. It takes effort. It takes time. It takes someone.

Here are a few suggestions on how to improve the robustness of the kind of services feather provides:
  • Add a captcha to the web form. The disadvantage is that this does not really save processing resources, but it probably should be done anyway.
  • Add an unnecessary and bogus input field to the web form, e.g. "Phone number". This input field should be hidden with CSS so that web browsers don't display it, and if someone submits anything with data with that field's name, then you can be nearly 100% certain it's spam from someone who's used a web scraper before automatically filling the form. Filter it out.
  • Change the webserver delegation architecture, so that each Apache process isn't loading tens of megabytes of libraries and keeping them in memory. Off-loading to shorter-living FastCGI daemons or similar solutions, or even sacrificing program startup speed by using CGI+suexec, etc., may be decent starting points.
  • Consider using a front-end proxy like Varnish to gloss over underlying nastiness.
  • Start with a new VM and migrate services to that one, gradually.
  • Document configuration choices and what each web service does/is there for, so that the next sysadmin coming along can make educated guesses quicker. :)
These tasks can rather easily be split into manageable one-person projects.

Does this sound interesting to you, or did I lose you at the third line of this blog entry?

Pop in on #perl6 on the freenode IRC network and say so.

1 comment:

Anonymous said...

You simply can't have a Trac up with no protection, be it auhentication or captchas or whatever. I've given up using Trac without authentication myself.