Ksplice for Fedora!

Posted in Ksplice on August 31st, 2010 by Keith Winstein2 Comments

In response to many requests, Ksplice is proud to announce we’re now providing Uptrack free of charge for Fedora!

Fedora will join Ubuntu Desktop among our free platforms, and will give Fedora users rebootless updates as long as Fedora maintains each major kernel release.

However, of note: Fedora is the only Linux distribution that migrates to a new Linux kernel version family (e.g. 2.6.33 to 2.6.34) during the lifetime of the product. This kernel version family migration is such a major version change that Ksplice recommends a reboot for this version change. These migrations occur roughly twice per year and only in Fedora; all of the other important Fedora kernel updates can be applied rebootlessly, as can the kernel updates for the rest of our supported Linux distributions.

We’ve also submitted the Uptrack client for integration into a later version of Fedora and are working with the Fedora folks to help make rebootless updates part of the distribution itself. Thanks for all your feedback about Uptrack and keep it coming!

Best,
the Ksplice Team

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Six things I wish Mom told me (about ssh)

Posted in System administration on August 26th, 2010 by Waseem Daher36 Comments

If you’ve ever seriously used a Linux system, you’re probably already familiar with at least the basics of ssh. But you’re hungry for more. In this post, we’ll show you six ssh tips that’ll help take you to the next level. (We’ve also found that they make for excellent cocktail party conversation talking points.)

(1) Take command!

Everyone knows that you can use ssh to get a remote shell, but did you know that you can also use it to run commands on their own? Well, you can–just stick the command name after the hostname! Case in point:

wdaher@rocksteady:~$ ssh bebop uname -a
Linux bebop 2.6.32-24-generic #39-Ubuntu SMP Wed Jul 28 05:14:15 UTC 2010 x86_64 GNU/Linux

Combine this with passwordless ssh logins, and your shell scripting powers have just leveled up. Want to figure out what version of Python you have installed on each of your systems? Just stick ssh hostname python -V in a for loop, and you’re done!

Some commands, however, don’t play along so nicely:

wdaher@rocksteady:~$ ssh bebop top
TERM environment variable not set.

What gives? Some programs need a pseudo-tty, and aren’t happy if they don’t have one (anything that wants to draw on arbitrary parts of the screen probably falls into this category). But ssh can handle this too–the -t option will force ssh to allocate a pseudo-tty for you, and then you’ll be all set.

# Revel in your process-monitoring glory
wdaher@rocksteady:~$ ssh bebop -t top
# Or, resume your session in one command, if you're a GNU Screen user
wdaher@rocksteady:~$ ssh bebop -t screen -dr

(2) Please, try a sip of these ports

But wait, there’s more! ssh’s ability to forward ports is incredibly powerful. Suppose you have a web dashboard at work that runs at analytics on port 80 and is only accessible from the inside the office, and you’re at home but need to access it because it’s 2 a.m. and your pager is going off.

Fortunately, you can ssh to your desktop at work, desktop, which is on the same network as analytics. So if we can connect to desktop, and desktop can connect to analytics, surely we can make this work, right?

Right. We’ll start out with something that doesn’t quite do what we want:

wdaher@rocksteady:~$ ssh desktop -L 8080:desktop:80

OK, the ssh desktop is the straightforward part. The -L port:hostname:hostport option says “Set up a port forward from port (in this case, 8080) to hostname:hostport (in this case, desktop:80).”

So now, if you visit http://localhost:8080/ in your web browser at home, you’ll actually be connected to port 80 on desktop. Close, but not quite! Remember, we wanted to connect to the web dashboard, running on port 80, on analytics, not desktop.

All we do, though, is adjust the command like so:

wdaher@rocksteady:~$ ssh desktop -L 8080:analytics:80

Now, the remote end of the port forward is analytics:80, which is precisely where the web dashboard is running. But wait, isn’t analytics behind the firewall? How can we even reach it? Remember: this connection is being set up on the remote system (desktop), which is the only reason it works.

If you find yourself setting up multiple such forwards, you’re probably better off doing something more like:

wdaher@rocksteady:~$ ssh -D 8080 desktop

This will set up a SOCKS proxy at localhost:8080. If you configure your browser to use it, all of your browser traffic will go over SSH and through your remote system, which means you could just navigate to http://analytics/ directly.

(3) Til-de do us part

Riddle me this: ssh into a system, press Enter a few times, and then type in a tilde. Nothing appears. Why?

Because the tilde is ssh’s escape character. Right after a newline, you can type ~ and a number of other keystrokes to do interesting things to your ssh connection (like give you 30 extra lives in each continue.) ~? will display a full list of the escape sequences, but two handy ones are ~. and ~^Z.

~. (a tilde followed by a period) will terminate the ssh connection, which is handy if you lose your network connection and don’t want to wait for your ssh session to time out.  ~^Z (a tilde followed by Ctrl-Z) will put the connection in the background, in case you want to do something else on the host while ssh is running. An example of this in action:

wdaher@rocksteady:~$ ssh bebop
wdaher@bebop:~$ sleep 10000
wdaher@bebop:~$ ~^Z [suspend ssh]

[1]+  Stopped                 ssh bebop
wdaher@rocksteady:~$ # Do something else
wdaher@rocksteady:~$ fg # and you're back!

(4) Dusting for prints

I’m sure you’ve seen this a million times, and you probably just type “yes” without thinking twice:

wdaher@rocksteady:~$ ssh bebop
The authenticity of host 'bebop (192.168.1.106)' can't be established.
RSA key fingerprint is a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68.
Are you sure you want to continue connecting (yes/no)?

What’s actually going on here? Well, if this is your first time connecting to bebop, you can’t really tell whether the machine you’re talking to is actually bebop, or just an impostor pretending to be bebop. All you know is the key fingerprint of the system you’re talking to.  In principle, you’re supposed to verify this out-of-band (i.e. call up the remote host and ask them to read off the fingerprint.)

Let’s say you and your incredibly security-minded friend actually want to do this. How does one actually find this fingerprint? On the remote host, have your friend run:

sbaker@bebop:~$ ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub
2048 a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68 /etc/ssh/ssh_host_rsa_key.pub (RSA)

Tada! They match, and it’s safe to proceed. From now on, this is stored in your list of ssh “known hosts” (in ~/.ssh/known_hosts), so you don’t get the prompt every time. And if the key ever changes on the other end, you’ll get an alert–someone’s trying to read your traffic! (Or your friend reinstalled their system and didn’t preserve the key.)

(5) Losing your keys

Unfortunately, some time later, you and your friend have a falling out (something about Kirk vs. Picard), and you want to remove their key from your known hosts. “No problem,” you think, “I’ll just remove it from my list of known hosts.” You open up the file and are unpleasantly surprised: a jumbled file full of all kinds of indecipherable characters. They’re actually hashes of the hostnames (or IP addresses) that you’ve connected to before, and their associated keys.

Before you proceed, surely you’re asking yourself: “Why would anyone be so cruel? Why not just list the hostnames in plain text, so that humans could easily edit the file?” In fact, that’s actually how it was done until very recently. But it turns out that leaving them in the clear is a potential security risk, since it provides an attacker a convenient list of other places you’ve connected (places where, e.g., an unwitting user might have used the same password).

Fortunately, ssh-keygen -R <hostname> does the trick:

wdaher@rocksteady:~$ ssh-keygen -R bebop
/home/wdaher/.ssh/known_hosts updated.
Original contents retained as /home/wdaher/.ssh/known_hosts.old

I’m told there’s still no easy way to remove now-bitter memories of your friendship together, though.

(6) A connection by any other name…

If you’ve read this far, you’re an ssh pro. And like any ssh pro, you log into a bajillion systems, each with their own usernames, ports, and long hostnames. Like your accounts at AWS, Rackspace Cloud, your dedicated server, and your friend’s home system.

And you already know how to do this. username@host or -l username to specify your username, and -p portnumber to specify the port:

wdaher@rocksteady:~$ ssh -p 2222 bob.example.com
wdaher@rocksteady:~$ ssh -p 8183 waseem@alice.example.com
wdaher@rocksteady:~$ ssh -p 31337 -l waseemio wsd.example.com

But this gets really old really quickly, especially when you need to pass a slew of other options for each of these connections. Enter .ssh/config, a file where you specify convenient aliases for each of these sets of settings:

Host bob
    HostName bob.example.com
    Port 2222
    User wdaher

Host alice
    HostName alice.example.com
    Port 8183
    User waseem

Host self
    HostName wsd.example.com
    Port 31337
    User waseemio

So now it’s as simple as:

wdaher@rocksteady:~$ ssh bob
wdaher@rocksteady:~$ ssh alice
wdaher@rocksteady:~$ ssh self

And yes, the config file lets you specify port forwards or commands to run as well, if you’d like–check out the ssh_config manual page for the details.

ssh! It’s (not) a secret

This list is by no means exhaustive, so I turn to you: what other ssh tips and tricks have you learned over the years? Leave ’em in the comments!

When you ssh in at 2 a.m., it won’t be to reboot

Say goodbye to your middle-of-the-night outage window for installing kernel updates. Install them without rebooting, with Ksplice Uptrack!

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Essay: 3G and me

Posted in Essays, Fun on August 20th, 2010 by Keith Winstein26 Comments

In 2002, I got my first cell phone.

June was stuffy in Manhattan, and my summer internship copy-editing the New York Sun, the now-defunct right-wing newspaper, was just about to start. I swam through the humid air past Madison Square Park to get to the store before closing.

“You want this one,” said the salesman at the RadioShack, pointing to a sleek model then on sale. “It’s a 3G phone. It’ll work with Sprint’s new 3G network they’re rolling out later this summer.”

“Ok,” I said. Sure enough, it had 3G:

Sanyo SCP-6200. QUALCOMM 3G CDMA

Fig. 1: Sprint's Sanyo 3G phone, circa mid-2002. An orange of more recent vintage looks on.

A few months later — after all the Sun‘s editorials casting doubt on whether lead paint can really poison you had been edited and sent off to our eight readers, and I was back at school — Sprint did roll out their 3G network:

Sprint launched nationwide 3G service in the 2002 third quarter. The service,
marketed as “PCS Vision”, allows consumer and business customers to use their
Vision-enabled PCS devices to take and receive pictures, check personal and
corporate e-mail, play games with full-color graphics and polyphonic sounds and
browse the Internet wirelessly with speeds up to 144 kbps (with average speeds
of 50 to 70 kbps).

I called Sprint and tried to subscribe. “Sir, you need a 3G phone to sign up,” they told me.

“I have one!” I said proudly. “It says 3G CDMA right on the back!”

“Oh, I’m sorry sir. We’ve changed the labeling of that model. That phone doesn’t have true 3G. It doesn’t say that on the back any more. If you like I would be happy to sell you the next model, the SCP-6400, which has true 3G.”

“No, thanks,” I said, thinking that 3G was pretty much a crock, while wryly appreciating RadioShack’s ability to make you feel cheated even on a $30 cellphone.

Sure enough, when my phone died and had to be replaced, I saw the new one only said “QUALCOMM CDMA” — no more “3G”. It had been revised downward.

Meanwhile, Sprint’s competitors were busy deploying their own nationwide 3G networks. Cingular, then a joint venture of SBC and BellSouth, trumpeted each step in the process:

June 2003:

ATLANTA, June 30 — Cingular Wireless today announced the world’s
first commercial deployment of wireless services using Enhanced Datarate
for Global Evolution (EDGE) technology. Cingular’s initial EDGE service
offering is in its Indianapolis market, with subsequent deployments
expected later in the year.

Building on more than a decade of wireless data experience, Cingular’s
EDGE technology enables true “third generation” (3G) wireless data
services with data speeds typically three times faster than those
available on GSM/GPRS networks.

Or October 2003:

Cingular began offering its 3G service EDGE (Enhanced Datarate for Global
Evolution) in Indianapolis in July, becoming the first commercial wireless
company in the world to offer the service.

Or June 2004:

This year, further enhancements have been made to the network with the
launch of EDGE in Connecticut, a high-speed wireless data service which
gives customers true “third generation” (3G) wireless data services with
data speeds typically three times faster than what was available on GPRS.

Those of you who care about these things will probably be jumping up and down right now, and/or closing the browser window. “EDGE isn’t 3G!” you are saying. “It’s 2.9G at best! And neither is 1xRTT, which is all the Sanyo SCP-6200 had. That’s barely 2.5G! Maybe 2.75G on a clear day.”

These people, who while enthusiastic sometimes seem to have been born yesterday, would point to the kerfuffle when Apple released the original iPhone in 2007 for Cingular and only supported EDGE. As the Wall Street Journal wrote:

Detractors and fans are going toe to toe on online forums. Much of the latest criticism is zooming in on Apple’s choice of technologies to use with the new phone and its decision to partner exclusively with AT&T Inc.’s Cingular Wireless, which is being rebranded as AT&T.

For example, the iPhone won’t use the fastest wireless Internet connection available, relying on so-called second-generation, or 2G, rather than faster 3G networks now being rolled out by major wireless carriers. Because of this, industry experts expect features of the iPhone such as Web browsing and downloading not to be very fast.

Tim Cook, Apple’s chief operating officer, said during a conference call with analysts yesterday the company is sold on Cingular’s 2G EDGE network because “it’s much more widespread and widely deployed in the U.S.” Mr. Cook didn’t comment on whether Apple will eventually support 3G but said, “Obviously we would be where the technology is over time.” Some people refer to EDGE as 2.5G.

By 2007, Cingular/AT&T was happy to downgrade its EDGE offerings in favor of a newer kind of 3G (known as W-CDMA or UMTS). From an interview with AT&T’s chief, Randall Stephenson, in the New York Times in June 2007:

”I got to tell you, carrying this thing around and experiencing those kinds of speeds on a wireless handset, your imagination begins to run in terms of what’s possible,” he said, ”and by the way, there’s not a 3G network available in Ottumwa, Iowa,” referring to the so-called third generation of Web-enabled cellphones that require faster networks. ”If you want to sell these devices in a variety of places, Edge is the only opportunity you have.”

AT&T has invested $16 billion in its network over the last two years, and the network is now designed to handle the expected increase in wireless data users, he said, adding: ”Capacity won’t be an issue. The network is ready.”

Ok, what are some quick takeaways here?

  • What Sprint sold as “3G” in 2002 (1xRTT voice), it rescinded later that year and relabeled the phones.
  • What counted as “3G” for Sprint in 2003 (1xRTT data), isn’t any more either.
  • What in 2004 constituted “true ‘third generation’ (3G)” to Cingular/AT&T, the company had retroactively downgraded to 2G or 2.5G or 2.9G by 2007.
  • From an engineer’s perspective, the 3G interfaces, if you read a book on telecom engineering, are CDMA2000 (including 1xRTT and EV-DO), EDGE, and W-CDMA (including UMTS, with or without HSUPA and HSDPA). The International Telecommunications Union has published a standard for third-generation wireless communications, known as IMT-2000, that includes those three and a few others.
  • To a first approximation, the first launch of “3G” in the United States, around 2002 and 2003, was a dud. The carriers responded by dusting themselves off, redoubling their efforts, deploying a new thing and retroactively downgrading their old “3G” product to be… some smaller number of G’s. “3G” itself it not a technical term with a whole lot of meaning, especially as it lumps together so many incompatible, competing air interface protocols. The situation for consumers was less confused in Europe, where GSM and W-CDMA are dominant, governments auctioned new frequencies set aside for “3G,” and the carrier offerings were more distinct.
  • The same song-and-dance is likely to play out over “4G” — a term that engineers tentatively apply to a forthcoming ITU standard called IMT-Advanced, and carriers apply to whatever they want you to buy now. You might notice that Sprint is currently selling Mobile WiMAX as “4G.” Mobile WiMAX is part of IMT-2000 — the 3G standard. Verizon Wireless is selling something called “LTE” as “4G” — it ain’t in IMT-Advanced either. Today’s “4G” products are like the “3G” of 2002 and 2003 — they will become “3.75G” as soon as the next hot thing comes out.

But the point I really want to make is: this is all a red herring. Focusing on the protocol between your cell phone and the tower — or worse, spending money on that basis — is letting yourself be distracted. It’s like the secret pick-me-up in Geritol, concocted by Madison Avenue instead of a chemist.

A cell phone is essentially sharing a swath of radio spectrum with a bunch of other people within a cell. Think of it like a cable modem or any other ISP. You can have the world’s most sophisticated modem, but if it’s trying to talk in a tiny slice of spectrum shared with everybody else within miles around (because there aren’t enough towers to divide you up into cells), it’ll still be awful.

Consider, for example, the performance I get from a Verizon “3G” USB modem:

3060 packets transmitted, 3007 received, 1% packet loss, time 3061925ms
rtt min/avg/max/mdev = 121.554/404.199/22307.520/1213.055 ms, pipe 23

Pretty sad! But hey, it’s 3G. In truth, a lot of boring factors control the performance of your cell phone data transmissions, principally:

  1. How much spectrum has the carrier licensed in my city, and how much is allocated to this kind of modulation?
  2. How many other people am I sharing the local tower with? In other words, how big is my cell, and how many towers has the carrier built or contracted with?
  3. How much throughput are my cellmates trying to consume?
  4. How much throughput has the carrier built in its back-end network connecting to the tower?

You might notice that all of these meat-and-potatoes factors involve the carrier spending money, and they all involve gradual improvement in behind-the-scenes infrastructure that’s hard to get customers excited. Persuading you to buy a new cell phone with a sophisticated modem and sign up for a two-year contract is a different story. So they don’t sell you something measurable where they could be held accountable; they sell how sweet it feels to be using a sophisticated radio modem protocol to talk to them.

Don’t get me wrong — UMTS and EV-DO are sophisticated protocols, and a lot of smart people and clever techniques made them legitimate engineering accomplishments. But the boring factors — the raw resources being shared among the nearby customers — dictate your performance just as much as incremental improvements in the air interface. What we really ought to care about is the same as with any Internet service provider — the throughput and latency and reliability you get to the endpoints you want to reach. That’s what matters, not the sophistication of one piece of the puzzle.

If the carrier sold you “384 kbps Internet access anywhere in the coverage area, outdoors,” that would be something you could hold them accountable for. The carrier might even have to put a brake on signing up new customers until it could build new towers or license more spectrum for everybody to share, if it made that guarantee.

Some have proposed even more freely enterprising business models — like having your phone get minute-to-minute bids from the local towers on who will carry your traffic for what price, and accept the lowest bidder who offers acceptable performance.

Selling you “3G” — well, that’s a lot easier to live up to. And it changes every year. So don’t tell me how many G’s your new phone has. We’ve loved and lost so many G’s at this point. Tell me you got a new phone where you pay to get 1 Mbps and 100 ms rtt to major exchange points. When the market moves forward enough to make that a reality, that’ll be a generation worth celebrating.

Want more tendentious histories of the recent technological past?

We can’t promise that, but our RSS feed is the best shot you’ve got! Subscribe now and you’ll never go hungry again.

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Strace — The Sysadmin’s Microscope

Posted in System administration on August 6th, 2010 by Greg Price16 Comments

Sometimes as a sysadmin the logfiles just don’t cut it, and to solve a problem you need to know what’s really going on. That’s when I turn to strace — the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn’t need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don’t have the ability to step on each other — which means your program can’t be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation — using the CPU and RAM and nothing else. As a former algorithms person, that’s what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let’s look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program’s Actual Behavior

One day, I wanted to know which Git commands take out a certain lock — I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I’ve done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run “git diff“, will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

  $ strace git status 2>&1 >/dev/null | grep index.lock
  open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
  rename(".git/index.lock", ".git/index") = 0
  $ strace git diff 2>&1 >/dev/null | grep index.lock
  $

So git status takes the lock, and git diff doesn’t.

Interlude: The Toolbox

To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what’s going on.

In Medias Res, If You Want

Sometimes we don’t have the luxury of starting a program over to run it under strace — it’s running, it’s misbehaving, and we need to find out what’s going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question — I find pstree -p invaluable for identifying this — and strace will attach to that program, while it’s running, and start telling you all about it.

Times

When I use strace, I almost always pass the -tt option. This tells me when each syscall happened — -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too — it tells me how long each individual syscall took from start to finish.

Data

By default strace already prints the strings that the program passes to and from the system — filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option — -s 1024 makes strace print up to 1024 characters for each string — or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

  $ strace -p 9026
  Process 9026 attached - interrupt to quit
  read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
  poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
  write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
  read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
  poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
  write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
  [...]

Those write() syscalls are the SQL queries — we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on — the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

  $ strace -p 9026 -e write=3
  Process 9026 attached - interrupt to quit
  read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
  poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
  write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
   | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
   | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
   | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
   | 00030  31 34 35 34                                       1454              |

and we see the whole query. It’s both printed and in hex, in case it’s binary. We could also get the whole thing with an option like -s 1024, but then we’d see all the data coming back via read() — the use of -e write= lets us pick and choose.

Filtering the Output

Sometimes the full syscall trace is too much — you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We’ll use this option in the next example.

Child Processes

Sometimes the process you trace doesn’t do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that’s the case, you may want to pass -f to make strace “follow forks” and trace child processes, too, as soon as they’re made.

For example, here’s a trace of a simple shell script, without -f:

  $ strace -e trace=process,file,desc sh -c \
     'for d in .git/objects/*; do ls $d >/dev/null; done'
  [...]
  stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
  clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
  wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948
  --- SIGCHLD (Child exited) @ 0 (0) ---
  wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here — all the real work was done inside process 11948, the one created by that clone() syscall.

Here’s the same script traced with -f (and the trace edited for brevity):

  $ strace -f -e trace=process,file,desc sh -c \
     'for d in .git/objects/*; do ls $d >/dev/null; done'
  [...]
  stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
  clone(Process 10738 attached
  child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
  [pid 10682] wait4(-1, Process 10682 suspended
   
  [pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
  [pid 10738] dup2(3, 1)                  = 1
  [pid 10738] close(3)                    = 0
  [pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
  [... setup of C standard library omitted ...]
  [pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
  [pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
  [pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
  [pid 10738] getdents(3, /* 0 entries */, 4096) = 0
  [pid 10738] close(3)                    = 0
  [pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
  [pid 10738] close(1)                    = 0
  [pid 10738] close(2)                    = 0
  [pid 10738] exit_group(0)               = ?
  Process 10682 resumed
  Process 10738 detached
  <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
  --- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself — future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process’s trace into a separate file.

Use #2: Why/Where Is A Program Stuck?

Sometimes a program doesn’t seem to be doing anything. Most often, that means it’s blocked in some system call. Strace to the rescue.

  $ strace -p 22067
  Process 22067 attached - interrupt to quit
  flock(3, LOCK_EX

Here it’s blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it’s opened as file descriptor 3. What file is that?

  $ readlink /proc/22067/fd/3
  /tmp/foobar.lock

Aha, it’s the file /tmp/foobar.lock. And what process is holding that lock?

  $ lsof | grep /tmp/foobar.lock
  command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
  command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don’t laugh — this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of “nothing to see here” entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it’s doing some pure computation, perhaps a bunch of math. You’re outside of strace‘s domain; good luck.

Uses #3, #4, …

A post of this length can only scratch the surface of what strace can do in a sysadmin’s toolbox. Some of my other favorites include

  • As a progress bar. When a program’s in the middle of a long task and you want to estimate if it’ll be another three hours or three days, strace can tell you what it’s doing right now — and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There’s no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don’t always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works — how about curling up with a set of man pages and some output from strace?

Trying to figure out when to read from our blog?

strace won’t help you, but our RSS feed will! Subscribe now and you’ll never wait4 another link!

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Choose Your Own Sysadmin Adventure

Posted in Fun, System administration on July 30th, 2010 by Christian Ternus8 Comments

Today is System Administrator Appreciation Day, and being system administrators ourselves,  we here at Ksplice decided to have a little fun with this holiday.

We’ve taken a break, drank way too much coffee, and created a very special Choose Your Own Adventure for all the system administrators out there.

Click here to begin the adventure.

Feedback and comments welcome. Above all: Happy System Administrator Appreciation Day. Share the love with your friends, colleagues, and especially any sysadmins you might know.

Wish your life as a sysadmin were a little less adventurous?

Ksplice Uptrack allows you to apply Linux security updates without rebooting, so you don’t have to interrupt your marathon StarCraft session to babysit a 3AM reboot.

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Learning by doing: Writing your own traceroute in 8 easy steps

Posted in Networking on July 29th, 2010 by Leonid Grinberg30 Comments

Anyone who administers even a moderately sized network knows that when problems arise, diagnosing and fixing them can be extremely difficult. They’re usually non-deterministic and difficult to reproduce, and very similar symptoms (e.g. a slow or unreliable connection) can be caused by any number of problems — congestion, a broken router, a bad physical link, etc.

One very useful weapon in a system administrator’s arsenal for dealing with network issues is traceroute (or tracert, if you use Windows). This is a neat little program that will print out the path that packets take to get from the local machine to a destination — that is, the sequence of routers that the packets go through.

Using traceroute is pretty straightforward. On a UNIX-like system, you can do something like the following:

    $ traceroute google.com
    traceroute to google.com (173.194.33.104), 30 hops max, 60 byte packets
     1  router.lan (192.168.1.1)  0.595 ms  1.276 ms  1.519 ms
     2  70.162.48.1 (70.162.48.1)  13.669 ms  17.583 ms  18.242 ms
     3  ge-2-20-ur01.cambridge.ma.boston.comcast.net (68.87.36.225)  18.710 ms  19.192 ms  19.640 ms
     4  be-51-ar01.needham.ma.boston.comcast.net (68.85.162.157)  20.642 ms  21.160 ms  21.571 ms
     5  pos-2-4-0-0-cr01.newyork.ny.ibone.comcast.net (68.86.90.61)  28.870 ms  29.788 ms  30.437 ms
     6  pos-0-3-0-0-pe01.111eighthave.ny.ibone.comcast.net (68.86.86.190)  30.911 ms  17.377 ms  15.442 ms
     7  as15169-3.111eighthave.ny.ibone.comcast.net (75.149.230.194)  40.081 ms  41.018 ms  39.229 ms
     8  72.14.238.232 (72.14.238.232)  20.139 ms  21.629 ms  20.965 ms
     9  216.239.48.24 (216.239.48.24)  25.771 ms  26.196 ms  26.633 ms
    10 173.194.33.104 (173.194.33.104)  23.856 ms  24.820 ms  27.722 ms

Pretty nifty. But how does it work? After all, when a packet leaves your network, you can’t monitor it anymore. So when it hits all those routers, the only way you can know about that is if one of them tells you about it.

The secret behind traceroute is a field called “Time To Live” (TTL) that is contained in the headers of the packets sent via the Internet Protocol. When a host receives a packet, it checks if the packet’s TTL is greater than 1 before sending it on down the chain. If it is, it decrements the field. Otherwise, it drops the packet and sends an ICMP TIME_EXCEEDED packet to the sender. This packet, like all IP packets, contains the address of its sender, i.e. the intermediate host.

traceroute works by sending consecutive requests to the same destination with increasing TTL fields. Most of these attempts result in messages from intermediate hosts saying that the packet was dropped. The IP addresses of these intermediate hosts are then printed on the screen (generally with an attempt made at determining the hostname) as they arrive, terminating when the maximum number of hosts have been hit (on my machine’s traceroute the default maximum is 30, but this is configurable), or when the intended destination has been reached.

The rest of this post will walk through implementing a very primitive version of traceroute in Python. The real traceroute is of course more complicated than what we will create, with many configurable features and modes. Still, our version will implement the basic functionality, and at the end, we’ll have a really nice and short Python script that will do just fine for performing a simple traceroute.

So let’s begin. Our algorithm, at a high level, is an infinite loop whose body creates a connection, prints out information about it, and then breaks out of the loop if a certain condition has been reached. So we can start with the following skeletal code:

    #!/usr/bin/python

    def main(dest):
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...
            pass

    if __name__ == "__main__":
        main('google.com')

Step 1: Turn a hostname into an IP address.

The socket module provides a gethostbyname() method that attempts to resolve a domain name into an IP address:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...
            pass

    if __name__ == "__main__":
        main('google.com')

Step 2: Create sockets for the connections.

We’ll need two sockets for our connections — one for receiving data and one for sending. We have a lot of choices for what kind of probes to send; let’s use UDP probes, which require a datagram socket (SOCK_DGRAM). The routers along our traceroute path are going to send back ICMP packets, so for those we need a raw socket (SOCK_RAW).

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 3: Set the TTL field on the packets.

We’ll simply use a counter which begins at 1 and which we increment with each iteration of the loop. We set the TTL using the setsockopt module of the socket object:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 4: Bind the sockets and send some packets.

Now that our sockets are all set up, we can put them to work! We first tell the receiving socket to listen to connections from all hosts on a specific port (most implementations of traceroute use ports from 33434 to 33534 so we will use 33434 as a default). We do this using the bind() method of the receiving socket object, by specifying the port and an empty string for the hostname. We can then use the sendto() method of the sending socket object to send to the destination host (on the same port). The first argument of the sendto() method is the data to send; in our case, we don’t actually have anything specific we want to send, so we can just give the empty string:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 5: Get the intermediate hosts’ IP addresses.

Next, we need to actually get our data from the receiving socket. For this, we can use the recvfrom() method of the object, whose return value is a tuple containing the packet data and the sender’s address. In our case, we only care about the latter. Note that the address is itself actually a tuple containing both the IP address and the port, but we only care about the former. recvfrom() takes a single argument, the blocksize to read — let’s go with 512.

It’s worth noting that some administrators disable receiving ICMP ECHO requests, pretty much specifically to prevent the use of utilities like traceroute, since the detailed layout of a network can be sensitive information (another common reason to disable them is the ping utility, which can be used for denial-of-service attacks). It is therefore completely possible that we’ll get a timeout error, which will result in an exception. Thus, we’ll wrap this call in a try/except block. Traditionally, traceroute prints asterisks when it can’t get the address of a host. We’ll do the same once we print out results.

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 6: Turn the IP addresses into hostnames and print the data.

To match traceroute‘s behavior, we want to try to display the hostname along with the IP address. The socket module provides the gethostbyaddr() method for reverse DNS resolution. The resolution can fail and result in an exception, in which case we’ll want to catch it and make the hostname the same as the address. Once we get the hostname, we have all the information we need to print our data:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                try:
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
            else:
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 7: End the loop.

There are two conditions for exiting our loop — either we have reached our destination (that is, curr_addr is equal to dest_addr)1 or we have exceeded some maximum number of hops. We will set our maximum at 30:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        max_hops = 30
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                try:
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
            else:
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            if curr_addr == dest_addr or ttl > max_hops:
                break

    if __name__ == "__main__":
        main('google.com')

Step 8: Run the code!

We’re done! Let’s save this to a file and run it! Because raw sockets require root privileges, traceroute is typically setuid. For our purposes, we can just run the script as root:

    $ sudo python poor-mans-traceroute.py
    [sudo] password for leonidg:
    1       router.lan (192.168.1.1)
    2       70.162.48.1 (70.162.48.1)
    3       ge-2-20-ur01.cambridge.ma.boston.comcast.net (68.87.36.225)
    4       be-51-ar01.needham.ma.boston.comcast.net (68.85.162.157)
    5       pos-2-4-0-0-cr01.newyork.ny.ibone.comcast.net (68.86.90.61)
    6       pos-0-3-0-0-pe01.111eighthave.ny.ibone.comcast.net (68.86.86.190)
    7       as15169-3.111eighthave.ny.ibone.comcast.net (75.149.230.194)
    8       72.14.238.232 (72.14.238.232)
    9       216.239.48.24 (216.239.48.24)
    10     173.194.33.104 (173.194.33.104)

Hurrah! The data matches the real traceroute‘s perfectly.

Of course, there are many improvements that we could make. As I mentioned, the real traceroute has a whole slew of other features, which you can learn about by reading the manpage. In the meantime, I wrote a slightly more complete version of the above code that allows configuring the port and max number of hops, as well as specifying the destination host. You can download it at my github repository.

Alright folks, What UNIX utility should we write next? strace, anyone? :-) 2

1 This is actually not quite how the real traceroute works. Rather than checking the IP addresses of the hosts and stopping when the destination address matches, it stops when it receives a ICMP “port unreachable” message, which means that the host has been reached. For our purposes, though, this simple address heuristic is good enough.

2 Ksplice blogger Nelson took up a DIY strace on his personal blog, Made of Bugs.

Fixing network problems? Don’t let reboots slow you down!

With Ksplice Uptrack, you can apply kernel patches without rebooting. Spend less time babysitting your updates and more time playing with network utilities!

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Source diving for sysadmins

Posted in System administration on July 13th, 2010 by Greg Brockman6 Comments

As a system administrator, I work with dozens of large systems every day–Apache, MySQL, Postfix, Dovecot, and the list goes on from there. While I have a good idea of how to configure all of these pieces of software, I’m not intimately familiar with all of their code bases. And every so often, I’ll run into a problem which I can’t configure around.

When I’m lucky, I can reproduce the bug in a testing environment. I can then drop in arbitrary print statements, recompile with debugging flags, or otherwise modify my application to give me useful data. But all too often, I find that either the bug vanishes when it’s not in my production environment, or it would simply take too much time or resources to even set up a testing deployment. When this happens, I find myself left with no alternative but to sift through the source code of the failing system, hoping to find clues as to the cause of the bug of the day. Doing so is never painless, but over time I’ve developed a set of techniques to make the source diving experience as focused and productive and possible.

To illustrate these techniques, I’ll walk you through a real-world debugging experience I had a few weeks ago. I am a maintainer of the XVM project, an MIT-internal VPS service. We keep the disks of our virtual servers in shared storage, and we use clustering software to coordinate changes to the disks.

For a long time, we had happily run a cluster of four nodes. After receiving a grant for new hardware, we attempted to increase the size of our cluster from four nodes to eight nodes. But once we added the new nodes to the cluster, disaster struck. With five or more nodes in the cluster, no matter what we tried, we received the same error message:

root@babylon-four:~# lvs
 cluster request failed: Cannot allocate memory
 Can't get lock for xenvg
 Skipping volume group xenvg
 cluster request failed: Cannot allocate memory
 Can't get lock for babylon-four

And to make matters even more exciting, by the time we observed the problem, users had already booted enough virtual servers that we did not have the RAM to go back to four nodes. So there we were, with a broken cluster to debug.

Tip 1: Check the likely causes of failure first.

It can be quite tempting to believe that a given problem is caused by a bug in someone else’s code rather than your own error in configuration. In reality, the common bugs in large, widely-used projects have already been squashed, meaning the most likely cause of error is something that you are doing wrong. I’ve lost track of the number of times I was sure I encountered a bug in some software, only to later discover that I had forgotten to set a configuration variable. So when you encounter a failure of some kind, make sure that your environment is not obviously at fault. Check your configuration files, check resource usage, check log files.

In the case of XVM, after seeing memory errors, we naturally figured we were out of memory–but free -m showed plenty of spare RAM. Thinking a rogue process might be to blame, we ran ps aux and top, but no process was consuming abnormal amounts of RAM or CPU. We consulted man pages, we scoured the relevant configuration files in /etc, and we even emailed the clustering software’s user list, trying to determine if we were doing something wrong. Our efforts failed to uncover any problems on our end.

Tip 2: Gather as much debugging output as you can. You’re going to need it.

Once you’re sure you actually need to do a source dive, you should make sure you have all the information you can get about what your program is doing wrong. See if your program has a “debugging” or “verbosity” level you can turn up. Check /var/log/ for dedicated log files for the software under consideration, or perhaps check a standard log such as syslog. If your program does not provide enough output on its own, try using strace -p to dump the system calls it’s issuing.

Before doing our clustering-software source dive, we cranked debugging as high as it would go to get the following output:

Got new connection on fd 5
Read on local socket 5, len = 28
creating pipe, [9, 10]
Creating pre&post thread
in sub thread: client = 0x69f010
Sub thread ready for work.
doing PRE command LOCK_VG 'V_xenvg' at 1 (client=0x69f010)
lock_resource 'V_xenvg-1', flags=0, mode=1
Created pre&post thread, state = 0
Writing status 12 down pipe 10
Waiting for next pre command
read on PIPE 9: 4 bytes: status: 12
background routine status was 12, sock_client=0x69f010
Send local reply

Note that this spew does not contain an obvious error message. Still, it had enough information for us to ultimately track down and fix the problem that beset us.

Tip 3: Use the right tools for the job

Perhaps the worst part of living in a world with many debugging tools is that it’s easy to waste time using the wrong ones. If you are seeing a segmentation fault or an apparent deadlock, then your first instinct should be to reach for gdb. gdb has all sorts of nifty capabilities, including the ability to attach to an already-running process. But if you don’t have an obvious crash site, often the information you glean from dynamic debugging is too narrow or voluminous to be helpful. Some, such as Linus Torvalds, have even vehemently opposed debuggers in general.

Sometimes the simplest tools are the best: together grep and find can help you navigate an entire codebase knowing only fragments of text or filenames (or guesses thereof). It can also be helpful to use a language-specific tool. For C, I recommend cscope, a tool which lets you find symbol usages or definitions.

XVM’s clustering problem was with a multiprocess network-based program, and we had no idea where the failure was originating. Both properties would have made the use of a dynamic debugger quite onerous. Thus we elected to dive into the source code using nothing but our familiar command-line utilities.

Tip 4: Know your error.

Examine your system’s failure mode. Is it printing an error message? If so, where is that error message originating? What can possibly cause that error? If you don’t understand the symptoms of a failure, you certainly won’t be able to diagnose its cause.

Often, grep as you might, you won’t find the text of the error message in the codebase under consideration. Rather, a standard UNIX error-reporting mechanism is to internally set the global variable errno, which is converted to a string using strerror.

Here’s a Python script that I’ve found useful for converting the output of strerror to the corresponding symbolic error name. (Just pass the script any substring of your error as an argument.)

#!/usr/bin/env python
import errno, os, sys
msg = ' '.join(sys.argv[1:]).lower()
for i in xrange(256):
    err = os.strerror(i)
    if msg in err.lower():
        print '%s [errno %d]: %s' % (errno.errorcode.get(i, '(unknown)'), i, err)

This script shows that the “Cannot allocate memory” message we had seen was caused by errno being set to the code ENOMEM.

Tip 5: Map lines of output to lines of code.

You can learn a lot about the state of a program by determining which lines of code it is executing. First, fetch the source code for the version of the software you are running (generally one of apt-get source and yumdownloader –source). Using your handy command-line tools, you should then be able to trace lines of debugging output back to the lines of code that emitted them. You can thus determine a set of lines that are definitively being executed.

Returning to the XVM example, we used apt-get source to fetch the relevant source code and dpkg -l to verify we were running the same version. We then ran a grep for each line of debugging output we had obtained. One such invocation,
grep -r "lock_resource '.*'" .,
showed that the corresponding log entry was emitted by a line in the middle of a function entitled _lock_resource.

Tip 6: Be systematic.

If you’ve followed the preceding tips, you’ll know what parts of the code the program is executing and how it’s erroring out. From there, you should work systematically, eliminating parts of the code that you can prove are not contributing to your error. Be sure you have actual evidence for your conclusions–the existence of a bug indicates that the program is in an unexpected state, and thus the core assumptions of the code may be violated.

At this point in the XVM debugging, we examined the _lock_resource function. After the debugging message we had in our logs, all paths of control flow except one printed a message we had not seen. That path terminated with an error from a function called saLckResourceLock. Hence we had found the source of our error.

We also noticed that _lock_resource transforms error values returned by the function it calls using using ais_to_errno. Reading the body of ais_to_errno, we noted it just maps internal error values to standard UNIX error codes. So instead of ENOMEM, the real culprit was one of SA_AIS_ERR_NO_MEMORY, SA_AIS_ERR_NO_RESOURCES, or SA_AIS_ERR_NO_SECTIONS. This certainly explained why we could see this error message even on machines with tens of gigabytes of free memory!

Ultimately, our debugging process brought us to the following block of code:

if (global_lock_count == LCK_MAX_NUM_LOCKS)     {
error = SA_AIS_ERR_NO_RESOURCES;
goto error_exit;
}

This chunk of code felt exactly right. It was bound by some hard-coded limit (namely, LCK_MAX_NUM_LOCKS, the maximum number of locks) and hitting it returned one of the error codes we were seeking. We bumped the value of the constant and have been running smoothly ever since.

Tip 7: Make sure you really fixed it.

How many times have you been certain you finally found an elusive bug, spent hours recompiling and redeploying, and then found that the bug was actually still there? Or even better, the bug simply changed when it appears, and you failed to find this out before telling everyone that you fixed it?

It’s important that after squashing a bug, you examine, test, and sanity-check your changes. Perhaps explain your reasoning to someone else. It’s all too easy to “fix” code that isn’t broken, only cover a subset of the relevant cases, or introduce a new bug in your patch.

After bumping the value of LCK_MAX_NUM_LOCKS, we checked the project’s changelog. We found a commit increasing the maximum number of locks without any changes to code, so our patch seemed safe. We explained our reasoning and findings to the other project developers, quietly deployed our patched version, and then after a week of stability sent an announce email proclaiming that we had fixed the cluster.

Your turn

What techniques have you found useful for debugging unfamiliar code?

Don’t resort to source-diving to fix bugs in your kernel

Instead, let the Ksplice Uptrack team fix your kernel bugs within milliseconds and without rebooting.

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Building Filesystems the Way You Build Web Apps

Posted in Fun, Programming on July 8th, 2010 by Evan Broder18 Comments

FUSE is awesome. While most major Linux filesystems (ext3, XFS, ReiserFS, btrfs) are built-in to the Linux kernel, FUSE is a library that lets you instead write filesystems as userspace applications. When something attempts to access the filesystem, those accesses get passed on to the FUSE application, which can then return the filesystem data.

It lets you quickly prototype and test filesystems that can run on multiple platforms without writing kernel code. You can easily experiment with strange and unusual interactions between the filesystem and your applications. You can even build filesystems without writing a line of C code.

FUSE has a reputation of being used only for toy filesystems (when are you actually going to use flickrfs?), but that’s really not fair. FUSE is currently the best way to read NTFS partitions on Linux, how non-GNOME and legacy applications can access files over SFTP, SMB, and other protocols, and the only way to run ZFS on Linux.

But because the FUSE API calls separate functions for each system call (i.e. getattr, open, read, etc.), in order to write a useful filesystem you need boilerplate code to translate requests for a particular path into a logical object in your filesystem, and you need to do this in every FUSE API function you implement.

Take a page from web apps

This is the kind of problem that web development frameworks have also had to solve, since it’s been a long time since a URL always mapped directly onto a file on the web server. And while there are a handful of approaches for handling URL dispatch, I’ve always been a fan of the URL dispatch style popularized by routing in Ruby on Rails, which was later ported to Python as the Routes library.

Routes dissociates an application’s URL structure from your application’s internal organization, so that you can connect arbitrary URLs to arbitrary controllers. However, a more common use of Routes involves embedding variables in the Routes configuration, so that you can support a complex and potentially arbitrary set of URLs with a comparatively simple configuration block. For instance, here is the (slightly simplified) Routes configuration from a Pylons web application:

from routes import Mapper

def make_map():
    map = Mapper()
    map.minimization = False

    # The ErrorController route (handles 404/500 error pages); it should
    # likely stay at the top, ensuring it can always be resolved
    map.connect('error/{action}/{id}', controller='error')

    map.connect('/', controller='status', action='index')
    map.connect('/{controller}', action='index')
    map.connect('/{controller}/{action}')
    map.connect('/{controller}/{action}/{id}')

    return map

In this example, {controller}, {action}, and {id} are variables which can match any string within that component. So, for instance, if someone were to access /spend/new within the web application, Routes would find a controller named spend, and would call the new action on that method.

RouteFS: URL routing for filesystems

Just as URLs take their inspiration from the filesystem, we can use the ideas from URL routing in our filesystem. And to make this easy, I created a project called RouteFS. RouteFS ties together FUSE and Routes, and it’s great because it lets you specify your filesystem in terms of the filesystem hierarchy instead of in terms of the system calls to access it.

RouteFS was originally developed as a generalized solution to a real problem I faced while working on the Invirt project at MIT. We wanted a series of filesystem entries that were automatically updated when our database changed (specifically, we were using .k5login files to control access to a server), so we used RouteFS to build a filesystem where every filesystem lookup was resolved by a database query, ensuring that our filesystem always stayed up to date.

Today, however, we’re going to be using RouteFS to build the very thing I lampooned FUSE for: toy filesystems. I’ll be demonstrating how to build a simple filesystem in less than 60 lines of code. I want to continue the popular theme of exposing Web 2.0 services as filesystems, but I’m also a software engineer at a very Git- and Linux-heavy company. The popular Git repository hosting site Github has an API for interacting with the repositories hosted there, so we’ll use the Python bindings for the API to build a Github filesystem, or GithubFS. GithubFS lets you examine the Git repositories on Github, as well as the different branches of those repositories.

Getting started

If you want to follow along, you’ll first need to install FUSE itself, along with the Python FUSE bindings – look for a python-fuse or fuse-python package. You’ll also need a few third-party Python packages: Routes, RouteFS, and github2. Routes and RouteFS are available from the Python Cheeseshop, so you can install those by running easy_install Routes RouteFS. For github2, you’ll need the bleeding edge version, which you can get by running easy_install http://github.com/ask/python-github2/tarball/master

Now then, let’s start off with the basic shell of a RouteFS filesystem:

#!/usr/bin/python

import routes
import routefs

class GithubFS(routefs.RouteFS):
    def make_map(self):
        m = routes.Mapper()
        return m

if __name__ == '__main__':
    routefs.main(GithubFS)

As with the web application code above, the make_map method of the GithubFS class creates, configures, and returns a Python Routes mapper, which RouteFS uses for dispatching accesses to the filesystem. The routefs.main function takes a RouteFS class and handles instantiating the class and mounting the filesystem.

Populating the filesystem

Now that we have a filesystem, let’s put some files in it:

#!/usr/bin/python

import routes
import routefs

class GithubFS(routefs.RouteFS):
    def __init__(self, *args, **kwargs):
        super(GithubFS, self).__init__(*args, **kwargs)

        # Maps user -> [projects]
        self.user_cache = {}

    def make_map(self):
        m = routes.Mapper()
        m.connect('/', controller='list_users')
        return m

    def list_users(self, **kwargs):
        return [user
            for user, projects in self.user_cache.iteritems()
            if projects]

if __name__ == '__main__':
    routefs.main(GithubFS)

Here, we add our first Routes mapping, connecting '/', or the root of the filesystem, to the list_users controller, which is just a method on the filesystem’s class. The list_users controller returns a list of strings. When the controller that a path maps to returns a list, RouteFS automatically makes that path into a directory. To make a path be a file, you just return a single string containing the file’s contents.

We’ll use the user_cache attribute to keep track of the users that we’ve seen and their repositories. This will let us auto-populate the root of the filesystem as users get looked up.

Let’s add some code to populate that cache:

#!/usr/bin/python

from github2 import client
import routes
import routefs

class GithubFS(routefs.RouteFS):
    def __init__(self, *args, **kwargs):
        super(GithubFS, self).__init__(*args, **kwargs)

        # Maps user -> [projects]
        self.user_cache = {}
        self.github = client.Github()

    def make_map(self):
        m = routes.Mapper()
        m.connect('/', controller='list_users')
        m.connect('/{user}', controller='list_repos')
        return m

    def list_users(self, **kwargs):
        return [user
            for user, projects in self.user_cache.iteritems()
            if projects]

    def list_repos(self, user, **kwargs):
        if user not in self.user_cache:
            try:
                self.user_cache[user] = [r.name
                    for r in self.github.repos.list(user)]
            except:
                self.user_cache[user] = None

        return self.user_cache[user]

if __name__ == '__main__':
    routefs.main(GithubFS)

That’s enough code that we can start interacting with the filesystem:

opus:~ broder$ ./githubfs /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs/ebroder
anygit	    githubfs	 pyhesiodfs	 python-simplestar
auto-aklog  ibtsocs	 python-github2  python-zephyr
bluechips   libhesiod	 python-hesiod
debmarshal  ponyexpress  python-moira
debothena   pyafs	 python-routefs
opus:~ broder$ ls /mnt/githubfs
ebroder

Users and projects and branches, oh my!

You can see a slightly more fleshed-out filesystem on (where else?) Github. GithubFS lets you look at the current SHA-1 for each branch in each repository for a user:

opus:~ broder$ ./githubfs /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs/ebroder
anygit	    githubfs	 pyhesiodfs	 python-simplestar
auto-aklog  ibtsocs	 python-github2  python-zephyr
bluechips   libhesiod	 python-hesiod
debmarshal  ponyexpress  python-moira
debothena   pyafs	 python-routefs
opus:~ broder$ ls /mnt/githubfs/ebroder/githubfs
master
opus:~ broder$ cat /mnt/githubfs/ebroder/githubfs/master
cb4fc93ba381842fa0c2b34363d52475c4109852

What next?

Want to see more examples of RouteFS? RouteFS itself includes some example filesystems, and you can see how we used RouteFS within the Invirt project. But most importantly, because RouteFS is open source, you can incorporate it into your own projects.

So, what cool tricks can you think of for dynamically generated filesystems?

Stop worrying about in-kernel filesystems

Still rebooting for this month’s round of ext4 bugs? Let Ksplice Uptrack fix the filesystems in your kernel without rebooting, so you can spend your time writing userspace filesystems instead!

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Let’s Play Vulnerability Bingo!

Posted in System administration, security on July 1st, 2010 by Jessica McKellar6 Comments

Dear Fellow System Administrators,

I like excitement in my life. I go on roller coasters, I ride my bike without a helmet, I make risky financial decisions. I treat my servers no differently. When my Linux vendor releases security updates, I think: I could apply these patches, but why would I? If I did, I’d have to coordinate with my users to schedule a maintenance window for 2am on a Sunday and babysit those systems while they reboot, which is seriously annoying, hurts our availability, and interrupts my beauty sleep (and trust me, I need my beauty sleep). Plus, where’s the fun in having a fully-patched system? Without open vulnerabilities, how else would I have won a ton of money in my office’s Vulnerability Bingo games?
vulnerability bingo card

How can I get in on some Vulnerability Bingo action, you ask? Simple: get yourself some bingo cards, be sure not to patch your systems, and place chips on appropriate squares when your machines are compromised. Or, as a fun variant, place chips when your friends’ machines get compromised! For the less adventurous, place chips as relevant Common Vulnerabilities and Exposures are announced.

What’s really great is the diversity of vulnerabilities. In 2009 alone, Vulnerability Bingo featured:

physically proximate denial of service attacks (CVE-2009-1046).

local denial of service attacks (CVE-2009-0322, CVE-2009-0031, CVE-2009-0269, CVE-2009-1242, CVE-2009-2406, CVE-2009-2407, CVE-2009-2287, CVE-2009-2692, CVE-2009-2909, CVE-2009-2908, CVE-2009-3290, CVE-2009-3547, CVE-2009-3621, CVE-2009-3620) coming in at least 5 great flavors: faults, memory corruption, system crashes, hangs, and the kernel OOPS!

And the perennial favorite, remote denial of service attacks (CVE-2009-1439, CVE-2009-1633, CVE-2009-3613, CVE-2009-2903) including but not limited to system crashes, IOMMU space exhaustion, and memory consumption!

How about leaking potentially sensitive information from kernel memory (CVE-2009-0676, CVE-2009-3002, CVE-2009-3612, CVE-2009-3228) and remote access to potentially sensitive information from kernel memory (CVE-2009-1265)?

Perhaps I can interest you in some privilege escalation (CVE-2009-2406, CVE-2009-2407, CVE-2009-2692, CVE-2009-3547, CVE-2009-3620), or my personal favorites, arbitrary code execution (CVE-2009-2908) and unknown impact (CVE-2009-0065, CVE-2009-1633, CVE-2009-3638).

Sometimes you get a triple threat like CVE-2009-1895, which “makes it easier for local users to leverage the details of memory usage to (1) conduct NULL pointer dereference attacks, (2) bypass the mmap_min_addr protection mechanism, or (3) defeat address space layout randomization (ASLR)“. Three great tastes that taste great together — and a great multi-play Bingo opportunity!

Linux vendors release kernel security updates almost every month (take Red Hat for example), so generate some cards and get in on the action before you miss the next round of exciting CVEs!

Happy Hacking,
Ben Bitdiddle
System Administrator
HackMe Inc.


You Don’t Want to Win Vulnerability Bingo

Don’t be like Ben Bitdiddle. Applying kernel updates doesn’t have to be a compromise between availability and security.

Ksplice Uptrack allows you to apply security updates without rebooting. When your Linux vendor release patches, the Uptrack service will send your machines rebootless versions that can be applied with zero disruption to you, your customers, or your running applications. Seeing is believing: roll out a fully-functional free trial on an unlimited number of machines and see what rebootless updates can do for you.

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon

Attack of the Cosmic Rays!

Posted in System administration on June 24th, 2010 by Nelson Elhage102 Comments

It’s a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.

For me, bitflips due to cosmic rays are one of those problems I always assumed happen to “other people”. I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I’d never really be able to tell the difference from some random kernel bug.

A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn’t make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can’t prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.

The problem

The symptom that I observed was that the expr program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:

[nelhage@psychotique]$ autoreconf -fvi
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
…

dmesg revealed that the segfaulting program was expr:

psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr

And I was easily able to reproduce the problem by hand:

[nelhage@psychotique]$ expr 3 + 3
Segmentation fault

expr definitely hadn’t been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.

Check the dumb things

I run Ubuntu, so the first things I checked were the
/var/log/dpkg.log and /var/log/aptitude.log files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn’t recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.

The logs revealed I hadn’t upgraded anything of note in the last several days, so that theory was out.

Next up, I checked env | grep ^LD. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?

[nelhage@psychotique]$ env | grep ^LD
[nelhage@psychotique]$

That, too, turned up nothing.

Start digging

I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.

The problem is a segfault, so I decided to pull up a debugger and figure out where it’s segfaulting. First, though, I’d want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S to determine that expr belongs to the coreutils package:

[nelhage@psychotique]$ dpkg -S $(which expr)
coreutils: /usr/bin/expr

And installed the coreutils debug symbols:

[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym

Now, I could run expr inside gdb, catch the segfault, and get a stack trace:

[nelhage@psychotique]$ gdb --args expr 3 + 3
…
(gdb) run
Starting program: /usr/bin/expr 3 + 3
Program received signal SIGSEGV, Segmentation fault.
0x0000000000001a70 in ?? ()
(gdb) bt
#0  0x0000000000001a70 in ?? ()
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
#2  0x00000000004027dd in eval4 (evaluate=true) at expr.c:773
#3  0x000000000040291d in eval3 (evaluate=true) at expr.c:812
#4  0x000000000040208d in eval2 (evaluate=true) at expr.c:842
#5  0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921
#6  0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952
#7  0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329

So, for some reason, the eval5 function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5 trying to do that’s causing it to jump off into nowhere? Let’s grab the source and find out:

[nelhage@psychotique]$ apt-get source coreutils
[nelhage@psychotique]$ cd coreutils-7.4/src/
[nelhage@psychotique]$ gdb --args expr 3 + 3
# Run gdb, wait for the segfault
(gdb) up
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
745           if (nextarg (":"))
(gdb) l
740       trace ("eval5");
741     #endif
742       l = eval6 (evaluate);
743       while (1)
744         {
745           if (nextarg (":"))
746             {
747               r = eval6 (evaluate);
748               if (evaluate)
749                 {

I used the apt-get source command to download the source package from Ubuntu, and ran gdb in the source directory, so it could find the files referred to by the debug symbols. I then used the up command in gdb to go up a stack frame, to the frame where eval5 called off into nowhere.

From the source, we see that eval5 is trying to call the nextarg function. `gdb` will happily tell us where that function is supposed to be located:

(gdb) p nextarg
$1 = {_Bool (const char *)} 0x401a70 <nextarg>

Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.

But why?

So there’s a flipped bit. But why, and how did it happen? First off, let’s determine where the problem is. Is it in the expr binary itself, or is something more subtle going on?

[nelhage@psychotique]$ debsums coreutils | grep FAILED
/usr/bin/expr                                                             FAILED

The debsums program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils package, we see that the expr binary has in fact been modified since it was installed. We can verify how it’s different by downloading a new version of the package, and comparing the files:

[nelhage@psychotique]$ aptitude download coreutils
[nelhage@psychotique]$ mkdir coreutils
[nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils
[nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr
 10113 377 M-^? 277 M-?

aptitude download downloads a .deb package, instead of actually doing the installation. I used dpkg -x to just extract the contents of the file, and cmp to compare the packaged expr with the installed one. -b tells cmp to list any bytes that differ, and -l tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr binary is corrupted.

So how did that happen? We can check the mtime (“modified time”) field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn’t fix up the mtime, which seems unlikely):

[nelhage@psychotique]$ ls -l /usr/bin/expr
-rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*

Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn’t been touched.

Memory is a tricky thing.

But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr is a pretty small program, and frequently used, so there’s a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it’s likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.

On Linux, we can do this by writing to the /proc/sys/vm/drop_caches file, as root. We’ll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:

[nelhage@psychotique]$ sha256sum /usr/bin/expr
4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089  /usr/bin/expr
[nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
[nelhage@psychotique]$ sha256sum /usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  /usr/bin/expr
[nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  coreutils/usr/bin/expr

And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr no longer segfaults, and matches the version I downloaded earlier.

(The sudo tee idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches of course won’t work because the file is still opened for writing by my shell, which doesn’t have the required permissions).

Conclusion

As I mentioned earlier, I can’t prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that’d be a pretty weird bug.

And in fact, since that incident, I’ve had several other, similar problems. I haven’t gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I’d guess that flipped bits come from noise somewhere — they’re just susceptible to lower levels of noise. So it could just mean I’m more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.

Now that I’ve written this post, I’m going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments — have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?

Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.


Out-of-this-world convenience!

Memory errors didn’t force me to restart my system. Don’t let kernel updates force you to restart yours. Try Ksplice Uptrack, and win back your 3am maintenance window.

Share :
  • Twitter
  • Reddit
  • Digg
  • Facebook
  • del.icio.us
  • StumbleUpon