Monday, November 24, 2008

Dallas One-Wire Temperature Network - Followup

I managed to get the little 5-volt 1 wire network I mentioned earlier, built up and running. Even fishing the wires wasn't too hard, "thanks" to the distinct lack of insulation in some of my walls. I did have to patch a section of the main run (I used a 100 ft run with short stubs to hold the sensors) when I put a staple through the cable while attaching it to the rafters. Doh!

The graphs make it look colder inside than it really is as I purposely put the sensors in all the cold spots. The kitchen has a zoned radiator that we can use if we are hanging out there, and the wild swings in my office are just the result of me closing the door when I'm not in there. The dining room is the warmest and I will eventually add a sensor there to put an upper bound on the data.



I didn't really like any of the pre-canned software options for it, so I rolled my own using xmgrace, digitemp, rsync, and cron. This is pretty crude, but it is a decent placeholder.

Monday, November 10, 2008

One-Wire Home Temperature Network with Linux - a prototype

So, I live in Maine. The locals say, quite correctly, that its the way life oughta be. The way we keep massive hordes away from our paradise is to make it kinda chilly 6 months of the year. The fact that I live in a circa 1830 farmhouse is a constant reminder of that.



Last winter I froze a kitchen water pipe in the basement. It was due to a gap in the foundation. I patched up the hole and insulated the pipe. (Oddly I inherited 90% of the water pipes insulated, but not this stretch.) The night it froze was not the coldest of the year (it was probably 20 above the coldest mark), but there was a strong wind and no snow on the ground so the cold air just swept through the gap onto the pipe. The rest of the winter passed without further incident.

I put a thermometer in the vicinity, but frankly the basement is kinda dark and creepy and you don't go venturing into the back in the winter without a reason. So the thermometer wasn't all that useful. The sensible thing to do would be to put a $20 wireless thermometer (e.g. AcuRite Digital Wireless Weather Thermometer Indoor Outdoor) down there with the readout in the kitchen.

But why do that, when we could go for geek overkill?

The wireless thermometer has some downsides: the receiver is clutter for something rarely used, I still have to remember to look at it, it has no log and the most interesting data is when I am sleeping, it only measures one place per piece of clutter, it lacks an alarm facility, and it needs batteries that inevitably will die in situ.

Clearly this is a job for a $500 computer instead, right?

It appears this is generally done with a "one-wire" network. 1-wire is a dallas semi standard for simple devices that can be powered and read with one wire. You daisy chain them together, typically using one wire from a piece of cat-5. You really need 2 wires, the second is for ground.

This is really neat stuff. There are sensors for temp, humidity, pressure, and even ground water for your garden. A bunch of places sell this stuff in kits, where the sensors come in a little box with an rj-45 "in" and an "out" if you want to chain another senor off of it. Or you can just buy the sensors for a lot less and solder the leads onto the ground and data wires wherever you want a reading. Each one has a serial number it reports as part of the 1-wire protocol so you can tell them apart, and its pretty easy to auto explore the chain and then power each sensor individually to get a reading (don't read them simultaneously).

The temperature gadgets, completely assembled wirth rj-45 connectors and little cases can run $25 each. But the sensors themselves are just $4 or less in single quantities. (Buy them by the thousand and you can get them for a buck.) I bought just the sensors.

In addition to the sensors, you also need a driver circuit to drive the power, do the polling, etc. There are several schematics for building them, but I gave into my software engineering side and bought a prebuilt one wire usb interface, for $28.



My primary interest is in the basement, but as long as I'm stringing a cable along the rafters I might as well measure a few different points. So I'll grab my office, the basement, the dining room, and the kitchen. I might stick the final sensor outside and cover it in shrinkwrap so there is a "control" number to compare the others to. We are heating this season with a pellet stove instead of central heat, so this information ought to help determine the effectiveness of the various fan placements I'm considering.

Once everything is in place the data can be captured a myriad of ways. The most common are the one wire filesystem, or by using digitemp. (google bait: if digitemp returns CRC errors, clear the configuration file it saves - this cost me hours of resoldering connections that were just fine.) After that normal linux graphing software can go to town.

I need to order a couple more sensors to layout the final network - I wanted to build a prototype first. It was easy enough.

To do it you'll need: an RJ-11 crimping tool (not rj-45 for ethernet), an rj-11 end. Some cat-5 or cat-3 that is long enough to run your network, and a soldering tool.

The way I wired it, only the middle two wires matter for the crimp. I have blue/white on the left and blue on the right as viewed with the clip down and the contacts away from you. We're going to use white for the signal and blue for the ground. Attach one of the sensors to the end of the cable. The signal pin is the middle one, and the ground pin is on the left. (Left is defined with the flat side of the sensor facing up and the leads pointed down. The right lead is not used, you can trim it off (I've just bent it out of place in this picture).



With the end sensor in place, you can add as many more as you like along the line by just stripping the insulation in place and soldering the leads right onto the cable wherever they need to go. This 1-wire stuff is extremely forgiving of my attempts to pretend that I know what I'm doing with electronics hardware.



Obviously you need to tape up or shrink wrap all the joints, I left them open on the prototype for photos. With this in place, digitemp -a -r 800 happily reports two different sensors within half a degree of each other. Huzzah!

Now its off to grab the sensors I need for the real thing, installing the cable in the basement along with the other probe points, and getting a graph and alarm server going on the IP network. Such fun!

Saturday, November 8, 2008

DNS Prefetching for Firefox

Recently I implemented patches to implement DNS prefetching in Firefox. I am primarily interested in their impact on Fennec (aka Mobile Firefox), but it looks like they will land first in Firefox 3.1 beta2. The, hopefully final, glitches are being shaken out of the patch now.

Google Chrome has a feature like this too.

DNS resolutions are always dominated by latency instead of bandwidth. Particularly on mobile networks the latencies are very high. That makes them perfect candidates for speculative pre-fetching. The advantage is in the latency improvement - instead of waiting for a hostname lookup when you click a link, do that lookup while you're reading the page the link is embedded in. Because the lookups are so small (generally one runt packet in and out) the cost of any wasted over optimistic lookups really doesn't impact the performance of browsing. Good payoff at low cost, the best of both worlds.

The basic benefit is simple: if you click on a link using a new hostname, you save a round trip time. On some networks this can be a substantial improvement (800ms or more) in responsiveness. Some describe this simply as "figuring out the IP address of every link before you click on it".

The Firefox implementation takes this approach one step further than just pre-resolving anchor href hostnames. It uses the prefetch logic on URLs that are being included in the current document. By this I mean that it uses the prefetch logic on things like images, css, and jscript that are being loaded right away, in addition to anchor links which might be clicked on at a slightly later time.

At first that seems non-sensical. How can you pre-fetch the DNS for something you are fetching right now? Where does the "pre" come in? The answer is not so much in the definition of "pre" as it is in the definition of "right now". Most HTTP User Agents, Firefox being no exception, limit their number of simultaneous connections and hosts. Typical pages embed quite a few objects and it is easy to run into these limits. When this happens the browser queues some of the requests. The Firefox pre-fetch DNS implementation allows those queued requests to overlap the high latency host resolution with whatever transfers might be going on without creating an excess level of parallelization.

While this is just a secondary benefit, it can be meaningful. For example, on the day I grabbed a snapshot of http://planet.mozilla.org/ it required 23 unique DNS resolutions in order to render the base page. Most of these were in img URLs. When loading the page with the prefetch patches, even with a cold cache, 16 of them were either fully completed when needed for the first connection or at least already in progress. The result, measured on an EDGE network, was a 4% overall improvement in page load time. Not bad for something that does not reduce bandwidth consumption in any way.

Configuration

Basically, it just works. You don't need to do anything. But there are a few configurables out there for both browsers and content providers.

First, as a browser you might want no part of this. Fair enough - its your browser. If you set the preference network.dns.disablePrefetch to true the prefetch code will never take effect, no matter what any other configuration is set to.

Furthermore, as a security measure, prefetching of embedded link hostnames is not done from documents loaded over https. If you want to allow it in that context too, just set the preference network.dns.disablePrefetchFromHTTPS to true.

Content providers have a couple neat tricks available too. These are meant to be compatible with Chrome.

For content to opt out of DNS pre-resolution it can be served with the x-dns-prefetch-control HTTP header set to off. The equivalent meta http-equiv element can be used instead of a response header too:
<meta http-equiv="x-dns-prefetch-control" content="off">

Setting content to on will reverse the effect. You can never turn pre-fetching on in a browser that has it disabled by preference, but you can undo the impact of a previous x-dns-prefetch control command. In this way, different content provider policies can apply to different portions of the document.

The last configuraton possibility allows the content provider to force the lookup of a particular hostname without providing an anchor using that name. This is done with the link tag:
<link rel="dns-prefetch" href="http://www.spreadfirefox.com/">

The href attribute can contain a full URL, or just a hostname. Hostname only attributes should preceed the hostname with two slashes:
<link rel="dns-prefetch" href="//www.spreadfirefox.com">

Content providers might use the link notation in a site-wide home page in order to preload hostnames that are widely used throughout the site but perhaps not on the home page.

Sunday, September 28, 2008

Asynchronous DNS lookups on Linux with getaddrinfo_a()

A little while back I posted about a bug in the glibc getaddrinfo() implementation which resulted in many CNAME lookups having to be repeated. At that time I teased a future post on the topic of the little known getaddrinfo_a() function - here it is.

type "man getaddrinfo_a". I expect you will get nothing. That was the case for me. Linux is full of non-portable, under-documented, but very powerful interfaces and tools. The upside of these tools is great - I recently referred to netem and ifb which can be used for easy network shaping, and interfaces like tee(), splice() and epoll() are also hugely powerful, but woefully underutilized. I always get a thrill when I stumble across one of these.

Part of the reason for their low profile is portability. And there are times when that matters - though I think it is cited as a bedrock principle more than is really necessary. I think the larger reason is that some of these techniques lack the documentation, web pages, and references in programmer pop-culture necessary to be ubiquitously useful.

Maybe this post will help getaddrinfo_a find its mojo.

This little jewel is a standard part of libc, and has been for many years - you can be assured that it will be present in the runtime of any distribution of the last several years.

getaddrinfo_a() is an asynchronous interface to the DNS resolution routine - getaddrinfo(). Instead of sitting there blocked while getaddrinfo() does its thing, control is returned to your code immediately and your code is interrupted at a later time with the result when it is complete.

Most folks will realize that this is a common need when dealing with DNS resolution. It is a high latency operation and when processing log files, etc, you often have a need to do a lot of them at a time. The asynchronous interface lets you do them in parallel - other than the waiting-for-the-network time, there is very little CPU or even bandwidth overhead involved in a DNS lookup. As such, it is a perfect thing to do in parallel. You really do get linear scaling.

The best documentation seems to be in the design document from Ulrich Drepper. This closely reflects the reality of what was implemented. Adam Langley also has an excellent blog post with an illustration on how to use it. Actually, the header files are more or less enough info too, if you know that getaddrinfo_a() even exists in the first place.

The good news about the API is that you can submit addresses in big batches with one call.

The bad news about the API is that it offers callback either via POSIX signal handling, or by spawning a new thread and running a caller supplied function on it. My attitude is generally to avoid making signal handling a core part of any application, so that's right out. Having libraries spawn threads is also a little disconcerting, but the fact that that mechanism is used here for the callback is really minor compared to how many threads getaddrinfo_a() spawns internally.

I had assumed that the invocation thread would send a few dns packets out onto the wire and then spawn a single thread listening for and multiplexing the responses.. or maybe the listening thread would send out the requests as well and then multiplex the responses. But reading the code shows it actually creates a pretty sizable thread pool wherein each thread calls and blocks on getaddrinfo().

This is more or less the technique most folks roll together by hand, and it works ok - so it is certainly nice to have predone and ubiquitously available in libc rather than rolling it by hand. And it is ridiculous to code it yourself when you are already linking to a library that does it that way. But it seems to have some room for improvement internally in the future.. if that happens, its nice to know that at least the API for it is settled and upgrades should be seamless.

One last giant caveat - in libc 2.7 on 64 bit builds, getaddrinfo_a() appears to overflow the stack and crash immediately on just about any input. This is because the thread spawned internally is created with a 16KB stack which is not enough to initialize the name resolver when using 64 bit data types. Oy! The fix is easy, but be aware that some users may bump into this until fixed libcs are deployed.

Tuesday, September 23, 2008

Simulating Wireless Networks With Linux

I have been working on enhancing network performance for the upcoming Firefox on Mobile. I jokingly refer to this work as "finding useful things to do with latency" as the browser is targeted at cell phone networks. These networks have latencies of hundreds, sometimes over a thousand, milliseconds.

From time to time I hope to talk on this blog about interesting things I have found or done while looking into this.

One of the cool things I consed up in the effort is a python script to emulate one of these networks over localhost. Just run the script, along with an XML file that describes the network you're looking to simulate, and then you can run any networking application you want across localhost to measure the impact of any potential changes you want to make.

The script relies on netem and ifb. In that sense, it doesn't really add anything fundamental by itself. Those are outstanding, but poorly understood tools.

By rolling that together in a script, and providing XML profiles for 3G, edge, bluetooth, evdo, hspd, and gprs wireless networks I was able to provide a meaningful testbed for evaluating default preferences for concurrency and pipeline depth, as well as the impact of changes to DNS pre-fetching and the pipelining implementation. All good stuff. Some of them need their own posts.

If you're interested in the tool - this is the release announcement. It is bundled as part of my local copy of the firefox development tree, but the tool is easily separable from that for use on something else.

Thursday, September 11, 2008

The minutia of getaddrinfo() and 64 bits

I have been spending some time recently improving the network behavior of Firefox in mobile (i.e. really high latency, sort of low bandwidth) environments.

The manifestations du jour of that are some improvements to the mozilla DNS system.

In that pursuit, I was staring at some packet traces of the DNS traffic generated from my 64 bit linux build (using my LAN instead of the slow wireless net) and I saw this gem:




That is the same request (and response) duplicated. Doing it an extra time appears to cost all of .3ms on my LAN, but on a cell phone that could delay resolution (and therefore page load) time by a full second - very noticeable lag.

I started by combing through the firefox DNS code looking for the bug I assumed I had accidentially put in the caching layer. But I confirmed there was just one call to libc's getaddrinfo() being made for that name.

Then I figured it was some kind of large truncated DNS record from blogspot which necessitated a refetch. Looking further into it the record was really quite normal. The response was just 127 bytes and exactly the same each time - it contained 2 response records: one A record and one CNAME record. Both had reasonably sized names.

I found the same pattern with another CNAME too: 2 out of 6.

And so the debugging began in earnest. Cliff Stoll was looking for his 75 cents, and I was out to find my extra round trip time.

I did not find an international conspiracy, but after whipping together a debuggable libc build I determined that the DNS answer parser placed a "struct host" and some scratch space for parsing into a buffer passed into it from lower on the stack. If the answer parser couldn't parse the response in that space an "ERANGE" error was returned and the caller would increase the buffer size and try again. But the try again involved the whole network dance again instead of just the parsing.

So what I was seeing was that the original buffer of 512 bytes was too small, but a second try with 1024 worked fine. Fair enough, it just seems like an undersized default to fail at such a common case.

And then it made sense. For most of its life, it hasn't been undersized - while the DNS response hasn't change the "struct host" did when I went to 64 bit libraries. struct host is comprised of 48 pointers and a 16 byte buffer. On a 32 bit arch that's 208 bytes, but with 8 byte pointers it is 400. With a 512 byte ceiling, 400 is a lot to give up.

64 bit has a variety of advantages and disadvantages, but an extra RTT was a silent-penalty I hadn't seen before.

This patch fixes things up nicely.

This is good concrete opportunity to praise the pragmatism of developing on open source. It is not about the bug (everyone has them, if this even is one - it is borderline), it is about the transparency. If this was a closed OS, the few avenues available to me would have been incredibly onerous and possibly expensive. Instead I was able to resolve it in an afternoon by myself (and mail off the patch to the outstanding glibc development team).

At some future time, I'll have a similarly thrilling story about the little known but widely deployed getaddrinfo_a().

Friday, May 30, 2008

Firefox Add-Ons - webfolder webDAV add-on updated for Firefox 3

I have had reason to do a little server side WebDAV work these days.

WebDav clients for Linux aren't all that common. There is a filesystem gateway (davfs), and cadaver. Cadaver is a decent command line app; like ncftp.

And more recently, I was introduced to webfolders. This is a addon for firefox that does a perfectly good job of creating a file manager interface for DAV sites.

Point one of this post: webfolders is cool - download it yourself.

Point two of this post: webfolders as listed on the website does not support firefox 3. And of course you are using firefox 3.

Point three of this post: you really ought to be using firefox 3 - it is much faster than firefox 2.

Point four of this post: I want it all - so I have done the trivial work of updating webfolders for firefox 3. Just changed a few constants and paths, and life is good. It is here for download (any OS!)

Friday, April 18, 2008

Measuring performance of Linux Kernel likely() and unlikely()

A little while back I wrote about how prominent likely() and unlikely() are in the Linux kernel, and yet I could not find any performance measurements linked to them.

Today I made some measurements myself.

But first a quick review - likely and unlikely are just macros for gcc's __builtin_expect(), which in turn allows the compiler to generate code compatible with the target architecture's branch prediction scheme. The GCC documentation really warns against using this manually too often:

You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
The kernel certainly makes liberal use of it. Accroding to LXR 2.6.24 had 1608 uses of likely and 2075 uses of unlikely in the code. LXR didn't have an index of the just released 2.6.25 yet - but I'd bet it is likely to be more now.

My methodology was simple, I choose several benchmarks commonly used in kernel land and I ran them against vanilla 2.6.25 and also against a copy I called "notlikely" which simply had the macros nullified using this piece of programming genius:




The tests I ran were lmbench, netperf, bonnie++, and the famous "how fast can I compile the kernel?" test.

The test hardware was an all 64 bit setup on a 2.6Ghz core-2 duo with 2GB of ram and a SATA disk. Pretty standard desktop hardware.

The core 2 architecture has a pretty fine internal branch prediction engine without the help of these external hints. But with such extensive use of the macros (3500+ times!), I expected to see some difference shown by the numbers.

But I didn't see any measurable difference. Not at all.

Not a single one of those tests showed anything that I wouldn't consider overlapping noise. I had 3 data points for each test on each kernel (6 points per test) and each test had several different facets. Out of dozens of different facets, there wasn't a single criteria where the measurement was always better or worse on one kernel.

And this disappoints me. Because I like micro optimizations damn it! And in general this one seems to be a waste of time other than the nice self documenting code it produces. Perhaps the gcc advice is correct. Perhaps the Core-2 is so good that this doesn't matter. Perhaps there is a really compelling benchmark that I'm just not running.

I say it is a waste in general because I am sure there are specific circumstances and code paths where this makes a measurable difference. There certainly must be a benchmark that can show it - but none of these broad based benchmarks were able to show anything useful. That doesn't mean the macro is over used, it seems harmless enough too, but it probably isn't worth thinking too hard about it either.

hmm.

Monday, April 14, 2008

Monitoring IP changes with NETLINK ROUTE sockets

Yesterday I read a mailing list query asking how to get event driven Linux IP address changes (i.e. without having to poll for them).

I agreed with the attitude of the post. The most important thing about scaling is to make sure the work your code does is proportional to the real event stream. Seems obvious enough, but lots of algorithms screw up that basic premise.

Any time based polling algorithm betrays this scaling philosophy because work is done every tick independent of the events to be processed. You're always either doing unnecessary work or adding latency to real work by waiting for the next tick. The select() and poll() APIs also betray it as these are proportional to the amount of potential work (number of file descriptors) instead of the amount of real work (number of active descriptors) - epoll() is a better answer there.

Event driven is the way to go.

Anyhow, back to the original poster. I knew netlink route sockets could do this - and I had used them in the past for similar purposes. I had to get "man 7 netlink"and google going to cobble together an example and only then did I realize how difficult it is to get started with netlink - it just has not been very widely used and documented.

So the point of this post is to provide a little google juice documentation for event driven monitoring of new IPv4 addresses using netlink. At least this post has a full example - the code is below.

If you need to use this functionality - I recommend man 3 and 7 of both netlink and rtnetlink.. and then go read the included header files, and use my sample as a guide. In this basic way you can get address adds, removals, link state changes, route changes, interface changes, etc.. lots of good stuff. It is at the heart of the iproute tools (ip, ss, etc..) as well most of the userspace routing software (zebra, xorp, vyatta, etc..).

Tuesday, April 8, 2008

IP Georeferencing

IP Georeferencing is a pretty cool toolbox item on today's web. Simply put, its the process of taking an IP address and converting it to some geographical (city, lat, long, whatever) information.

It gets commonly used for security filters, log analysis, ad targeting, community building, etc..

Maxmind deserves a shout out for making available databases to do this. There is nothing inherent about an IP number that gives you this information, so they just ned to build out of band databases to get the job done. They have two copies of the databases, one for free and one a bit more accurate that is priced very reasonably.

The databases come with a number of different libraries for using them: C, Java, PHP, etc.. The libraries are released under the LGPL.

Recently I was doing a project that needed to lookup scads and scads of addresses, so I put a little muscle into improving the lookup routines in the C code.

I'm happy to say I was able to improve things anywhere from 2x to 8x in terms of overall lookups, depending on what exactly was being lookedup. There were a bunch of changes, but the primary one was the addition of a variable length radix lookup mechanism that changed the average number of comparisons from 27 to 4 - not rocket science, but the right tool for the job.

I'm even more happy to say I sent those back as LGPL contributions of my own. The code is on my open source contribution web page.

Saturday, April 5, 2008

Linux Selective Acknowledgment (SACK) CPU Overhead

Last year I tossed an e-mail from the Linux kernel networking list in my "projtodo" folder.

The mail talked about how the Linux TCP stack in particular, and likely all TCP stacks in general, likely had an excessive-CPU attack exposure when confronted with malicious SACK options. I found the mail intriguing but unsatisfying. It was well informed speculation but didn't have any hard data, nor was there any simple way to gather some. Readers of the other posts on this blog will know I really dig measurements. The issue at hand was pretty obviously is a problem - but how much of one?

A few weeks ago I had the chance to develop some testing code and find out for myself - and IBM DeveloperWorks has published the summary of my little project. The executive summary is "its kinda bad, but not a disaster, and hope is on the way". There is had data and some pretty pictures in the article itself.

The coolest part of the whole endeavor, other than scratching the "I wonder" itch, was getting to conjure up a userspace TCP stack from raw sockets. It was, of course, woefully incomplete as it was just meant to trigger a certain behavior in its peer instead of being generally useful or reliable - but nonetheless entertaining.

Thursday, April 3, 2008

Linux Kernel - likely() not measured?

The other day on kernelnewbies, the able Robert Day wondered whether or not anyone had quantified the effects of the likely() and unlikely() macros scattered all over the Linux kernel.

He received no less than 5 replies telling him how the macros worked. (If you are curious, this is the best explanation) - but not a single piece of measurement or other characterization.

LXR shows over 3500 uses of those macros in the kernel, and nobody has any data for any scenario? Wowzers.

Doing before/after benchmarks with those macros changed to nops would be an interesting project. Could use the usual suspects of linux kernel performance enhancements to test (lmbench, compile test, some kind of network load generator, etc..)

Comments with pointers to data would be cool.

Monday, March 10, 2008

The Everything Old Is New Again Meme

I've been struck lately by the dismissive comments in technology reviews along the lines of "of course this is really just a new skin on a really old idea". They are particularly striking, because the reviews are essentially overall positive about the technology they just dismissed as a rehash! If they really were just recycled implementations of old worn ideas, then why are we so excited now - why waste the bits talking about them?

I'm left thinking that there just isn't enough respect for a quality implementation that takes into account the real needs of the current market. To some, it is all about the idea. I'm the last to diss a great idea, but let's face it - folks who think the idea is everything tend to overlook various little bits of reality that don't mesh well with their idea (we call those implementation details) and in general just commoditize the implementation. Ideas like this make great conversations but crappy products.

The truth is these next generation technologies are usually quality implementation with some of the substantial "implementaion details" overcome. To trivialize those details is a mistake - the difficulty of a quality implementation is often overlooked by folks who think the main idea is everything. Often they require some real innovative thinking on their own. Anybody who has taken a tour of duty in one (or two or three) startups will tell you that neither idea nor execution are to be taken for granted - this is hard stuff when you're blazing new trails.

A couple common examples:
Thinking about just these two examples, it isn't hard to see why they are breaking through now instead of the earlier incarnations made reference to. While market conditions and external factors have changed, it isn't simply that their train has arrived and they were dusted off for the occasion - real work has made them better.

Tuesday, March 4, 2008

Calgary IOMMU - At What Price?

The Calgary IOMMU is a feature of most IBM X-Series (i.e. X86_64) blades and motherboards. If you aren't familiar with an IOMMU, it is strongly analogous to a regular MMU but applied to a DMA context. Their original primary use was for integrating 32 bit hardware into 64 bit systems. But another promising use for them is enforcing safe access in the same way an MMU can.

In normal userspace, if a program goes off the rails and accesses some memory it does not have permissions for a simple exception can be raised. This keeps the carnage restricted to the application that made the mistake. But if a hardware device does the same thing through DMA, whole system disaster is sure to follow as nothing prevents the accesses from happening. The IOMMU can provide that safety.

An IOMMU unit lets the kernel setup mappings much like normal memory page tables. Normal RAM mappings are cached with TLB entries, and IOMMU maps are cached TCE entries that play largely the same role.

By now, I've probably succeeded in rehashing what you already knew. At least it was just three paragraphs (well, now four).

The pertinent bit from a characterization standpoint is a paper from the 2007 Ottawa Linux Symposium. In The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda of IBM and some co-authors from Intel and AMD do some measurements using the Calgary IOMMU, as well as the DART (which generally comes on Power based planers).

I love measurements! And it takes guts to post measurements like this - in its current incarnation on Linux the cost of safety from the IOMMU is a 30% to 60% increase in CPU! Gah!

Some drill down is required, and it turns out this is among the worst cases to measure. But still - 30 to 60%! The paper is short and well written, you should read it for yourself - but I will summarize the test more or less as "measure the CPU utilization while doing 1 Gbps of netperf network traffic - measure with and without iommu". The tests are also done with and without Xen, as IOMMU techniques are especially interesting to virtualization, but the basic takeaways are the same in virtualized or bare metal environments.

The "Why so Bad" conclusion is management of the TCE. The IOMMU, unlike the TLB cache of an MMU, only allows software to remove entries via a "flush it all" instruction. I have certainly measured that when TLBs need to be cleared during process switching that can be a very measurable event on overall system performance - it is one reason while an application broken into N threads runs faster than the same application broken into N processes.

But overall, this is actually an encouraging conclusion - hardware may certainly evolve to give more granular access to the TCE tables. And there are games that can be played on the management side in software that can reduce the number of flushes in return for giving up some of the safety guarantees.

Something to be watched.

Thursday, February 28, 2008

Appliances and Hybrids

In my last post, which was about the fab book Network Algorithmics, I mentioned intelligent in-network appliances. I also mentioned that "the world doesn't need to be all hardware or all software".

I believe this to be an essential point - blending rapidly improving commodity hardware with a just a touch of custom ASIC/FPGA components, glued together by a low level architecture aware systems programming approach makes unbelievably good products that can be made an produced relatively inexpensively.

These appliances and hardware are rapidly showing up everywhere - Tony Bourke of Load Balancing digest makes a similar (why choose between ASIC or Software?) post today. Read it.

Wednesday, February 27, 2008

Network Algorithmics

In late 2004 George Varghese published an amazing book on the design and implementation of networking projects: Network Algorithmics

You won't find the boilerplate "this is IP, this is TCP, this is layer blah" crapola here. What makes this special is that it describes the way the industry really builds high end switches, routers, high end server software and an ever growing array of intelligent in-network appliances. The way that is really done, is very different than presented in a classic networking text book.

The material is not algorithms, rather it is a way of looking at things and breaking down abstraction barriers through techniques like zero copy, tag hinting, lazy evaluation, etc.. It makes the point that layers are how you describe protocols - not how you want to implement them. The world doesn't have to be all hardware or all software, this helps train your mind on how to write systems that harmonize them both by taking the time to really understand the architecture.

To me, this book is a fundamental bookshelf item. You don't hear it often mentioned with the likes of Stevens, Tanenbaum, and Cormen - but amongst folks in the know, it is always there. More folks ought to know.

Friday, February 15, 2008

SLA as medians, percentiles, or averages - as told by Amazon's Dynamo

I am going to recommend another published paper because of how it talks about characterizing its own performance. That's not the point of the paper, but I found it really interesting anyhow.

The paper is Dynamo: Amazon’s Highly Available Key-value Store, by a number of good folks over at Amazon.com (including the CTO Werner Vogels). It was published in the peer reviewed SOPS '07.

The bulk of the paper is about Dynamo, which apparently is a home grown key-value storage system that has a lot of inherent replication, scalability, and reliability. A good read.

Instead of the big picture I wanted to focus on a detail. The paper rejects the notion of defining SLAs with median expectations. Instead, Dynamo uses 99.9 percentiles. (It also rejects using means, but that is pretty nonsensical anyhow, isn't it?). The central idea is that the SLA defines acceptable usage for all users - not just for half of them (the median).

This matters in real life in a very real way. There is normally one version of an operation and then an ala carte menu of choices they might layer on that.. lots of folks use the vanilla version, but if 10, 20, 30, or even 40 percent of users are using some features that require extra processing - they are totally left out of the SLA. The canonical Amazon case is a user with a very long purchasing history - an important case but not the average case. 99.9 installs a threshold for a reasonable definition of "everybody" while still leaving room for the occasional pathological case. The authors point out that for more money you can have more nines, but the law of diminishing returns certainly applies.

You have to wonder if after today's S3 outage they wish they had bought another nine or two ;) (I have no reason to think that had anything to do with Dynamo - I'm just poking fun - what Amazon has built for S3/EC2 is very impressive)

I was happily nodding along as I read the paper when this came up:
Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds
My initial reaction was not charitable: Whoa - that's not really setting the bar very high, is it guys? Hundreds of millis to read and write a key/value pair? You say yourself in the introductory pages that these services are often layered on top of each other! Sure, it is more latency sensitive than an overnight data warehouse operation, but that's hardly an impressive responsiveness threshold.

But, I was too harsh. I hadn't internalized what shifting from the 50th percentile to the 99.9th really meant. The value isn't representative of what the typical user will see - it is representative of the worst you can stomach. In effect, it loses its marketing value - which is how a lot of SLAs are used in the real world.

The Dynamo paper backs this up. Figure 4 shows both the average and 99.9 percentiles for both reads and writes. average reads ran around 15ms, writes around 25. 99.9 percentiles were respectively in the 150 and 250 neighborhoods. This all makes much more sense, especially when dealing with disk drives.

In the end, I think you need more than one datapoint in order to make an effective characterization. Amazon clearly wouldn't be happy if everybody was seeing 150ms latencies - though they can stomach it if literally it is a one in a thousand kind of occurrence.

Maybe SLAs should be expressed as 4 tuples of 25/50/75/99.9 .. I've developed benchmarks that way and felt that helped keep the subsequent optimizations honest.

Even 90/99.9 is mostly what you need to keep a cap on the outliers while still getting a feel for what somebody is likely to see.

Thursday, January 31, 2008

AjaxScope Paper Provides Javascript Characterization

Emre Kıcıman and Benjamin Livshits from Microsoft Research present some interesting data in their SOPS paper - AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications.

The paper is mostly about AjaxScope, a neat insturmentation and profiling tool for JavaScript.

What I want to highlight, though, are some of the measurements they present for both IE (6 and 7) as well as Firefox (2). This is real data in a refereed journal of the ACM, not a Gartner-style whitepaper.

Among the interesting nuggets are IE's 35x slower performance in String cat operations, and Firefox's 4x slower Array join execution time. The authors also put the intrinsics into context by measuring the performance of common portal pages - IE beats Firefox on msn.com, but Firefox turns the tables on Yahoo!.

Lots more interesting data, and a useful tool, in the paper. Read it.

Saturday, January 19, 2008

Web Syndication Format Market Share

For quite a while my todo list has had an instruction to find and characterize the popularity of RSS vs ATOM . Which syndication format is more popular?

Atom seems on the face of it to be a better format than RSS, but some of what it addresses are not really wide spread problems for operations. Market share will tell if it was a solution looking for a real problem or not. Atom is about 2 years old - and it is pretty common to see atom feeds available around the net now.

Measuring the breakdown among my own set of feeds that I read isn't terribly useful. I have a bias in my selections - it isn't like measuring my connectivity or transport properties where I am representative as a sample.

For the record: I have 112 feeds, 54 of them in atom and 58 in some kind of rss.

The best information I could find was from syndic8.com. But frankly, it wasn't very satisfying. The site didn't feel very complete, and in the end only showed essentially the ratio between RSS and Atom offerings. They listed about 1/2 a million feeds - 82% of which were some flavor of RSS.

What I want to know is the ratio between active usages (i.e. fetches) of the two formats. Lots of sites offer both formats - but which do users actually consume?

Feedburner clearly has this info - but I couldn't find it published anywhere.

Does anybody have more information?

Thursday, January 17, 2008

Characterization Zealots Unite!

The eponymous Kode Vicious over at ACM's Queue magazine has an excellent rant on the value of measuring instead of assuming. I read it in print a ways back, now that it is in digital form it deserved a blog shoutout.