Reducing network interrupt overhead (pfSense/BSD)

unhappy_mage

[H]ard|DCer of the Month - October 2005
Joined
Jun 29, 2004
Messages
11,455
I just bought a new box (Intel Atom 230-based) and I want to use it as a router. I'm using pfSense, which is easy to configure, but I'm getting lousy speeds through it. I currently have four interfaces on it: re0/re1 are pci-express Realtek 8111s, and bge0/1 are on a dual-port PCI-X card plugged into the PCI slot on the board.

My internet connection is 100 megabit ethernet, delivered on cat5 by my university. Right now, though, I can only route about 70 megabits to on-campus destinations, and I think network traffic is killing the CPU. Here's a snippet from `top -S` to show what I mean:
Code:
  PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
   27 root        1 -68    -     0K     8K CPU0   0   4:42 100.00% irq16: bge0
   11 root        1 171 ki31     0K     8K RUN    1   3:49 72.17% idle: cpu1
   18 root        1 -28    -     0K     8K WAIT   1   1:28 23.58% swi5: +
As you can see, bge0 is using an entire virtual core by itself, and swi5 is using the rest of the hyperthreaded core. This is with pfSense's WAN interface set to bge0 and LAN set to re0, but I see similar CPU load with every other configuration. I disabled USB on the board so it wouldn't share an IRQ with the network card on the off chance that would help, no luck. I also tried turning off HT; then irq16 takes 2/3 of the CPU and swi5 takes the other third, and things run slower still.

Right now, as far as I can tell, the IRQs are shared like this:
IRQ 16: VGA re0 bge0
IRQ 17: re1 bge1
This mapping persists, even when I set "first PCI card IRQ" to 10 in the BIOS.

My brother has a machine with dual Pentium 3s that can route traffic faster than this, using Intel ethernet cards. Maybe I'll see if I can borrow a dual-port Intel card from somewhere.

Any suggestions other than "get real NICs"?
 
In real FreeBSD, you can use polling(4) to negate interrupt load from a nic. I'm not sure if its possible to do with pfsense because it requires building a custom kernel.

Otherwise I'd check for options on interrupt moderation on your NIC.
 
Unfortunately with that kind of throughput you'll probably need real network cards, or a beefier cpu.
Also, that dual P3 being faster doesn't surprise me any, especially with the intel network cards.
 
In real FreeBSD, you can use polling(4) to negate interrupt load from a nic. I'm not sure if its possible to do with pfsense because it requires building a custom kernel.

Otherwise I'd check for options on interrupt moderation on your NIC.
Polling makes CPU load go to zero-ish; unfortunately, traffic drops to 30 mbit or so.
Unfortunately with that kind of throughput you'll probably need real network cards, or a beefier cpu.
Also, that dual P3 being faster doesn't surprise me any, especially with the intel network cards.
The Atom benchmarks faster than a single Pentium 3 at 1.13 GHz, according to this source. 866 MHz is 75% as fast as that. FreeBSD's network stack seems to be processing interrupts in a single-threaded manner, so the second CPU doesn't have such a big impact. I'm guessing (hoping!) that the slow speeds are all NIC- and not CPU-bound.

I borrowed a single-port Intel 82550EY card to see if that helps.
 
Here's some data using the Intel card. `top -S`:
Code:
  PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
   35 root        1 -68    -     0K     8K CPU0   0   1:01 97.66% irq16: fxp0 uhci3
   11 root        1 171 ki31     0K     8K RUN    1   1:46 73.78% idle: cpu1
   18 root        1 -28    -     0K     8K WAIT   1   0:21 28.56% swi5: +
   12 root        1 171 ki31     0K     8K RUN    0   0:53  3.37% idle: cpu0
fxp0 is now bound to irq16, and pushing traffic through that interface makes the CPU go to 100%! In other words, the problem may be CPU horsepower after all. Toggling hardware checksums doesn't help anything. Turning on polling doesn't slow things down as much as before, but still substantially, and CPU usage still gets much lower.
 
I've got an Atom dual core board running pfSense 2.0-ALPHA here (FreeBSD 8.0 kernel). It can manage about 200mbit routed throughput, with one core pegged and the other pretty much idle, via a single port on a PCI-X based em0 NIC in a router on a stick configuration. I think your CPU should be able to handle 100mbit. The dual-port NIC I'm using can be had new on eBay for about $30.

So my suggestion is 'get real NICs' :p. If you're running an old release you might also try 1.2.3, as the FreeBSD 7.2 kernel has a lot of hardware improvements.
 
I've got an Atom dual core board running pfSense 2.0-ALPHA here (FreeBSD 8.0 kernel). It can manage about 200mbit routed throughput, with one core pegged and the other pretty much idle, via a single port on a PCI-X based em0 NIC in a router on a stick configuration. I think your CPU should be able to handle 100mbit. The dual-port NIC I'm using can be had new on eBay for about $30.
I ordered a dual-port Intel card (which presumably uses the em driver) from eBay. Maybe it'll be here by Christmas, but I can stand to wait.
So my suggestion is 'get real NICs' :p. If you're running an old release you might also try 1.2.3, as the FreeBSD 7.2 kernel has a lot of hardware improvements.
I'm running 1.2.3, so perhaps it's worth trying the 2.0 alpha. Any opinions on whether a 64-bit or 32-bit release is more likely to go fast?
 
2.0 is still quite buggy. Once you get it set up the way you want it's pretty solid, but there are lots of bugs in the UI and so on that can make getting there difficult. I wouldn't really recommend it for anything important.

The 64-bit releases are brand spanking new, they've only started publishing them in the past week, and I don't really see how you'd get a performance gain that way so I'd say stick to 32-bit for now.
 
Hmm, I think you can get the polling performance up a bit by increasing the kernel tick rate.
First, get the current rate with sysctl kern.hz, then put kern.hz=N in /boot/loader.conf, for some larger value of N. Try doubling it, perhaps?
 
2.0 is still quite buggy. Once you get it set up the way you want it's pretty solid, but there are lots of bugs in the UI and so on that can make getting there difficult. I wouldn't really recommend it for anything important.
That's okay, this isn't a particularly important box. I can afford to mess around with it a bit.
The 64-bit releases are brand spanking new, they've only started publishing them in the past week, and I don't really see how you'd get a performance gain that way so I'd say stick to 32-bit for now.
Okay. I think I'll wait for the Intel card to get here before I experiment much more; that way hopefully I can spend less time overall on it.
Hmm, I think you can get the polling performance up a bit by increasing the kernel tick rate.
First, get the current rate with sysctl kern.hz, then put kern.hz=N in /boot/loader.conf, for some larger value of N. Try doubling it, perhaps?
It's already set to 1000 Hz, which I would suspect to be sufficient. I'll give it a try when I mess around with the Intel card et al.
 
It's already set to 1000 Hz, which I would suspect to be sufficient. I'll give it a try when I mess around with the Intel card et al.

Looking at it, you got about 30 Mbit/sec with polling, which is, er, 3840 bytes per tick. That's just a couple of packets, so I suppose the driver or the hardware is doing something silly. Try the intel cards, indeed.
 
I got my dual-port Intel card in the mail over holiday break, and I plugged it in today. Without polling a single HTTP download from on-campus works at about 92 mbit/s, and multiple concurrent streams push to 96 or so (which I'm guessing is about all I should expect to get), with CPU usage about 65%. Turning on polling drops CPU usage to 35% and doesn't slow things down any.

Now all I need to do is figure out how to physically fit this thing into the case :p
 
Ah, good to hear - that's how it's supposed to act.
As for the case, I suggest taking a hacksaw to it. ;)
 
Not quite so simple: I'm trying to mount this inside a case that has another machine in it already :D Here's the problem (sorry for phone pic in bad light):
router-mount.jpg

The card on top is the new Intel card, the card on the bottom is my SAS expander, and the whole thing needs to come down until the lid closes against the top edge of the case that's visible there. The Intel card hits the SAS expander before that's possible.

So, I'm looking at what it would take to move the SAS expander to a different slot. Don't you love it when a little change snowballs into a big change? :rolleyes:
 
Is there enough room on that atom board to get a riser for the network card to turn it horizontal over the board?
 
Is there enough room on that atom board to get a riser for the network card to turn it horizontal over the board?

Not really. Here's the other side of the network card:
card-other-side.jpeg

Both the memory and the boot disk (the little black thingy is a sata-based SSD) appear to stick up far enough to be a problem. The riser can't make the card go to the left (from this perspective) either, because the back of the case is right there. (See the line coming out parallel to the edge of the board? That's where the fans on the Norco 4220 are.)

So, long story short, I think I'm gonna try moving the SAS expander and see if that helps, and if it doesn't I'm gonna take the Intel card out. It's good to at least know the solution, even if I can't implement it.
 
Ok, that's an impressively tight layout. What exactly are you building here, a router/storage solution implemented as one box / two PCs?
 
Ok, that's an impressively tight layout. What exactly are you building here, a router/storage solution implemented as one box / two PCs?

Yep. The storage box is running OpenSolaris, and the router's running PFsense. Here's a list of components:
(server)
  • Norco 4220 case
  • Intel S3210SHLC board
  • 8GB ECC DDR2
  • Q6600
  • 80GB 2.5" disk (boot)
  • LSI SAS3442E-R controller
  • HP SAS expander
  • ACARD 9010B memory device (log device for ZFS)
  • 6*Hitachi 7k1000.B 1TB drives, raidz2
  • Dual-port Intel GigE card
(router)
The ACARD device and the router box are attached to the lid of the case, with long cables so I can easily get it apart. I made my own "standoffs" using #6 screws and washers, and some short pieces of clear plastic tubing. I'm rather happy with how they turned out.

I'm planning to expand this box by buying an SSD (the Intel ones are quite nice, if nothing cheaper and equally good comes along) and 6 2TB drives. Then I'll create a new pool, turn on dedup, and move the data from my current pool onto it. Then I can destroy the current pool and add a new vdev to the new one. Then a few years down the road I can replace the 1TB disks with 4 or 5TB ones, and if I need more capacity I can add a third set of 6. It's quite a versatile setup.
 
Good, glad you got the CPU usage sorted.. I was going to say that I'm running an Atom 330 board which has onboard Atheros PCI-e gigabit for the WAN and I'm running an Intel PRO/1000 PT PCI-e for the LAN.

With pfSense 2.0 Beta-1 I push about 30Mbps with around 14% CPU usage. I was skeptical on the Atheros and I thought I might have an Interrupt issue, but it's doing as well as the Intel.

I once had a Via Epia 500Mhz board with a few Realtek cards.. It would max out the interrupts at 15 - 17Mbps. No substitute for real NICs!!

Riley
 
What kind of speeds are you seeing with that system and ZFS on sustained reads and writes?

If you're interested you could also check out the new Supermicro X7SPA-H. Dual integrated Intel gigabit ports and a dual-core D510 cpu with pci-e slot. That or the HF with IPMI will probably be what I'll pick up (debating this or a VM box that combines file server/router/ftp and such). Wireless n cards getting BSD support will probably speed up that decision though
 
What kind of speeds are you seeing with that system and ZFS on sustained reads and writes?
Well, these benchmarks aren't necessarily applicable to anyone else's setup: the pool's 78% full, it's been in use for a while, and so forth. It's hard to get good read benchmark numbers because there's 8GB of memory in the box, and so forth. Also, there's a problem in recent months with LSI cards and SAS expanders that causes messages like this to appear:
Code:
Jan 10 12:30:36 fs scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,2940@1c/pci1000,3080@0 (mpt1):
Jan 10 12:30:36 fs         Disconnected command timeout for Target 8
and no IO happens in ~20 seconds before that message.

That said, here are some benchmarks. I uncompressed a snapshot of the Linux kernel, then timed extracting it locally:
$ sync; time (gtar xf linux-2.6.32.3.tar; sync)
Here's some output from "zpool iostat -v huge 5"
Code:
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
huge        4.24T  1.19T    202  4.39K  25.3M  39.6M
  raidz2    4.24T  1.19T    202  4.39K  25.3M  39.5M
    c0t8d0      -      -     41    193  3.56M  11.1M
    c0t7d0      -      -     41    186  3.66M  10.5M
    c0t4d0      -      -     50    183  4.56M  10.5M
    c0t3d0      -      -     58    186  5.30M  10.8M
    c0t2d0      -      -     52    185  4.71M  10.7M
    c0t0d0      -      -     45    204  3.83M  10.7M
  c7d1         8K  1008M      0      1      0  21.5K
11.187 seconds elapsed.

Then I tried the same thing over NFS:
Code:
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
huge        4.24T  1.19T    121  1.97K  14.9M  11.6M
  raidz2    4.24T  1.19T    121      0  14.9M      0
    c0t8d0      -      -     35      0  2.25M      0
    c0t7d0      -      -     35      0  2.32M      0
    c0t4d0      -      -     41      0  3.04M      0
    c0t3d0      -      -     26      0  2.34M      0
    c0t2d0      -      -     31      0  2.43M      0
    c0t0d0      -      -     44      0  3.00M      0
  c7d1        12K  1008M      0  1.97K      0  11.6M
Here we see the advantage of the log disk. NFS IO is synchronous (in order to guarantee consistency in case of client and server crashes) so it gets committed to the log device. Here it's doing about 1,970 writes per second for a total of only 11.6MB/s, or about 6k each write. Unpacking over NFS is still substantially slower (1:41.062) but not the blowout it used to be before the log device.

Sequential IO is on the order of 150 MB/s writes and 200 MB/s reads, except for the stalls mentioned above. Using vdbench, I timed an appropriately large benchmark of 1280k random reads at 50 MB/s, and the array does ~300 random 8k reads per second (although that's kind of a weird thing to do with raidz2, it still manages to go that fast).

If you're interested you could also check out the new Supermicro X7SPA-H. Dual integrated Intel gigabit ports and a dual-core D510 cpu with pci-e slot. That or the HF with IPMI will probably be what I'll pick up (debating this or a VM box that combines file server/router/ftp and such). Wireless n cards getting BSD support will probably speed up that decision though
That does look like a nice board. The reason I bought an Atom, though, was that I could get a complete bundle for $130, and that board's $180.
 
Back
Top