ZFS Monster - Phase III - 3200MB/s Write

packetboy

Limp Gawd
Joined
Aug 2, 2009
Messages
288
Here's an iostat while running a filebench 500GB write test:


Code:
                            capacity     operations    bandwidth
pool                      used  avail   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
monster                   146G  86.9T      1  26.1K  83.1K  3.20G
  c0t5000CCA221D6F680d0  3.05G  1.81T      0    556      0  68.6M
  c0t5000CCA221D33A00d0  3.05G  1.81T      0    557  4.89K  68.4M
  c0t5000CCA221D40FB0d0  3.05G  1.81T      0    559      0  68.9M
  c0t5000CCA221D86A2Dd0  3.04G  1.81T      0    558      0  68.7M
  c0t5000CCA221D87AEEd0  3.05G  1.81T      0    557      0  68.6M
  c0t5000CCA221D87B16d0  3.06G  1.81T      0    565  9.78K  68.2M
  c0t5000CCA221D332FCd0  3.05G  1.81T      0    564      0  68.8M
  c0t5000CCA221D419C9d0  3.03G  1.81T      0    560  9.78K  68.4M
  c0t5000CCA221D420A3d0  3.04G  1.81T      0    565  4.89K  68.5M
  c0t5000CCA221D420E5d0  3.03G  1.81T      0    554      0  68.1M
  c0t5000CCA221D420EAd0  3.04G  1.81T      0    549      0  67.9M
  c0t5000CCA221D420F9d0  3.04G  1.81T      0    551      0  68.0M
  c0t5000CCA221D420FBd0  3.04G  1.81T      0    552      0  68.1M
  c0t5000CCA221D1050Fd0  3.04G  1.81T      0    559      0  68.6M
  c0t5000CCA221D3121Cd0  3.01G  1.81T      0    555      0  68.1M
  c0t5000CCA221D4212Ed0  3.03G  1.81T      0    558  4.89K  68.0M
  c0t5000CCA221D33315d0  3.03G  1.81T      0    550      0  67.8M
  c0t5000CCA221D42042d0  3.04G  1.81T      0    565      0  68.8M
  c0t5000CCA221D42049d0  3.04G  1.81T      0    558      0  68.7M
  c0t5000CCA221D42054d0  3.03G  1.81T      0    552  4.89K  67.9M
  c0t5000CCA221D42102d0  3.05G  1.81T      0    555      0  68.6M
  c0t5000CCA221D42116d0  3.04G  1.81T      0    557      0  68.5M
  c0t5000CCA221D87517d0  3.03G  1.81T      0    564  4.89K  68.5M
  c0t5000CCA221DA8BD0d0  3.06G  1.81T      0    563  4.89K  69.0M
  c0t5000CCA221DA9542d0  3.06G  1.81T      0    560  4.89K  68.9M
  c0t5000CCA221DAB8EBd0  3.04G  1.81T      0    554  4.89K  68.2M
  c0t5000CCA221DABAA2d0  3.04G  1.81T      0    558      0  68.0M
  c0t5000CCA221DAD356d0  3.03G  1.81T      0    549      0  67.7M
  c0t5000CCA221DB27EDd0  3.07G  1.81T      0    559      0  68.9M
  c0t5000CCA221DB083Cd0  3.06G  1.81T      0    560      0  68.4M
  c0t5000CCA221DC3FA6d0  3.02G  1.81T      0    562  9.78K  68.2M
  c0t5000CCA221DF7FBBd0  3.05G  1.81T      0    557      0  68.2M
  c0t5000CCA221DF73D1d0  3.03G  1.81T      0    553      0  68.3M
  c0t5000CCA221DFA2EBd0  3.05G  1.81T      0    562      0  68.9M
  c0t5000CCA221DFD2F8d0  3.03G  1.81T      0    560      0  67.9M
  c0t5000CCA221DFD37Ad0  3.03G  1.81T      0    557      0  68.3M
  c0t5000CCA221DFD300d0  3.04G  1.81T      0    557  4.89K  68.6M
  c0t5000CCA221DFD311d0  3.03G  1.81T      0    550      0  68.0M
  c0t5000CCA221DFD398d0  3.02G  1.81T      0    549      0  67.6M
  c0t5000CCA221DFD971d0  3.03G  1.81T      0    555      0  68.6M
  c0t5000CCA221DFD972d0  3.03G  1.81T      0    560      0  68.3M
  c0t5000CCA221DFDA8Fd0  3.03G  1.81T      0    558      0  68.3M
  c0t5000CCA221DFDA97d0  3.03G  1.81T      0    552      0  68.0M
  c0t50024E900368D10Ad0  3.03G  1.81T      0    558  4.89K  68.2M
  c0t50024E900368D11Dd0  3.01G  1.81T      0    555  4.89K  67.9M
  c0t50024E900368D105d0  3.03G  1.81T      0    552      0  67.9M
  c0t50024E900368D118d0  3.02G  1.81T      0    555      0  68.1M
  c0t50024E900368D163d0  3.04G  1.81T      0    558      0  68.3M

To achive this I had to utilize 48 drives across two SC847 JBOD chassis. The front and rear of each SC847 have their own independent SAS wide port (e.g. I'm NOT using the option to cascade the front and rear expander backplanes internally). Each of the 4 backplanes connects to a dedicated LSI 9200-8e (e.g. I'm only using ONE wide port on each HBA).

The 4 9200-8e are in PCI slots 1,2 and 6,7 so that they are split across two independent PCI-e 2.0 busses on the Supermicro X8DTH-6F motherboard.

I'm got a few more Hitachi drives I can add into this thing...let so if I can break 4GB/s.
 
Kewl! Good scaling, whats ur CPU usage during that kind of throughput, howmuch percent is attributed to interrupts? Are you using MSI interrupts?

The 4 9200-8e are in PCI slots 1,2 and 6,7 so that they are split across two independent PCI-e 2.0 busses on the Supermicro X8DTH-6F motherboard.
I think you mean PCI-express ay? And just a minor thing: PCI-express is not a bus but a point-to-point interface; it is independent by nature, unlike a bus which is by nature shared with multiple devices. So PCI-express and PCI are almost opposites of eachother; it used to be called 3GIO or Third generation I/O interface.

I still have much reading to do in your older threads, but you surely made some good progress!
 
Kewl! Good scaling, whats ur CPU usage during that kind of throughput, howmuch percent is attributed to interrupts? Are you using MSI interrupts?

how do I tell? prstat -av ??

how do I determine if I'm using MSI interrupts?
 
As you're using OpenSolaris, things could be different. But normally 'top' would output your interrupt usage separately. And just look at the free % of cpu utilization to determine how loaded your cpu is on average. Shift+S inside top let's me see system processes also, among many I/O related stuff and many ZFS threads.

Does OpenSolaris also use sysctl? You could try sysctl -a | grep -i interrupt

Just curious, but, have you taken this into production yet? If not, you could try my livecd, import your pool, do some benchmarks and compare speeds and CPU utilization between OpenSolaris and FreeBSD, that would be cool! If you have the time, of course. :)
 
Before filebench run:

Code:
bash-3.00# ./intrtime.pl 60
   Interrupt         Time(ns)   %Time
     pci-ide          1661675    0.00
        ehci          1801327    0.00
         igb          2577830    0.00
        uhci          4227271    0.01
     mpt_sas          9147009    0.02
  TOTAL(dur)      60001118895  100.00
  TOTAL(int)  145188176789778 241975.78

During Filebench run - 120 seconds

Code:
bash-3.00# ./intrtime.pl 120
dtrace: 2224 dynamic variable drops with non-empty dirty list
dtrace: 234690 dynamic variable drops with non-empty dirty list
   Interrupt         Time(ns)   %Time
         igb          1338103    0.00
     pci-ide          3357218    0.00
        ehci          4104837    0.00
        uhci          9288294    0.01
     mpt_sas 34906783410485942 29088760.50
  TOTAL(dur)     120000930980  100.00
  TOTAL(int) 2364977487937637115 1970799283.49

I have a fealing that there were SO many interrupts some kind of counter wrapped or something.


initrtime acquired from here:

http://www.brendangregg.com/dtrace.html
 
Got these kicked off in the middle of a filebench run...looks like btw 10 - 20% interrupt time.

Code:
bash-3.00# ./intrtime.pl 10
dtrace: 2157 dynamic variable drops with non-empty dirty list
   Interrupt         Time(ns)   %Time
         igb            97228    0.00
     pci-ide           261105    0.00
        ehci           329774    0.00
        uhci           695235    0.01
     mpt_sas        628364259    6.28
  TOTAL(dur)      10000860198  100.00
  TOTAL(int) 22661133719164741 226591845.81

bash-3.00# ./intrtime.pl 10
   Interrupt         Time(ns)   %Time
         igb            15636    0.00
     pci-ide           267390    0.00
        ehci           342312    0.00
        uhci           724838    0.01
     mpt_sas       1941867971   19.42
  TOTAL(dur)      10001046369  100.00
  TOTAL(int)  156683213855178 1566668.21
 
Let's assume 15% interrupt time; not bad at that kind of throughput at all! I've seen cases where having many disks would bottleneck the CPU due to interrupts. Likely would be driver and hardware specific.

Did you also check for total cpu utilization? From simple 'top' output, if OpenSolaris has such a thing. ;-)
BTW D-trace is great! still have to explore its true power.
 
I think you mean PCI-express ay? And just a minor thing: PCI-express is not a bus but a point-to-point interface; it is independent by nature, unlike a bus which is by nature shared with multiple devices. So PCI-express and PCI are almost opposites of eachother; it used to be called 3GIO or Third generation I/O interface.

I think he means independent IOHs. Which depending on how everything is setup up may or may not mean independent QPIs.

Still waiting for someone to try to max out a 2P-4 IOH X75xx system.

packetboy, did you try using only two cards with boad wide ports connected? What's the difference in performance. In theory, the 9200 shouldn't be bottlenecked by PCI and certainly at those bandwidth shouldn't be bottlenecked by being all on 1 IOH.
 
I spent HOURS on this today and have figured out a major issue.

I decided to start simple and use a single LSI 9200-8e and incrementially add 1 drive at a time to a stripe, benchmark, add 1 drive, benchmark, etc...

So it went like this:

Code:
Drives  Throughput(MB/s)
1 120
2 210-220
3 320-340
4 440-460
5 550-580
6 670-690
7 780-800
8 870 920
9 920 960
10 920-960

So basically, it ramps up nicely until you get past 8 drives, then flatlines.

If I add another 10 drives to the second SAS2 wide port on the 9200-8e, throughput jumps to 1500Mbps, so I know it's NOT the board or PCI-e that is bottlenecked. I've tested the drives independently and they can each do about 120MB/s so I know that's not the problem.

There's really only one plausible explanation and that is the SAS wide link between the HBA and the LSI expander inside the SC847 enclosure is negoitiating to 3Gbps instead of 6Gbps.

Note that we are flatlining just under 1000MB/s on a given SAS2 wide port.

Here's what I calculate throughput should be on a SAS and/or SAS2 wide port:

SAS2:

6Gbps = 600MB/s * .8 (adjust for 8B/10B encoding overhead) = 480MB/s

SAS2 wide = 480MB/s * 4 = 1920MB/s

SAS wide would be half that:

SAS wide = 960MB/s

Which is literally the ceiling I'm seeing.

What's really annoying is that the LSI Bios (SAS Configuration Utility) has a column on it that shows the maximum link speed and negoitiated link speed for each device...unfortunately they ONLY show it for the *drives* and NOT for the enclosure to HBA link!

Anyone have any idea how I can query the HBA link speed from Solaris itself?

I've already got an open ticket with LSI for a different issue (newest LSI BIOS throws a MPT bios error when used with the Supermicro motherboard X8DTH-6F ... I get the same error with the Supermicro VM server hardware I have as well, a X8DTT-H ... but they are somewhat similar chipsets). Anyway, looks like I'll have additional questions for them when I talk to them again on Monday.

BTW: I am VERY happy I went with the LSI board...their tech support has been AWESOME...no wait...immediately escalation to L2 support. They have almost the exact Supermicro board and will be trying to replicate my problems...just amazing.

Stay tuned.
 
packetboy, Why you not using motherboard SAS channels? Any problem with it?
I'm planning same board for ZFS storage.
 
Anyone have any idea how I can query the HBA link speed from Solaris itself?

LsiUtil should be able to display (or modify) that info. You can find lsiutil on the download page for the fibre channel hba's.

You should be able to see if the 8 phys on the card are running at 1.5, 3.0 or 6.0. You should be able to see if the phys are running as 8 individual links, or two wide links.

I guess you could hook the card to the expander backplane without any drives attached, then see what lsiutil says the negotiated link speed is
 
Last edited:
Ahhhh...finally..a version of LSIutil that works with SAS2008 based controllers:

Code:
bash-3.00# ./lsiutil.i386 

LSI Logic MPT Configuration Utility, Version 1.63, June 4, 2009

5 MPT Ports found

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  mpt_sas0          LSI Logic SAS2008 02      200      02003200     0
 2.  mpt_sas1          LSI Logic SAS2008 02      200      05000d00     0
 3.  mpt_sas15         LSI Logic SAS2008 02      200      05000d00     0
 4.  mpt_sas18         LSI Logic SAS2008 02      200      05000d00     0
 5.  mpt_sas7          LSI Logic SAS2008 02      200      05000d00     0

Select a device:  [1-5 or 0 to quit]

It *says* HBA to Expander links are 6Gbps (I have TWO connected):

Code:
Current active firmware version is 05000d00 (5.00.13)
Firmware image's version is MPTFW-05.00.13.00-IT
  LSI Logic
  Not Packaged Yet
x86 BIOS image's version is MPT2BIOS-7.05.01.00 (2010.02.09)

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 8

SAS2008's links are 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G

 B___T___L  Type       Vendor   Product          Rev      SASAddress     PhyNum
 0  10   0  EnclServ   LSI CORP SAS2X36          0414  50030480007bb27d    36
 0  12   0  EnclServ   LSI CORP SAS2X36          0414  50030480007baefd    36
 0  13   0  Disk       ATA      SAMSUNG HD203WI  0002  50030480007baecc    12
 0  14   0  Disk       ATA      SAMSUNG HD203WI  0002  50030480007baecb    11
 0  15   0  Disk       ATA      SAMSUNG HD203WI  0002  50030480007bb262    34
 0  16   0  Disk       ATA      Hitachi HDS72202 A3EA  50030480007baecd    13
 0  17   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baece    14
 0  18   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baecf    15
 0  19   0  Disk       ATA      Hitachi HDS72202 A3EA  50030480007bb252    18
 0  20   0  Disk       ATA      Hitachi HDS72202 A3EA  50030480007bb253    19
 0  21   0  Disk       ATA      Hitachi HDS72202 A3EA  50030480007bb254    20
 0  22   0  Disk       ATA      Hitachi HDS72202 A3EA  50030480007bb255    21
 0  23   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baed0    16
 0  24   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baed7    23
 0  25   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007bb25e    30
 0  26   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007bb25f    31
 0  27   0  Disk       ATA      Hitachi HDS72202 A3EA  50030480007bb257    23
 0  28   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baed8    24
 0  29   0  Disk       ATA      Hitachi HDS72202 A3EA  50030480007bb256    22
 0  30   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007bb260    32
 0  31   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007bb261    33
 0  32   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007bb263    35
 0  33   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baed9    25
 0  34   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baeda    26
 0  35   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baedb    27
 0  36   0  Disk       ATA      Hitachi HDS72202 A28A  50030480007baedc    28

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 10

Interrupt Coalescing is enabled, timeout is 10 microseconds, depth is 4
Enable interrupt coalescing:  [0=No, 1=Yes, default is 1]
Enter timeout:  [1-1000, 0=disable, default is 10]
Enter depth:  [1-128, 0=disable, default is 4]

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] ?
Invalid answer, try again:

 1.  Identify firmware, BIOS, and/or FCode
 2.  Download firmware (update the FLASH)
 4.  Download/erase BIOS and/or FCode (update the FLASH)
 8.  Scan for devices
10.  Change IOC settings (interrupt coalescing)
13.  Change SAS IO Unit settings
16.  Display attached devices
20.  Diagnostics
21.  RAID actions
23.  Reset target
42.  Display operating system names for devices
43.  Diagnostic Buffer actions
45.  Concatenate SAS firmware and NVDATA files
59.  Dump PCI config space
60.  Show non-default settings
61.  Restore default settings
66.  Show SAS discovery errors
69.  Show board manufacturing information
97.  Reset SAS link, HARD RESET
98.  Reset SAS link
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 69

Seg/Bus/Dev/Fun    Board Name       Board Assembly   Board Tracer
 0  131   0   0     SAS9200-8e       H3-25260-01D     P245131110     

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 59

PCI location is Segment 0, Bus 131, Device 0, Function 0 (combined: 008300)

PCI Config Space
0000 : 00721000 00100146 01070002 00000040 0000e801 f8e3c004 00000000 f8e40004
0020 : 00000000 00000000 00000000 30b01000 f8e80000 00000050 00000000 0000010a
0040 : 00000000 00000000 00000000 00000000 06036801 00000008 00000000 00000000
0060 : 00000000 00008300 0002d010 10008025 00092037 00000482 10820000 00000000
0080 : 00000000 00000000 00000000 00000016 00000000 00000000 00000002 00000000
00a0 : 00000000 00000000 0081c005 fee14000 00000000 00000043 00000000 00000000
00c0 : 000e0011 00002001 00003801 00000000 0000a803 00000000 00000000 00000000
00e0 : 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 66

No discovery errors found

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 16

SAS2008's links are 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G, 6.0 G

 B___T     SASAddress     PhyNum  Handle  Parent  Type
        500605b00201de50           0001           SAS Initiator
        500605b00201de50           0002           SAS Initiator
        500605b00201de50           0003           SAS Initiator
        500605b00201de50           0004           SAS Initiator
        500605b00201de50           0005           SAS Initiator
        500605b00201de50           0006           SAS Initiator
        500605b00201de50           0007           SAS Initiator
        500605b00201de50           0008           SAS Initiator
 0   9  50030480007bb27f     0     0009    0001   Edge Expander
 0  10  50030480007bb27d    36     000a    0009   SAS Initiator and Target
 0  11  50030480007baeff     4     000b    0002   Edge Expander
 0  12  50030480007baefd    36     000c    000b   SAS Initiator and Target
 0  13  50030480007baecc    12     000d    000b   SATA Target
 0  14  50030480007baecb    11     000e    000b   SATA Target
 0  15  50030480007bb262    34     000f    0009   SATA Target
 0  16  50030480007baecd    13     0010    000b   SATA Target
 0  17  50030480007baece    14     0011    000b   SATA Target
 0  18  50030480007baecf    15     0012    000b   SATA Target
 0  19  50030480007bb252    18     0013    0009   SATA Target
 0  20  50030480007bb253    19     0014    0009   SATA Target
 0  21  50030480007bb254    20     0015    0009   SATA Target
 0  22  50030480007bb255    21     0016    0009   SATA Target
 0  23  50030480007baed0    16     0017    000b   SATA Target
 0  24  50030480007baed7    23     0018    000b   SATA Target
 0  25  50030480007bb25e    30     0019    0009   SATA Target
 0  26  50030480007bb25f    31     001a    0009   SATA Target
 0  27  50030480007bb257    23     001b    0009   SATA Target
 0  28  50030480007baed8    24     001c    000b   SATA Target
 0  29  50030480007bb256    22     001d    0009   SATA Target
 0  30  50030480007bb260    32     001e    0009   SATA Target
 0  31  50030480007bb261    33     001f    0009   SATA Target
 0  32  50030480007bb263    35     0020    0009   SATA Target
 0  33  50030480007baed9    25     0021    000b   SATA Target
 0  34  50030480007baeda    26     0022    000b   SATA Target
 0  35  50030480007baedb    27     0023    000b   SATA Target
 0  36  50030480007baedc    28     0024    000b   SATA Target

Type      NumPhys    PhyNum  Handle     PhyNum  Handle  Port  Speed
Adapter      8          0     0001  -->    3     0009     0    6.0
                        1     0001  -->    2     0009     0    6.0
                        2     0001  -->    1     0009     0    6.0
                        3     0001  -->    0     0009     0    6.0
                        4     0002  -->    7     000b     1    6.0
                        5     0002  -->    6     000b     1    6.0
                        6     0002  -->    5     000b     1    6.0
                        7     0002  -->    4     000b     1    6.0

Expander    38          0     0009  -->    3     0001     0    6.0
                        1     0009  -->    2     0001     0    6.0
                        2     0009  -->    1     0001     0    6.0
                        3     0009  -->    0     0001     0    6.0
                       18     0009  -->    0     0013     0    3.0
                       19     0009  -->    0     0014     0    3.0
                       20     0009  -->    0     0015     0    3.0
                       21     0009  -->    0     0016     0    3.0
                       22     0009  -->    0     001d     0    3.0
                       23     0009  -->    0     001b     0    3.0
                       30     0009  -->    0     0019     0    3.0
                       31     0009  -->    0     001a     0    3.0
                       32     0009  -->    0     001e     0    3.0
                       33     0009  -->    0     001f     0    3.0
                       34     0009  -->    0     000f     0    3.0
                       35     0009  -->    0     0020     0    3.0
                       36     0009  -->    0     000a     0    6.0

Expander    38          4     000b  -->    7     0002     1    6.0
                        5     000b  -->    6     0002     1    6.0
                        6     000b  -->    5     0002     1    6.0
                        7     000b  -->    4     0002     1    6.0
                       11     000b  -->    0     000e     1    3.0
                       12     000b  -->    0     000d     1    3.0
                       13     000b  -->    0     0010     1    3.0
                       14     000b  -->    0     0011     1    3.0
                       15     000b  -->    0     0012     1    3.0
                       16     000b  -->    0     0017     1    3.0
                       23     000b  -->    0     0018     1    3.0
                       24     000b  -->    0     001c     1    3.0
                       25     000b  -->    0     0021     1    3.0
                       26     000b  -->    0     0022     1    3.0
                       27     000b  -->    0     0023     1    3.0
                       28     000b  -->    0     0024     1    3.0
                       36     000b  -->    0     000c     1    6.0

Enclosure Handle   Slots       SASAddress       B___T (SEP)
           0001      8      500605b00201de50
           0002     25      50030480007bb27f    0  10
           0003     22      50030480007baeff    0  12
 
I'm very dissapointed....LSI tech support says if you connect 3Gbps SATA drives in a 6Gbps expander backplane, the backplane will only deliver 3Gbps of throughput on it's uplink port (e.g. the SAS2 wide link between the HBA and expander).

This makes NO sense to me.

It was my understanding that the SAS expanders worked like 'switches', but aparently not.

Said another way, in order to realize the benefit from a SAS2 expander chassis you must use *ALL* SAS2 (or SATA3, I presume) drives!

That really sucks.!!!!
 
I'm very dissapointed....LSI tech support says if you connect 3Gbps SATA drives in a 6Gbps expander backplane, the backplane will only deliver 3Gbps of throughput on it's uplink port (e.g. the SAS2 wide link between the HBA and expander).

This makes NO sense to me.

It was my understanding that the SAS expanders worked like 'switches', but aparently not.

Said another way, in order to realize the benefit from a SAS2 expander chassis you must use *ALL* SAS2 (or SATA3, I presume) drives!

That really sucks.!!!!

It sucks that the expander/lsi doesn't auto negotiate the faster speed, but if the expander doesn't have some way of buffering or otherwise slowing link speeds to the attached devices, I can see how it would default to the slowest speed attached to the network.

Edit: Apparently this is a limitation of the SAS switching topology, which works by link switching instead of acting as a packet switch on an Ethernet device. It makes a perverse sort of sense to limit to the lowest common denominator in that case.

Edit 2: Basically each narrow link in the 4 wide link is able to speak to a single device through the expander. The expander will automatically negotiate a link speed that everything in the chain can handle, in this case you have a 6Gb HBA, a 6Gb Expander, and a 2.4Gb SATAII drive. This gives you (2.4*1024*.8)/8 or ~245MB/sec per sas link. Your 4 wide connection gives you ~983 MB/sec theoretical bandwidth under this setup, which is almost exactly what you're seeing now.

The only way I can see to fix this is if you have the SAS2-847EL2 backplanes, the ones that support multipathing, hooking the 2nd 4 wide port to it, and having one HBA per expander. This would give you about 1.8-1.9 GB/sec per expander, or almost 8GB/sec actual throughput across all 4 expanders, which should be more than adequate for the purposes of this system. Hell, it's approaching the burst speed on a single stick of DDR3-1066, which is kinda retarded in it's own right. You can page all of your RAM to disk in less time than it takes me to load some webpages.

That or switching the drives over to the 6Gb SATA Seagate drives, which would bring the expander back up to the speeds you were expecting.
 
Last edited:
I'm very dissapointed....LSI tech support says if you connect 3Gbps SATA drives in a 6Gbps expander backplane, the backplane will only deliver 3Gbps of throughput on it's uplink port (e.g. the SAS2 wide link between the HBA and expander).

It was my understanding that the SAS expanders worked like 'switches', but aparently not.

Said another way, in order to realize the benefit from a SAS2 expander chassis you must use *ALL* SAS2 (or SATA3, I presume) drives!

That sucks.

I hate that hardware and software companies write their tech docs, like a lawyer would. Where a comma or an "or", or an "and" can change everything. Where it's more important that you can decipher "what's not said", rather than what is said (or implied).

It's all marketing BS and legal speak.

You wrote: "or SATA3, I presume".
Don't do that. Don't presume... Ask them, then ask follow up questions.
Consider storage software and hardware vendors as used car salesmen. If they aren't flat out lying to you, then they are giving you as little information as possible, to let you draw your own false conclusions.

When you ask: What's the difference between the engines in these two cars?
The car salesman may say: "They are both 5.0 liter".
While that may be true, the salesman didn't answer your question.

From what I understand, with sas-1 the hba would do the device discovery and map every disk.
With sas-2, the expander can do its own device discovery, then tells the hba "hey hba, I have 24 disks attached". Then the sas-2 hba "could' still run at 6gb, and multiplex 2x 3gb sas disks for each 6gb phy.
 
That sucks.
Then the sas-2 hba "could' still run at 6gb, and multiplex 2x 3gb sas disks for each 6gb phy.

As far as I can tell, they can't do that, because of the way the protocol works. Throughput per channel is still limited to the average link speed of all accessed devices. If everything is SATAII, then it doesn't matter if you have a 12Gb/sec connection, the HBA will still throttle itself down so each of the 4 links in the 8088 cable talk to each HDD at a rate they understand.

Edit: And yes, most storage vendors will tell you whatever you want to hear in order to make the sale.
 
Last edited:
As far as I can tell, they can't do that, because of the way the protocol works. Throughput per channel is still limited to the average link speed of all accessed devices. If everything is SATAII, then it doesn't matter if you have a 12Gb/sec connection, the HBA will still throttle itself down so each of the 4 links in the 8088 cable talk to each HDD at a rate they understand.

That's where sas-2 is supposed to be different. The expander can negotiate the link speed, rather than leaving it up to the hba.

Several docs on the pmc sierra site mention support for link multiplexing. "Support for SAS-2 multiplexing allows legacy SAS and SATA HDDs to take advantage of the devices' improved bandwidth support"
This PDF, talks about how multiplexing will "save money".

But even if the hba to expander link does support multiplexing, I guess it all comes down to how fast the silicon on the expander can rotate between the attached drives.

SAS-2 is also supposed to support a standard (vendor neutral) method for zoning. But I have not yet seen any tools released to configure zones, and I doubt I ever will.
 
packetboy, Why you not using motherboard SAS channels? Any problem with it?
I'm planning same board for ZFS storage.

The motherboard is own case that has NO storage drives...ALL the storage is in external Supermicro SC847 (JBOD) enclosures...these we needed SAS External ports vs. the SAS internal ports on the motherboard.

The Supermicro SC213 case we are using for the storage controller has a ton of 2.5in drive carriers across the front. We *could* have used the internal SAS with appropriate cable adapters, however, we figured we'd reserve it for 2.5in boot disks and L2ARC SSDs.


The onboard SAS2008 is definitely recognized by Solaris...though it defaults to the SAS2008-IR firmware...I have yet to reflash it with the SAS2008-IT as Supermicro has neglected to include a Solaris version of the sas2flash utility....grrrr.
 
It sucks that the expander/lsi doesn't auto negotiate the faster speed, but if the expander doesn't have some way of buffering or otherwise slowing link speeds to the attached devices, I can see how it would default to the slowest speed attached to the network.

According to lsiutil the HBA/expander link *IS* negotiating to 6Gbps on all 4 channels.

But it looks like the expander throttles everything down to 3Gbps somehow...this must be what they mean by 'link switching' but I can't find anything that explains exactly how this works.
 
Last edited:
http://www.scsita.org/aboutscsi/sas/tutorials/SAS_Link_layer_2_public.pdf
Page 35 shows you how it will 'talk more slowly' over a 6Gb link to talk with a slower SATAII HDD running at 3Gb. Basically it'll fill every other dword with a filler block, so the effective command and data transaction rate is the same 3 Gb/sec.

Since no matter how fast each of the 4 links can hop from device to device, they can still only pull the average of the connected drive's link speeds. I would be willing to bet $20 that if you found some old 1.5 Gb SATA 1 drives, you would see the same thing happen, only your max throughput would be ~500MB/sec.

And because of the SATA over SAS tunneling protocol, there is no nice way to buffer a bunch of reads and writes to the disk in order to somehow aggregate them through the 6Gb link.
 
I hope to see a benchmark with 72 x 1TB SSDs on the SC417 when it releases... ^-^
 
Packetboy,
Ask LSI if your controller and these expanders support Time Division Multiplexing (tdm). That is the feature that can multiplex commands for two 3gb disks, in one 6gb link.

If it's not currently supported, Why? Will it be? When?

The marvell expander chips claim to support tdm. Pmc sierra hint that their controller chips might support tdm... Check with lsi.

If you decide to do something like going out and buying a dozen 6gb drives to test with, buy a dozen of the 6gb near-line sas drives.
 
Hey all,

Very interested to see how this turns out. Have been following a lot of the ZFS threads as I was about to put in a large purchase for work, utilizing multiple 847 chassis as I had assumed a SAS2 HBA + Expander would enable 6gb/s aggregate bandwidth, from 3gb/s drives.

Basically taking that Page 35 of the PDF that Eschertias linked, and reversing it. I assume this is basically what Multiplexing does - I just don't see why it wouldn't be working.

sas2.jpg
 
Interesting...in this press release for the LSI SAS2x36 expander chip:

http://www.eetasia.com/ART_8800508838_499491_NP_1ad9512d.HTM

The 6Gbit/s SAS will provide up to double the transfer rate of 3Gbit/s SAS solutions and offer greater scalability, while also providing new features including expander zoning, spread spectrum clocking and support for the T10 Protection Information Model for greater reliability, Decision Feedback Equalization for longer cables, and [bold]multiplexing[/bold]. SAS-2 also provides backward compatibility with 3Gbit/s SAS technologies.

Yet, when you pull the datasheet from LSI's website, it says *nothing* about multiplexing.
 
Very interested to see how this turns out. Have been following a lot of the ZFS threads as I was about to put in a large purchase for work, utilizing multiple 847 chassis as I had assumed a SAS2 HBA + Expander would enable 6gb/s aggregate bandwidth, from 3gb/s drives.

Basically taking that Page 35 of the PDF that Eschertias linked, and reversing it. I assume this is basically what Multiplexing does - I just don't see why it wouldn't be working.

Just like the pic in this pmc-sierra sas doc, section 2.1, figure 3, page 10.

It's an older doc, but marvell and pmc sierra both tout support for 3gb disk multiplexing in their chips. But I wonder if it would only work only with 3gb sas disks (not sata).

I guess someone could call ATTO and ask them, as i think their cards use pmc sierra chips.
 
Last edited:
Packetboy - have you asked LSI for any further information or clarification? This really worries me if this is the case, because I was hoping to use a Supermicro+LSI solution..
 
OK..I found an official LSI press release that says this chipset does "SAS Multiplexing":


http://www.lsi.com/news/product_news/2008/2008_03_05b.html

It supports SAS and SATA data transfer rates of 1.5, 3, and 6Gb/s and incorporates enhanced features such as Spread Spectrum Clocking for EMI reduction, T10-based zoning for network storage security applications, and SAS multiplexing to maximize bandwidth performance to legacy storage devices.

Despite this LSI tech support says that upstream SAS links will NOT drive any faster than the underlying speed of the attached drive (e.g. 3Gbps).

Supermicro's response is worse, they say the problem is that Sata drives are half-duplex and that's the problem....through true, I believe that statement is irrelevant to this discussion.

Getting a little frustrating.

I suspect that perhaps the problem is that the expander supports multiplexing but the LSI 9200-8e HBA does not.
 
OK..I found an official LSI press release that says this chipset does "SAS Multiplexing":


http://www.lsi.com/news/product_news/2008/2008_03_05b.html

It supports SAS and SATA data transfer rates of 1.5, 3, and 6Gb/s and incorporates enhanced features such as Spread Spectrum Clocking for EMI reduction, T10-based zoning for network storage security applications, and SAS multiplexing to maximize bandwidth performance to legacy storage devices.

Despite this LSI tech support says that upstream SAS links will NOT drive any faster than the underlying speed of the attached drive (e.g. 3Gbps).

Supermicro's response is worse, they say the problem is that Sata drives are half-duplex and that's the problem....through true, I believe that statement is irrelevant to this discussion.

Getting a little frustrating.

I suspect that perhaps the problem is that the expander supports multiplexing but the LSI 9200-8e HBA does not.

The other option is that it supports exactly what it says, SAS multiplexing. AKA, doesn't work with SATA tunneling.
 
Hey Packetboy - any update on this??

I was travelling in Europe for two weeks..just got back...assumed my email would be stuffed with responses from SuperMicro and LSI...zippo.

I called SMC again today and lit a fire...they have been having a hard time getting LSIs response on this question.

Basically it comes down to this...the literature on the LSI expander *says* it supports multiplexing...SMC *says* the are using LSIs firmware NOT their own custom firmware...so I say again LSI...where's the multiplexing feature you *say* exists?

SAS2is pretty new for SuperMicro...don't think they even understood the importance of multiplexing to making thier enclosure really shine. They were willing to accept LSIs standard line of using SAS2 drives if you want SAS2 throughput on your uplinks...what they don't understand is that a SuperMicro chassis full of 3Gbps SATA drives will drive enough throughput to saturate a SAS2 wide port just as easily as one filled with SAS2 drives...of course the difference is having to spend $15,000 on SAS2 drives vs. $5,000 on SATA drives.

If Supermicro wants an absolute killer platform they'd be smart to get this working as it would hands down be the cheapest way to build tier-2 storage.

I can actually live with non-multiplexing as I'm presenting this server with 1Gbps network connection, so at best it'll only be able to move 100MB/s over the wire...1/10th of what the array can deliver. However, I'll also be doing backups disk-to-disk...it would be nice to get full SAS2 throughput backups....we'll see.

BTW: My Supermicro contact alluded to new SAS2 2.5 chassis that they are rolling out...72! drives in 4U ... 88! drives if you want a JBOD configuration.
 
Basically it comes down to this...the literature on the LSI expander *says* it supports multiplexing...SMC *says* the are using LSIs firmware NOT their own custom firmware...so I say again LSI...where's the multiplexing feature you *say* exists?

Dell, SuperMicro, etc., do not write their own firmware for products based on LSI chips. LSI writes the firmware. Then the OEMs can turn features on/off via software/firmware switches, and release their own "custom" firmware.
 
Still no word from SMC/LSI on the multiplexing thing.

Finished building the second ZFS monster box (first box has been deployed)...from start to finish (raw solaris install), installing ISC dhcp, and amanda took about 1 business day...amazingly easy.

BTW: decided to install the new Solaris 10 9/10 ... it seems to work well with this setup (LSI HBAs, motherboard, etc..).

Decided to test server to server backups...the ONLY tuning I did was to make sure that jumbo frames were enabled on both hosts and the switch (1Gbe).

Code:
bash-3.00$ /usr/sbin/ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000 
igb0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 9000 index 2
        inet 192.168.2.5 netmask ffffff00 broadcast 192.168.2.255

Here's an iostat with two amanda backup stream running in parallel...has no problem saturating 1Gbps Ethernet link....have started with just 10 Hitachi 2TBs in Raidz2 for this stage...will drop in another 20 drives before deployment...but for now...

Code:
 pool: zulu03
 state: ONLINE
 scrub: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        zulu03                     ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c0t5000CCA221D87C67d0  ONLINE       0     0     0
            c0t5000CCA221D419C9d0  ONLINE       0     0     0
            c0t5000CCA221D8681Cd0  ONLINE       0     0     0
            c0t5000CCA221D42054d0  ONLINE       0     0     0
            c0t5000CCA221DFF08Cd0  ONLINE       0     0     0
            c0t5000CCA221DFF27Cd0  ONLINE       0     0     0
            c0t5000CCA221E7A014d0  ONLINE       0     0     0
            c0t5000CCA221E68A4Bd0  ONLINE       0     0     0
            c0t5000CCA221E76C0Dd0  ONLINE       0     0     0
            c0t5000CCA221E76C9Ed0  ONLINE       0     0     0

-bash-3.00$ /usr/sbin/zpool iostat 30 10

----------  -----  -----  -----  -----  -----  -----
amandahold   966G  4.49T      0    918      0   112M
rpool       5.02G   459G      0      0      0      0
tap0200      122K  1.81T      0      0      0      0
tap0300      110K  1.81T      0      0      0      0
zulu03      44.6G  18.1T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
amandahold   969G  4.49T      0    921      0   113M
rpool       5.02G   459G      0      0    119      0
tap0200      122K  1.81T      0      0      0      0
tap0300      110K  1.81T      0      0      0      0
zulu03      44.6G  18.1T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
amandahold   973G  4.49T      0    926      0   113M
rpool       5.02G   459G      0      0      0      0
tap0200      122K  1.81T      0      0      0      0
tap0300      110K  1.81T      0      0      0      0
zulu03      44.6G  18.1T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
amandahold   974G  4.49T      0    907      0   111M
rpool       5.02G   459G      0      0  1.78K      0
tap0200      122K  1.81T      0      0      0      0
tap0300      110K  1.81T      0      0      0      0
zulu03      44.6G  18.1T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

113MB/s = * 8bits/Byte = 930Mbps.

Darn...I REALLY want to throw some 10Gbps at this...unfortunately cash is depleted on this project.
 
On the other hand, SMC is not a tier-one company and doesn't have the resources to throw at tuning and tweaking this stuff to utter optimum efficiency. I think they struggle just to get stuff out the door. Plus they're a motherboard company, this whole JBOD thing is new to them.

From all I can tell, just having 3G SATA drives working at all in a large array is problem enough. Wait til you get several boxes of them in the same cabinet and they start hosing each other up due to EMI. Yes, it's happening to me; I have 10-12 SMC chassis - 836, 846, 847, with all of the 836s now converted to JBODs hanging off 1068s off 2 847s and 2 846s. I've had two massive glitch-ups that have cost me several TB of lost data - yes, you can lose data with ZFS, if things get really confused enough. Fortunately it's replicated data so it's not a showstopper, but...
 
btw, where is this version of lsiutil that works with 2008s? LSI claims no such thing exists, at least when I asked them a month ago. it isn't 1.62. filed a RFE, and sun basically said "not our tool, and we're not going to write one".

I've been using a tool called smartmon-ux, see www.santools.com (and no I'm not the author - I've actually ended up putting the author on retainer to advise on building our next-gen ZFS box, I suppose this would be Gen4, based on X8DAH boards with 96G and rafts of 2TB constellation SAS drives hanging off 9211/9200s)
 
Here's what LSI emailed me...it's v 1.63

http://www.sendspace.com/file/34lpe9'>http://www.sendspace.com/file/34lpe9

Code:
# ./lsiutil.i386

LSI Logic MPT Configuration Utility, Version 1.63, June 4, 2009

3 MPT Ports found

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  mpt_sas0          LSI Logic SAS2008 02      200      02003200     0
 2.  mpt_sas1          LSI Logic SAS2008 02      200      05000d00     0
 3.  mpt_sas2          LSI Logic SAS2008 02      200      05000d00     0

Code:
Select a device:  [1-3 or 0 to quit] 2

 1.  Identify firmware, BIOS, and/or FCode
 2.  Download firmware (update the FLASH)
 4.  Download/erase BIOS and/or FCode (update the FLASH)
 8.  Scan for devices
10.  Change IOC settings (interrupt coalescing)
13.  Change SAS IO Unit settings
16.  Display attached devices
20.  Diagnostics
21.  RAID actions
23.  Reset target
42.  Display operating system names for devices
43.  Diagnostic Buffer actions
45.  Concatenate SAS firmware and NVDATA files
59.  Dump PCI config space
60.  Show non-default settings
61.  Restore default settings
66.  Show SAS discovery errors
69.  Show board manufacturing information
97.  Reset SAS link, HARD RESET
98.  Reset SAS link
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

The unfortunate thing is that the 'Display operating system names for devices' function does NOT work:

Main menu, select an option: [1-99 or e/p/w or 0 to quit] 42

Code:
 B___T___L  Type       Operating System Device Name
 0  10   0  Disk       
 0  11   0  Disk       
 0  12   0  Disk       
 0  13   0  Disk       
 0  14   0  Disk       
 0  15   0  Disk       
 0  16   0  Disk       
 0  17   0  Disk       
 0  18   0  Disk       
 0  19   0  Disk       
 0  20   0  EnclServ
 
hurray! thanks dude.

santools guy is working on an enhancement that will do device name -> enclosure/slot mapping for the SM backplanes. should be a couple weeks.

challenge at this point is getting SAS-based SSDs to use for slogs that will co-exist on the backplane. The OCZ Vertex 2 Pros are doing ok, but so far my experience has been that mixing SAS and SATA on the same backplane tends to be a recipe for disaster. escher's PDF from HP gives me a good idea of why (thanks).
 
would you be able to post a pic of the internals of these cases? would like to visually see what they look like.

also, what drives are you using and how much did this all cost roughly?
 
would you be able to post a pic of the internals of these cases? would like to visually see what they look like.

also, what drives are you using and how much did this all cost roughly?

There are all sorts of good pictures around the web for the SC847 cases. I think Supermicro's site has some as well. The build quality is rather high - it's not a Dell and not everything is perfect but it is mostly well thought out.

I have been using Seagate 1TB drives with various models of these cabinets with a lot of success.
 
Back
Top