Blog Archives

New OpenSolaris IP administration

The overall objective for the Brussels project is to provide standard interfaces for network administration. Phase I of the project brought a unified way of administering datalinks including aggregations in OpenSolaris/Solaris 10 with the dladm(1M) command. Now phase II of the project have integrated doing the same for IP administration with ipadm(1M).

Besides making it easier to set up and manage interfaces by using ipadm instead of ifconfig and/or editing various files in /etc it will also make it much easier to get a overview of all addresses and interfaces on a host. A think that directly strikes me is that it will now be possible (without complicated regular expressions) to easy parse parse addresses and other information from active IP interfaces. Ifconfig does not provide the best output for parsing. All this will soon be available for testing in a development build (b137).

Example:
Set up an persistent new persistent address on e1000g0 and viewing it:

# ipadm create-addr -i e1000g0 192.168.0.242/24# ipadm show-addr

INTF      ORIGIN    STATE   ZONES   FLAGS   ADDR/MASKe1000g0   static    ok      global  U-p     192.168.0.242/24

PSARC 2009/306 Brussels II – ipadm and libipadm
PSARC 2010/080 Brussels II addendum
6827318 Brussels Phase II aka ipadm(1m)

Don’t keep opening those files

Constantin has a wonderful little script that will configure you home server to reduce it’s power consumption. Unfortunately for me he uses a loop that falls into one of my scripting pet peeves. Within Contstantin’s script there is absolutely no problem but I have in the past seen customers burnt by this such that I have been able to work magic to get some really spectacular performance increases.

The problem is here:

for i in $disks ; do
	echo "device-thresholds	$i	5m" >>$TMPFILE
done

Let me emphasize again that in this case there is no real problem but I believe that if you do the right thing where it does not matter you will get it right when it does.

The issue is that for every time around that loop the script opens the output file writes one line and closes the file. This is bad when the file system is a local one but when over NFS this is a performance disaster. With a small change the number of opens can be reduced to just one, which when over NFS all data has to be sent to the server and the server confirm it is on stable storage before the close can complete give a spectacular improvement.

Here is a test I ran over a very slow NFS v4  link (Wifi) to my home server:

brompton% cat /tmp/loop
#!/bin/ksh

i=0
out=${1:-${0##*/}.out}
: > $out
while (( i < 1000))
do
     echo $i >> $out
     let i=i+1
done
brompton% time /tmp/loop
/tmp/loop  0.15s user 0.51s system 2% cpu 26.850 total
brompton%

Compare that with a the good case:

brompton% cat /tmp/loop2
#!/bin/ksh
i=0
out=${1:-${0##*/}.out}
while (( i < 1000))
do
    echo $i
    let i=i+1
done > $out
brompton% time /tmp/loop2
/tmp/loop2  0.01s user 0.01s system 5% cpu 0.384 total
brompton%

From 26 seconds to 0.384 seconds.  Now this is a very bad case as the loop does nothing else and the network latency is particular bad but none the less the principle is the same if you have 10G ethernet, you don’t want to be leaving that performance on the table.

What is the maximum number of commands queued to a LUN

This is not quite a one liner as I’m reusing the code from a previous post to print out the devices in a human readable form other wise it is just a one liner and was when I typed it in.

The question posed here was what is the maximum number of commands sent to a LUN at any one time? Clearly this will max out at the throttle for the device however what was interesting, since the customer had already tuned the throttle down and the problem had gone away was what was their configuration capable of sending to the LUN:

#!/usr/sbin/dtrace -qCs
#define SD_TO_DEVINFO(un) ((struct dev_info *)((un)->un_sd->sd_dev))
#define DEV_NAME(un)
 stringof(`devnamesp[SD_TO_DEVINFO(un)->devi_major].dn_name) /* ` */
#define DEV_INST(un) (SD_TO_DEVINFO(un)->devi_instance)
fbt:*sd:*sd_start_cmds:entry  {
        @[DEV_NAME((struct sd_lun *)arg0),DEV_INST((struct sd_lun *)arg0)] = max((struct sd_lun *)arg0->un_ncmds_in_driver)
}
END
{
      printa("%s%d %@dn", @);
}

This produces a nice list of disk devices and the maximum number of commands that have been sent to them at anyone time:

# dtrace -qCs  /var/tmp/max_sd.d -n 'tick-5sec { exit(0) }'
sd2 1 sd0 70
#

Combine that with the dscript from the latency bubble posting earlier and you can drill down on where your IO is waiting.

Latency Bubbles follow up

Following on from the latency bubbles in your IO posting. I have been asked two questions about this post privately:

  1. How can you map those long numbers in the output into readable entries, eg sd0.
  2. How can I confirm that disksort has been turned off?

The first one just requires another glob of D:

#
#pragma D option quiet
#define SD_TO_DEVINFO(un) ((struct dev_info *)((un)->un_sd->sd_dev))
#define DEV_NAME(un)
       stringof(`devnamesp[SD_TO_DEVINFO(un)->devi_major].dn_name) /* ` */
#define DEV_INST(un) (SD_TO_DEVINFO(un)->devi_instance)
fbt:ssd:ssdstrategy:entry, fbt:sd:sdstrategy:entry
{
        bstart[(struct buf *)arg0] = timestamp;
}  fbt:ssd:ssdintr:entry, fbt:sd:sdintr:entry
/ arg0 != 0 / {
        this->buf = (struct buf *)((struct scsi_pkt *)arg0)->pkt_private;
}
fbt:ssd:ssdintr:entry, fbt:sd:sdintr:entry
/ this->buf /
{
          this->priv = (struct sd_xbuf *) this->buf->b_private;
}
fbt:ssd:ssdintr:entry, fbt:sd:sdintr:entry
/ this->priv /
{
             this->un = this->priv->xb_un;
}
fbt:ssd:ssdintr:entry, fbt:sd:sdintr:entry
/ this->buf && bstart[this->buf] && this->un /
{
         @l[DEV_NAME(this->un), DEV_INST(this->un)] =
                 lquantize((timestamp - bstart[this->buf])/1000000, 0,
                 60000, 60000);
         @q[DEV_NAME(this->un), DEV_INST(this->un)] =
                 quantize((timestamp - bstart[this->buf])/1000000);
                 bstart[this->buf] = 0;
}

The second required a little bit of mdb. Yes you can also get the same from dtrace mdb gives the the immediate answer, firstly for all the disks that use the sd driver and then for instance 1:

 # echo '*sd_state::walk softstate | ::print -at "struct sd_lun" un_f_disksort_disabled' | mdb -k
300000ad46b unsigned un_f_disksort_disabled = 0
60000e23f2b unsigned un_f_disksort_disabled = 0
# echo '*sd_state::softstate 1 | ::print -at "struct sd_lun" un_f_disksort_disabled' | mdb -k
300000ad46b unsigned un_f_disksort_disabled = 0

Latency bubbles in your disk IO

The following was written in response to an email from a customer about monitoring IO in response to my scsi.d postings. Tim covers where disk IO requests can be queued in his posting titled “Where can I/O queue up in sd/ssdwhich I would recommend as a starting point.

The disk IO sub-systems are built to provide maximum through put which is most often the right thing. However the weakness of tuning for throughput is that occasionally you can get some bizarre behaviour when it comes to latency. The way that optimum IO bandwidth is achieved is by sorting each io by logical block address (LBA) and then issuing those in order to minimize head seek. This is documented in the disksort(9F) manual page.

So if you have a sequence of writes to blocks N, N+1, N+2, N-200, N+3, N+4, N+5,N+6, N+7 in that order and your LUN as a queue depth and therefore throttle of 2.1 The IO’s will actually be delivered to the LUN in this order N, N+1, N+2, N+3, N+4, N+5,N+6, N+7, N-200. Hence there will be a significant latency applied to the IO going to LBA N-200 and in practice it is possible to have IO requests delayed on the waitq for many seconds (I have a pathological test case that can hold them there for the time it takes to perform an IO on nearly every block on the LUN, literally hours). You better hope that that IO was not your important one!

This issue only comes into play in the disk driver has reached the throttle for the device as up until that point each IO can be passed straight to the LUN for processing.2 Once the driver has reached the throttle for the LUN it begins queuing IO requests internally and by default will sort them to get maximum throughput. Clearly the lower the throttle the the sooner you get into this potential scenario.

Now for the good news. For most disk arrays sorting by LBA does not make much sense since the LUN will be made up of a number of drives and there will be a read cache and a write cache. So for these devices it makes sense to disable disksort and deliver the IO requests to the LUN in the order in which they are delivered to the disk driver. If you look in the source for sd.c you will see that we do this by default for most common arrays. To achieve this there is a flag, “disable disksort”, that can be set in sd.conf or ssd.conf depending on which driver is in use. See Micheal’s blog entry about editing sd.conf. While you are reading that entry note you can use it to set the throttle for individual LUNS so you do not have to set [s]sd_max_throttle, which will penalize all devices rather than just the one you were aiming for. If you have just one that only has a small queue depth and you will see below why a small queue depth can be a really bad thing.

So how could you spot these latency bubbles?

It will come as no surprise that the answer is dtrace. Using my pathological test case, but with it set to run for only 10 minute to a single spindle, the following D produces a clear indication that all is not well:

fbt:ssd:ssdstrategy:entry, fbt:sd:sdstrategy:entry
{
      start[(struct buf *)arg0] = timestamp;
}
fbt:ssd:ssdintr:entry, fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf *)((struct scsi_pkt *)arg0)->pkt_private)] != 0 /
{
       this->un = ((struct sd_xbuf *) this->buf->b_private)->xb_un;
       @[this->un] = lquantize((timestamp - start[this->buf])/1000000,
            60000, 600000, 60000);
       @q[this->un] = quantize((timestamp - start[this->buf])/1000000);
       start[this->buf] = 0;
}

This produces the following output3, the times are milliseconds:

dtrace: script 'ssdrwtime.d' matched 4 probes ^C
6597960853440
        value  ------------- Distribution ------------- count
      < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 377204
        60000 |                                         0
       120000 |                                         0
       180000 |                                         0
       240000 |                                         0
       300000 |                                         0
       360000 |                                         0
       420000 |                                         0
       480000 |                                         2
       540000 |                                         300
    >= 600000 |                                         0
6597960853440
        value  ------------- Distribution ------------- count
           -1 |                                         0
            0 |                                         40
            1 |                                         9
            2 |                                         6
            4 |                                         17
            8 |                                         23
           16 |                                         6
           32 |                                         36
           64 |@@                                       15407
          128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   361660
          256 |                                         0
          512 |                                         0
         1024 |                                         0
         2048 |                                         0
         4096 |                                         0
         8192 |                                         0
        16384 |                                         0
        32768 |                                         0
        65536 |                                         0
       131072 |                                         0
       262144 |                                         0
       524288 |                                         302
      1048576 |                                         0

Now recall that my test case is particularly unpleasant but it demonstrates the point. 300 IO requests took over 9 minutes and they only actually got to complete as the test case was shutting down. While the vast majority of the IO requests complete in less than 256ms.

Now lets run the same pathological test with disksort disabled:

dtrace: script 'ssdrwtime.d' matched 4 probes
^C
6597960853440
value  ------------- Distribution ------------- count
< 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 543956
  60000 |                                         0
6597960853440
value  ------------- Distribution ------------- count
   -1 |                                         0
    0 |                                         30
    1 |                                         21
    2 |                                         30
    4 |                                         0
    8 |                                         0
   16 |                                         50
   32 |                                         3
   64 |                                         384
  128 |                                         505
  256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  531169
  512 |@                                        11764
 1024 |                                         0

Notice that now the majority of the IO requests took longer now, falling in the 256ms bucket rather than the 128ms bucket but none of the IO requests took many minutes.

Now my test case is pathological but if you have drives with small queue depths and disk sort is still enabled you are open to some quite spectacular latency bubbles. To mitigate this my advice is:

  1. Don’t ever set the global [s]sd_max_throttle in /etc/system. Use the [s]sd.conf file to set the appropriate throttle for each device.

  2. Consider what is more important to you. Throughput or latency. If is is latency or if your LUN is on a storage array then turn of disksort using the [s]sd.conf file.

  3. If you have pathological applications then understand that the IO subsystem can give you throughput or bounded latency, not both. So separate out the IO devices that need throughput from those for which latency is more important.

  4. Be aware that even “dumb” disk drives often implement disk sort internally so in some cases they can give a similar issues when they have a queue depth of greater than 24. In those cases you may find it better to throttle them down to a queue depth of 2 and disable disksort in [s]sd to get the most predictable latency all be it at the expense of throughput. If this is your issue then you can spot this either by using scsi.d directly or by modifying it to produce aggregations like those above. I’ll leave that as an exercise for the reader.


1The queue depth of a LUN is the number of commands that it can handle at the same time. The throttle is usually set to the same number and it used by the disk driver to prevent it sending more commands than the device can cope with.

2Now the LUN itself may then re order the IO if it has more then two IO’s in it’s internal queue.

3Edited to remove output for other drives.

4With a queue depth of 2 the drive can not sort the IO requests. It has to have one as active and the next one waiting. When the active one completes the waiting one will be actioned before a new command can come from the initiator.

WP Like Button Plugin by Free WordPress Templates