A storage cluster is born

It is time to add space to our VMware cluster for storage of VMs for our customers.

We started initially using Compellent SAN storage. Worked well, had a lot of space, but the performance was not what I was looking for (it is all SATA based), using F/C is far more complicated than we need for this project, and just plain expensive to upgrade (adding a shelf of 16 450GB SAS disks is more expensive than the solution presented here).

Spin forward a few months, I took one of our older NetApp FAS270c systems and ripped out the 73GB 10K disks, went to eBay and purchased 2 shelves of 300GB 10K F/C disks and after an afternoon of shuffling we had ourselves a cluster for storage. Performance is absolutely great (and predictable). Expanding it? Expensive, just like the Compellent, and I also wanted to investigate some of the new things that can be done with storage, like inline (and synchronous) compression and data deduplication.

History lesson: Years and years ago I did a lot with Solaris and I have kept my feet wet playing with OpenSolaris and ZFS. I won’t bore you with the great details of why ZFS is the shit (links at the bottom of the post) or why Solaris needs to live on forever (because no one can thread at the kernel level like Solaris), but I will tell you that using Solaris (or OpenSolaris) with ZFS is a combination that is very tough to beat. So last year I used the Sun Try-and-Buy program to test out a 7110 and I absolutely loved the interface, and the price drop that occurred while I had it! But someone within Sun decided that no, as a TaB customer the new pricing is not available. I was floored by this. I shipped back the 7110, and anyway, I really wanted a cluster, not a single head, single JBOD enclosure.

In the end I wanted a cluster!

That’s 2 heads, automatic failover, etherchannel/trunk/IPMP/LACP oh hell, I just wanted multiple ethernets bound together if possible, basic reporting (I can SNMP for the real stuff), and finally, a price tag that I can feel good about for storage of our customer data.

This is where NexentaStor comes in! They have the pieces all put together so I don’t have to self-engineer something. I have a vendor I can harass or ask questions of. I can focus on what makes my business work instead of working to make my business stable.

I chose NexentaStor because I wanted something more than just some hand built item and I wanted a working implementation of high availability without a lot of hackery(tm) on my part. I actually do not like building servers, messing with a ton of configurations and creating job security because things get so complicated that I can’t go on vacation. I had run up OpenSolaris with ZFS on multiple systems. I am comfortable with it, I am happy with it, but I want something that others can handle if I am not around. This includes web GUI management and basic reporting and a simplified command line without all of the UNIXisms that drive a layman batty!

Do check out their website, we are using the commercial version and there are developer and community editions available as well.

My parts list for the head units (there are 2 so double all parts):

  • 1 SuperMicro SYS-6016T-NTRF4+ 1U chassis+MB originally specced, later changed because of availability
  • 2 Intel Xeon E5520 Nehalem
  • 6 4GB registered ECC (24GB total)
  • 1 SuperMicro AOC-USAS-L8i SAS HBA
  • 1 CABL-0167L Backplane Cable
  • 2 Western Digital 500GB RE3 SATA (in mirror for boot)

My parts list for the JBOD shelving units (there are 2 so double all parts)

  • 1 SuperChassis SC846E2-R900B 4U 24 bay SAS with SAS Expander
  • 2 CABL-0166L external SAS cables
  • 1 SuperMicro CSE-PTJBOD-CB1 power card
  • 2 CABL-0168L internal SAS internal->external cables
  • 24 Seagate ST31000424SS 1TB SAS disks (22 installed, 2 on shelf)
  • 2 Crucial RealSSD 256GB 2.5″ SSD for mirrored ZIL
  • 2 AOC-SMP-LSISS9253 SAS interposer cards for SATA->SAS interconnections

Basic nerd data…

Note – benchmarking is very hard to do. I can easily show you bonnie++ output saturating a gig ethernet, I can show you output from multiple VMs saturating 2 gig ethernets but that didn’t really tell me anything.

In fact, I can show you output from 8 VMs pushing a total of 4Gbps of aggregate throughput, 2Gbps of write traffic with 2Gbps of read traffic (well, not a full base 2 version of such, more like base 10), but instead I’d rather show you what things look like with what I have on the server right now. Ie; the numbers are really boring and making flat graphs just doesn’t look cool.

(as an aside, reaching over 10K I/Os per second reads and writes is easy with this solution, reaching 20K I/Os per second reads is much harder while the scaling of writes per second continued to move towards the 30K mark before service transaction time started creeping up over 1.2ms – yah, thems are nerd numbers)

One head, one volume, 4 Windows Server 2008R2 systems installed, patched, sitting idle. Each VM is configured with 2GB RAM, 200GB disk shared over NFS connection on bonded gigabit ethernet.

/$ df -k /volumes/volume02/volume02
Filesystem             size   used  avail capacity  Mounted on
volume02/volume02       13T    16G    13T     1%   volume02/volume02

So about 16GB in use, but the volume and share have deduplication turned on and this is where the real fun starts.

The systems are as I described above, installed, patched, and idle. Nothing is going on, and before you ask ‘why’, it is because I have torn down all of my benchmarking VMs, I got tired of seeing that I could saturate the ethernet network repeatedly though the storage system was fine, quiet, and bored out of its proverbial skull. I just can’t create enough synthetic load to really create problems, not from I/O, not from pure streaming reads and writes.

/$ zpool list volume02
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
volume02  18.9T  17.7G  18.9T     0%  1.29x  ONLINE  -

1.29x – if I read this right – it is a savings of almost 30% over not using dedupe. From reading the different white papers and blog posts, the deduped data doesn’t impose any performance overhead for reading whatsoever while writing is only marginally affected. If that is almost 30% savings

Whew, while this post isn’t all that informative, it is full of a summary of what I was playing with for the last couple of weeks.

Below are pictures of the system(s) and links to pertinent data you may enjoy.

Deduplication now in ZFS – Virtually All The Time

ZFS Deduplication : Jeff Bonwick’s Blog

First Look at ZFS Deduplication – The Blog of Ben Rockwood

Nexenta Adds Dedupe To Open-Source ZFS Storage – Storage – IT …

NexentaStor FAQ

Nexenta Systems announces record growth in 2009 | NAS Storage Server

Nexenta Systems

and onto…

taking the twins to the data center for mounting
Head units on cart to the data center

new storage system in the rack

43 Replies to “A storage cluster is born”

  • Looks like a fun build!

    A few geeky questions.. ;)

    Why 3gb SAS instead of 6gb SAS for future-proof’in, especially since your RealSSD’s can’t get maxed out on 3gb! :) (Also hope you don’t run into OpenSolaris bug 6900767 – that bit us on a few OpenSolaris boxes with the 1068 controllers. Annoying that is.) I have a hunch the reason is “lack of availability of 6gb backplanes”, but hey, gotta ask. ;)
    Are you spacing the SSDs out across the 4 3gb lanes, since each of the SSDs could easily saturate a single lane on reads?
    How are the RealSSD’s performing for ZIL’s? I worry about the slowdown over time with writes to them without TRIM, etc. are you planning on removing and scrubbing the discs occasionally to get the performance back? I would’ve loved to go with them for ZIL along with cache, but was a bit nervous about that. If you’ve found a way to deal with it that’d make my next build easier!
    How are you cabling it? Cascade shelf-to-shelf, and then connect a HBA on each end of it?

    In any case, nice build! Care to post a total cost? :)

    • When will I reach 3Gbps?

      Guessing never! So was not really worried about that.

      If (huge if at that) I hit a performance issue I’ll build another and move the VMs at that time. I am not interested in THE ONE storage system to rule them all.

      Pricing? I have spreadsheets! Part numbers were listed and easy to extrapolate :)

      • Well, it’s pretty easy to hit 3gbps on a channel with 6-12 drives on it (depending on how things are stacked), running with some sort of RAID level (as that causes more data to go across the drives than just the actual amount of traffic being written or read) especially if 1-2 of those drives are SSDs.. but I suppose the real question is will it actually *matter* [ie – cause slowness to the point where it’s visible to the end user], and the answer is almost certainly no. ;)

        Are you running compression in combination with dedupe? I’ve been quite impressed with the I/O that is still possible through a system that is compressing and deduplicating at the same time.

        • Yep, compression and dedupe on, one of the reasons I went with the ZFS based testing in thebfirst place.

          I think ZFS should get more love from that L community and swallow that earnest pride that makes the GPL such a POS at times. FreeBSD can do it!

          But I ain’t using FreeBSD for this as I want HA.

          As for the 3Gbps stuff, I built my raid groups around 7 drives per and while technically true I can hit that internally I think that reality will show that this won’t be an issue. If it becomes an issue I can rebuild heads into 2U systems and spin up another HBA and multiple-path, or better yet…build another basket for my eggs :)

          • I’m totally with you – wishing for a ZFS port for, well, everything. ;) btrfs is interesting, but it’d be nice to have the effort go into improving zfs.. ah well.

            Totally agreed on the more eggs statement! I do like the approach of limiting the size and cost of an individual basket so more can be created as needed.. ;)

  • One more geeky question – where did the interposer card fit in? Not finding anything but “buy” links.. is it a small adapter card that fits in between the drive and the backplane in the sled, or something else? Pictures would be appreciated, but probably difficult to get as I’m imagining this is in heavy use already. ;)

    Thx!

    • Oh that’s exactly where the interposer fits.

      SATA disk (or SSD) -> interposer -> SAS bus

      As far as pictures – oops, everything is mounted and running, sorry :( Not heavy yet, but that will change in the coming days!

      • Yeah, that’s kind of what I figured! :)

        Did you need to use a special sled to fit the interposer in? Or do the standard sleds in the enclosure you picked up have a second set of screw holes that give you enough room to add the interposer?

        Thanks!

        • We used them for the SSD, so no idea what would happen on a normal SATA disk.

          Probably just move up one set of wholes for the screws…

            • They are mounted inside of adapters, then mounted into the drive sleds, then mounted in the external JBOD enclosures.

            • I figure I’m a bit late to the commenting game, but do you have a picture of the interposer/SSD/adapter all put together? I’ve found very little information on these interposers, and from what I can find, they don’t *quite* fit in the supermicro 2.5″ to 3.5″ “adapter”.

            • Sorry, no, didn’t take pictures before it was all put together.

              I am working on building another for a customer and will put up pictures of everything this time around.

            • Too bad. I’m having a heck of a time figuring out how everything fits with the interposer. We got the MCP-220-00043-0N 3.5″-2.5″ adapter, but with the interposer attached to the drive, it sticks out of the drive bay too far. We also tried the drive in an ICYDock adapter and had the same result.

              I didn’t see it mentioned – what adapter did you use?

  • The realssd drives have a pretty big dram cache sans supercap – Doesn’t that make the Zil volatile?

    • It could, but the ZIL devices are housed in the external enclosure and will migrate with the volume in the case of an HA failover.

      The chance of a power outage is pretty low (though not impossible) and there are 2 ZIL devices in mirror configuration to help prevent issues in the case of some kind of failure.

      Not perfect though close enough for this kind of thing.

      This link has some data to help: http://www.anandtech.com/show/2909

  • Great info in this post! I’m going to build out a storage system and am using your config as a guide. I’m not that familiar with SuperMicro and wondered how you attach the external storage to the 1U 6016? The SuperMicro AOC-USAS-L8i SAS HBA looks like an internal RAID card (and also looks like some custom SuperMicro UIO format where the server you listed looks like it uses PCI-E only)? Sorry for the noob questions, but there seems to be a lot of moving pieces in this storage thing.

    • There are a ton of parts to this.

      The external storage is connected via external SAS connectors.

      We used PCIe cards for things, the UIO card we did use was for ethernet. As we didn’t need hardware raid for things, the SAS HBA is JBOD.

  • Hello,

    This is Spandana Goli from Nexenta Systems.

    Having read your blogs on Nexentastor , we are happy to learn about your positive experience.

    Is there an email address I could reach you at to discuss further opportunities?

    Regards,
    Spandana

  • Thank you for your post. I’ve played around with OpenSolaris, Nexenta Core and NexentaStor in the past and all of them look very promising.

    What NexentaStor licenses are required for your HA configuration? I would guess 2x Enterprise [Silver/Gold] Edition xTB + 1x HA Plugin?

    • Did raidz2 groups, three per per volume, 2 volumes, plus hot and cold spares.

      Also have write logs mirrored per volume.

  • ZFS now runs well on FreeBSD. You can even run a mirrored ZFS boot volume nowadays, just as you could with UFS using GEOM and a kernel tweak. I’ve been running the latter (mirrored UFS boot volume) for more than 5 years now, and am considering upgrading to ZFS upon the next maintenance cycle.

    On another note, one of our non-critical FreeBSD boxes had the highest (#1) uptime on netcraft, until someone complained about it…

    http://www.osnews.com/thread?334847

    It’s now been going for 7+ years, without downtime. Would I use FreeBSD again? Without a doubt!

    • I FreeBSD and use it throughout my environment.

      But, as I stated, i wanted HA, 2 heads, shared SAS bus, automated failover in the case of a failure of the head unit.

      I continue to support, mirror, and make love to FreeBSD :)

  • Thanks for the great article! Very informative.

    Do you do any Multipath-ing to the JBOD units?

    What about redundancy between the two JBOD boxes? if the JBOD enclosure fails are you protected?

    I’m trying to weight my options between two separate nodes with AutoCDP + simple failover vs Shared Storage + HA.

    • Hiya, for this installation, no multipath was done on the SCSI layer.

      The head units are 1U tall, there isn’t enough room for 2 cables coming out to the units in a clean, non-hackery way.

      • So after all that work… getting interposers for the sata ssds, and such – you didn’t actually cross connect the head units?

        So in effect you have Head unit#1 connected to JBOD#1 – and Head unit#2 connected to JBOD#2 only?

        And then you just ran software sync and active-passive fail-over between the 2 head units? ( i.e. I think you did what TonyB called – “AutoCDP + simple failover” ? )

        -Michael

        • Uhm, no.

          Each head connects to JBOD1 and JBOD2, active clustering via RSF-1 is running and working.

          The heads themselves are not cross connected.

          • Thanks for the clarification… this is one of those the picture would have said the thousand words thing. :-)

            Any chance you can post a picture of the back of the head units?

            And/or is perhaps your part list above on each of the head units supposed to say “( 2 ) CABL-0167L Backplane Cable ” ?

            This SAS stuff is new to me… but I am looking for something to do with my SC846 cases I have laying around. :-)

  • late to the party but…

    how is this build working out for you a year later-ish? (im considering something very similar) Reliability? etc

    also, how did the install/setup of the HA plugin go? was it relatively straight forward?

    • Well, truth be told…I am not the most happy of people.

      I was just told I should make a followup post, so I better get going on that. I’ll put in details.

      I’ll get’er done soon.

      • Please do! We built basically the same setup as you. I just came across your post tonight. Anyways we’ve had great luck with our set up.
        I’m really interested in hearing what’s going on.

      • Excellent, looking forward to hearing about your trials and tribulations. Planning to start my own in the next few weeks. Thanks!

Comments are closed.