At the new office, we have several really big
EMC fiber channel
SANs. We have it configured in a fairly complicated, high-availability way. Each server has two
HBA controller cards and each
HBA is connected to redundant storage controllers over
Cisco fiber channel switch. This installation is expensive but the performance is excellent. The hardware is top notch and it's automated
failover features are really solid. The
EMC hardware engineers seem to to have their head screwed on straight.
But... but... but... What moron wrote their software?
To run an
EMC SAN, you need to configure settings on the storage controller, use a client side software to mount the volumes, and then manage the backups with another software suite. That doesn't sound to bad, right?
The storage manager software is a Java based website that using
LDAP authentication (yea... you don't need another password). To use the SAN, you create RAID groups by selecting the disks that should be included, then you define
LUNs inside the RAID group, then assign the RAID group to a storage group, and then assign the storage group to a client. Then you turn clockwise three times, throw salt over shoulder, and (if all goes well) the
LUN is available to the client.
I've actually simplified the steps, believe it or not. At the end of this, you get a
LUN that acts like a disk to the server side operating system. It has a really useful name like
LUN138 but you can assign something more human readable like "
ExchangeServer01-
StorageGroup1" or whatever. That will make your SAN Management easier (you would assume).
Now, let's more the client side. On the
HBA fiber card, there is a world wide name (similar to the MAC address) and there is a way to assign a human-readable name like
ExhchangeServer01-
HBA1 instead of long
hexadecimal string (which should make things easier, you assume). Once you have everything set there you flip back the the SAN manager to register the
HBA card. And, of course, you can't read the human-readable name. All you get is the world wide name and you have to flip back and forth to figure out which 20+ character
hexadecimal number belongs to which card.
After you get
registered on the SAN, you load software on the client to mount the volumes. This client side software does not use
LDAP. Instead, it uses a locally controlled password. Once you find out the password and load the software, it scans the storage groups available to the client and displays the
LUN "
ExchangeServer01-
StorageGroup1"...... Well, no - it displays this long 20+ character
hexadecimal number that you've never seen before. If you're only mounting one ore two that's not a big deal, but if you are
putting in five or six that can be kind of annoying. To make it worse, these
LUNs show up in Computer Management as Disk1, Disk2, Disk3, etc. and there is no other useful information as to which disk is which
LUN.
So here's a scenario: you create four
LUNS for four Exchange storage groups on one Enterprise server and you would like to mount the
LUNs in a
particular pattern. To make the LUNs match a pattern, you will need to flip between three different windows comparing
hexadecimal strings to decide which disk in the computer management window belongs to which LUN. Wouldn't it be nice if the
EMC software could read the
EMC created human readable tags in the
EMC storage system?
We haven't even gotten to the backup system. If your are rolling out a couple of servers with 5 or 6 mounted
LUNs per server, you head is already hurting so make sure you take a long break before starting the next step.
Coffee would probably be a good idea.
EMC Replication manager uses cluster-by-cluster
snapshotting to create a backup very quickly. You can snapshot a dozen 200 gigabyte
LUNs simultaneously and it will take about a hour / hour and a half. Pretty slick, I suppose but (again) the software experience leaves something to be desired.
But, before you frustrate yourself with Replication Manager, you have to dive back into the storage manager. To make a replica, you need a backup
LUN that has an identical cluster count and cluster size. And, of course, there is no "Create new
LUN with these settings" or "Make replication
LUN" option so you have to do it manually. If you are one cluster off, the replication will fail so you better write and type carefully. Go ahead, make a human readable name for the
LUN - it won't do any good but it will make you feel better.
After you put the
LUN in the storage group assigned to the Replication manager, you can then move to the Replication Manager software. You'll need another password for this, too. Another local password, of course. Oh, and you might as well make a service account with domain administrator access now, you'll need it later. Also, don't ever (never, ever) change the password on that account.
So, you
logon the Replication manager, add the new
LUNs to a storage group (no, not the storage manager storage group, a replication storage group) , assign it at name (no, it won't pick up the other name you already gave it), create an application group that defines the source
LUN and give it a name (no, of course it won't pick up the name you already gave it), and then create a job (and, yes, you need to give the job a name). After you jump through all these hoops, it will create a
Windows Scheduled Task to run the replication job.
Yep, you read that right. This expensive, complicated, high end software uses the incredibly unreliable Windows task service.
Believe it or not, it gets worse. The replication service uses a high-numbered
TCP port (up around 65000) to manage the service. Ports in that range are not reserved and get dynamically assigned by a variety of programs on
temporary basis. Since these high numbered ports are a free-for-all, most programmers make allowances conflicts. You saw this coming, didn't you -
EMC's programmers did not make allowances for conflict. For example, an
MMC console open in one session of one server that gathers information from another server (Exchange System Manager, for example) and that
MMC decided to use the
EMC replication manager port. This will
make the snapshot fail completely. To top it off, failed snapshots are not written over so if you have some sort of overwrite pattern going, that pattern will probably fail the next time it's run, too.
And, you'll love
EMC's fix - make a registry edit on every single server that blocks other programs from using that port. They aren't even offering us a script to make the change.
Are all of the
EMC programs written by their summer interns or something? It's amazingly bad. These guys need to read
The Inmates are Running the Asylum or
Joel on Software to learn how to program useful stuff....