Thursday, May 19, 2011

Getting OSVault to Work from Open-Source

Every some often, we can a request from someone saying, basically, "I downloaded your code and am having troubles getting it to work, can you help me?". The short answer is "Yes, but we do that for pay, not for free."

The longer answer is, "Its not easy building a scalable, hierarchical storage system that can scale from 256GBytes to 10Petabytes, from scratch, and its takes a LOT of time and effort to get it working reliably."

Here are some of the common issues:
1. You need a working DMAPI/XDSM kernel. This code IS NOT in the LINUX mainstream kernel, and you need to get a working copy from somewhere. SGI is the best place, since they used to maintain the DMAPI code as a part of XFS. You will need to go back to the 2.6.24 or 2.6.27 kernel to do it easily. After 2.6.27, SGI removed the DMAPI code from XFS, so its much more difficult to get DMAPI into kernels after that. Plus, if you go back to 2.6.27, then you will have troubles with support for things like the Intel ICH10 chipset on your motherboard. If you have ICH10, then you can't boot from the DVD or the SATA drives with earlier kernels. So you need to patch in ICH10 support, which is another time-consuming process. And while you're at it, you probably need all the other LINUX patches between 2.6.27 and 2.6.39.

2. You need to use the XFS file system. Yes, I know that IBM JFS has DMAPI also, but that code doesn't work with a lot of things, like NFS server. XFS is all we use now for the OSVault work.

3. You need to use programs that DON'T use memory-mapped files. Memory-mapped files can bypass the DMAPI intercept for non-resident files. An example is VSFTPD. Use ProFTPd.

4. You have to get dmapid talking to your kernel. You should run the XFS test routines for DMAPI to make sure your kernel is OK. If you go straight to dmapid and there is a kernel problem, dmapid will probably just give an error and exit (or it may just sit there and do nothing since its not receiving events from the kernel). Make sure you understand how to mount with DMAPI options.

5. You need to use CentOS 4 or 5. We haven't tried it with other distributions. And you need to be careful about updates to the kernel from CentOS, which will replace your DMAPI kernel. Best to setup your own YUM server and put updates into that (and change your /etc/yum.repos.d directory contents).

6. You need to get your tape or optical libraries working correctly. The MCLIB directory has a lot of programs that must work with the library. Don't try migration until those work. You can try network migration, but you have to do all the SSH security setups first. If you don't see your drives and library in /proc/scsi/scsi, go back to square one and start all over again until you do. If "mcstat" doesn't work, do go any further until it does.

7. You need to initialize the MySQL database so that things like "listmedia" report correctly. There are scripts in the open source release that do this. If "listmedia" doesn't work, then migration will never work.

There are many more issues to address, but if you are just starting out with the open source distribution of OSVault from dvdvault.sourceforge.net, you can expect to spend anywhere from 80 to 1000 hours getting a new system up, depending on your skill with the LINUX kernel. The majority of that time is spent in kernel work and in hardware configuration setup. Keep in mind that correctly installing a full LINUX server with RAID boot can take 1-2 days after RAID initialization. Applying all the patches to a LINUX kernel for DMAPI (issue 1) takes at least 3 days of work, since you have to verify each one.

Now, if you want to build OSVault in a high-availability configuration, you're definitely ambitious! Our first HA configuration took 2 months to setup and fully test, even though we know all of the software intimately.

The open source code on Source Forge is the same code we use for our products. Our most active tape-based system using this code restores between 30 and 100 files from tape each day and migrates (writes) over 400GBytes of new data every day, and its been running at that rate for 20 months now using a Qualstar TLS-88264 tape library.

Wednesday, April 13, 2011

An Example of Backing Up Client Systems to OSVault Servers

In our office, we use a tape-based OSVault server as the destination for our system backups. That OSVault server uses a Qualstar 66-slot, LTO-3 tape library with two drives that we purchased several years ago. It has 4TBytes of spinning disk in a RAID-6 configuration. The backup programs we use (Toucan and Simple Backup Suite) can backup our systems to disk storage and are open-sourced, so they meet our needs adequately.

The structure of our offices is that our servers are located in our Datacenter in downtown Denver, but very few people actually work at the data center. Most work out of other offices with high-speed Internet links. So, the backups were initially setup to write directly to the OSVault server over our company VPN.

We found several limitations with that setup. The first was that the backups needed to be throttled at times, so that other Internet activity (web and VOIP for example) weren't degraded. Also, we found that we were transferring A LOT of data every day, usually at night. A full backup from one employee laptop could exceed 250GBytes.

Our solution was to put a small InfiniDisc system in the office as the backup destination. Those InfiniDisc systems generally have 500GBYtes or 1000GBytes of RAID-1 storage and can receive data at around 50MBytes/second, versus the 1MBytes/second direct Internet links. So, backups ran quickly to the InfiniDisc, then the InfiniDisc system would move the resulting backup at a reduced data rate to the OSVault server. The little InfiniDisc servers cost less than $650 each.

After a couple of years of doing this, we have found the solution works well for us. The costs are very low (less than 3 cents per gigabyte stored) and we get some great advantages:
- We get disk-to-disk-to-tape backup without any licensing costs for a traditional tape backup solution
- We have our most recent backup on the local InfiniDisc for quick restore on a Gigabit Ethernet network
- We don't have to move our incremental backups to the OSVault server, reducing Internet usage
- We have remote access anywhere to our backups stored on the OSVault server
- We can expand out storage capability anytime simply by adding more LTO-3 tape cartridges
- And, its a set-and-forget setup

Wednesday, March 9, 2011

Fixing the 1TByte inode problem in XFS file systems

If you have an XFS file system that you fill completely full, then add more hard disk space to it, you could run into the situation where the file system reports "no space available", but a "df" command shows plenty of space available. This is caused by the inability to allocate new inodes in that XFS filesystem.

Now, XFS dynamically allocates inodes, so you might be wondering how this could happen. The reason is that unless you say otherwise, inodes are limited to 32-bit values, which means they must fit in the first 1TByte of storage in the file system. But you completely filled that 1TByte previously, so XFS can't allocate more inodes now.

You could just switch to the "inode64" mount option and continue on, but that risks compatibility problems with NFS and with DMAPI (MySQL won't store a 64-bit inode properly in an "int" variable).

To get around this problem is actually not too difficult IF you know how. If you don't know how, it can be a very difficult time.

To fix the situation, you need to move files that occupy some of the storage in the first 1TByte of the file system. To do that, try the following:

1. Run xfs_info on your XFS mount point. For example:
[root@osvault ~]# xfs_info /cache/
meta-data=/dev/CACHE/CACHE isize=256 agcount=375, agsize=64469728 blks
= sectsz=512 attr=1
data = bsize=4096 blocks=24157093888, imaxpct=25
= sunit=0 swidth=0 blks, unwritten=1
naming = version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=1
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0

Notice that in this case, the "agsize" is 64 million blocks and the "bsize" is 4K, so the Allocation Groups are 256GBytes each. That means the first 1Tbyte of storage is in the first 4 Allocation Groups. So you want to find the largest files that are in that first 1TByte. If your allocation group is shown differently, then divide 1TByte by the allocation group size and get the number of allocation groups.

2. Figure out if those first allocation groups are full by running:

for ag in `seq 0 1 5`; do echo freespace in AG $ag; xfs_db -r -c "freesp -s -a $ag" /dev/CACHE/CACHE ; grep "total free"; done

If the total free blocks in each Allocation Group (AG) is less than about 40, then you can't create inodes in that allocation group. So now you want to find some files in that allocation group and move them out of the file system and then back in again. Its important that you "mv" the file, rather then "cp", so that the file is deleted from the XFS file system.

2. Now run xfs_bmap -v filename on all of the files in your filesystem. Yes, its tedious, so you probably want to script it. Just run an "ls /mountpoint" and send the output to a temporary file. Edit that temporary file and add the command to the beginning. Beware of spaces, quotes and parenthesis in your filenames.

3. Examine the output from all those xfs_bmap programs and search for lines with " 0 " (thats a space, then a numeral zero, then a space). That will find the files in Allocation Group 0. Repeat this for Allocation group 1, 2, and 3 (in this example). Every file in those first few allocation groups are candidates. All you have to do is copy the file to a temporary location, then copy it back to the XFS file system. The new instance of the file will be placed in storage in other allocation groups, and your XFS file system will now be able to allocate more inodes for new files.

Friday, January 14, 2011

The Challenges of Building a Really Large Storage System

Recently, dropping prices on disk drives have made me wonder if a really large storage system is practical based only on spinning disks, rather than a tape or optical library on the back end. So I decided to work up a 500TByte disk-based storage system that could be counted on to hold data securely for 5-10 years and see what the total cost would be.

Technical Challenges of Large Disk Farms
It is possible to buy a single RAID system today that supports 500TBytes of storage, but I have seen the following disastrous consequences of doing that approach:
1. A single firmware bug, in two instances, resulted in the lost of all storage in the RAID system. This happened on two different manufacturer's platforms and they were both major players in the RAID storage market
2. A bad batch of disk drives at one customer (155 drives out of 1400 installed) had a flawed FLASH memory chip (too much phosphorus in the ceramic coating for the chip caused the leads to the chip to be destroyed). On one VERY important day in 2000, 5 of those drives in a single RAID system (and 3 in one RAID set) decided to die at the same time. The end result was having to reload an Oracle database from tape for 4-8 hours, but the bigger result was a $500,000 fine by a federal judge for failure to process certain legal requirements in time.
3. Cost of these large RAID systems can be prohibitive, with drive trays with 16 drives costing $5000 to $8000, and electricity to drive them also being high (450 watts or higher). One customer swore they had fully redundant, adequate electrical supply for 11 drive trays, but within a week the single circuit breaker covering those 11 drive trays tripped and the system went down and required rewiring to get it back up. A very large school district's IT department was offline during this time.

A Disk-Only Solution
So, how about a different approach to 500TBytes of storage. Say we use 16-drive trays of 2TByte green disk drives at a cost of around $8000 each. Each tray can be RAID-6, giving us about 24TBytes per tray and the tray would be configured as iSCSI volumes. So we would need 21 trays to get our 500TBytes of storage, a couple of redundant network switches to hook everything together, and one tray configured with LINUX as a server. The total cost of all this storage would be about $180,000 (trays, switches, cabinet to hold them, power strips).

Now lets look at the utility costs to run this storage. Placing the storage at a colocation facility would run right around $56,000 per year for floor space and electricity, leaving out a network connection cost. The electricity becomes a major factor since the whole unit pulls more than 62 AMPs at 120 volts, running idle.

So a three year cost to operating this unit (assuming one cold stand-by system) would be about $348,000, or about 67 cents per gigabyte.

Of course backup costs and offsite storage costs aren't included in this calculation.

Now, lets compare this to a 500TByte OSVault implementation....

If we really wanted to stay low cost, we would put in a Qualstar 268 slot tape library and three of the same storage trays (24TByte each). That gives us about 430TBytes of storage in the tape library and 72Tbytes of spinning disk storage. Total purchase price of this unit (tape library, three trays of storage/server, 268 LTO-4 tapes, network switches) is about $105,000 and the OSVault software is free (unless you need our help installing it). The cost to put it in a colocation facility is around $7000 per year, so the three year cost is around $126,000, or about 1/3 the cost of a totally disk based solution.

Think Green
If you are thinking "green", the OSVault solution uses only 11 AMPs to manage 500TBytes of storage (at 120 volts).

So, for those that think the cost delta is not too large, remember that you don't have a disaster recovery plan priced in for the disk only solution, which could again double your costs.

I will grant you that the labor costs of the OSVault solution is greater, since you have to take tapes offsite (second copies, for example), but still you are looking at total managed prices for an OSVault solution at about 17 cents per gigabyte per year, versus around 50-70 cents per gigabyte per year for a totally disk storage system, in the best case.

Infinite Storage Anyone?
And the really nice part of the OSVault solution? It can double to 1Petabyte of storage for only 20% more cost. And that tape-based storage has at least 4 times the shelf life of disk based storage.