Tuesday, October 8, 2013

True Disaster Recovery in a Cloud Storage Environment

Recent news articles have discussed the Chapter 11 bankruptcy of a large cloud storage provider. This cloud storage provider marketed itself as completely safe, with up to 5 physical locations for customer data, and a large amount of venture financing to insure continuity.  That bankruptcy left several large customers with only 2-4 weeks to retrieve data that took months or years to store.  Speculation is that there could be upwards of 20 petabytes of data that need to be retrieved from those systems, and now that data access is in the hands of a bankruptcy court.

Obviously, in a cloud environment, as this case shows, disaster is not limited to fire or flood, but also includes insolvency of your provider.  Even if the data is read out from the cloud storage to another storage device at 1gigabit per second, it could take years before all data can be safely transferred.  But in the meantime, access to your valuable data is constantly in jeopardy and out of your control.

This is one of the reasons that Open Source Storage uses OSVault and tape-based archival for our storage systems.  The other reasons are cost and reliability.  With dedicated tape volumes per customer, and having those volumes owned by our customers, we can move a complete set of data from our site to another site in less than one day.  We use open formats to store data, so that moving to other vendors does not require complex and costly manual processes.  And our costs of storage are still 75% less than other cloud storage providers.

Tape-based archives give us some distinct, valuable advantages.  We can make second copies and move those second copies offsite for very little incremental cost (about $10 per terabyte capital cost and less than $.05 per terabyte storage cost per year).  Accidental or deliberate deletion of data is difficult or impossible since we keep maps of all data stored on tape in multiple locations and do not recycle tapes so that even deleted data can be recovered.  And we can grow our systems from small to large without expensive or time consuming hardware replacement or upgrades.

As an exercise, we implemented a 500TByte disaster recovery to a separate location.  The total cost of the additional hardware was less than $20,000 (although we had already purchased the hardware as part of our disaster planning).  The total time required to get access to all files in the 500TByte file system was less than one day.

So, when planning your cloud-based or in-house storage environment, please consider the following issues in your disaster recovery planning:

  • Is my data stored in two separate locations, to safeguard against flood, fire and other disasters?
  • Is my data safe from individual actions, either mistakes or deliberate?
  • Can I get my data physically relocated without costly or time consuming effort?
  • Do I have the resources and trained personnel to implement a disaster recovery?
  • Is access to my data completely within my control during any disaster?

Thursday, September 19, 2013

Recovering From A Failed Tape Cartridge

If you have been storing files on a tape cartridge and it starts to exhibit serious I/O errors, you need to act quickly to make sure you don't lose any files.  Luckily, tape cartridges rarely fail, and they usually fail when you first start writing data to them.

You will need to use the Command Line Interface to run most of these steps:

First Step - Run "unmigrate -s /cache -v LABEL -r",   where LABEL is the barcode label of the tape that is getting errors.  This will make sure that all files on the tape that are still on disk are marked as not migrated, so that they will be written to another tape at a later time.  The "-r"  means to "don't recall" the files, so that the tape will not be loaded to get files back.

Second Step - After running the "unmigrate", you should go the the web gui, click on the tape label in question, then select "Display Current Tape Contents".  The popup window that appears will show all the current files that are still on the suspect tape.  Hopefully, this display shows no files.  If it shows no files, you are done and should eject that tape from your tape library and destroy it.

Third Step - If you are at this step, then you have files on the tape that are not fully on disk storage and you need to recover those.  If you have been making second copies ("copy2" tapes), then you can just switch to the second copy for those files and restore them.  If you do not have a second copy, so to the "Try to Recover From Bad Tape" step below.

To switch to the second copy, run "switchmig -s /cache -v LABEL".  This program will make all files that have their primary copy on the tape with barcode LABEL now have their primary copy where ever the second copy was stored.  After "switchmig" completes, run "check_db" to make sure that the database has the new locations.

Fourth Step -  Now that you can recover the files from the second copy, you can elect to just leave things that way, or you can unmigrate all those files that were originally on the bad tape, so that there are two new copies on other tapes.  If you wish to just use the previous second copy as the primary copy now, just stop at this point and the next second copy migration will make another copy of the file.

Fifth Step - How to Unmigrate the files from second copy media back to disk. This step is more involved and requires you to use a LinkFile created in /var/tmp.   If you look at a directory listing of /var/tmp, you will see the linkfiles.  Select the linkfile from yesterday,

Try to Recover From Bad Tape Step
The procedure to follow at this point depends on whether the bad tape is a TAR tape or an LTFS tape.  For LTFS tapes, you should follow the LTFS recovery processes detailed by IBM.  For TAR format tapes, you should run "unmigrate -s /cache -w".   This will create a restore process that will recover all files in order, until a serious I/O error occurs.

Thursday, May 2, 2013

Using the OSVault GUI To Recycle Tapes

In release 5.0.38 of the OSVault GUI, a button was added to automate the un-migration of files from a tape volume, as shown in the example below.  In conjunction with the Media Usage report, you can use this screen to move all files off a tape that is not very full so that the tape can be removed from the library or recycled.


When you go to the Media screen, and click on a tape volume, the above window will appear.  You should then click the "Display Current Tape Contents" to see what files are on the tape and get a count of files on the tape.  If that tape is mostly empty, you can then click the "Unmigrate Volume" button, which will start the process to make sure that all files on the tape are restored back to disk and marked as not-resident on any tape volume.

When the Unmigrate window opens, you must then confirm that you want to unmigrate that tape by again clicking on the button with the tape volume displayed.  That will then start the process of finding all files in /cache that are copied (or only resident) on the selected tape volume.  Once all files are identified, a work queue is built of those files and the restore of files only on tape is started back to disk storage.   This process will take some time to complete.  The window will display the progress.

When the unmigrate is complete, you should repeat the above process to insure that all files are marked as not present on the select tape volume.



Tuesday, January 8, 2013

Technical Details on LTFS Implementation with OSVault

For the past two years, we have been working on getting LTFS support into OSVault.  That work is complete and this posting is an attempt to outline the technical details in that development effort.

For those who may not be familiar with LTFS,  you can read a writeup about it at Wikipedia.org.  The import thing to note is that LTFS requires LTO-5 or LTO-6 tape drives.  LTFS cannot work on LTO-4 or earlier tape models.  Also, converting an existing OSVault TAR-based tape system to LTFS is not an easy task and is not outlined in this posting.

Our implementation of LTFS support in OSVault allows for files to be stored in the XFS file system on spinning disk, and to have those files stored in that file system automatically be copied to a tape in the LTFS format.  When the XFS file system is becoming full, the data portion of older files is truncated from the XFS file system for those files that have been copied to LTFS tape.  If a truncated file is opened later, OSVault uses the DMAPI interface to realize this and will automatically restore the file to spinning disk from the tape cartridge before the first I/O on the file is allowed to proceed.

There are five major steps in getting LTFS support into OSVault:

  1. The LTFS file system support had to be installed in the LINUX system that OSVault runs under
  2. The AUTOFS implemenation for SCSI medium changers (tape libraries) had to be developed that supported LTFS mounts and unmount
  3. The LIN_TAPE device driver had to be ported to the DMAPI kernel
  4. The Migration (copying) program had to be modified to support LTFS
  5. A LTFS_INV program to label all media in a robotic tape library and to update the database with that information


LTFS File System

Getting the LTFS file system was the easiest part of the implementation, since we used IBM's LTFS packages. Currently we are using LTFS 1.3.0 and that software is installed using an IBM-provided RPM package for CentOS 5.8.
Using the standard IBM package gives us several advantages beyond labor savings in the development process.  The biggest is that this baseline is the first implemenation of LTFS and has the most test time in its implementation (since 2010).  On interchangeability of our tapes to other LTFS system, we can be reasonably sure that any difficulties are due to the other product, since we are using an industry-standard implementation and have not made any modifications for our product.

AUTOFS Implementation

This was probably the most difficult part of the LTFS development.  We have been using AUTOFS 4.1.2 historically for mounting and unmounting DVD and Blu-Ray media in OSVault.  Tape (TAR) support in OSVault did not use AUTOFS.  However, due to the nature of LTFS, it was determined that the best way to support LTFS in OSVault was to use AUTOFS (and the automount daemon) to mount and dismount tape media.
The AUTOFS allows the LINUX system to show every tape in the tape library as a separate folder in the /archive mount point.  So, if tapes in the library have barcodes of OS0901L5, OS0902L5, OS0903L5, OS0904L5, and OS0905L5, then the folders /archive/OS0901L5, /archive/OS0902L5, /archive/OS0903L5, /archive/OS0904L5, and /archive/OS0905L5 are available to write to and read from.
In the above example, even though there are 5 tapes in the library, you can randomly access folders even if you have as few as one LTO-5 drive (although we recommend at least 3 drives in a production environment).  The AUTOFS implementation will remove/unmount media to put requested media into a drive without the user needing to worry about that.
The AUTOFS changes included the creation of a new automounter module, mount_ltfs and lookup_ltfs.  Those two modules deal with the robotic operations of moving media into and out of tape drives, finding the correct media in the robotic library, verifying that the tape in the drive has the same internal label as the barcode on the outside, and figuring out which tape drive to use for various operations.
A custom change to AUTOFS was to add in code to do an LTFS-specific set of operations when the tape drive mount times-out, as from inactivity.  In most AUTOFS modules, there is only a generic umount done on the folder when it times out from inactivity.  In our LTFS support, it is necessary to issue an unload to the tape drive, to take the media off the recording heads and return the tape into the cartridge (a generally accepted practice when not using a tape drive).
One of the advantages of using AUTOFS for LTFS access is that it is possible for networked users to directly read from the tape media, bypassing the /cache file system in OSVault.

LIN_TAPE Driver Port to DMAPI Kernel

We use the DMAPI interface to control the automatic restore of file data from removable media back to the XFS file system on spinning disk.  The DMAPI code is not in the mainline LINUX kernel, and OSStorage has had to maintain their own baseline of kernel patches.  We started with the 2.6.24-rc3 baseline that had DMAPI included in the XFS file system many years ago, and have been adding patches to that kernel to keep up with new devices and other kernel changes over the years.
Adding the LIN_TAPE driver (IBM's standard open-source driver for LTO tape drives) to this kernel was challenging.  LIN_TAPE had many conditionals in the source code based on the 2.6.24 release, which our kernel reported as supporting.  But since we had a Release Candidate, some of the SCSI interfaces had changed and our LIN_TAPE port did not match up correctly.  So the LINUX SCSI Interface code in LIN_TAPE had to be modified to match our kernel's implementation.

Migration to LTFS

OSVault transparently restores files from LTFS to the XFS file system when requested.  But prior to that restore, the file had to be copied to tape media.  We decided to modify the migration code that supports writing to BD-RE media (Blu-Ray rewritable media) and apply that the LTFS.  Not many changes were required and this migration code has been extensively used in our lab and at customer sites since as early as 2003.  The major changes required adding and "ltfs" format to the MySQL database and changing the database calls to support LTFS media entries.  Since BD-RE media only goes to 100GBytes, a lot of subsequent testing was required to make sure that this migration program would continue to work up to 2.5TBytes of storage per media.

LTFS_INV Program

A tape inventory program had to be developed that could query a SCSI Medium Changer (Robotic tape library) and create a database of tape media found in that tape library.  That "ltfs_inv" program has been tested on Qualstar and SpectraLogic brands of tape libraries, but should work on any tape library that supports the SCSI standard for Medium Changers and has LTO-5 or LTO-6 tape drives.
The ltfs_inv program loads all tapes in the library into tape drives, checks if it already has an LTFS file system on it, and if not it will format the tape with the LTFS file system. The internal volume label of the LTFS file system is also set to match the barcode on the outside of the tape cartridge and that internal volume label is checked each time OSVault mounts that tape.