Virtual Machine (VM) Data Loss: Prevention and RecoveryBy Drew Robb
Most companies these days are heavily virtualized. Gartner numbers suggest that 75% or higher virtualization use is the norm. But benefits such as reduced complexity, better server utilization, increased flexibility, and greater responsiveness to business needs are not the whole picture. The additional software layer added to create virtual machines (VMs) can make it difficult for admins to know which physical system is running which VMs, and what storage is involved. This complexity can lead to issues such as data loss.
Surveys by Ontrack reveal the leading reasons for VM data loss to be user error, ransomware, hardware failures, formatting, metadata corruption, and RAID storage corruption. Hardware issues are similar to those suffered by physical systems: failing drives, failing controllers, failing server components, and power issues.
RAID and virtualization don't always mix
RAID corruption can often present more of a challenge due to the nature of virtualization. RAID fragments data across disks and reassembles it when requested. However, the complexity of virtualization coupled with the presence of deduplication and compression can cause problems when RAID controllers try to map where all information resides across many disks.
Reformatting of disks, virtual disks, arrays, and volumes, or re-installation of software are additional causes of data loss in virtualized environments. With VMs, reformatting at the Guest or Host level often leads to loss of data. And then there is corruption due to buggy patches and updates, leading to host file corruption and guest file system damage.
Thin provisioning data loss, too, should also be considered. Space may only have been provisioned for immediate needs. Blocks of storage are then added to the virtual disk as it grows. This can be a source of complexity and further fragmentation of virtual environments. Metadata pointers can get lost or damaged. That makes it difficult to rebuild virtual disks.
Human error obviously plays a part in some of the issues mentioned above. But it comes into its own in areas such as virtual disks deleted by mistake, VMs being overwritten or their space reassigned. And then there is snapshot chain corruption, i.e. one of a series of snapshots is either corrupted, deleted or unavailable for one reason or another. This can foul up backups and make it difficult to recover data.
VM data recovery options
Obviously, backup is the first line of defense. But backups are not always complete and may become corrupted. In those cases, recovery might be possible through data recovery service providers. It is sometimes possible to directly recover data from physical drives. This is done by taking an image of the drives and reading whatever raw data might be available.
Data can also be recovered from logical volumes (LUNs) or RAID. If the RAID controller is available, it can be used to track down data spread across virtual disks. By determining what the configuration should be, engineers can virtually rebuild the array and gain access to the storage. In the event of RAID controller corruption, the controller can often be emulated and you can rebuild what is missing.
Moving up to a higher degree of difficulty is recovery at the host file system level. In VMware, this would be VMFS and in Hyper-V, NTFS or ReFS. Even when data isn’t available directly at the storage level, recovery firms can sometimes trace data from the basic storage data blocks, map it to the host level and recompile it.
If that doesn’t bear fruit, specialist firms have additional tools they can employ to extend further into the guest file system level. Further, it may be possible to reach into the guest file level and access data lurking in application files such as SQL, Exchange, SharePoint, Oracle and Office files.
Take the case of a Korean MSP. One of its clients had a NetApp FAS8060 system containing 161 x 900GB SAS HDDs. Human error by a technician initiated a wipe command on some LUNs. 45 GB of data went AWOL from the Sybase server. The MSP brought in Ontrack via telephone. A data recovery expert told the MSP to take the storage offline to avoid further overwrite damage. The hard drives were connected remotely to Ontrack’s Remote Data Recovery server. The drives had to be sorted into groups and manually rebuilt to a point in time as close as possible to the time of the incorrect wipe command. The recovered logical volumes passed integrity checks and were made operational with no loss of data.
The reality of VM data loss
Virtualization may save time and eliminate complexity from the user's view. But it comes with its own set of challenges, one of which is a rising incidence of corruption and data loss. Whether through volume corruption, deleted volumes, ransomware, deleted or corrupted virtual backups, RAID and hardware failures and deleted or corrupt files within virtualized storage systems, data loss is a reality for anyone managing virtual systems - and an issue many don't think of until they lose data.