© Dennis Leeuw dleeuw at made-it dot com
Last updated: 11 Mar 2010
License: GPL
| System usage | / | /home | iSCSI | |
|---|---|---|---|---|
| Logical Volume(s) | /dev/vg-1/lv-1 | /dev/vg-1/lv-2 | /dev/vg-1/lv-3 | unused |
| Volume Group(s) | /dev/vg-1 | |||
| Physical Volume(s) | /dev/md0 | /dev/sdc1 | /dev/sdd1 | |
| /dev/sda1 | /dev/sdb1 | |||
| iSCSI | iqn...-1 | iqn...-2 | iqn...-3 | |
| Hardware | SCSI disk | NAS | ||
Let us first look at the terms SAN and NAS. SAN stands for Storage Attached Network and NAS for Network Attached Storage. A NAS is often refered to when storage is provided to a network like the Microsoft™ shares that are offered through SMB or CIFS, but also NFS shares.
SANs are disks that are offered from a network to a server. This is most often an iSCSI network or a Fibre Channel network.
The line between a SAN and a NAS was when the terms where invented more clear then they are now. An iSCSI share can be offered to the network and a CIFS share can be provided to a server. The most clear destinction between the two is still that a share offered by a NAS can be mounted by more then one host at the same time. Some kind of locking prevents access problems to files, while a share offered by a SAN can only be used by a single host in read/write mode.
See http://www.pathname.com/fhs/
mkdir rmdir touch rm cp mv ln
For every file you create there needs to be some extra data, like the filename, the owner, and the access rights. This additional data is called meta-data, while the actual file is called the data.
chmod, chown
A block device is a device where you can read blocks from and write blocks to. Harddisks, DVD-drives, and USB-sticks are all block devices. Block devices are able to read block 12 and then block 5709, so it does not have to be insequence.
Probably the only character device you may come across when dealing with storage is a tape unit.
A block is a size unit. The size of a block on a harddisk is 512, 1024, 2048 or 4096 bytes. More is also possible. The block size means that when you write a file of 12 bytes to a disk that is formatted with a blocksize of 2048, the file occupies 2048 bytes. You might think that it is then best to always use the smallest blocksize possible. This is not true. The smaller the blocksize the more blocks are needed on a disk to hold the data, which means that fragmentation chances get higher, which degrades performance.
Larger block sizes will help disk I/O performance when using large files, such as databases. The larger block can be read at once, there are less blocks to read, so there is less searching done.
There of course also needs to be some table that tells which blocks belong to which file. The more entries in this table the lomnger it takes to retrieve the entire file.
A harddisk can not read blocksizes smaller then 512 bytes. So smaller files can only be read by reading 512 bytes and discarding the rest.
Each block device has an entry in /dev. IDE-disks start with hd and SCSI-devices start with sd. The disk are numbered with letters of the alphabet, and partitions are numbered with numbers. So the first partition on the first IDE-disk is called /dev/hda1.
Use ls -l on /dev/disk/ to see more information about how everything is interrelated. All files are symlinks in the directories below /dev/disk, so you can see which id, path, label or uuid is mapped to which block device.
ls -l /dev/disk/by-id/ total 0 lrwxrwxrwx 1 root root 9 2009-11-16 09:46 ata-WDC_WD800JD-75MSA3_WD-WMAM9ARM8462 -> ../../sda lrwxrwxrwx 1 root root 10 2009-11-16 09:46 ata-WDC_WD800JD-75MSA3_WD-WMAM9ARM8462-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 2009-11-16 09:46 ata-WDC_WD800JD-75MSA3_WD-WMAM9ARM8462-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 2009-11-16 09:46 scsi-SATA_WDC_WD800JD-75M_WD-WMAM9ARM8462 -> ../../sda lrwxrwxrwx 1 root root 10 2009-11-16 09:46 scsi-SATA_WDC_WD800JD-75M_WD-WMAM9ARM8462-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 2009-11-16 09:46 scsi-SATA_WDC_WD800JD-75M_WD-WMAM9ARM8462-part2 -> ../../sda2
ls -l /dev/hd*
ls -l /dev/sd*
mknod:
mknod -m 666 hda b 3 0 mknod -m 666 hda1 b 3 1 mknod -m 666 hda2 b 3 2 mknod -m 666 hda3 b 3 3
To play around with disks and test what is described you might not want to destroy your currently available disks. If you do not have spare disks to work with, you could create some disk images files. To do so first create one or more files like this:
dd if=/dev/zero of=disk-image bs=1 count=1MThis creates a file called "disk-image" of 1 Megabyte only containing zero's. Note though that you atleast need to set count=2M if you want to format the image with a journal (ext3). The remainder of the text will assume a 3M file.
Since we are now having a flat file with nothing on it we should make it more like a disk:
parted disk-image mklabel loopIf you have an image that is larger then 2 Terabyte use gpt as the disk label.
Next we will create a single partition spanning the entire "disk":
parted disk-image mkpart primary ext2 0 3MIf you want to test LVM or RAID, you need to set the following too:
parted disk-image set 1 lvm onAnd use the RAID or LVM tools to configure the device.
Assuming that we, at this point, do not use RAID or LVM, we need a way to treat the created file as a "normal" block device:
losetup /dev/loop0 disk-imageThis assumes that this is the first loop device we use on our system. Use:
losetup -aTo view all loop devices and where they are connected to or use:
losetup -fTo find the first free available loop device.
From this point on we can use our normal tools to deal with the file. With the mkfs tools you can create a filesystem on it:
mkfs.ext3 -b 2048 -L image1 /dev/loop0This creates an Ext3 filesystem with a blocksize of 2048 and a volume name of "image1". To view what the superblock has to say about the created image use:
tune2fs -l /dev/loop0
To be able to access the contents, we need to mount it:
mount -t ext3 /dev/loop0 /mntIf you do a ls of the /mnt directory you should now see a "lost+found" directory.
To remove everything we have done:
umount /mnt losetup -d /dev/loop0 rm -f disk-image
Since the Linux kernel supports a lot of different filesystem types, there needs to be some sort of abstraction between the actual filesystem and the tools you use to manipulate files. Otherwise all the tools should know about all the different filesystems or you needed a copy command per filesystem type. Since automation is about making life easier most of the tools talk to the kernel as if there is just a single filesystem: VFS (Virtual File System). The VFS then passes the requests on to the actual device driver (module).
| File Access | ||
| VFS | ||
| ext3 | iso9660 | NFS |
| Disk | CD-ROM | Network |
As you can see from the above table a filesystem can be a local, like a harddisk, CD-ROM or USB-stick, but also network filesystems are handled through VFS. Linux also has some special filesystems like /proc, /sys.
Since everything within a UNIX™ system is a file, that means that files, special files and even directories are files. All these files are represented on the disk using inodes. To represent a disk (or partition) the VFS system uses Superblocks. More on this in the following sections.
The Superblock contains all the information about a certain filesystem (partition). It maintains all the details about the filesystem, like the block size, the total number of blocks available on the filesystem, the number of free blocks, the number of inodes on the filesystem, and the root inode of this filesystem.
The Superblock only handles metadata of mounted filesystems. It contains amongst others:
| device | The device this Superblock belongs to |
| blocksize | The blocksize of this filesystem |
| fstype | The filesystem type for this filesystem |
The first thing we need to describe is the dentry (directory entry) table, which describes which filename has which inode (index node). It is comparible to a DNS lookup table. It translates names to numbers (filenames to inode numbers).
The inode is a table with an unique number containing metadata that describes a file and it's access rights. An inode holds amongst other things:
| device | A device identifier of the device that the file is stored on |
| number | Contains the inode number |
| count | An inode can have several references to it (symlinks) |
| mode | The file mode, like 0750. So it indicates the file type and the access rights. |
| nlink | The number of hardlinks to this file |
| uid | The User ID(s) of the owner of this file |
| gid | The Group ID of this file |
| size | The size of the file |
| atime | The last access time |
| mtime | The last modification time |
| ctime | The last change time |
http://www.linuxhq.com/guides/TLK/fs/filesystem.html
When formatting a device for an ext2 filesystem you can use the -b option to tell the system the blocksize:
mke2fs -b 4096 <device>
To see the blocksize of a certain filesystem use:
stat -f [<device>|<mount-point>]or
dumpe2fs -h <device>
http://en.wikipedia.org/wiki/Ext3
Disk limitations When a disk is full... it's full you can not grow. Even if another partition has space you can not shrink one and grow the other without backing up both partitions first.
Create an LVM, note that this is destructive for the data on the disks:
pvcreate /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
Use pvdisplay to show you what is done.
To revert this action use:
pvremove /dev/md0 /dev/md1
Creat the volume group:
vgcreate vol1 /dev/md0 /dev/md1
Use vgdisplay to see what has been done for a less verbose output use vgscan.
To rename a Volume Group:
vgrename vol1 vg-1
To remove a Volume Group:
vgremove vg-1
Create a Logical Volume(s):
lvcreate --name lv-1 --size 40G vg-1 lvcreate --name lv-2 --size 5G vg-1 lvcreate --name lv3 --size 1G vg-1
Use lvdisplay to see what happened, for a less verbose output use: lvscan.
To rename Logical Volumes use:
lvrename fileserver lv3 lv-3
Do NOT use the following commands on a life filesystem. They are only in this section to learn from when you have a new Logical Volume, and for completeness. Resizing existing, mounted, filesystems is described in the section Dealing with life systems.
To "grow" a file system use:
lvextend -L1.5G /dev/vg-1/lv-3
To "shrink" a file system use:
lvreduce -L1G /dev/fileserver/media WARNING: Reducing active logical volume to 1.00 GB THIS MAY DESTROY YOUR DATA (filesystem etc.) Do you really want to reduce media? [y/n]: y Reducing logical volume media to 1.00 GB Logical volume media successfully resized
Your filesystems are now ready for use. The only thing you need to is to create filesystems on them and mount them.
First we need to unmount the filesystem:
umount /var/share
Check to make sure that the filesystem is unmounted:
df -h
Enlarge the filesystem:
lvextend -L50G /dev/vg-1/lv-3
Check the current file system to make sure that it is without errors:
e2fsck -f /dev/vg-1/lv-3The output of e2fsck reports the amount of blocks that have been checked. If you ever want to resize the disk back to it's current size, make a note of the amount of blocks so you can use it to shrink the disk again.
Extend the ext3 file system to also use the additional disk space:
resize2fs /dev/vg-1/lv-3
After this you can remount the filesystem and be happy.
It's worth nothing that you can store all your data on your server, when a simple crash of a disk destroys it all. So backups are needed. But there is more to back up then just backing up data so you can restore it in case of an emergency. Another scenario is that you might want to move data from your primary storage disks to tape, because you do not need them online, but can not throw the data away. So you store the data on cheaper storage, like tapes. The last backup scenario is to back up data where you want to be able to retrieve older versions of a single file. Meaning you do some version control on a document by making for example daily copies, so you can later on always go back to an earlier version.
All these different scenarios have an impact on how you do your back ups. Another important aspect is the safety of the backed up data. If for example your back up machine is next to the machine holding the data you have a copy, but when a fire breaks out it is of no use, because most likely both machines will be lost.
There are different solutions to this problem, like in the case that you do back ups to tape, you could move the tape with the back up off site, meaning to restore the data after a fire you might need to buy a new tapestreamer, but your data is safe. You could also decide to place the entire tapestreamer off site and do your back up across the network, or even across the Internet. It all depends on the amount of data, and how much security you need, and how much money you are will to spent.
http://www.bacula.org/2.4.x-manuals/en/developers/developers.pdf
The design of your backup system is dependant on your local situation. The simplest form of backup is to copy your data to another harddisk. Before you you act it is always good to think first. This section discusses some points to think about before diving into a backup design.
The first thing to realize is the possible failure scenarios, and probably more importantly how to avoid them. A couple of examples:
If disaster strikes it is good to have a plan according to which you can recover. Important to note is that the plan should be accessable after the disaster, so an online version might not be the best solution. In this plan the following points should be addressed:
Notes before you start your design:
Windows users are used to the archive bit, which can be set if a backup is needed, or unset if the backup is done. GNU/Linux and Unix in general have another scheme for this. The ext2 and ext3 file systems provide you with a atime, ctime or mtime. The atime is the time the file was last accessed (read), the ctime is the time the inode was last changed and the mtime is the time the file was last modified. The last two probably need a little bit more explanation. When the mtime changes, the ctime changes. Meaning when a file is written to both change, but when you change the rights on a file, or change the ownership, which means you only adjust the information that is stored in the inode only the ctime changes. Which means for backing up files back up software will use the ctime. (see stat and touch)
The emergency-backup consists of a full backup followed by a couple of incrementals, after which this cycle repeats itself.
Characterization:
Requirements are:
Questions that need to be answered:
The archive-backup consists only of a full backup of the data after which the data can be removed from the primary disks. To make sure this data is redundant it needs to be written to two tapes, which should be stored separate from one another.
Characterization:
Requirements:
Question that need to be answered:
If you have dropped a tape cardridge, restore its content and rebackup. Since dropped tapes have a shorter lifespan.
To maximize tape life, tape cartridges should be kept in an atmosphere free of contaminating dust particles and corrosive gases or chemicals. Cartridges should always be acclimated to the operating environment prior to mounting the cartridge on the drive. A minimum of 24 hours of acclimation time is generally recommended to make sure the cartridge is at the same humidity and temperature as the drive for newly received tapes.
The National Bureau of Standards publication, Care and Handling of Computer Magnetic Storage Media, recommends that magnetic tape be stored at 65 +/- 3 degrees Fahrenheit and 40% +/- 5% Relative Humidity.
National Media Laboratory = NML
Studies by the NML indicate that magnetic media, properly cared for, should have a lifetime which equals or exceeds that of the recording technology (10 to 20 years).
tar mt st
Take notes from: http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-lists-3/emc-networker-19/recommendations-for-new-tape-library-62149/index-15.html
http://searchstorage.techtarget.co.uk/news/column/0,294698,sid181_gci1295968,00.html
http://mailman.eng.auburn.edu/pipermail/veritas-bu/2009-April/103851.html
http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1267114354407+28353475&threadId=1212744
http://storage.ittoolbox.com/groups/vendor-selection/storage-select/tape-autoloader-selection-1832275
See: http://en.wikipedia.org/wiki/Linear_Tape-Open
common optical media formats have a storage life of 30 years or more
Store discs in a cool, dry environment away from direct light. Discs stored between 23 degrees F (-5 degrees C) and 86 degrees F (30 degrees C) can last up to 100 years
Do not leave the disc in direct sunlight or in a hot, humid environment--like your car on a summer day--as these conditions could warp and damage the disc
Do not allow moisture to condense on the disc.
hdparm -i gives you more information about the hardware, like disk manufacturer, serial number and disk geometry.
Serial ATA is ATA over serial lines. SATA uses smaller cables, then parallel ATA, which leaves more room and thus better cooling in the computer housing. SATA also does not use the master/slave setup anymore and is hotplugable.
Serial-attached SCSI has thinner cables, less bulky connectors and allows for longer cables. The hardware is cheaper and less prone to crosstalk.
| RAID level | Technology | Min disks | Max failures | Capacity | Benefit | Downside |
|---|---|---|---|---|---|---|
| 0 | Striping | 2 | 0 | N*s | Speed | No redundancy |
| 1 | Mirroring | 2 | 1 | s | Read double speed, write singel speed | |
| 3 | Parity | 3 | 1 | (N-1)*s | Safety is gained by a parity disk, which only costs one disk | Disk is only available for read or write |
| 4 | Parity per stripe | 3 | 1 | (N-1)*s | Safety is gained by a parity disk, which only costs one disk, read and write access | |
| 5 | Parity distributed over disks | 3 | 1 | (N-1)*s | No limit speed of parity disk | |
| 6 | 2 parity blocks distributed over disks | 4 | 2 | (N-2)*s | No limit speed of parity disk | High write penalty |
A combination of the above options is possible. Where one would create e.g. RAID 10, with 2 times 2 mirrored drives that hold a stripe set. Or RAID 01 where you would have a stripe set which is then mirrored. Capacity (N*s)/2, failure 1 disk, or an entire mirror set.
Simply said iSCSI is a protocol that allows you to send SCSI commands over IP networks. It is a combination of a NAS and a SAN, where it takes the best of both worlds.
iSCSI is client-server-based. The client is called the initiator and the server is called the target. The initiator sends SCSI commands to the target and the target sends the results back to the initiator. In hardware comparison, the initiator is the SCSI adapter and the target is the SCSI disk. With iSCSI the SCSI-bus is replaced by a network and TCP/IP.
The basic concept from a Linux point of view is that iSCSI makes block devices (disks) available across the network. This means you can partition it and create a filesystem on it. Because of this the initiator needs to have exclusive rights on the target. The target can only supply an iSCSI device to a single initiator.
There are two types of initiators:
Software initiators consist of one or more kernel drivers, a TCP-stack and a network card (NIC).
SCSI is CPU intensive esp. during high I/O loads. Hardware initiators offload the main CPU of a computer by implementing a SCSI ASIC on a network card and some form of TOE(TCP Offload Engine). The combination is often called an HBA (Host Bus Adapter). The HBA appears to the host as a "normal" SCSI adapter. With an option ROM it is even possible to boot a system from an iSCSI disk.
TOE handles part or all of the TCP-stack handling on the NIC. Which frees resources on the main CPU.
Since a host can supply multiple disks to the network, only an IP address is not sufficient to address a target. Of course next to an IP address also a port is used to connect to an iSCSI host. The combination of an IP address and a port is called the iSCSI portal. This way you can run multiple portals on a single host. The default port for the iSCSI portal is TCP/3260 and if a system port is needed port TCP/860 should be used.
iSCSI also has a globally unique addressing scheme to refer to a target. Both the initiator and target need to have such an iSCSI address. The address could look like this: iqn.1987-05.com.cisco:01.4ee667decaeb.
| Type | Date | Naming Authority | String defined by the naming authority |
|---|---|---|---|
| iqn | 1987-05 | com.cisco | 01.4ee667decaeb |
| iqn | 1992-05 | com.emc | ck2000544005070000-5 |
| iqn | 1992-05 | com.example | storage.disk1.sys1.xyz |
The Type field can hold three possible values: iqn (iSCSI Qualified Name), eui (IEEE EUI-64 format) or naa (T11 Network Address Authority 64 or 128-bit identifier).
The Date field holds the date of the first full month an authority was registered.
The Naming Authority is the reversed domain name of the authority. Most of the times however an IP address is used for the actual setup and not the supplied domain name. So the actually used domain name might not be that relevant.
The String can hold any naming convention the authority wants to use. It is by the way legal to leave this field empty.
iSCSI addresses are written in the following format: iqn.1987-05.com.cisco:01.4ee667decaeb. There is colon (:) between de domain part and the actual device addressing. All other elements are seperarted by dots (.).
An initiator can ask a target for a list of available devices, this process is called Discovery or auto-discovery. Of course this is the easiest way of connecting to an iSCSI target. The other way is by directly supplying the portal and target or use iSNS (Internet Storage Naming Service). iSNS is like a DNS for storage networks.
To provide an iSCSI disk to the network, the underlying hardware can be any kind of disk and does not need to be a SCSI disk or a real disk for that matter, you could as easily provide RAM disks to the network or a disk image.
Fix
Set the initiator name in /etc/initiatorname.iscsi so that you can configure your targets as needed:
InitiatorName=iqn.1987-05.com.cisco:01.7fcfb3a9ad78
Start the iSCSI service:
service iscsi start
With iscsi-ls you can now see your iSCSI targets.