Random Musings about Raw Disks vs File Systems and Asynchronous I/O
(note: this is heavily Solaris information)

There are two reasons to put Sybase devices on raw disks:
1. Performance: Raw disk partitions (meaning, disk slices that do NOT 
have a file system built upon them) still allow the operating system to 
access the data in an asynchronous (non-serial) nature.  Sybase performs 
far better off of Raw partitions when working with data and logs devices.
asynchronously (but not tempdev; see below).  

2. Reliability: Most file system "systems" buffer writes to disk
in order to speed up physical access.  Sybase works as a transactional
system, meaning that once the Engine believes a transaction has
been committed to disk, it deletes all transactional log evidence
of the transaction.  If the underlying device is a file system,
and if the writes are buffered, then a gap exists between commit
and physical writing to disk.  If the Sybase server crashes at this
point, unrecoverable data corruption will occur.  If this corruption
is on the master device, the entire server can be lost.  This may seem
like a one-in-a-million problem, but its common.  The dreaded 605 error.

Despite the increased difficulty in administration, DBAs must insist
on raw partitions for data and log devices.  Some admins and programmers 
destroy the efficiency of their databases because of their insistence on 
using file based devices (piece of mind to be able to do an ls -l on your 
file than it is to administer a raw device that you can't "see").  I
see this more often than not with Oracle database configturations; 
despite the fact that Oracle can work with raw disk partitions, admins
often times put the devices on flat files.

- UPDATE: Apparently in Sybase 12.0, significant advances have been
made in the Sybase product line from a recovery standpoint, where
they're no longer recommending the use of Raw devices for implementations
on Sun and HP systems, GIVEN you are installing on a large scale 64-bit 
machine with ample physical memory.  With enough RAM at your disposal,
writes can be cached and performance does not take a hit to the synchronous
file system devices.

----
- Some devices are better off on file systems; if you have tempdb intensive
activity, putting tempdev on a file system can easily result in
a 33% performance increase (I've documented such results in my last
two P&T clients).  This is because tempdb is accessed in a serial,
synchronous manner by Sybase.  And, since tempdb is dropped and
recreated up on each reboot of Sybase, there is no recovery concerns
(like the ones raised above).

- Some argue effectively that test and dev environments are easier
to manage/administer by putting devices on file systems.  You're not
terribly worried about performance, and the ease of deleting files
from file systems for quick turnaround time demanded by developers
when upgrading environments sometimes outweighs the recovery hassles
inherent to such a config.

- There are some who argue effectively that DSS based data devices
will benefit from being placed on flat files.  This is because O/S
typically does far larger I/Os than Sybase is able to do (most at
64K these days), and thus when reading large amounts of data which 
is more or less contiguous on the data device, read performance should 
be increased.   The O/S will buffer 64K of I/O before Sybase requests it.

- Occasionaly I've configured Sybase systems to have file-system
based devices for some system devices; sybsystemprocs and sybsyntax.
These are regenerable and subject to little write activity.
I recommend NEVER putting the master device on a flat file.  Update: 11.5 
install manual actually recommends putting sysprocsdev on flat files.

- Note: if you use file system based devices, Sybase will preallocate
the entire device size in your directory.  It writes its pertinent
data, and zero's out the rest of the file.  If these files are deleted
(which I've seen happen) you have major corruption on your hands.  

- Some sites set up their Sybase devices using pointers (soft links) 
to the devices instead of the actual devices themselves (I personally
do NOT recommend doing this; there's no need to add another breakpoint
in a sybase configuration.  If the directory where those devices
soft pointers gets corrupted or unmounted, the Sybase server is in
peril).  However, if you do this, Sybase will follow the link and
use async I/O if the link is to a raw partition.  In fact, the entries
in /dev/rdsk are softlinks themselves to the disk entry in the /devices
directory (on solaris anyway).  

- You can tell, without investigating the sysdevices table, how Sybase
is accessing each device by examining the errorlog.  Upon startup,
Sybase analyzes each device and determines whether or not it can
access the data asynchronously (by default it will attempt asyncI/O 
over standard/synchronous I/O).  

- Security note: using raw devices makes unauthorized access to your
data files a more complicated endeavor, if you are concerned about security
of data stored within your databases.  A flat file data device can
be perused easily, whereas data stored in raw devices is not.  It is still
possible to convert raw devices to flat files (through the dd command)
but its a hassle that some hackers may not exploit.

- IF you plan to use a raw disk, then you must format the disk correctly.
In Solaris (not known in other OS's) the first cylinder of a new disk
is used to contain the disk partitioning information for the disk.
If you create a Sybase device on a raw disk that includes cylinder 0,
you will eventually have corruption.  Always partition the first
partition (slice 0) to start at least on cylinder 1 (I normally give a
few cylinders free room, depending on the size of the disk).
On AIX, you must set "vstart=2" on all disks to avoid this problem (and you
must set vstart higher if you're mirroring your devices).  AIX
attempts to write logical volume information to the initial parts of disks.

- Its impossible to force Sybase to access a synchronous device in
an asynchronous manner.  Despite the server wide option that seems
to force the engine to read all devices synchronously, its a misleading
option.  In fact, I think its a bug in Sybase 11.0.x.  If you do a
1> sp_configure "async"
2> go

you'll see an option "allow server async i/o" option.  It defaults to
"0," or off.  Yet, if you reconfigure it and reboot it does not change.
Thats because, as explained above, Sybase determines accessability on
a per-device basis, not at a server level.

- The only way I know of that allows async behavior on a filesystem type 
device is if you're using the Veritas File systmer (vxfs).  Vxfs allows
all the write buffering inhereint in a file system plus the ability to 
have async behavior.  

- NFS mounted raw devices, I believe, will result in standard/synchronous
access.  this is because the NFS system somehow serializes access over
its nfsd protocol.  

- The message "dstartio: I/O request repeatedly delayed..." in the server
errorlog indicates an overload of the I/O activity caused by async behavior.
Sybase has run out of disk I/O structures internally and has had to
delay your I/O activity.  If you look (on Solaris) in the file 
/usr/include/sys/asynch.h you'll see a #define MAXASYNCHIO line with a 
number (typically 200 on default Solaris installations).  If you set the 
three async I/O server parameters to this default value (disk I/O structures, 
max async I/Os per engine, and max async I/Os per server) you should never 
see this value.   However some sites report still seeing the dstartio
values even when setting these parameters to default.  Usually, 500
disk I/O structures will be more than sufficient.  (on other O/Ss, you'll 
want to set cnblkio, cnblkmax, and cnmaxaio_server to the same value.  
This info should be documented in the Error Messages guide).  In fact,
one DBA reports that Sybase Tech support recommends bumping up the
MAXASYNCHIO value to 500 in 2.5.1 and greater.

In later version sof Solaris, MAXASYNCHIO can be defined in /etc/system.
However I have yet to see a system that has done this.

- Backup/Recovery; you cannot reliably backup a sybase raw device
using dd.  This is because of the "dirty page" concept; When a 
transaction is committed, only the log information is actually 
written to disk.  The rest of the information still resides in 
Sybase's memory (until either flushed out of the buffer cache by
incoming pages or if a checkpoint command is entered).  Thus, unless
the Sybase server is shutdown (thus guaranteeing that all data pages
in memory have been written to disk) dumps of the raw devices will
not have all the database's information.  Always use the 
Sybase dump commands to backup databases; thats what the backupserver
was written for and thats why you should use it.  (Plus, if you dd
a raw disk, you'll have to backup the entire disk whereas a Sybase
dump database only backs up the actual data pages filled...much more
efficent).

- Solaris KAIO: Back in Solaris 2.4, KAIO (Kernalized async I/O) was
not a standard part of the operating system and required a seperate
patch (at least jumbo patch 101945-34 and Sun patch 102020-05).  Sybase 
was supporting 4.9.2 and 10.0.x at
that time and had certified on Solaris both with and without KAIO.
Once we hit Solaris 2.5; KAIO was a standard part of the O/S.  KAIO
also allowed the relaxing of the Sybase recommended (n-1) engines on
a multi-cpu box (because KAIO kernels handled I/O more efficnetly).
Sybase 10 and below servers had to upgrade three kernel parameters:
(cnblkio, cnblkmax, cnmaxaio_server) from defaults of 200 to 500
to account for the aggresive io.  A lack of tuning these three
parameters would lead to many "dstartio" kernel messages.

----------------
SGI Issues
- SGI's file systems can be configured for asynch I/O.  They use a concept
called "directed I/O" to go directly to the user space.  

----------------
AIX Issues

- I worked in an Aix shop once that used file systems primarily, and
I believe IBM Aix also has an asynchronous file system capability.
(But its NOT JFS.  The JFS is a very fast File system, but still has
write buffering data corruption possibilities).
- AIX disks are known to have performance differences depending on
how close to the center of the spinning heads the segments are that
you create the devices upon.  Thus, you always want to put your
most actively used devices on partitions closest to the center of
the disk.  In fact, we often times *saved* the inside segments for
future devices.
- You must set vstart above 0 to avoid having raw disks blow out the 
logical volume information (see above).  Suggested to use a multiple of 8MB
when creating devices (to avoid an extent overhang) and to start specifically 
at 512k so as to avoid having a future "alter database" command over extend by 
1/2 mb.

----------------------
HP Issues:
- Terminology: block device == file system or cooked device.  A character
device is a raw partition device

- With Sybase 10.x over HP/UX 10.x the sysadmin
had to physically enable Async I/O behavior before Sybase could use it.
You had to follow these steps (as root on the Unix box):
1.  /etc/mknode /dev/async c 101 5
2.  chmod 0660 /dev/async
3.  chown sybase /dev/async
4.  chgrp  /dev/async
OR using SAM:
1. select  Drivers
   -async  I/O
   make state as in from pending
and then do stesp 2-4 above.

this is documented in the Sybase setup manual for HP specific servers.

- In old versions of Sybase over HP (10.x over hp 9.x), you had to 
configure for async I/O backwards of normal.  You had to actually
configure the devices as write buffered block devices to get async I/O
(in fact if you used raw devices you could only get standard I/O).  
This was fixed in Sybase 11.x over HP 10.x.  Now Sybase is trying
to phase out the use of block/file system devices altogether over HP.

---
Windows NT considerations:
- Windows NT installations CAN use raw file; its defintly rare though
to have a raw disk installed in a PC.  If the disk is dedicated, have
the NT admin reformat/unformat.  Its a bit odd to access raw disks
in NT though; for users familiar with the Unix-way of accessing them.
- NT installs are limited in the number of raw partitions (since they're
drive-letter based; 25 as a max, usually less b/c of reserved drive
letters A and B for floppies).

- However, An engineer with Sybase reports that Sybase tests of NT
raw versus file-system based devices showed a negligible performance
difference (3% better in 11.5, 8% in 11.9).  However, 
unlike Unix filesystems, NT does NOT buffer data writes to its file
systems, and thus there is no risk of data recovery.  It is recommended
to use file system-based devices in NT installations. (NTFS)

---
Linux Considerations:
- the 11.0.3 version released initially by sybase over linux did NOT
support eithr Raw disks nor Asynch i/o (resultant of the previous).
With ESD#6 however, released mid July, 2000 KAIO was included for
Redhat kernels 2.2.14 and above.  Surf to www.sybase.com/linux/ase.
(note; one of the links seems to be broken; if you see a link to
tsam-web1.sybase.com try using wim2.sybase.com).

---
A compelling argument AGAINST master device on Raw filesystems:
Posted to Sybase-L on 8/13/98 by James McAllister 

"We configure our master devices as plain UNIX files (along with tempdb and
sybsystemprocs devices). Our rationale is that disaster recovery is
significantly improved. For example, assume the master device is
inoperable. Recovering the master device from tape takes many steps.

1. Run "buildmaster", remembering to get the exact size correct.
2. Start the server in single-user mode.
3. Fix sysservers, so you can talk to the backup server.
4. Load the database from tape.
5. Reload any other database on the master device.

Sound simple? Things get REAL complicated if you have ever "altered" the
master database or any other database on the master device, created user
databases on the master device, dropped tempdb segments from the master
device, or have more than 10 devices.

Any or all of these factors may mean you have to perform manual surgery on
sysusages.

Compare that to restoring a UNIX file master device.

1. Pop the tape in.
2. ufsrestore -i
3. Start the SQL server.

Because you have to have a functional, generic master database before you
can restore your actual master database, restoring master from tape is
entirely different than restoring other databases. For that reason, we use
ufsdump for the master device and Sybase dumps for user databases.

Another reason for doing this is upgrades. Since they mainly affect
databases on the master and sysprocs devices, you can "cp" the files before
the upgrade and have an easy fallback if the upgrade is unsuccessful.

In other words,

ufsdump
   - only backs up plain UNIX files
   - database must be quiescent
   - simplifies restoring master

Sybase dump
   - backs up an entire database
   - database can be active      
   - simplifies restoring user databases

Note that plain files tend to be more vulnerable to corruption upon system
crash, because OS-buffered writes are lost. So in other words, you are
trading an enormously improved recovery task for a slightly increased
chance you will have to use it. My experience is that it is a good
trade-off."