8.0 Data Types
Prior to the arrival of MASM, most assemblers provided very little capability
for declaring and allocated complex data types. Generally, you could allocate
bytes, words, and other primitive machine structures. You could also set aside a
block of bytes. As high level languages improved their ability to declare and
use abstract data types, assembly language fell farther and farther behind. Then
MASM came along and changed all that[24].
Unfortunately, many long time assembly language programmers haven't bothered to
learn the new MASM syntax for things like arrays, structures, and other data
types. Likewise, many new assembly language programmers don't bother learning
and using these data typing facilities because they're already overwhelmed by
assembly language and want to minimize the number of things they've got to
learn. This is really a shame because MASM data typing is one of the biggest
improvements to assembly language since using mnemonics rather than binary
opcodes for machine level programming.
Note that MASM is a "high-level" assembler. It does things assemblers for
other chips won't do like checking the types of operands and reporting errors if
there are mismatches. Some people, who are used to assemblers on other machines
find this annoying. However, it's a great idea in assembly language for the same
reason it's a great idea in HLLs[25]. These features
have one other beneficial side-effect: they help other understand what you're
trying to do in your programs. It should come as no surprise, then, that this
style guide will encourage the use of these features in your assembly language
programs.
8.1 Defining New Data Types with TYPEDEF
MASM provides a small number of primitive data types. For typical
applications, bytes, sbytes, words, swords, dwords, sdwords, and various
floating point formats are the most commonly used scalar data types available.
You may construct more abstract data types by using these built-in types. For
example, if you want a character, you'd normally declare a byte variable. If you
wanted a 16-bit integer, you'd typically use the sword (or word) declaration. Of
course, when you encounter a variable declaration like "answer byte ?" it's a
little difficult to figure out what the real type is. Do we have a character, a
boolean, a small integer, or something else here? Ultimately it doesn't matter
to the machine; a byte is a byte is a byte. It's interpretation as a character,
boolean, or integer value is defined by the machine instructions that operate on
it, not by the way you define it. Nevertheless, this distinction is important to
someone who is reading your program (perhaps they are verifying that you've
supplied the correct instruction sequence for a given data object). MASM's
typedef directive can help make this distinction clear.
In its simplest form, the typedef directive behaves like a textequ. It let's
you replace one string in your program with another. For example, you can create
the following definitions with MASM:
char typedef byte
integer typedef sword
boolean typedef byte
float typedef real4
IntPtr typedef far ptr integer
Once you have declared these names, you can define char, integer, boolean,
and float variables as follows:
MyChar char ?
I integer ?
Ptr2I IntPtr I
IsPresent boolean ?
ProfitsThisYear float ?
- Rule:
- Use the existing MASM data types as type building blocks. For most data
types you create in your program, you should declare explicit type names using
the typedef directive. There is really no excuse for using the built-in
primitive types[26].
8.2 Creating Array Types
MASM provides an interesting facility for reserving blocks of storage - the
DUP operator. This operator is unusual (among assembly languages) because its
definition is recursive. The basic definition is (using HyGram notation):
DupOperator = expression ws* 'DUP' ws* '(' ws* operand ws* ') %%
Note that "expression" expands to a valid numeric value (or numeric
expression), "ws*" means "zero or more whitespace characters" and "operand"
expands to anything that is legal in the operand field of a MASM word/dw,
byte/db, etc., directive[27]. One would typically use
this operator to reserve a block of memory locations as follows:
ArrayName integer 16 dup (?) ;Declare array of 16 words.
This declaration would set aside 16 contiguous words in memory.
The interesting thing about the DUP operator is that any legal operand field
for a directive like byte or word may appear inside the parentheses, including
additional DUP expressions. The DUP operator simply says "duplicate this object
the specified number of times." For example, "16 dup (1,2)" says "give me 16
copies of the value pair one and two. If this operand appeared in the operand
field of a byte directive, it would reserve 32 bytes, containing the alternating
values one and two.
So what happens if we apply this technique recursively? Well, "4 dup ( 3 dup
(0))" when read recursively says "give me four copies of whatever is inside the
(outermost) parentheses. This turns out to be the expression "3 dup (0)" that
says "give me three zeros." Since the original operand says to give four copies
of three copies of a zero, the end result is that this expression produces 12
zeros. Now consider the following two declarations:
Array1 integer 4 dup ( 3 dup (0))
Array2 integer 12 dup (0)
Both definitions set aside 12 integers in memory (initializing each to zero).
To the assembler these are nearly identical; to the 80x86 they are absolutely
identical. To the reader, however, they are obviously different. Were you to
declare two identical one-dimensional arrays of integers, using two different
declarations makes your program inconsistent and, therefore, harder to read.
However, we can exploit this difference to declare multidimensional arrays.
The first example above suggests that we have four copies of an array containing
three integers each. This corresponds to the popular row-major array access
function. The second example above suggests that we have a single dimensional
array containing 12 integers.
- Guideline:
- Take advantage of the recursive nature of the DUP operator to declare
multidimensional arrays in your programs.
8.3 Declaring Structures in Assembly Language
MASM provides an excellent facility for declaring and using structures,
unions, and records[28]; for some reason, many
assembly language programmers ignore them and manually compute offsets to fields
within structures in their code. Not only does this produce hard to read code,
the result is nearly unmaintainable as well.
- Rule:
- When a structure data type is appropriate in an assembly language program,
declare the corresponding structure in the program and use it. Do not compute
the offsets to fields in the structure manually, use the standard structure
"dot-notation" to access fields of the structure.
One problem with using structures occurs when you access structure fields
indirectly (i.e., through a pointer). Indirect access always occurs through a
register (for near pointers) or a segment/register pair (for far pointers). Once
you load a pointer value into a register or register pair, the program doesn't
readily indicate what pointer you are using. This is especially true if you use
the indirect access several times in a section of code without reloading the
register(s). One solution is to use a textequ to create a special symbol that
expands as appropriate. Consider the following code:
s struct
a Integer ?
b integer ?
s ends
.
.
.
r s {}
ptr2r dword r
.
.
.
les di, ptr2r
mov ax, es:[di].s.a ;No indication this is
ptr2r!
.
.
.
mov es:[di].b, bx ;Really no indication!
Now consider the following:
s struct
a Integer ?
b integer ?
s ends
sPtr typedef far ptr s
.
.
.
q s {}
r sPtr q
r@ textequ <es:[di].s>
.
.
.
les di, ptr2r
mov ax, r@.a ;Now it's clear this is using r
.
.
.
mov r@.b, bx ;Ditto.
Note that the "@" symbol is a legal identifier character to MASM, hence "r@"
is just another symbol. As a general rule you should avoid using symbols like
"@" in identifiers, but it serves a good purpose here - it indicates we've got
an indirect pointer. Of course, you must always make sure to load the pointer
into ES:DI when using the textequ above. If you use several different
segment/register pairs to access the data that "r" points at, this trick may not
make the code anymore readable since you will need several text equates that all
mean the same thing.
8.4 Data Types and the UCR Standard Library
The UCR Standard Library for 80x86 Assembly Language Programmers (version 2.0
and later) provide a set of macros that let you declare arrays and pointers
using a C-like syntax. The following example demonstrates this capability:
var
integer i, j, array[10], array2[10][3], *ptr2Int
char *FirstName, LastName[32]
endvar
These declarations emit the following assembly code:
i integer ?
j integer 25
array integer 10 dup (?)
array2 integer 10 dup ( 3 dup (?))
ptr2Int dword ?
LastName char 32 dup (?)
Name dword LastName
For those comfortable with C/C++ (and other HLLs) the UCR Standard Library
declarations should look very familiar. For that reason, their use is a good
idea when writing assembly code that uses the UCR Standard Library.
[1] Someone who uses TASM all the time may think this is
fine, but consider those individuals who don't. They're not familiar with TASM's
funny syntax so they may find several statements in this program to be
confusing.
[2] Simplified segment directives do make it easier to
write assembly language programs that interface with HLLs. However, they only
complicate matters in stand-alone assembly language programs.
[3] A lot of old-time programmers believe that assembly
instructions should appear in upper case. A lot of this has to do with the fact
that old IBM mainframes and certain personal computers like the original Apple
II only supported upper case characters.
[4] Note, by the way, that I am not suggesting that this
error checking/handling code should be absent from the program. I am only
suggesting that it not interrupt the normal flow of the program while reading
the code.
[5] Doing so (inserting an 80x86 tutorial into your
comments) would wind up making the program less readable to those who already
know assembly language since, at the very least, they'd have to skip over this
material; at the worst they'd have to read it (wasting their time).
[6] Or whatever other natural language is in use at the
site(s) where you develop, maintain, and use the software.
[7] You may substitute the local language in your area if
it is not English.
[8] In fact, just the opposite is true. One should get
concerned if both implementations are identical. This would suggest poor
planning on the part of the program's author(s) since the same routine must now
be maintained in two different programs.
[9] Or whatever make program you normally use.
[10] This happens because shorter function invariable
have stronger coupling, leading to integration errors.
[11] Technically, this is incorrect. In some very
special cases MASM will generate better machine code if you define your
variables before you use them in a program.
[12] Older assemblers on other machines have required
the labels to begin in column one, the mnemonic to appear in a specific column,
the operand to appear in a specific column, etc. These were examples of
fixed-formant source line translators.
[13] See the next section concerning comments for more
information.
[14] This document will simply use the term comments
when refering to standalone comments.
[15] Since the label, mnemonic, and operand fields are
all optional, it is legal to have a comment on a line by itself.
[16] It could be worse, you should see what the
"superoptimizer" outputs for the signum function. It's even shorter and harder
to understand than this code.
[17] This is true regardless of what metric you use to
determine the "best" code sequence.
[18] Of course, if the program is a class assignment,
you may want to check your instructor's cheating policy before showing your work
to your classmates!
[19] The designer of the SNOBOL4 and Icon programming
languages.
[20] Note that this does not infer that it is hard to
write easy to read C programs. Only that if one is sloppy, one can easily write
something that is near impossible to understand.
[21] Okay, this is a cheap shot. In fact, most of the
assembly code on this planet is poorly written.
[22] Actually, MASM 6.x does, but we'll ignore that fact
here.
[23] Sometimes, for performance reasons, the code
sequence above is justified since straight-line code executes faster than code
with jumps. If the program rarely executes the ELSE portion of an if statement,
always having to jump over it could be a waste of time. But if you're optimizing
for speed, you will often need to sacrafice readability.
[24] Okay, MASM wasn't the first, but such techniques
were not popularized until MASM appeared.
[25] Of course, MASM gives you the ability to override
this behavoir when necessary. Therefore, the complaints from "old-hand" assembly
language programmers that this is insane are groundless.
[26] Okay, using some assembler that doesn't support
typedef would probably be a good excuse!
[27] For brevity, the productions for these objects do
not appear here.
[28] MASM records are equivalent to bit fields in C/C++.
They are not equivalent to records in Pascal.
|