T
Terje Mathisen
- Jan 1, 1970
- 0
Yes, since division is much slower, if you can turn your number into a
fraction, and then multiply by 10 repeatedly, with a guarantee that the
rounding error in the initial "turning the number into a fraction"
operation is upwards only, but less than one place in the last bit, it
should be a valid and faster algorithm.
I (re-)invented & proved division by reciprocal multiplication many
years ago, and wrote code that did a complete verification: I tested
every possible 32-bit input divided by every possible 32-bit divisor,
not by checking all 2^64 combinations, but by verifying the correct
answer around every single transition, i.e. dividends just above, at and
just below an integer multiple of the divisor.
When I was thinking of an algorithm that went for speed at all costs, I
was thinking of something that multiplied by 100 repeatedly - and used
one byte-wide decimal add operation (the x86 architecture having such a
thing) for the last step of the conversion.
Please write some code & time it! Yes, there is a set of decimal
operations in x86, and most of them are speed-equivalent to doing a full
DIV. :-(
Basically, one has a seven-bit number from 0 to 99; the last three bits
need no conversion, they have a binary value from 0 to 7, which is the
same in packed BCD; the first four bits would be masked, and used to
index into a table...
x'00', x'08', x'16', x'24', x'32', x'40', x'48', x'56', x'64', x'72',
x'80', x'88', x'96'
which is only one *small* table.Perhaps this is too cumbersome to be
faster than a single multiply instruction, however.
"Perhaps"?
Please read & understand the code I posted: I don't use any MULs at all
for the successive factors of 10.
What I do is to split the initial number into two halves, i.e. modulo
100000, then convert both halves simultaneously to ascii.
The conversions take place using "LEA reg,[reg+reg*4]" to multiply by 5
in a single cycle without multiplying, while getting the final factor of
2 by using masks that are one bit lower each time.
BTW, your table above would still require a BCD carry propagation, i.e.
to convert 53 (bin) to ascii you would get '48' + 5, then have to add
them together, getting '4D' (in hex), then either use a slow DAA (?)
opcode or add '06' hex and detect that you got an overflow into the next
nybble.
Terje