ugly math

Terje Mathisen · Nov 27, 2006

Yes, since division is much slower, if you can turn your number into a
fraction, and then multiply by 10 repeatedly, with a guarantee that the
rounding error in the initial "turning the number into a fraction"
operation is upwards only, but less than one place in the last bit, it
should be a valid and faster algorithm.

I (re-)invented & proved division by reciprocal multiplication many
years ago, and wrote code that did a complete verification: I tested
every possible 32-bit input divided by every possible 32-bit divisor,
not by checking all 2^64 combinations, but by verifying the correct
answer around every single transition, i.e. dividends just above, at and
just below an integer multiple of the divisor.

When I was thinking of an algorithm that went for speed at all costs, I
was thinking of something that multiplied by 100 repeatedly - and used
one byte-wide decimal add operation (the x86 architecture having such a
thing) for the last step of the conversion.

Please write some code & time it! Yes, there is a set of decimal
operations in x86, and most of them are speed-equivalent to doing a full
DIV. :-(

Basically, one has a seven-bit number from 0 to 99; the last three bits
need no conversion, they have a binary value from 0 to 7, which is the
same in packed BCD; the first four bits would be masked, and used to
index into a table...

x'00', x'08', x'16', x'24', x'32', x'40', x'48', x'56', x'64', x'72',
x'80', x'88', x'96'

which is only one *small* table.Perhaps this is too cumbersome to be
faster than a single multiply instruction, however.

"Perhaps"?

Please read & understand the code I posted: I don't use any MULs at all
for the successive factors of 10.

What I do is to split the initial number into two halves, i.e. modulo
100000, then convert both halves simultaneously to ascii.

The conversions take place using "LEA reg,[reg+reg*4]" to multiply by 5
in a single cycle without multiplying, while getting the final factor of
2 by using masks that are one bit lower each time.

BTW, your table above would still require a BCD carry propagation, i.e.
to convert 53 (bin) to ascii you would get '48' + 5, then have to add
them together, getting '4D' (in hex), then either use a slow DAA (?)
opcode or add '06' hex and detect that you got an overflow into the next
nybble.

Terje

glen herrmannsfeldt · Nov 30, 2006

Terje Mathisen wrote:

(snip)

I (re-)invented & proved division by reciprocal multiplication many
years ago, and wrote code that did a complete verification: I tested
every possible 32-bit input divided by every possible 32-bit divisor,
not by checking all 2^64 combinations, but by verifying the correct
answer around every single transition, i.e. dividends just above, at and
just below an integer multiple of the divisor.

That is nice. I probably would have guessed that there would be
some cases where it rounded wrong, but that would have been a guess.
I would have guessed that it would take at least one more bit than the
dividend length to get it right.

Now, will it still work for 64 bits?

-- glen

Terje Mathisen · Nov 30, 2006

glen said:
Terje Mathisen wrote:

(snip)

That is nice. I probably would have guessed that there would be some
cases where it rounded wrong, but that would have been a guess. I
would have guessed that it would take at least one more bit than the
dividend length to get it right.

That's correct: You do need a 33-bit reciprocal, rounded up.

This means that for that half of divisors where the reciprocal's 33rd
bit is 1, you get to round it up to a 32-bit reciprocal (with trailing
zeroes), but for the other half you need some more work:

Either you must add half the input value once more to the low half of
the product, then propagate any carries, or you can take advantage of
the fact that the top reciprocal bit is always 1, so you can multiply by
the next 32 bits, then add the missing part afterwards. Both of these
suffer from the need to temporarily work with more than 32 bits:

First method
mov eax,[reciprocal_rounded_down] ; First 32 bits of reciprocal
mul ebx ; Value to be divided
shr ebx
add eax,ebx
adc edx,0
mov cl,[shift]
shr edx,cl ; EDX has exact result

Second method:
mov eax,[reciprocal_without_top_bit]
mul ebx
add edx,ebx ; Can overflow into 33 bits!
rcr edx,1
mov cl,[shift]
shr edx,cl ; EDX has exact result

Agner Fog & I corresponded on this algorithm for a while, he discovered
a much better approach to use in the case of too-long reciprocals:

Instead of faking a 33-bit reciprocal, he determined that it is OK to
adjust the input value instead, and use a rounded-down 32-bit reciprocal:

mov eax,[reciprocal_rounded_down]
add ebx,1 ; Value to be divided, adjusted up. (Can overflow!!!)
jc skip_mul
mul ebx
skip_mul:
mov cl,[shift]
shr edx,cl ; EDX has exact result

If the cost of that branch can be high, either because maximal inputs
are relatively common, or because your code is running out of branch
predictor entries, then you can make it branchless like this:

mov eax,[reciprocal_rounded_down]
inc ebx ; Value to be divided, adjusted up. (Can overflow!!!)
mul ebx
test ebx,ebx
cmovz edx,[reciprocal_rounded_down]
mov cl,[shift]
shr edx,cl ; EDX has exact result

The best choice for a semi-constant division would be an algorithm that
always works, right? It turns out that you can do this in various ways,
but the trailing 32-bit reciprocal + extra add is probably the easiest:

Works for any divisor, calculate a 33-bit accurate reciprocal, where the
33rd bit is rounded up, then remove the top bit:

mov eax,[reciprocal_without_top_bit]
mul ebx
add edx,ebx ; Can overflow into 33 bits!
rcr edx,1
mov cl,[shift]
shr edx,cl ; EDX has exact result

Now, will it still work for 64 bits?

Yes, as long as you follow the same rules!

In fact, on a 64-bit system you can use the same approach even in C as
long as you have some kind of compiler intrinsic to give you access to
the high half of a 64x64->128 bit multiplication. You can do it even if
you don't have a carry flag to temporarily give you an extra bit.

I.e. this works on Alpha and most others.

uint64_t reciprocal_without_top_bit;
unsigned shift;

void setup_divisor(uint64_t divisor);

inline
uint64_t semi_fixed_divide(uint64_t n)
{
uint64_t prod = MULH(n, reciprocal_without_top_bit);
prod += n; // Adjust for missing top bit
int carry = prod < n; // Overflow? 0 or 1
prod = (prod >> 1) | (carry << 63);
prod >>= shift;
return prod;
}

Terje

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

ugly math

ugly math

Terje Mathisen

glen herrmannsfeldt

Terje Mathisen

Similar threads