bfloat16 #545

ryao · 2024-12-19T20:36:33Z

The weights are natively bfloat16. Rather than convert them into float, you could just keep them as bfloat16 and convert between float and bfloat16 on the fly using a union type and a bitshift. This should double performance in the forward() function since it is memory bandwidth bound. The only caveat is that you would need to handle subnormal numbers when converting from float to bfloat16.

There are two ways of doing this:

Check for subnormal numbers via issubnormal() and zero them when converting from float to BF16.
Set bit 15 of the MXCSR on amd64 CPUs (non-portable)

Presumably, both could be used via a CPP check. The issubnormal() check could be done on non-amd64 processors while bit 15 of the MXCSR could be set on amd64 processors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bfloat16 #545

bfloat16 #545

ryao commented Dec 19, 2024

bfloat16 #545

bfloat16 #545

Comments

ryao commented Dec 19, 2024