Mixtral 8x7B going 378 tokens per second on CPU

PeterStuer · 2024-04-29T13:05:46

I think it's processing the tokenized prompt at 378 tokens per second, but emitting predicted tokens by the model at 11 tokens per second.

dangoodmanUT · 2024-04-29T12:49:19

Only for prompt eval, it's only spitting out 10 tok/s if I am understanding that screenshot correctly

kkielhofner · 2024-04-29T13:58:34

Very misleading, likely deliberately so. Per usual it’s working - here we are talking about it.

It’s evaluating input prompt at that speed. Important metric and impressive on its own but “going X tokens per second” in the mind of anyone looking at this is generated tokens.

That number: 11 tokens per second

368 on input is impressive, 11 on output is too.

The need to taint legitimate and real progress by stretching, misleading, etc is yet another classic issue greatly exacerbated by social media.

d-z-m · 2024-04-29T13:55:03

Seems misleading to phrase it this way if you aren't talking about inference tok/s.

pella · 2024-04-29T18:22:18

with 8 Memory Channels + AMD Ryzen Threadripper PRO 7995WX

T-A · 2024-04-29T15:06:10

... with 96 cores. Is that a Threadripper?