Hacker News new | past | comments | ask | show | jobs | submit login
Mixtral 8x22B (mistral.ai)
533 points by meetpateltech 14 days ago | hide | past | favorite | 242 comments



First test I tried to run a random taxation question through it

Output: https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...

Within the first paragraph, it outputs:

> GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGE

Thought that was hilarious.


The `mixtral:8x22b` tag still points to the text completion model – instruct is on the way, sorry!

Update: mixtral:8x22b now points to the instruct model:

  ollama pull mixtral:8x22b
  ollama run mixtral:8x22b


Wait. Isn't it a breaking change to change the underlying model like this? Wouldn't people start running into consistency issues in production? (given ollama appears to be oriented towards backend use)

Sure, in theory. But if you move so fast that you already are running the base 8x22B model from last week, you can easily fix this.

I've long thought that if you want reproducibility and reliability, you need to pin your deps.

So, IMO, the change is very much worth it to reduce confusion going forward.


That's not the model this post is about. You used the base model, not trained for tasks. (The instruct model is probably not on ollama yet.)


Yeah this is exactly what happens when you ask a base model a question. It'll just attempt to continue what you already wrote based off its training set, so if you say have it continue a story you've written it may wrap up the story and then ask you to subscribe for part 2, followed by a bunch of social media comments with reviews.


It can be fun, though, to prompt a text completion with something like "I'm thinking about" and just seeing what random thing it completes it with.

I absolutely did not:

ollama run mixtral:8x22b

EDIT: I like how you ninja-editted your comment ;)


Considering "mixtral:8x22b" on ollama was last updated yesterday, and Mixtral-8x22B-Instruct-v0.1 (the topic of this post) was released about 2 hours ago, they are not the same model.


Are we looking at the same page?

https://imgur.com/a/y6XfpBl

And even the direct tag page: https://ollama.com/library/mixtral:8x22b shows 40-something minutes ago: https://imgur.com/a/WNhv70B


Let me clarify.

Mixtral-8x22B-v0.1 was released a couple days ago. The "mixtral:8x22b" tag on ollama currently refers to it, so it's what you got when you did "ollama run mixtral:8x22b". It's a base model only capable of text completion, not any other tasks, which is why you got a terrible result when you gave it instructions.

Mixtral-8x22B-Instruct-v0.1 is an instruction-following model based on Mixtral-8x22B-v0.1. It was released two hours ago and it's what this post is about.

(The last updated 44 minutes ago refers to the entire "mixtral" collection.)


And where does it say that's the instruct model?


I get:

ollama run mixtral:8x22b

Error: exception create_tensor: tensor 'blk.0.ffn_gate.0.weight' not found


You need to update ollama to 0.1.32.


Thanks. That did it.


Not instruct tuned. You're (actually) "holding it wrong".


Looks like an issue with the quantization that ollama (i.e llama.cpp) uses and not the model itself. It's common knowledge from Mixtral 8x7B that quantizing the MoE gates is pernicious to model perplexity. And yet they continue to do it. :)


No, it's unrelated to quantization, they just weren't using the instruct model.


Does anyone have a good layman's explanation of the "Mixture-of-Experts" concept? I think I understand the idea of having "sub-experts", but how do you decide what each specialization is during training? Or is that not how it works at all?


Ignore the "experts" part, it misleads a lot of people [0]. There is no explicit specialization in the most popular setups, it is achieved implicitly through training. In short: MoEs add multiple MLP sublayers and a routing mechanism after each attention sublayer and let the training procedure learn the MLP parameters and the routing parameters.

In a longer, but still rough, form...

How these transformers work is roughly:

``` x_{l+1} = mlp_l(attention_l(x_l)) ```

where `x_l` is the hidden representation at layer l, `attention_l` is the attention sublayer at layer l, and `mlp_l` is the multilayer perceptron at sublayer l.

This MLP layer is very expensive because it is fully connected (i.e. every input has a weight to every output). So! MoEs instead of creating an even bigger, more expensive MLP to get more capability, they create K MLP sublayers (the "experts") and a router that decides which MLP sublayers to use. This router spits out an importance score for each MLP "expert" and then you choose the top T MLPs and do an average weighed on importance, so roughly:

``` x_{l+1} = \sum_e mlp_{l,e}(attention_l(x_l)) * importance_score_{l, e} ```

where the `importance_score_{l, e}` is the score computed by the router at layer l for "expert" e. That is, `importance_score_{l} = attention_l(x_l)`. Note that here we are adding all experts, but in reality we choose the top T, often 2, and use that.

[0] some architectures do, in fact, combine domain experts to make a greater whole, but not the currently popular flavor


So it is somewhat like a classic random forest or maybe bagging, where you're trying to stop overfitting, but you're also trying to train that top layer to know who could be the "experts" given the current inputs so that you're minimising the number of multiple MLP sublayers called during inference?


Yea, it's very much bagging + top layer (router) for the importance score!


Would this be a reasonable explanation?

> MLPs are universal function approximators, but these models are big enough that it is better to train many small functions rather than a single unified function. MoE is a mechanism to force different parts of the model to learn distinct functions.


It misses the crucial detail that every transformer layer chooses the experts independently from the others. Of course they still indirectly influence each other since each layer processes the output of the previous one.


This is a bit of a misnomer. Each expert is a sub network that specializes in sub understanding we can't possibly track.

During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.

The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.


I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)

Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.

I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.


>I heard MoE reduces inference costs

Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.


It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.

We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".


The latter. Yes, it all needs to stay in memory.


Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?


It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.


It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.


Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)


Sounds like the "you only use 10% of your brain" myth, but actually real this time.


Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.


Not quite, you don't save memory, only compute.


A decent loose analogy might be database sharding.

Basically you're sharding the neural network by "something" that is itself tuned during the learning process.


Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?


Not really. The “expert” term is a misnomer; it would be better put as “brain region”.

Human brains seem to do something similar, inasmuch as blood flow (and hence energy use) per region varies depending on the current problem.


Any idea why everyone seems to be using 8 experts? (Or was GPT-4 using 16?) Did we just try different numbers and found 8 was the optimum?


Probably because 8 GPUs is a common setup, and with 8 experts you can put each expert on a different GPU


Has anyone tried MoE at smaller scales? e.g. a 7B model that's made of a bunch of smaller ones? I guess that would be 8x1B.

Or would that make each expert too small to be useful? TinyLlama is 1B and it's almost useful! I guess 8x1B would be Mixture of TinyLLaMAs...


There is Qwen1.5-MoE-A2.7B, which was made by upcycling the weights of Qwen1.5-1.8B, splitting it and finetuning it.


Yes there are many fine tunes on huggingface. Search "8x1B huggingface"


The previous mixtral is 8x7B


Nobody decides. The network itself determines which expert(s) to activate based on the context. It uses a small neural network for the task.

It typically won't behave like human experts - you might find one of the networks is an expert in determining where to place capital letters or full stops for example.

MoE's do not really improve accuracy - instead they are to reduce the amount of compute required. And, assuming you have a fixed compute budget, that in turn might mean you can make the model bigger to get better accuracy.


Not quite a layman's explanation, but if you're familiar with the implementation(s) of vanilla decoder only transformers, mixture-of-experts is just a small extension.

During inference, instead of a single MLP in each transformer layer, MoEs have `n` MLPs and a single layer "gate" in each transformer layer. In the forward pass, softmax of the gate's output is used to pick the top `k` (where k is < n) MLPs to use. The relevant code snippet in the HF transformers implementation is very readable IMO, and only about 40 lines.

https://github.com/huggingface/transformers/blob/main/src/tr...


It’s not “experts” in the typical sense of the word. There is no discrete training to learn a particular skill in one expert. It’s more closely modeled as a bunch of smaller models grafted together.

These models are actually a collection of weights for different parts of the system. It’s not “one” neural network. Transformers are composed of layers of transformations to the input, and each step can have its own set of weights. There was a recent video on the front page that had a good introduction to this. There is the MLP, there are the attention heads, etc.

With that in mind, a MoE model is basically where one of those layers has X different versions of the weights, and then an added layer (another neural network with its own weights) that picks the version of “expert” weights to use.


It's really a kind of enforced sparsity, in that it requires that only a limited amount of blocks be active at a time during inference. What blocks will be active for each token is decided by the network itself as part of training.

(Notably, MoE should not be conflated with ensemble techniques, which is where you would train entire separate networks, then use heuristic techniques to run inference across all of them simultaneously and combine the results.)


The simplest way to think about it is a form of dropout but instead of dropping weights, you drop an entire path of the network


As always, code is the best documentation: https://github.com/ggerganov/llama.cpp/blob/8dd1ec8b3ffbfa2d...


maybe there's one that is maitre d'llm?


There is some good documentation around mergekit available that actually explains a lot and might be a good place to start.


Correct, the experts are determined by Algo, not anything humans would understand.


"64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 Turbo

Maybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex's podcast?


How useful is such a large input window when most of the middle isn't really used? I'm thinking mostly about coding. But wheb putting even say 20k tokens into the input, a good chunk doesn't seem to be "remembered" or used for the output


While you're 100% correct, they are working on ways to make the middle useful, such a "Needle in a Haystack" testing. When we say we wish for context length that large, I think it's implied we mean functionally. But you do make a really great point.


maybe we'll look back at token context windows like we look back at how much ram we have in a system.


I agree with this in the sense that once you have enough, you stop caring about the metric.


And how much RAM do you need to run Mixtral 8*22B? Probably not enough on a personal laptop.


Generally about ~1gb ram per billion parameters. I've run a 30b model (vicuna) on my 32gb laptop (but it was slow).


I run it fine on my 64gb RAM beast.


At what quantization? 4-bit is 80GB. Less than 4-bit is rarely good enough at this point.


Is that normal ram of GPU ram?


64GB is not GPU RAM, but system RAM. Consumer GPUs have 24GB at most, those with good value/price have way less. Current generation workstation GPUs are unaffordable; used can be found on ebay for a reasonable price, but they are quite slow. DDR5 RAM might be a better investment.


While there is a lot more HBM (or UMA if you're a Mac system) you need to run these LLM models, my overarching point is that at this point most systems don't have RAM constraints for most of the software you need to run and as a result, RAM becomes less of a selling point except in very specialized instances like graphic design or 3D rendering work.

If we have cheap billion token context windows, 99% of your use cases aren't going to hit anywhere close to that limit and as a result, your models will "just run"


I still don’t have enough RAM though ?


RAM is simply too useful.


Wasn't there a paper yesterday that turned context evaluation linear (instead of quadratic) and made effectively unlimited context windows possible? Between that and 1.58b quantization I feel like we're overdue for an LLM revolution.


So far, people have come up with many alternatives for quadratic attention. Only recently have they proven their potential.


tons and tons of papers, most of them had some disadvantages. Can't have the cake and eat it too:

https://arxiv.org/html/2404.08801v1 Meta Megalodon

https://arxiv.org/html/2404.07143v1 Google Infini-Attention

https://arxiv.org/html/2402.13753v1 LongRoPE

and a ton more


FWIW, the 128k context window for GPT-4 is only for input. I believe the output content is still only 4k.


How does that make any sense on a decoder-only architecture?


It's not about the model. The model can output more - it's about the API.

A better phrasing would be that they don't allow you to output more than 4k tokens per message.

Same with Anthropic and Claude, sadly.


you can always put the unfinished output as the input to continue forever until reaching the full 128k context window

Great to see such free to use and self-hostable models, but it's said that open now means only that. One cannot replicate this model without access to the training data.


There's a large amount of liability in disclosing your training data.


I expect we'll see some AI companies in the future throwing away the training dataset. Maybe some have already.

During a court case, the other side can demand discovery over your training dataset, for example to see if it contains a particular copyrighted work.

But if you've already deleted the dataset, you're far more likely to win any case against you that hinges on what was in the dataset if the plaintiff can't even prove their work was included.

And you can argue that the dataset was very expensive to store (which is true), and therefore deleted shortly after training was complete. You have no obligation to keep something for the benefit of potential future plaintiffs you aren't even aware of yet.


Calling the model 'truly open' without is not technically correct though.


It's open enough for all practical purposes IMO.


It's not "open enough" to do an honest evaluation of these systems by constructing adversarial benchmarks.


As open as an executable binary that you are allowed to download and use for free.


In the age of SaaS I’ll take it. It’s not like I have a few million dollars to pay for training even if I had all the code and data.


...And a massive pile of cash/compute hardware.


not that massive, we're talking six figures. There was a blogpost about this a while back on the startpage of HN.


6 figures are a massive pile of cash.


It's... really not, considering the audience here. Even less massive if 2-3 engineers get together to do it.

It is considering what you get for it, and it's not lower end six figures and most likely seven. The JetMoe team released their training cost estimate and it took them $100k to train what's effectively a 2.2B model for 1.25 T tokens. Compare that to the still tiny Mistral 7B which is 3x larger and was trained on 4x more data you get a figure more around $1.7M. These are the absolute smallest production-viable LLMs.

For something like Mixtral 8X22B with 40B active params you'd looking at the $10M range, and if something gets screwed up during training you can be left with a dud and nothing to show for it, like LLama-2-33B. It's like buying millions worth of lootboxes and hoping something good drops.


for finetuning or parameter training from scratch?



That's for an 8B model.


This is over trivializing it, but there isn't much more inherent complexity in training an 8B or larger model other than more money, more compute, more data, more time. Overall, the principles are similar.


Assuming linear growth to number of parameters that's 7.5 figures instead of 6 for 8x22B model.


What's the best way to run this on my Macbook Pro?

I've tried LMStudio, but I'm not a fan of the interface compared to OpenAI's. The lack of automatic regeneration every time I edit my input, like on ChatGPT, is quite frustrating. I also gave Ollama a shot, but using the CLI is less convenient.

Ideally, I'd like something that allows me to edit my settings quite granularly, similar to what I can do in OpenLM, with the QoL from the hosted online platforms, particularly the ease of editing my prompts that I use extensively.



Not sure why your comment was downvoted. ^ is absolutely the right answer.

Open WebUI is functionally identical to the ChatGPT interface. You can even use it with the OpenAI APIs to have your own pay per use GPT 4. I did this.


Hey can you guys elaborate how this works? I'm looking at the Ollama section in their docs and it talks about load balancing? I don't understand what that means in this context.


You probably want to look at the getting started guide

https://docs.openwebui.com/getting-started/

IIUC the load balancing page is for people who want to run openwebui at a larger scale

https://en.wikipedia.org/wiki/Load_balancing_(computing)


You can try Msty as well. I am the author.

https://msty.app


openrouter.ai is a fantastic idea if you don't want to self host


It ranks between Mistral Small and Mistral Medium on my NYT Connections benchmark and is indeed better than Command R Plus and Qwen 1.5 Chat 72B, which were the top two open weights models. Grok 1.0 is not an instruct model, so it cannot be compared fairly.


Can you share the details about the benchmark?

Uses an archive of 267 NYT Connections puzzles (try them yourself if unfamiliar). Three different 0-shot prompts, words in both lowercase and uppercase. One attempt per puzzle. Partial credit is awarded if not all lines are solved correctly. Top humans get near 100.

Most other benchmarks don't clearly show the difference between the top models and the rest. This may be because they are older and have been over-optimized or perhaps because they are just easier.


These LLMs are making RAM great again.

Wish I had invested in the extra 32GB for my mac laptop.


You can't upgrade it?

Edit: I haven't owned a laptop for years, probably could have surmised they'd be more user hostile nowadays.


Everything is soldered in these days.

It's complete garbage. And most of the other vendors just copy Apple so even things like Lenovo have the same problems.

The current state of laptops is such trash


Plenty of laptops still have SO-DIMM, such as EliteBook for example.

People need to vote with their wallet, and not buy stuff that goes against their principles.


There needs to be more fidelity than "vote with wallet". Let's say I decided to not purchase your product. Why?

The question remains unanswered. Perhaps I didn't see it for sale or Bob in accounting just got one and I didn't want to look like I was copying Bob.

Even at scale this doesn't work. Let's say Lenovo switches to making all of their laptops hot pink with bedazzled rhinestone butterflies and sales plummet. You could argue it was the wrong pink or that the butterflies didn't shimmer enough ... any hypothesis you wish.

The market provides an extremely low information poor signal that really doesn't suggest any course of action.

If we really want something better, there needs to be more fruitful and meaningful communication lines. I've come up with various ideas over the years but haven't really implemented them.


You misunderstand the signal. The signal is “you’re doing something wrong”. Companies have tremendous incentive to figure out what that is. They do huge amounts of market research and customer feedback.


With SO-DIMM you gain expandability at the cost of higher power draw and latency as well as lower throughput.

> SO-DIMM memory is inherently slower than soldered memory. Moreover, considering the fact that SO-DIMM has a maximum speed of 6,400MHz means that it won’t be able to handle the DDR6 standard, which is already in the works.

https://fossbytes.com/camm2-ram-standard/


There are so many variables though ... most of the time you have to compromise on a few things.


These days with Apple Silicon, RAM is a part of the SoC. It's not even soldered on, it's a part of the chip. Although TBF, they also offer insane memory bandwidths.


Yes it’s almost like we got some benefit from iterating on these aspects of hardware design, as opposed to the typical HN grump characterisation of unbridled evil whenever a laptop isn’t exactly like how they were in 2006.


I don't think it should be too hard to see why a community of hackers take issue with laptops becoming harder to repair and upgrade, considering they're typically the target demographic for such features. Especially when the result is trading almost all forms of autonomy over their devices for marginal increases in RAM speed...


> most of the other vendors just copy Apple

Weird conspiracy theories aside, the low power variant of RAM (LPDDR) has to be soldered onto the motherboard, so laptops designed for longer battery life have been using it for years now.

The good news is that a newer variant of low power RAM has just been standardized that features low power RAM in memory modules, although they attach with screws and not clips.

https://fossbytes.com/camm2-ram-standard/


> And most of the other vendors just copy Apple

I don't think it's so much "copying Apple" as much as it's learning that they can do things like solder in the RAM to cut costs and discovering that most people won't care and will still buy it.


> Everything is soldered in these days.

[...insert sound of old timers laughing at newbies who declared T480/T490 overhyped for no reason because "no one care about upgradability anymore"...]

:)


I really really like my Macbook Pro. But dammit, you can't upgrade the thing (Mac laptops aren't upgrade-able anymore). I got M1 Max in 2021 with 32GB of RAM. I did not anticipate needing more than 32GB for anything I'd be doing on it. Turns out, a couple of years later, I like to run local LLMs that max out my available memory.


I say 2021, but truth is the supply chain was so trash that year that it took almost a year to actually get delivered. I don't think I actually started using the thing until 2022.


I got downvoted for saying a true fact? that I ordered a the new M1 Max in 2021 and it took almost a year for me to actually get it? it true.


Maybe they think that you aren’t meaningfully contributing to the conversation, which seems very, very likely.


You are getting downvoted because you vaguely suggested something negative about an Apple product, as is my comment below


FWIW, I am downvoting this comment because you’re whinging, not because I have an issue with which companies you do or do not like.


So far in this thread, you've implied on giving downvotes to comments barely even critical of Apple, and ironically whining about the common sentiment on HN being against anti-repairable consumerist products products (as apple typically produces)

Both of which seem to be exactly what GP took issue with, and was calling out.


People are absurd with their downvotes. I got downvoted for saying it took almost a year for my macbook to arrive once i ordered it. Its true. But its also true that supply chains were a wreck at the time. Apple wasn't the only tech gadget that took forever to arrive.


> mac laptop


My 32-GB RAM feels so inadequate now though!

It feels absolutely amazing to build an AI startup right now. It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.

- We first struggled with limited context windows [solved]

- We had issues with consistent JSON ouput [solved]

- We had rate limiting and performance issues for the large 3rd party models [solved]

- Hosting our own OSS models for small and medium complex tasks was a pain [solved]

Obivously every startup still needs to build up defensibility and focus on differentiating with everything “non-AI”.


We are going to quickly reach the point where most of these AI startups (which do nothing but provide thin wrappers on top of public LLMs) aren't going to be needed at all. The differentiation will need to come from the value of the end product put in front of customers, not the AI backend.


The same happened to image recognition. We have great algorithms for many years now. You can't make a company out of having the best image recognition algorithm, but you absolutely can make a company out of a device that spots defects in the paintjob in a car factory, or that spots concrete cracks in the tunnel segments used by a tunnel boring machine, or by building a wildlife camera that counts wildlife and exports that to a central website. All of them just fine-tune existing algorithms, but the value delivered is vastly different.

Or you can continue selling shovels. Still lots of expensive labeling services out there, to stay in the image-recognition parallel


The key thing is AI models are services not products. The real world changes, so you have to change your model. Same goes for new training data (examples, yes/no labels, feedback from production use), updating biases (compliance, changing societal mores). And running models in a highly-available way is also expertise. Not every company wants to be in the ML-ops business.


The dynamic does seem to be different with the newer systems. Larger more general systems are better than small specialized models.

GPT-4 is SOTA at OCR and sentiment classification, for example.


Sure, in the same way SaaS companies are just thin wrappers on top of databases and the open web.


You will find that a disproportionately large amount of work and innovation in an AI product is in the backing model (GPT, Mixtral, etc.). While there's a huge amount of work in databases and the open web, SaaS products typically add a lot more than a thin API layer and a shiny website (well some do but you know what I mean)


I'd argue the comment before you is describing accessibility, features, and services -- yes, the core component has a wrapper, but that wrapper differentiates the use.


> It's as if your product automatically becomes cheaper, more reliable, and more scalable with each new major model release.

and so do your competitor's products.


Any business idea built almost exclusively on AI, without adding much value, is doomed from the start. AI is not good enough to make humans obsolete yet. But a well finetuned model can for sure augment what individual humans can do.


The progress is insane. A few days ago I started being very impressed with LLM coding skills. I wanted Golang code, instead of Python, which you can see in many demos. The prompt was:

Write a Golang func, which accepts the path into a .gpx file and outputs a JSON string with points(x=tolal distance in km, y=elevation). Don't use any library.


If you don't mind, I'm trying to experiment w/ local models more. Just now getting into messing w/ these but I'm struggling to come up w/ good use cases.

Would you happen to know of any cool OSS model projects that might be good inspiration for a side project?

Wondering what most people use these local models for


I've got a couple I've done and it's been really enjoyable.

I think the real value in using local models is exposing them to personal/unique information that only you have, thus getting novel and unique outcomes that no public model could provide.

1. Project 1 - Self Knowledge - Download/extract all of my emails and populate into a vector database, like Chroma[0] - For each prompt do a search of the vector store and return N number of matches - Provide both prompt and search result to LLM, instructing it to use the search result as context or in the answer itself.

2. Project 2 - Chat with a Friend - I exported the chat and text history between me and a good friend that passed away - I created a vector store of our chat history in chunks, each consisting of 6 back-and-forth interactions - When I "chat" with the LLM the a search is first conducted for matching chunks from the vector store and then using those as "style" and knowledge context for a response. Optional: You can use SillyTavern[1] for a more "rich" chat experience

The above lets me chat, at least superficially, with my friend. It's nice for simple interactions and banter; I've found it to be a positive and reflective experience.

0. https://www.trychroma.com/ 1. https://sillytavernai.com/


thank you tons. this is a great start and I'm gonna use these for learning :)

u rock


Pleasure is all mine :D Have fun and please reach out if you get stuck!

No ideas about side projects or anything "productive" but for a concrete example look at SillyTavern. Making fictional characters. Finding narratives, stories, role-play for tabletop games. You can even have group chats of AI characters interacting. No good use cases for profit but plenty right now for exploration and fun.


I'm experimenting with using them to help me make a mod for a game (Vic3). It has a lot of config files and I'm using AI to help generate the data structures and parsers/serializers.

It's coming along but very hit or miss with the small models I'm using (all my poor 6750XT 12GB can manage).

I could be doing something wrong though, even with GPT-4 im struggling a bit - it's very lazy and doesn't want to fully populate the data structure fields (the game file objects can haze dozens/hundredss of fields). I'm probably just not using the right incantation/magic phrase/voodoo?


One idea that I've been mulling over; Given how controllable linux is from the command line, I think it would be somewhat easy to set up a voice to text to a local LLM that could control pretty much everything on command.

It would flat out embarass alexa. Imagine 'Hal play a movie', or 'Hal play some music' and it's all running locally, with your content.


There are a few projects doing this. This one piqued my interest as having a potentially nice UX after some maturity. https://github.com/OpenInterpreter/01


...until one day it hallucinates or has existential crisis and does something funny with "dd" or "fdisk/sfdisk"... :)

> We had issues with consistent JSON ouput [solved]

It says the JSON output is constrained via their platform (on la Plateforme).

Does that mean JSON output is only available in the hosted version? Are there any small models that can be self hosted that output valid JSON.


> Are there any small models that can be self hosted that output valid JSON.

Yes, for example this one is optimized for function calling and JSON output: https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B


> Does that mean JSON output is only available in the [self]-hosted version?

I would assume so. They probably constrain JSON output so that the JSON response doesn't bork the front-end/back-end of la Plateforme itself as it moves through their code back to you.


How are you approaching hosting? vLLM?


Dumb question: Are "non-instructed" versions of LLMs just raw, no-guardrail versions of the "instructed" versions that most end-users see? And why does Mixtral need one, when OpenAI LLMs do not?


LLM’s are first trained to predict the next most likely word (or token if you want to be accurate) from web crawls. These models are basically great at continuing unfinished text but can’t really be used for instructions e.g. Q&A or chatting - this is the “non-instructed” version. These models are then fine tuned for instructions using additional data from human interaction - these are the “instructed” versions which are what end users (e.g. ChatGPT, Gemini, etc.) see.


Very helpful, thank you.



I appreciate the correction, thanks!


Taking a moment to thank Mistral for coming through with the open release. In a just world Mistral and Meta would be the ones being decorated with the 'Open AI' medal. For now in AI, if it has 'Open' in the name, it isn't.

Reminds me of open source vs free software/GPL years ago, when the internet was taking off.

folks like ESR were weak with their definition of open source and it wasn't as good.

stallman said:

... the obvious meaning for the expression “open source software” is “You can look at the source code.” This is a much weaker criterion than free software ...

https://www.gnu.org/philosophy/free-software-for-freedom.htm...

https://www.gnu.org/philosophy/open-source-misses-the-point....


I'm really excited about this model. Just need someone to quantize it to ~3 bits so it'll run on a 64GB MacBook Pro. I've gotten a lot of use from the 8x7b model. Paired with llamafile and it's just so good.


Can you explain your use case? I tried to get into offline llms, on my machine and even android but without discrete graphics, its a slow hog so I didnt enjoy it but suppose I buy one, what then ?


Yes, I have a side project that uses local whisper.cpp to transcribe a podcast I love and shows a nice UI to search and filter the contents. I use Mixtral 8x7b in chat interface via llamafile primarily to help me write python and sqlite code and as a general Q&A agent. I ask it all sorts of technical questions, learn about common tools, libraries, and idioms in an ecosystem I'm not familiar with, and then I can go to official documentation and dig in.

It has been a huge force multiplier for me and most importantly of all, it removes the dread of not knowing where to start and the dread of sending your inner monologue to someone's stupid cloud.

If you're curious: https://github.com/noman-land/transcript.fish/ though this doesn't include any Mixtral stuff because I don't use it programmatically (yet). I soon hope to use it to answer questions about the episodes like who the special guest is and whatnot, which is something I do manually right now.


I run Mistral-7B on an old laptop. It's not very fast and it's not very good, but it's just good enough to be useful.

My use case is that I'm more productive working with a LLM but being online is a constant temptation and distraction.

Most of the time I'll reach for offline docs to verify. So the LLM just points me in the right direction.

I also miss Google offline, so I'm working on a search engine. I thought I could skip crawling by just downloading common crawl, but unfortnately it's enormous and mostly junk or unsuitable for my needs. So my next project is how to data-mine common crawl to extract just the interesting (to me) bits...

When I have a search engine and a LLM I'll be able to run my own Phind, which will be really cool.


Presumably you could run things like PageRank, I'm sure people do this sort of thing with CommonCrawl. There are lots of variants of graph connectivity scoring methods and classifiers. What a time to be alive eh?


> Can you explain your use case?

pretty sure you can run it un-censored... that would be my use case


Shopping for a new mbp. Do you think going with more ram would be wise?


Unfortunately, yes. Get as much as you can stomach paying for.


I'm considering switching my function calling requests from OpenAI's API to Mistral. Are they using similar formats? What's the easiest way to use Mistral? Is it by using Huggingface?


easiest is probably with ollama [0]. I think the ollama API is OpenAI compatible.

[0]https://ollama.com/


Most inference servers are OpenAI-compatibile. Even the "official" llama-cpp server should work fine: https://github.com/ggerganov/llama.cpp/blob/master/examples/...


Ollama runs locally. What's the best option for calling the new Mixtral model on someone else's server programmatically?


Openrouter lists several options: https://openrouter.ai/models/mistralai/mixtral-8x22b


The development never stops. In a few years we will look back and see how the previous models were and how they're now. How we couldn't run LLaMa 70B on MacBook Air and now we can.


Yes it's pretty cool. There was a neat comparison of deep learning development that I think resonates quite well here.

Around 5 years ago, it took a lambda user some pretty significant hardware, software and time (around a full night), to try to create a short deepfake. Now, you don't need any fancy hardware and you can have some decent results within 5 min on your average computer.



Good to continue to see a permissive license here.


I can't even begin to describe how excited I am for the future of AI.


Curious to see how it performs against GPT-4.

Mixtral8x22 beats CommandR+, which is at GPT-4-level in LMSYS' leaderboard.


LMSYS leaderboard is just one benchmark (that I think is fundamentally flawed). GPT-4 is clearly better.


Which alterative benchmarks do you recommend?


I don't have a perfect solution, except for the obvious answer that the best we can do is a combination of multiple benchmarks. It's harder now than ever because you also want to test long contexts, and older benchmarks are over-optimized. I think it would be best to overweigh newer benchmarks and underweigh benchmarks on which objectively dumb models like Claude 3 Haiku score well. And I have a big problem with HumanEval, which is Python-only and has only 164 problems but is used as the catch-all for coding abilities.

Is this the best permissively licensed model out there?


Today. Might change tomorrow at the pace this sector is at.


So far it is Command R+. Let's see how this will fare on Chatbot Arena after a few weeks of use.


> So far it is Command R+

Most people would not consider Command R+ to count as the "best permissively licensed model" since CC-BY-NC is not usually considered "permissively licensed" – the "NC" part means "non-commercial use only"


My bad, I remembered wrongly it was Apache too.


I'm confused on the instruction fine-tuning part that is mentioned briefly, in passing. Is there an open weight instruct variant they've released? Or is that only on their platform? Edit: It's on HuggingFace, great, thanks replies!




How much vram is need to run this?


80GB in 4bit.

But because it only activates one expert at a time, it can run on a fast CPU in reasonable time. So 96GB of DDR4 will do. 96GB of DDR5 is better.


WizardLM-2 8x22b (which was a fine tune of the Mixtral 8x22b base model) at 4bit was only 80GB.


We rolled out Mixtral 8x22b to our LLM Litmus Test at s0.dev for Cody AI. Don't have enough data to say it's better or worse that other LLMs yet, but if you want to try it out for coding purposes, let me know your experience.


Seems that Perplexity Labs already offers a free demo of it.

https://labs.perplexity.ai/


That's the old/regular model. This post is about the new "instruct" model.


The instruct version is now on labs.

Isn't equating active parameters with cost a little unfair since you still need full memory for all the inactive parameters?


Well, since it affects inference speed it means you can handle more in less time, needing less concurrency.


Fewer parameters at inference time makes a massive difference in cost for batch jobs, assuming vram usage is the same


Is this release a pleasant surprise? Mistral weakened their commitment to open source when they partnered with Microsoft.

It's nice they're using some of the money from their commercial and proprietary models, to improve the state of the art for open source (open weights) models.


Mistral just released the most powerful open weight model in the history of humanity.

How did they weaken their commitment to open weights?


> Mistral just released the most powerful open weight model in the history of humanity.

Well, yeah, it's very welcome, but 'history of humanity' is hyperbole given ChatGPT isn't even two years old.

> How did they weaken their commitment to open weights?

Before https://web.archive.org/web/20240225001133/https://mistral.a... versus after https://web.archive.org/web/20240227025408/https://mistral.a... the Microsoft partnership announcement:

> Committing to open models.

to

> That is why we started our journey by releasing the world’s most capable open-weights models

There were similar changes on their about the Company page.


Pricing?

Found it: https://mistral.ai/technology/#pricing

It'd useful to add a link to the blog post. While it's an open model, most will only be able to use it via the API.


That looks expensive compared to what groq was offering: https://wow.groq.com/


Can't wait for 8x22B to make it to Groq! Having an LLM at near GPT-4 performance with Groq speed would be incredible, especially for real-time voice chat.


I also assume groq is 10-15x faster


It's open source, you can just download and run it for free on your own hardware.


Well, I don't have hardware to run a 141B parameters model, even if only 39B are active during inference.


It will be quantized in a matter of days and runnable on most laptops.


8 bit is 149G. 4 bit is 80G.

I wouldn’t call this runnable on most laptops.


"Who among us doesn't have 8 H100 cards?"


Four V100s will do. They're about $1k each on ebay.


$1500 each, plus the server they go in, plus plus plus plus.


Sure, but it's still a lot less than 8 h100s.

~$8k for an LLM server with 128GB of VRAM vs like $250k+ for 8 H100s.


$8K sounds like a lot more than "free".

If you are looking to play with the model without installing it locally, we've added it our playground at https://trypromptly.com/playground.

Page not found

Sorry missed this. It was hidden behind login before. It should now be reachable.

I have been using mixtral daily since it was released for all kinds of writing and coding tasks. Love it and massively invested in mistrals mission.

Keep on doing this great work.

Edit: been using the previous version, seems like this one is even better?


It wasn't clear but how much hardware does it take to run Mixtral 8x22B (mistral.ai) next to me locally?


a macbook with 64g of ram


At what quantization?


We need larger context windows, otherwise we’re running the same path with marginally different results.


Did anyone have success getting danswer and ollama to work together?


labs.perplexity.ai now has mixtral-8x22b-instruct.

I asked it what it's knowledge cutoff was, and it said 2021-09.

Anyone know why it's trained on such old data?


Is 8x22B gonna make it to Le Chat in the near future?


How does this compare to ChatGPT4?


is this different than their "large" model


I just find it hilarious how approximately 100% of models beat all other models on benchmarks.


Benchmarks published by the company itself should be treated no differently than advertising. For actual signal check out more independent leaderboards and benchmarks (like HuggingFace, Chatbot Arena, MMLU, AlpacaEval). Of course, even then it is impossible to come up with an objective ranking since there is no consensus on what to even measure.


Just because of the pace of innovation and scaling, right now, it seems pretty natural that any new model is going to be better than the previous comparable models.


Benchmarks are often weird because of what a benchmark inherently needs to be.

If you compare LLMs by asking them to tell you how to catch dragonflies - the free text chat answer you get will be impossible to objectively evaluate.

Whereas if you propose four ways to catch dragonflies and ask each model to choose option A, B, C or D (or check the relative probability the model assigns to those four output logits) the result is easy to objectively evaluate - you just check if it chose the one right answer.

Hence a lot of the most famous benchmarks are multiple-choice questions - even though 99.9% of LLM usage doesn't involve answering multiple-choice questions.


gotta cherry pick your benchmarks as much as possible


What do you mean?


Virtually every announcement of a new model release has some sort of table or graph matching it up against a bunch of other models on various benchmarks, and they're always selected in such a way that the newly-released model dominates along several axes.

It turns interpreting the results into an exercise in detecting which models and benchmarks were omitted.


It would make sense, wouldn't it? Just as we've seen rising fuel efficiency, safety, dependability, etc. over the lifecycle of a particular car model.

The different teams are learning from each other and pushing boundaries; there's virtually no reason for any of the teams to release a model or product that is somehow inferior to a prior one (unless it had some secondary attribute such as requiring lower end hardware).

We're simply not seeing the ones that came up short; we don't even see the ones where it fell short of current benchmarks because they're not worth releasing to the public.


Sibling comment made a good point about benchmarks not being a great indiactor of real world quality. Every time something scores near GPT-4 on benchmarks, I try it out and it ends up being less reliable than GPT-3 within a few minutes of usage.


That's totally fine, but benchmarks are like standardized tests like the SAT. They measure something and it totally makes sense that each release bests the prior in the context of these benchmarks.

It may even be the case that in measuring against the benchmarks, these product teams sacrifice some real world performance (just as a student that only studies for the SAT might sacrifice some real world skills).


That's a valid theory, a priori, but if you actually follow up you'll find that the vast majority of these benchmark results don't end up matching anyone's subjective experience with the models. The churn at the top is not nearly as fast as the press releases make it out to be.


Subjective experience is not a benchmark that you can measure success against. Also, of course new models are better on some set of benchmarks. Why would someone bother releasing a "new" model that is inferior to old ones? (Aside from attributes like more preferable licensing).

This is completely normal, the opposite would be strange.


So this one is 3x the size but only 7% better on MMLU? Given Moores law is mostly dead, this trend is going to make for even more extremely expensive compute for next gen AI models.


That's 25% fewer errors.


True, I was in too distracting an environment to do that calculation, but it still feels like its a logarithmic return on extra compute. How long before the oceans start to boil? (figuratively that is).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: