Hacker News new | past | comments | ask | show | jobs | submit login
Use GPT-3 incorrectly: reduce costs 40x and increase speed by 5x (buildt.ai)
295 points by Buoy on Feb 8, 2023 | hide | past | favorite | 45 comments



If speed and price are concerns, use the FOSS models available on the Hugging Face Hub: https://hf.co/models. Thousands of models, different sizes and tasks. Download locally and fine-tune, if necessary.

For those specifically interested in text embeddings, here is a good analysis: https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...


Normally I'd point out how these are a lot less capable than GPT3. But the article ends up fine tuning GPT Babbage, and multiple free models can outperform Babbage, so this is very solid advise.


Starting with HF models and moving to a large model like GPT3 when the task calls for it is a good approach to take with almost all tasks.


How about fine tuning testing w/ Davinci and then scaling it down for the other models or HF once you've proven it works. I believe the openai docs propose this approach (minus HF of course)


That's up to you. Many don't want to open an account and pay in order to explore what's possible. There are LLMs available on the HF Hub, such as google/flan-t5-xl.


This works well. Starting with bigger models allows us to explore whats possible. I find it easier to scale down from here.


The medium article you posted does analysis on OpenAI's old embeddings. OpenAI has a single new embedding model [0] that replaces all the old models and it's also super cheap ($0.0004 / 1K tokens).

0. https://openai.com/blog/new-and-improved-embedding-model/


The first comment in that article has details on the new model. Not the original author but per their testing they said they paid $70 to encode 1M records. The embeddings are 1536 dimensions, which require a lot of vector storage. The HF hub has open models for 384 dimensions or 768 dimensions that work well for a lot of use cases.


Based on my understanding of the article, it sounds like OP has re-discovered/invented Knowledge Distillation[0]. Which is reasonably well research technique, and frequently applied to get models to run on compute constrained devices. Or taking very deep models, and using them to train very shallow, but low-latency models.

[0] https://en.m.wikipedia.org/wiki/Knowledge_distillation


Glad to know that's what it's called! For avoidance of any doubt I'm not trying to claim I 'discovered' any of this - just wanted to share some useful learnings :D


When you independently discover commonly used techniques (often without even realizing it), it's usually a good sign you're working on and thinking about cool stuff! Eventually one of those ideas will be entirely novel (you've probably already had many of them), and that's how scientific progress is made.


I came here to say this. This entire article should have been 5 lines, had they known about Knowledge Distillation.

Anecdotally, a similar thing happened to me when I "independently invented" synthetic data generation for deep learning back in 2016... Turns out my brilliant idea was not that original after all


The more I learn, the more I realize how rarely I've thought of a "new idea." In fact, I'm not convinced such a thing even exists. An idea is generally a set of connections between pre-existing ideas, so perhaps a novel subset of those connections would qualify as a "new idea," but one could argue it's more of a rearrangement of existing concepts. (What was the first idea? The first thought? Does thought require language, and if so, is consciousness an emergent property of language?)

When I realized how rare it is to have a unique thought, I reframed my approach to evaluating my own ideas. Instead of worrying about novelty, I worry about relevance.

If I can't find a pre-existing similar idea, I assume I'm on the wrong track unless I can identify a compelling reason why I'm in a unique position to have an idea that nobody has ever thought or at least felt like communicating publicly.

If I can find such a pre-existing idea, then I know I'm on the right track, and often it validates a pattern I identified between a subset of connections in my previous knowledgebase, solidifying the pattern in my mind. Regardless of whether I can label it a "new idea," I can use it to work toward the next step in whatever line of thinking caused me to notice the pattern in the first place.


I have tried this method some time ago, it does work but it is not consistent. Sometimes you’d get good results identical to davinci as the article mentioned but you have the risk of showing random text not related to the prompt.

But I haven’t fine tuned on a large number of records, so that might solve the consistency problem.


I'd suggest you try on 10k examples or more (variety is key ofc) and see how you get on!


This is great, thanks for sharing. I've been pretty reluctant to use davinci in production due to costs, but the other models seem significantly less capable. It seems like there's still a lot to learn about working with these models at scale.


Yes definitely give this a go, for us with our use case davinci is prohibitively expensive in production so we literally cannot use it given the number of requests we make - interestingly I saw some OpenAI documentation the other day at the YC event which basically said that with a large enough dataset (I recall >= 30k examples) all of the models (yes even ada) start to behave with similar performance, so bear that in mind!


My impression is that the workflow is supposed to be davinci for prototyping, then fine-tuning and downgrading to curie in several steps as you polish prompts and breakdown tasks (see https://github.com/openai/openai-cookbook/blob/main/techniqu...).

I haven't gotten beyond the davinci for prototyping tho. It's cheap enough that a month of tinkering a chatbot was $7 in jan.


Any other obvious low hanging fruit tricks for GPT3? Currently using Davinci003 for my use case bc price didn’t seem incredibly expensive. Was planning on fine tuning a model down the road, but didn’t get to it for MVP bc it seemed a prohibitively large amount of work.


Does OpenAI provide guarantees now around fine tuned models being loaded 24/7? Last time I fine tuned Curie models they took forever to load and that killed it for me as we’re building an interactive application.


If you're using it to the point you need it guaranteed up 24/7, you may be better off not treating it as a SaaS API and instead building up your own model. It seems somewhat likely OpenAI has some pretty insane changes going on with access and pay structure in the future, so I'd be cautious becoming reliant on them (at least for the next 3 months)


This isn't really a comment on this post, but have you thought about building a Jetbrain plugin version of your product?


We've had some demand for this, if there's enough we'll definitely consider it when we hire more engineering resource, which is the current constraining factor


Thanks for sharing! I’ve been looking at trying cheaper models but have been scared about quality. Definitely more tempted to try these now!


Definitely give them a go, we use fine-tuned ada a bunch for classification work for example; I personally think the smaller models are overlooked and don't get enough love - if OpenAI increased the context window of a model like babbage to 8k tokens I feel like that would be as much of a big deal as making a marginal improvement to davinci, purely because so many use cases rely on low-latency, many request models.


Is this just fine tuning a cheaper model like Curie on the outputs of a more expensive model like Davinci? I proposed this as a question/idea to an OpenAI employee on Twitter and he talked me out of it.


What was their argument?

I think it depends on how the answers from davinci are used. If the answers are used verbatim to fine-tune a cheaper model, then that could be problematic. However if it is used to generate a fine-tuning dataset which is then corrected manually, I don't see the problem.


He didn't give a great explanation. Just kinda kept dismissing it as a bad idea and suggesting other paths.

I don't see why using the Davinci outputs verbatim would be a problem in certain situations. The goal is just to get a fine-tuned cheaper model (like Curie) closer to Davinci performance in some narrow problem domain. Of course it's never going to be as good or broad as Davinci with this approach, but the lower cost may outweigh that. Just surprised more people haven't tried and benchmarked this approach...but I'm no expert here so there is probably a good reason.


Damn, the text is almost unreadable because of the strange proportions of the body font and the low line height.


they're talking about current googlle's response which everyones been clearly suggesting is no longer actual response time to a search but a entirely truncated and curated deal.

Google is primed to be a rear-view mirror technology if it keeps reducing it's actual use.


Wow some of these comments seem harsh. The author isn't claiming to have invented these techniques and it's cool to see them being applied in the "real world" to save money.


Yep, that was the aim, I'm just trying to put some stuff out there that helped us out!


Agreed. Fine-tuning GPT-3 is often called out as expensive and sometimes unnecessary given K-shot & embeddings. I'm glad they shared their success story even if it's not ground-breaking.


Thanks - we used to use a bunch of k-shot prompts (particularly with our previous idea before we pivoted when we got into YC), but with the davinci model we were sending ~3.5k tokens per invocation, which in the long term was costing far more than finetuning!


Checkout https://text-generator.io for embeddings for search, they are better because they are cheaper/faster/take into account linked images (actually embed text images and code in the same space)

https://text-generator.io/blog/embed-images-text-and-code

Your training trick is neat/great innovation but also keep in mind it is likely overfitting meaning when you get a bit of new data you need to index and search that model isn't going to do well at all at embedding it, said differently that training works well if you can cover the types of data you're going to see in production really well at training time. If not there's a big accuracy drop for unseen data due to overfitting


I'm clearly not versed in AI but at least two of their examples are quite obviously wrong. Their Study Notes example asks for five key facts about Ancient Rome -- one of them is borderline incorrect and only applies to the Roman Empire, the other one is an overgeneralization, and they're two, not five. The "Receipts" example gets the total sum wrong.

Is this the future!?


Looks like a spam blogger's dream come true, for sure.


The “code” example starts with an image recognition neural net prompt and ends 400 lines later by moving a turtle around the screen.

I mean, we know AI isn’t very good at these things, but it’s pretty bold to use these as marketing examples!


Chat: “Yawn, this is basic textbook stuff”

Me: These people are wizards


We're in the HN eye of the storm.


[flagged]


This is literally just a textbook HN snarky comment.


the title in question is misleading and likely the negative feedback received you received is aimed at that.

The primary objective of ChatGPT is not to serve as a finished product answering queries, but rather to undergo fine-tuning with respect to given prompts. This model is capable of utilizing a variety of sub-models and executing queries as necessary.

Let's consider an example where a user inputs the sentence "What is the capital of France?" This is how ChatGPT will answer you:

> 1 Preprocessing: The chat interface processes the user's input and recognizes that the user is asking for the capital of a country.

    2 Generation of prompt: Based on this information, the chat interface generates the following prompt: "The capital of France is __."

    3 Input to GPT-3: The prompt and the user's input are fed into the GPT-3 model.

    4 Generation of response: The GPT-3 model then generates a response based on the prompt and input, for example: "The capital of France is Paris."

    5 Postprocessing: The response generated by the GPT-3 model is postprocessed by the chat interface to ensure that it is grammatically correct and in the correct format. The final response returned to the user is: "The capital of France is Paris."

What i mean by that is that of course you will be able to use smaller more efficient models to get faster, quicker results and what ChatGPT really is, is the world training it what prompts look like. The "GPT Part" is only used after heavy preprocessing i would not be surprised if it already knew how to use its smaller models itself =)


This does not at all sound how all LLMs I know work. Eg no model the size of ChatGPT should need grammar correction. Having that run client side sounds even more weird. Where did you take this information from?


from itself and other interfaces around.

And i didnt say client side, there probably is simply an api wrapper in between.

I also believe that this is how its moderation works. Chatgpt has not all "bad stuff" removed from its data, but there simply is an output filter.

The question always still is how to get from complex prompts to actual answers and boy have i tried complex prompts!

Of course im looking in from the outside with no validity to my claim but if i were to write a bot from scratch thats also how i would do it.


> from itself

That is a horrendously bad source. ChatGPT doesn't know how ChatGPT works, because nothing about how ChatGPT works was published on the internet at the time when ChatGPT was trained. Frankly, there's still nothing on the internet about this.

> Of course im looking in from the outside with no validity to my claim but if i were to write a bot from scratch thats also how i would do it.

ChatGPT wasn't "written" as much as trained. Writing a bot and training one require completely different mentalities.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: