Hacker News new | past | comments | ask | show | jobs | submit login
Talk = GPT-2 and Whisper and WASM (github.com/ggerganov)
189 points by tomthe on Dec 7, 2022 | hide | past | favorite | 50 comments



    > whisper: number of tokens: 2, 'Hello?'
    > gpt-2: I want to have you on my lap.
this GPT-2 better chill


This would of course be even more fun with ChatGPT, but it is a nice and funny demo of their whisper.cpp library. The second video is worth watching: https://user-images.githubusercontent.com/1991296/202914175-...


I think LaMBDA would be really fun. If you asked ChatGPT what movies it likes, it would tell that it is a large language model trained by OpenAI and it can't have opinions yada yada yada.


I understood that this limitation is circumvented with prompts like

Imagine there is a guy that likes watching movies. Which ones would he like most in 2022?

That context persists for a while.


It's interesting that the english language model is loaded and it's clearly trying to pronounce things in a spanish way.


Actually that’s the italian voice


Correct, I had loaded randomly the "Italian" voice of the Web Speech API.


    The total data that the page will have to load on startup (probably using Fetch API) is:
    - 74 MB for the Whisper tiny.en model
    - 240 MB for the GPT-2 small model
    - Web Speech API is built-in in modern browsers
cool but im now wondering what it would take to bring this down enough to put this in real apps? anyone talking about this?


I really liked how the page tells you the size it is planning to download, and prompts you before downloading.

Coming from a limited bandwidth contract, I hate when I click a link and it instantly starts downloading a huge file.

Great work OP!


~314mb is a lot for a web app but small for a desktop or even a mobile app.


> ~314mb is a lot for a web app but small for a desktop or even a mobile app.

Everyday we stray further from god's light :/


Those 314 MB are justified though, which can hardly be said for the typical app/homepage.


Unfortunately these smaller models are also terrible at performance, particularly the GPT-2 model small model is really unsuitable for the task of generating text. The largest models publicly available, which are nowhere near GPT-3 Da Vinci level, are tens of GBs.

We may be able to reduce the size without sacrificing performance, but that's an area of active research still.


We can bring back pre-loading screens for webpages from the Web 2.0 era.


Isn't WEB 2.0 era current era? I mean WEB 3.0 era is in relation to blockchains only, not the rest. The proponents of "everything on blockchain" they actually want that for everything (not that will ever work, but that's beyond our discussion)


Perhaps it will be built-in to browsers soon


Given Whisper is open source, I'd be surprised if it's not. It would be cool for Web Speech API's SpeechRecognition to simply use it, though that would make browser downloads a little beefier.


It could easily be downloaded separately in the background once the browser application is already up and running. Would be great to have it in the browser though for sure.


I don't see why they would ever package GPT2 (the bigger model) in the browser.

Speech to text has higher chances though, that's an interesting idea, as they already package text to speech too.


To be honest, I expect that in 10 years people will regularly use these sorts of text generation tools in the way text prediction and thesauruses and grammar checkers and spellcheckers are used today but for bigger blocks of text.

I can't really see why not anyway, as more things are in the browser it makes sense to me to integrate the ability to "AI check" your text like a grammar or spell checker to improve your writing along some dimensions that you like.

It's not honest, but in kind of the same way that a spellchecker isn't honest and since it's going to be possible anyway I don't see what extra harm it causes to make it accessible for everyone so that we can both actually see an upside and also begin to recognize that text we read is at this point likely to be at least partially AI generated and potentially factually incorrect.

Even better if things like Firefox reader mode, one of my favorite tools, can also do text summarization. Just imagine the adversarial interaction between a tool designed to generate confident sounding fluff and one to summarize confident sounding fluff. Honestly it seems like a likely inevitable future path.

It may as well be part of the browser where it stands a better chance of keeping people's long term attention on the ease of using these tools. Spammers will be able to do it, fake journalists and such will be able to do it, better if we can do it too so that at least we are aware of the potential abuse.


We need much better models in browsers. The main reason is to pass everything through the language model and get polite and helpful responses. You never have to see Google, the website or the ads ever again if you don't want to. The QA model should be able to detect most undesirable parts - spam, ads, fakes, factually incorrect data. Something like chatGPT running locally. This is important for privacy. If we run the model, we have a safe creative space. If they run the model, they get everything spilled out.


Lots of web based apps load more data than this. The 300 MB is only 3 seconds on a gigabit connection.


in real life the models are hosted on a server and you send the text and sound and receive the model's output


Listening to that demo, it's incredible how far we've come!

Or, not.

Racter was commercially released for Mac in December 1985:

Racter strings together words according to "syntax directives", and the illusion of coherence is increased by repeated re-use of text variables. This gives the appearance that Racter can actually have a conversation with the user that makes some sense, unlike Eliza, which just spits back what you type at it. Of course, such a program has not been written to perfection yet, but Racter comes somewhat close.

Since some of the syntactical mistakes that Racter tends to make cannot be avoided, the decision was made to market the game in a humorous vein, which the marketing department at Mindscape dubbed "tongue-in-chip software" and "artificial insanity".

https://www.mobygames.com/game/macintosh/racter

https://www.myabandonware.com/game/racter-4m/play-4m

It's only amazing that chatGPT backed by GPT-3 is the first thing since then to do enough better that everyone is engaged.

I owned that in 1985, and having studied AI/ML previously I've been (and remain something of) an AGI skeptic. But now in 2022, I finally think “this changes everything” ... not because it's AI, but because it's making the application of matching probabilistic patterns across mass knowledge practical and useful for everyday work, particularly as a structured synthesis assistant.


GPT-2 is really by far massively stronger than anything in 1985. I suggest that you try using https://chat.openai.com/chat


OpenAI chat uses GPT-3 which, as some other user already pointed out, is not even close to GPT-2 in terms of generating text


Technically GPT-3.5, it's a newer version https://openai.com/blog/chatgpt/


sure. my point was that transformers are massively better than whatever they had in 1985


well, the AI Winter happened in the intervening years, so that might help explain

https://en.wikipedia.org/wiki/AI_winter


I implemented whisper + chatgpt + pyttsx3 and it worked. But then suddenly the chatgpt wrapper that I found on github stopped working.

edit: whisper is awesome


It looks like the ChatGPT APIs that work well are the ones that are implemented as a browser extension and reusing the bearer token that you get by signing into ChatGPT from the same browser. I'm guessing since you're using pyttsx3 that you wrote a Python app instead and not in the browser?


Cool. Would like to see that.


I'm curious how they chose between:

A) ggml https://github.com/ggerganov/ggml/tree/master/examples/gpt-2

B) Fabrice Bellard's GPT2C https://bellard.org/libnc/gpt2tc.html


Hey author here - I implemented `ggml` as a learning exercise. It allows me to easily port it to WebAssembly or iOS for example.


Oops - I didn't spot it was your own libary! Kudos!


Technically this seems to work, and mad props to the author for getting to this point. On my computer (MacBook Pro) it's very slow but there are enough visual hints that it's thinking to make the wait ok. I have plenty of complaints about the output but most of that is GPT-2's problem.


offtopic but what are the real limitations of gpt2 vs gpt3? (i know that gpt2 is free)


It's almost the same model architecture, but GPT3 is much better trained. GPT3 is coherent, while GPT2 is prone to generating gibberish or getting stuck in a loop. The advantage is pretty significant for longer generations.

That being said, neither GPT3 nor GPT2 are "efficient" models.

On the one hand, they use inefficient architectures - starting with using a BPE Tokenizer, to having dense attention without any modifications, to being a decoder only architecture etc. Research has come up with many more fancy ideas on how to make all this run better and with less compute. But there is a reason why GPT2/3 are architecturally simple and inefficient: we know how to train these models reliably (more or less) on thousands of GPUs, whereas the same might not be true for more modern and efficient implementations. For instance, when training OPT, Facebook started using more fancy ideas but finally ended up going back to GPT-3 esque basics, simply because training on thousands of machines is a lot harder than it seems in theory.

On the other hand, these models have far too many parameters compared to the data they were trained on. You might say they are undertrained - or they lean heavily on available compute to make up for missing data. In any case, much smaller models (like Chinchilla by DeepMind) match their performance with less parameters (and hence compute or model size) by using more and better data.

In closing, there are better models for edge devices. This includes GPT clones like GPT-J in 8bit, or distilled version thereof. Similarly, there is still a lot of gains that will happen when all the numerous efficiency improvements get implemented in a model that operates at the data/parameter efficiency frontier.

Still, even when considering efficient models like Chinchilla and then even more architecturally efficient versions thereof - we are still talking about a lot of $$$ to train these models. And so we are yet further from having OpenSource implementations of these models than we are from someone (like DeepMind) having them...

With time, you can expect to run coherent models on your edge device. But not quite yet.


Thank you. Do you know any open source model that works generating code from natural language? I tried salesforce codex and it sucks big time.


Interestingly, Code models are constrained even more by difficulties of tokenization in light of - most crucially - us not having actually that much code to train on (we already train on all of of github, and it doesn't "saturate" the model).

At this stage, we are back to improving model efficiency, I think, especially for code models. But not there yet.

Sorry for the rambling, the actual answer is no I do not have a really good codex type model in open source.. yet


I see. The Open AI code generator gave me really impressive results for basic to intermediate questions in the data analytics space. I think it's a function of the context you give about the problem (aka what are the literal meaning of the columns in the business context) and how objective your question - to the model - is, plus some other internal model variable that I'm completely unaware of. But it's nice to have your input so I can understand a little bit what happens under the hood!


Size of the model is a big one. GPT-3 has over 10x as many parameters for example. Training data would be another huge one. Architecturally, they aren't that different if I recall correctly, it's a decoder stack of transformer like self-attention. Real world capability has GPT-3 giving much better answers, it was a big step up from GPT-2.


So how 'big' is GPT-3?

Is it anywhere near being able to be run on local consumer hardware?

How long until we can have the GPT3 or 3.5 chatbot locally like we have StableDiffusion locally for image generation?

I've been spoiled by having it accessible offline and with community built support/modifications to it. GPT-3 is super neat but feels like too many guard rails or the custom playground is too pricey.


> So how 'big' is GPT-3?

For inference so basically running it you need multiple GPUs and hundreds of GBs of GPU memory. As to model size it's around 100x bigger than SD. You can forget about running it locally unless you have dozens of high-end GPUs or you want to wait hours/days/weeks (depending on your hardware) for a single response.


got it. thanks! is there any application that gpt2 would be enough and could work as well as gpt3?


I've been thinking of doing something like this but hooked up with ChatGPT/GPT-3-daviinci003. Obviously model will not load in the browser but we cna call the API. Could be a neat way to interact with the bot.


Anyone found a sentence that GPT-2 returns a good response for? My experiments have been not great so far.

(LOVE this demo.)


What are some good things to try? I can't get any sense out of it at all so far.


This is the smallest GPT-2 model so it usually generates gibberish. Maybe some better prompting could improve the results.

Currently, the strategy is to simply prepend 8 lines of text (prompt/context) and keep appending every new transcribed line at the end:

https://github.com/ggerganov/whisper.cpp/blob/master/example...


This guy's doing really great work recently. Keep it up, Georgi!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: