Hacker News new | past | comments | ask | show | jobs | submit login
Generative A.I. arrives in the gene editing world of CRISPR (nytimes.com)
90 points by msmanek 10 days ago | hide | past | favorite | 56 comments





Reading their blog post I wonder if an LLMs is really the best way to do this. If I got it right, they used the LLM to enumerate potential protein DNA sequences. Does that really need an LLM? Enumeration is not novel, nor are LLMs particularily good at it. If you want to computationally parallelize the search in a large enumeration space it would be much easier to simply, well, do that instead of taking a detour via a statistical parrot.

In a nutshell this sounds more like a case of "we wanted something with AI in the title".


It's not an English LLM, but a "protein" language model, where tokens represent amino acids or nucleotides. Learning a transformer language model on such data simply learns a distribution over sequences of tokens. It's a fine approach conceptually that in many ways is the "right" way or most elegant method, and not a stretch at all.

I enjoyed the feeling when I made this connection talking with a startup doing this a while back. It's just a different "language" and although it's not a given that LLMs can operate in it, it's a reasonable thing to try, and it turns out they can.

Personally I think it was obvious that LLMs were going to be useful for protein modelling since the previous generation used HMMs very successfully. Pfam (a library of HMMs for classifying proteins into preexisting known families) is one of the most important resources we have because of the power of HMMs to model sequential language.

I suspect we will need to move from sequential modelling to graphical modelling to level-up again, though.


> I suspect we will need to move from sequential modelling to graphical modelling to level-up again, though.

Out of curiosity, would you mind elaborating on this?


I don't work in the field so I'm probably just repeating something Hinton already said, but it seems to me like attempting to model things in reality that have graph-like structures (like interacting pairs of residues in a 3d protein structure) using sequences with finite context lengths is ultimately going to be less efficient than modelling graphs. My guess is this work is roughly describing that I think of: https://www.cis.upenn.edu/~mkearns/papers/barbados/jordan-tu...

it could also be I completely misunderstand context in sequential models and what I'm describing is already being used, or has been evaluated and has been unsuccessful.


> Learning a transformer language model on such data simply learns a distribution over sequences of tokens.

If statistical distributions can model higher level polypeptide structure, then it could be useful.


yeah but your training is bottlenecked by the lack of ground truth. Some things were(I presume)/will be easy to do with LLMs, like protein structure, because every part of every protein is source data (and there's millions of known structure). But suppose you want to estimate clearance, or ld50. How many proteins do we know their serum clearance? 1000? 10000 maybe?

I don't have a direct answer to your question. My guess is that LLMs are too limited to make truly great solutions in biology but sequential modelling is a key component that will not be replaced any time soon. For example, transformers were key to AlphaFold's success, but they still needed many other steps to make accurate predictions.

I worked on a predecessor to LLMs - HMMs for protein modelling. They were, and still are for most people the best way to model protein sequences. It's usually done as prediction, rather than generation (IE, you use the model to classify an unknown sequence into a known category, rather than asking the model to generate new instances of a category). HMMs for proteins are a bit stuffy, and they model local changes well, but struggle with long-range interactions that LLMs seem to excel at (for example, an HMM will do a good job of letting you stuff a few more residues into a protein in a localized region such as a hinge, but are not so great at modelling groups of residues that are located far-apart in sequence space but close in protein space).

One detail of the bitter lesson is, imho, that statistical parrots are better than they "should" be, probably for the same reason that mathematics is unexpectedly proficient in modelling physics: to some degree, the models recapitulate the true latent space of the underlying system well enough to generalize outside the original observation space.


I think your intuition is off here. The number of sequences to enumerate is much greater than the number of atoms in the universe. You need a smart way to enumerate these and that's what the LLM is for. The statistical parrot is not a detour its a shortcut.

The real power of LLMs is they can model anything as a “language” given the right sequence training data.

Warning: the following is my opinion.

In the same way that MLP “neurons” are universal approximators, it seems that LLMs are universal mappers.

They have the potential to help us organize and translate the immense quantity of data being generated by modern methods in all respective disciplines. We might create a model that translates english to protein synthesis, and vice versa, which would be pretty useful given my lay understanding of biochem.

To your point - this probably is NOT the best way to do this in an objective sense. But to my mind we are hitting upper limits as finite beings and need things like this, which utilize native language constructs, to move forward.


First the search space is way too large for brute force enumeration. We’re talking like 10^300 combinations. Also the hard part isn’t just listing amino acid sequences, its finding ones that do what you want them to. The only way to figure that out is by testing them, which is difficult and expensive. So you need an algorithm that is good at only listing sequences that are likely to work. That’s precisely what LLM’s are good at: finding patterns and sequences that are correlated in a useful way

Well hopefully it's trained on genetic DNA sequences and not Reddit threads. If so, it should do pretty well predicting the next sequence given previous sequences. There are probably all sorts of undiscovered patterns.

To be fair, having AI in the title landed it on the front page of HN, so...

Is this going to be as good as when AI arrived in the world of materials science?

https://www.404media.co/google-says-it-discovered-millions-o...

Or is it only just going to be as good at generating headlines?




Hey thanks:) That was nice of you

Very kind of you! Thank you!

Imagine an AI learning from photos/videos of a person and their DNA sequence? And also a list of diseases, health records, etc. Then asking it for predictions while giving it feedback afterwards so it can tune itself.

You could even guarantee privacy. That would be some really useful data.


You mean exactly what 23andMe tried to do, and failed miserably at.

We are still early. Eventually you'll be able to change your race, gender, add reptile eyes, regrow limbs etc. Has to start somewhere. Need more data.

I really, really wanted to see a new generation of tattoo technology based on fluorescence and squid chromatophores. However, for the time being, the vast majority of gene editing will be for well-understood medical conditions where all the alternatives have been excluded. Germline (or even somatic) modification for recreational purposes or for non-urgent medical reasons is definitely still considered highly suspect by society as a whole, and I don't see that changing overnight. Somethings still work better in scifi than reality.

Exactly. We should start building global database connecting DNA with medical history.

You mean, like UKBB and All of Us already do, but less nationally focused? The approach seems fraught with complexity due to the complexity of medical ethics, the variation of national laws, and strongly-held nationalist positions.

Not a big fan? It could be provably private. You could have a kit with a random username/password. It could be done. People just have a bad taste from 23andMe.

I can already change my race. I just check a different box on government forms...

There was no privacy there

I like this a lot; you could have a multimodal setup with a DNA transformer, an image transformer and an LLM. Extremely fundable startup.

The real endgame here isn't to just enumerate and then patent those sequences, right?

Captain Trips

One of the four horsemen of the AI apocalypse.


So the six finger hands were just a foreshadowing?

What can possibly go wrong if we let ChatGPT edit our DNA?

This model has nothing to do with ChatGPT other than transformers. And as someone that could desperately use some advances in gene editing, this lowbrow dismissal is frustrating.

More and more I find the flippant dismissal of the amazing breakthroughs that are constantly streaming in kind of disheartening.

Nobody is impressed by anything anymore, nobody is excited about the future - everything’s just “terrible” and everything looking forward is bad. I hate that mindset and I’m sick and tired of cynicism poisoning practically every online space. Thank you for saying it’s “lowbrow” and giving OP a little bit of a hard time.


the breakthroughs in tech are great, the way they are marketed, sold, and pushed in our lives by means of adtech everywhere... not so much.

Part of me says - and this is somewhat of a hot take - "then stop using platforms that mash ads into everything?" I don't know, like, I'm here for the rebel yell against global industrial capitalism - let's replace it with the solar punk world of our dreams, but... like, we're not going to do that any time soon, so instead of complaining about "how terrible everything is" let's try to do better. Let's be excited about this stuff. Let's be positive.

Have you seen how generative AI thinks hands work? Just a few edits and reality can catch up.

Have you seen DNA? Mistakes and dupes and hallucinations all over the place. Ever since Sherlock Crick and Doctor Watson started meddling with it.

It's one thing to analyze it. It's an entirely different thing to let a machine of dubious abilities create new DNA.

Isn't DNA in itself a machine of dubious abilities? It's only functional because what functions is what survives, imagine the amount of 'unsurvived' because of how shit the code is.

machines that undergo accelerate evolution, I would trust them more under rigorous guidelines

I just need to be able to test the guidelines and results. Clinical trial process to make an objective decision from that point.


And how does ChatGPT edit your dna

I still consider biological life as the best ‘robot’ because it can create more of itself.

As long as robots are incapable of recreation I don’t see the threat.

One could say all maschines today are infertile.


What about computer viruses?

Computer viruses run on hardware they do not create but merely hijack. Unless a virus took control of a semiconductor fab, it's hard to argue that they are alive/reproducing in the context of this discussion.

> Unless a virus took control of a semiconductor fab, it's hard to argue that they are alive/reproducing in the context of this discussion.

It does not seem implausible at this point to imagine a virus which gets to control some currency, uses it place an order for parts to be assembled, delivered to a location, connected to power, etc.

It is, in a sense, taking control by pulling the levers supplied by our society. Is that alive?

I would say no, it is not 'alive'… but I would also paraphrase Dijkstra: "The question of whether an AI or computer virus is 'alive' is no more interesting than the question of whether a submarine can swim."


Parasites do the same thing: hijack a piece of hardware and use it to reproduce. Computer viruses have even formed a kind of symbiotic ecosystem with blackhats: the scammer provides resources to help the virus reproduce, and the virus provides access to the scammer in turn.

In a way, all life hijacks hardware (the material world) that it doesn't create to reproduce itself.


> Parasites do the same thing

Only analogously. The reality is that what we call computer viruses are merely instructions running on a computer and are not substantially distinct from the computer in the same way that parasites or physical viruses are distinct from biological tissue.


Computer viruses need to get an ability to mutate by themselves, improving the code overtime though.

Simulated evolution is trivial to implement, but my guess is also a bit pointless from the point of view of most people writing viruses — the viruses might mutate to not give them money.

Those mutations would probably be favored by evolution, if anything. They would attract less attention and allow more focus on reproduction.

I think of robots as being physically embodied somehow. I don't think of a software program like a virus as being a robot.

[flagged]


Do you mean for AI to do the entire job of researching, creating, testing, manufacturing, and distributing a cure, or just for AI to be involved? And do you mean completely eradicating a disease, or just producing a cure for it? And do you mean an outright cure, or also a treatment or vaccine? If the latter in all cases, here's an example:

https://www.technologyreview.com/2022/08/26/1058743/i-was-th...

If you mean the former in any case, it'll probably be a while, if ever in our lifetimes.


It cured my investment interest in it.

[flagged]


Die Instantly Man is the worst superhero...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: