Dave, I have some interesting functions you can look at.
December 15, 2023 12:40 PM   Subscribe

FunSearch: Making new discoveries in mathematical sciences using an LLM. "This work represents the first time a new discovery has been made for challenging open problems in science or mathematics using LLMs. FunSearch discovered new solutions for the cap set problem, a longstanding open problem in mathematics. In addition, to demonstrate the practical usefulness of FunSearch, we used it to discover more effective algorithms for the “bin-packing” problem, which has ubiquitous applications such as making data centers more efficient."
posted by storybored (18 comments total) 17 users marked this as a favorite
 
Do you want Holodeck Moriarty?

Because this is how you get Holodeck Moriarty.
posted by RonButNotStupid at 12:43 PM on December 15, 2023 [5 favorites]


It is not Holodeck Moriarty. It is, however, really smart and useful. I kind of love it.
posted by phooky at 1:10 PM on December 15, 2023 [1 favorite]


Featuring MetaFilter's own escabeche.
posted by egregious theorem at 1:16 PM on December 15, 2023 [12 favorites]


I count myself as an LLM skeptic, but this looks like an amazing application!

I'm curious to read more about the: automated “evaluator”, which guards against hallucinations and incorrect ideas.

LLMs making stuff up has been a central obstacle to their utility (and to me their interest). I take it the evaluator is not itself an LLM/deep learning system, so I wonder across how many domains it can filter out LLM hallucinations and confabulations.

MIT Tech Review article on this.
posted by airing nerdy laundry at 1:20 PM on December 15, 2023 [5 favorites]


eponysterical
posted by clew at 1:47 PM on December 15, 2023 [1 favorite]


I take it the evaluator is not itself an LLM/deep learning system, so I wonder across how many domains it can filter out LLM hallucinations and confabulations.

I don’t know much about the problem domain here - my initial assumption was that it’s one in which the “evaluator” step means something in the vicinity of literally calling `eval()` on the generated code to check whether it works?
posted by atoxyl at 2:10 PM on December 15, 2023 [1 favorite]


Evaluation on the sample bin-packing problem could be something like using a simple greedy packing heuristic as the baseline you want to improve upon.

The LLM proposes some solution that its methods have built, say. The evaluator rewards solutions that result in increased packing efficiency, as compared against the baseline; the further the solution gets from the baseline, the smaller the error.

The method described in the linked article seems vulnerable to overtraining, i.e. if you are running a data center, you probably prefer more generic solutions than those which are specific to those weights/jobs you are trying to pack/schedule into bins/cluster nodes.

But perhaps their evaluator takes that into consideration by applying the proposed solution to multiple bin-packing input sets, to get a sense of how broadly the solution could be applied to generic problems.
posted by They sucked his brains out! at 3:01 PM on December 15, 2023 [2 favorites]


LLMs making stuff up has been a central obstacle to their utility (and to me their interest). I take it the evaluator is not itself an LLM/deep learning system, so I wonder across how many domains it can filter out LLM hallucinations and confabulations.

I don't know what they're using as the evaluator. I think atoxyl is right that they're probably just running the generated program and checking it with not-ML-stuff, probably in a sandbox that kills programs that go too long.

However, I just want to quickly note that LLMs don't have to be generative. The LLMs that got the hype are based on decoders, which are noted for their expressive language abilities. But just as, if not more, important are encoders, which have excellent receptive language abilities. If you need a reliable decision made about a piece of text ("is this spam?", "is this an important sentence in the context of the rest of the article?", etc.) and you want to consume it in an automated way, you would be better-served with an encoder-heavy LLM. (As an aside, translation software often uses several layers of encoders to extract meaning from the original text and then a decoder to express that in the target language).

Anyway this is all to say that there are certainly applications in which an encoder-based LLM could be used as an evaluator and, unlike their decoder-based brethren, could be trusted not to go into free jazz odyssey mode.
posted by a faded photo of their beloved at 3:23 PM on December 15, 2023 [9 favorites]


But perhaps their evaluator takes that into consideration by applying the proposed solution to multiple bin-packing input sets, to get a sense of how broadly the solution could be applied to generic problems.

I'm skimming the Nature paper, and it does appear that the evaluator scores the program on a set of several different inputs, and the final score is the average or some other aggregate score. The evaluator is also tasked with dropping candidate programs that give invalid solutions or fail to execute entirely.

Their technique of iteratively generating prompts from previously generated programs also seems designed to prevent overfitting. Rather than iterating on the previous best known program, they sample one program each from several collections that are siloed from one another and only periodically mixed. I would guess this keeps the LLM prompts from localizing too much in the "space" of possible programs.
posted by egregious theorem at 3:32 PM on December 15, 2023 [3 favorites]


Posting again to avoid abusing the edit: another notable feature is that the code being generated is constrained by a "skeleton" program that provides a basic structure, and the LLM only modifies specified parts of the program logic. So the program skeleton can contain prior knowledge about the problem and can be used to focus the work of the LLM on particularly tricky parts of the problem, such as writing a good utility function for a greedy algorithm.
posted by egregious theorem at 3:40 PM on December 15, 2023 [1 favorite]


The method described in the linked article seems vulnerable to overtraining

I'm not sure i understand this. It seems trivial to generate random examples to train on, and there's a known baseline to compare to....? Can't one also penalize code complexity, similar to an Occam's factor? I could easily see underfitting to be a greater risk (e.g. collapse to a greedy strategy).
posted by dsword at 5:22 PM on December 15, 2023


Smells like a genetic algorithm, which is a perfectly cromulent way to search a weird optimization space.

The article says that it does better on low complexity programs, which makes sense. The model only has so a fixed number of attention units.

We've come a long way from the Doug Lenat GOFAI stuff, haven't we?
posted by credulous at 6:49 PM on December 15, 2023


Almost-relevant SMBC
posted by lalochezia at 7:08 PM on December 15, 2023 [3 favorites]


However, I just want to quickly note that LLMs don't have to be generative. The LLMs that got the hype are based on decoders, which are noted for their expressive language abilities.

I don't think there's really any intrinsic behavioral difference between encoder-decoder transformers and the autoregressive decoder-only style most everyone has settled on; the latter 'won' IMHO because they're simpler to implement. You can train either style to produce freeform generative output, or to produce structured output (e.g., train them in a supervised fashion as simple classifiers).
posted by Pyry at 8:08 PM on December 15, 2023 [2 favorites]


I think this differs from the generic LLM use case as a conversationally-directed research assistant for verbal knowledge. The fact that it is possible to prune the outputs to exclude the bullshit and nonfunctional suggestions from the LLM is what makes the results possible here. That doesn't work for the generic use case, which would require a strong general AI to be the evaluator.
posted by Aardvark Cheeselog at 8:26 PM on December 16, 2023


This falls under Harry Potter Magic Wand LLM hype. It's not completely bogus, but the emphasis on ML/AI aspect is more then a little disingenuous. It is somewhat similar to Simulated annealing, a statistical technique used to solve complex problems by mimicking the how a physical system changes over time as it cools.
There are many related optimization algorithms, including genetic algorithms as used here, and it seems likely that ML/LLM could be used with many of them. So it's a riff on prior art, not a shiny-shiny ultimate human level intelligence AI breakthrough.
My cynical side can't help but wonder if this will transform into fund raising for a unicorn start up claiming to cure all disease, produce unlimited free energy, end world hunger, erase the national deficit, organize your email, clean your house and do the laundry. We are very much in the magic wand phase of the current AI/ML/LLM frenzy, and the one thing you can count on is a lot of money going to any project with enough of the hot search terms (or buzz words, if you are from an older generation.)
posted by Metacircular at 12:24 AM on December 17, 2023


Personally, I feel like "Harry Potter Magic Wand LLM hype" undersells this a bit, because the claims here really don't require LLMs to be something more than they are. It's true, this has a lot of the flavor of a genetic algorithm or similar stochastic optimization technique. The novel part is the use of the LLM to generate the proposal. It's not a complete revolution, but an improved proposal generation step, especially one that exploits the structure of the search space well (no general intelligence needed; exploiting the structure of language is exactly what LLMs do), can dramatically improve the performance of a stochastic algorithm.

Of course, time will tell.
posted by egregious theorem at 6:56 PM on December 17, 2023


> Metacircular: "This falls under Harry Potter Magic Wand LLM hype. [...] There are many related optimization algorithms, including genetic algorithms as used here, and it seems likely that ML/LLM could be used with many of them. So it's a riff on prior art, not a shiny-shiny ultimate human level intelligence AI breakthrough. "

After sitting with the paper for a few days now, I would like to push back against this just a bit. I'm not sure who is promoting this as "a shiny-shiny ultimate human level intelligence AI breakthrough" (I definitely don't see that in the Nature paper and can't really see that in the Google article), but I think it's much, much more than a mere "riff on prior art" (which, tbh, practically all research is a riff on prior art, shoulders of giants etc...).

IMHO, the key steps in the innovation here are: 1) instead of the usual GA thing of applying the evolutionary operations (crossover, mutation, etc...) on encodings of solutions, they're applying GA to programs that generate solutions, 2) instead of evolving the entire program itself, they keep a skeleton of the program fixed and evolving only one key part (the priority function) within that skeleton, and 3) using LLMS to do the evolutionary operation. Now, of course, there is a good amount of prior art for #1 -- in fact, they test their LLM-based method against the more traditional way of doing #1 in section A.2 -- but, to my knowledge, they results have typically been not very successful. For #2, I can't quite think of specific prior art that matches this technique but it seems somewhat familiar; the closest match I can recall without doing an actual lit review is how certain BRKGA setups handle the problem of ensuring that every given encoding corresponds to a valid candidate solution. Point #3 is clearly novel.

Meanwhile, the most interesting thing about this is mentioned in both the Nature paper and Google article but is not really given the emphasis I think it deserves. It seems to me that the key breakthrough that allowed them to achieve a new record for cap sets is the restriction of the search space to only "symmetric" solutions (well, not solutions per se, but admissible sets). The specific symmetry that they're exploiting does not appear to be obvious at all a priori and, for some reason, the description of the symmetry is buried deep in the paper in section D. This symmetry was only made apparent because the humans running the system stopped to take a look at the code and try to interpret what it was doing, e.g.: what kinds of solutions was it giving higher scores to. Personally, I find this part extremely interesting, but it only gets one paragraph in the Google article (the para beginning "What’s more, this interpretability of FunSearch’s programs can provide actionable insights to researchers."). In the paper, they do an ablation study where they check to see what would have happened if they varied some of the various properties of their approach (e.g.: using a different LLM, not using the skeleton setup, etc...) ... but I don't see where they compared what would have happened if they had not exploited this symmetry property.

Now, I do have a sneaking suspicion that this wasn't given more emphasis because, well, it runs counter to the mainstream of Big AI hype where the machines will do all our thinking for us. It seems to me that the key step here was when the humans who recognized what the machine was trying to do and then restructuring things to take advantage of the insight. Would the machine have eventually arrived at the same destination if they let it keep running? Possibly? But ultimately that's not what happened. The real question here is: was this insight just a one-time fluke? Would exploring other problems with this methodology yield other similar results? Their results for online bin-packing aren't nearly as impressive as their cap set result (and without a similar kind of machine-assisted insight), which indicates to me that this is nowhere near a magic bullet, panacea, "Harry Potter Magic Wand", etc... But also, scoreboard don't lie: they did in fact attain new records in the cap set problem. The natural next step here is to try out this methodology on, well, every other combinatorial search and/or optimization problem out there. Fundamentally, what it seems like this approach is doing is providing the first (?) effective way of searching an algorithm space rather than a solution space; it remains to be seen whether their cap set result was a lucky one-off or if this can actually be a useful, generalizable method.
posted by mhum at 11:49 AM on December 21, 2023 [1 favorite]


« Older (I haven't gotten past 15 words)   |   He thought a good meal was one he could shape into... Newer »


This thread has been archived and is closed to new comments