Hacker Newsnew | past | comments | ask | show | jobs | submit | prettyblocks's commentslogin

I think good results come from spending a substantial amount of time reading the claude-code documentation and setting up an environment that lets you work in this new paradigm. Asking it a few things, getting poor results, and deciding "it doesn't work when you ask it to correct itself won't get you that far.

I asked claude:

take this game for ti-89: https://gist.githubusercontent.com/mattmanning/1002653/raw/b... 10eaae3bd5298b8b2c86e16fb4404/drugwars.txt

and make a single page web app in one single html/css/js file where it draws the ti-89 on the screen and you play the game in that calculator

---

It came up with this in one single shot:

https://drugwars-8od.pages.dev/


Thanks! Paid the debt in 6 days got bored.

I seriously remember this game being hard... I finished with $500k cash, $11m in the bank, and $0 debt. 100/100 score. High school me is very proud.

I imagine you can do this with a hook that fires every time claude stops responding:

https://code.claude.com/docs/en/hooks-guide


I think the biggest case for fine tuning is probably that you can take small models, fine tune them for applications that require structured output, and then run cheap inference at scale. "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.


Especially for super constrained applications. I don't care if the language model that I use for my extremely specific business domain can solve PhD math or remember the works of Shakespeare. I'd trade all of that for pure task specific accuracy.


Can you share more details about your use case? The good applications of fine tuning are usually pretty niche, which tends to make people feel like others might not be interested in hearing the details.

As a result it's really hard to read about real-world use cases online. I think a lot of people would love to hear more details - at least I know I would!


If you treat LLMs as generic transformers, you can fine tune with a ton of examples of input output pairs. For messy input data with lots of examples already built, this is ideal.

At my day job we have experimented with fine tuned transformers for our receipt processing workflow. We take images of receipts, run them through OCR (this step might not even be necessary, but we do it at scale already anyways), and then take the OCR output text blobs and "transform" them into structured receipts with retailer, details like zip code, transaction timestamps, line items, sales taxes, sales, etc.

I trained a small LLM (mistral-7b) via SFT with 1000 (maybe 10,000? I don't remember) examples from receipts in our database from 2019. When I tested the model on receipts from 2020 it hit something like 98% accuracy.

The key that made this work so well is that we had a ton of data (potentially billions of example input/output pairs) and we could easily evaluate the correctness by unpacking the json output and comparing with our source tables.

Note that this isn't running in production, it was an experiment. There are edge cases I didn't consider, and there's a lot more to it in terms of accurately evaling, when to re-train, dealing with net new receipt types, retailers, new languages (we're doing global expansion RN so it's top of mind), general diversity of edge cases in your training data, etc.


Exactly, inference cost is a very good reason to fine tune with something like Qwen


Wouldn’t it be better to use a grammar in the token sampler? Tuning is fine, but doesn’t guarantee a syntactical correct structured output. But if the sampler is grammar aware it could.


I think both should be done, they don't really serve the same purpose.


I agree.

Also for certain use cases there are constraints like embedded hardware systems with no internet access. These LLMs have to be trained to specialize for clearly defined use cases under hardware constraints.

Frontier LLMs also are rarely function in isolation instead are orchestrating a system of special units aka subsystems and agents.

While costs and effort are one thing, being able to downsize these monster LLMs through finetuning itself in the first place is extremly valuable.


> "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.

I am not expert in this topic, but I am wondering if large cached context is actually cheap to run and frontier models would be cost efficient too in such setting?


I'd like to read more about that if anyone has any suggestions.


I am not expert in this topic, but its easy to observe that price for cached tokens is usually 10x cheaper on major providers.


I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs. It needs to be as tiny as possible so it doesn't take resources away from the running game.


> I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs.

Unless your game states have combinatoral exlosion, would it not be better to generate all of that pre-build? If templated you can generate a few hundreds of thousands of templates to use for any circumstance, then instantiate and stitch together those templates during the game runtime.


There are a bunch of tutorials on how to use GRPO to fine tune a small Qwen. Depending what you're doing LoRA or even just prefix tuning can give pretty good results with no special hardware.


How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?


> How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?

I dunno, for game prose I expect that a tiny highly quantized model would be sufficient (generating no more than a paragraph), so 300MB - 500MB maybe? Running on CPU not GPU is feasible too, I think.


This is literally what I'm waiting for. I want a ~8B model that works well with OpenClaw.


I don't think you will get that anytime soon because for a model to work well with something like openclaw it needs a massive context window.


but but but but unified memory! (jk, I don't actually believe in Apple marketing words)

There might be future optimizations. Like, have your small model do COT to find where to look for memory that is relevant.


Qwen 9B doesn't?


Nothing is really usable outside Opus.

I've tried too. Wasted a few days trying out even high end paid models.


Could be, but it likely won't be able to support the massive context window required for performance on par with sonnet 4.6


I think I recall reading about Hendrix that he tried to emulate the sounds of cartoons with his guitar, and then when he was in the army he did the same with trying to reproduce the sounds of fighter jets. Not sure if urban legend, but cool origin story.


[flagged]


Ethically pure 60s musicians are pretty hard to come by


Ethical humans are pretty hard to come by if you put them under a microscope.


"Not beating women" doesn't require a microscope.


I agree but when you’re dealing with celebrities people sometimes lie and exaggerate, and third parties sometimes extrapolate beyond any semblance of grounded facts. So most people subject to that level of scrutiny and fame are likely to have some allegations against them whether true or not.

Hendrix’s girlfriend Kathy Etchingham claims he never abused her. Some third parties dispute her claims about her relationship.

His arrest record suggests at least some type of altercation with a previous girlfriend but it’s far from clear cut to me.

People are complex and reality is complex. I myself was subject to false accusations about abuse from a disgruntled ex girlfriend (who actually WAS in fact physically and mentally abusive to me and I have the scars to prove it).

But regardless, I have zero issues reflecting on a person’s accomplishments and talents even in the context of them being a horrible person. In fact, I find that part of the intrigue of really talented people. Reality and people are quite multi-dimensional. The only general rule I know is that nobody is perfect and holding up ANYone as some example of moral perfection is almost certainly wrong.


OT1H, yeah: angry misogynist bad.

OTOH: why does his accidental fatal pathway tarnish him morally to you?

Very mixed message: "Don't beat women nor vomit!"


You can just put "Never use subagents" in your CLAUDE.md and it will honor it, no?


IME CLAUDE.md rarely gets fully honored. I've left HN comments before about how I had to convert some CLAUDE.md instructions to pre-commit deterministic checks due to how often they were ignored. My guesstimate is that it is about 70 % reliable. That's with Opus 4.5. I've since switched to GPT-5.2 and now GPT-5.3 Codex and use Codex CLI, Pi and OpenCode, not CC, so maybe things have changed with a new system prompt or with the introduction of Opus 4.6.


I would argue that the ability to crawl and scrape is core to the original ethos of the internet and all the hoops people jump through to block non-abusive scraping of content is in fact more anti-social than circumventing these mechanisms.


I find that even though this isn't standard, that these -cli tools will scan the repo for .md files and for the most part execute the skills accordingly. Having said that, I would much prefer standards not just for this, but for plugins as well.


Standards for plugins makes sense, because you're establishing a protocol that both sides need to follow to be able to work together.

But I don't see why you need a strict standard for "an informal description of how to do a particular task". I say "informal" because it's necessarily written in prose -- if it were formal, it'd be a shell script.


Yes, it's good, but it doesn't have access to any sensor apis.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: