Artificial Intelligence (Previously "Chat GPT")

Bones

@Kirwan yeah we've got an internal gpt that's typically better, but can't produce docx - I'm sure I can probably get a bot sorted if it's gonna be a regular thing, was just annoying when I'd prompt it to add something and it would change two other things and drop a third.

Bones

I should add, most likely issue occurs between monitor and keyboard.

Kirwan

@Bones said in Artificial Intelligence (Previously "Chat GPT"):

@Kirwan yeah we've got an internal gpt that's typically better, but can't produce docx - I'm sure I can probably get a bot sorted if it's gonna be a regular thing, was just annoying when I'd prompt it to add something and it would change two other things and drop a third.

Yeah ok, I've been playing with Task MCP servers for that. There's one called TaskMaster and another called Shrimp-Tasks. These help it plan, save the steps into a file and keep it's head straight about what it's doing. Funny to watch it argue with it self with thinking enabled.

nzzp

Dropping in for some Fern wisdom.

Looking at levelling up some of the AI usage. Have access to Copilot which seems 'fine' but probably a bit behind the curve. Considering springing for Chat GPT plus. Are there better ones to look at?

Usage will be in a professional environment typically processing language into standard forms - parsing email chains and dictation into templates (basically doing my donkey work for me).

Kirwan

Wait a week (or so) and evaluate Grok 4.

They are all very similar at this point (the top models). Claude is the best for coding, some are better for visual stuff, but for pure text stuff you won't see a huge difference so you can be driven by price.

The Grok team iterate faster. OpenAI are the 'leaders' right now, but with the exodus of talent and these huge data centers their competitors are setting up I'm not sure how long that will last.

I'd recommend picking a monthly plan over a yearly one so you can swap as you need to.

Kirwan

OpenAI lost four more to Zuck. They are being gutted.

Interesting tidbit is part of their offer is normal working hours. Sounds like OpenAI was 80+ a week.

Kirwan

The graph is crazy. This is a third part benchmark, Grok 4 is the first to break 10% and is twice as good as second place.

Rembrandt

@grok explain this to me like as though I'm an 8 year old

*Okay! Imagine you have a big toy puzzle that’s super tricky, and it’s called ARC-AGI-2. This puzzle is special because it’s designed to test how smart something is at figuring out new things, kind of like how you learn to solve a new game without anyone telling you all the rules first. Grown-ups use it to see how smart their robot helpers, like AI (which is like a super smart computer brain), can be.

Now, there’s a robot brain called Grok 4, made by a team called xAI. It’s like the smartest kid in class who just figured out how to solve 15.9% of that tricky puzzle! That’s a big deal because the other robot brains could only solve about half that much before. It’s like going from solving 8 pieces to solving 15 pieces of a 100-piece puzzle!

The people at ARC Prize, who made the puzzle, checked Grok 4’s work and said, “Wow, you’re really good!” They even showed it on a big chart (like a scoreboard) to prove it. This makes Grok 4 the new champion, and it’s exciting because it means robots are getting better at thinking like us.

But here’s a little grown-up stuff: some people are worried because making Grok 4 super smart used a lot of energy, and that’s not always good for the planet. It’s like using extra batteries that might hurt the Earth. So, while it’s cool that Grok 4 is so smart, grown-ups are also thinking about how to keep things safe and happy for everyone!*

Kirwan

The main part of why this graph is important, is the questions are private and they are formulated so they are not questions on the internet. So the models can't be trained to the benchmark, and it's not seen the question before.

That's why all the models previously were doing so poorly. The guy why runs this describes anything under 10% as noise.

This is the first step towards a model working out answers from the much overused term "first principles". It's a huge achievement. The first step to these things creating new knowledge (and yes I know Google's model has created a new algorithm, but this a new approach than what they did).

The other test was "Humanity's Last Exam" and the models were previously tapping out on that at 25%. These are all PHD and above questions in many different domains, no single human could possibly answer all the questions, it would take a team of experts.

Grok Heavy got 50.7% correct.

Kirwan

Rembrandt

@Kirwan The Sycophancy chapter is fascinating. AI with morals.

The Silver Fern

Artificial Intelligence (Previously "Chat GPT")