Q-learning // relax no AGI or just hype?)

sbagency
4 min readNov 23, 2023

--

UPD from Sam interview:

https://www.theverge.com/2023/11/29/23982046/sam-altman-interview-openai-ceo-rehired#content

AGI can’t be based on a singular algorithm or method; rather, it is a continually evolving and intricate concept that undergoes constant rejuvenation. Q* is just one of dozens of CS methods, with its pros and cons.

https://www.linkedin.com/posts/yann-lecun_please-ignore-the-deluge-of-complete-nonsense-activity-7133900073117061121-tTmG
https://www.linkedin.com/posts/drjimfan_part-ii-reverse-engineering-q-alphago-activity-7133887837585768448-FR6q
https://www.linkedin.com/pulse/alleged-breakthrough-q-future-ai-stefan-wendin-ksgef

In my decade spent on AI, I’ve never seen an algorithm that so many people fantasize about. Just from a name, no paper, no stats, no product. So let’s reverse engineer the Q* fantasy. LONG READ, Part I:

To understand the powerful marriage between Search and Learning, we need to go back to 2016 and revisit AlphaGo, a glorious moment in the AI history.
It’s got 4 key ingredients:

1. Policy NN (Learning): responsible for selecting good moves. It estimates the probability of each move leading to a win.

2. Value NN (Learning): evaluates the board and predicts the winner from any given legal position in Go.

3. MCTS (Search): stands for “Monte Carlo Tree Search”. It simulates many possible sequences of moves from the current position using the policy NN, and then aggregates the results of these simulations to decide on the most promising move. This is the “slow thinking” component that contrasts with the fast token sampling of LLMs.

4. A groundtruth signal to drive the whole system. In Go, it’s as simple as the binary label “who wins”, which is decided by an established set of game rules. You can think of it as a source of energy that *sustains* the learning progress.

How do the components above work together?

AlphaGo does self-play, i.e. playing against its own older checkpoints. As self-play continues, *both Policy NN and Value NN are improved iteratively*: as the policy gets better at selecting moves, the value NN obtains better data to learn from, and in turn it provides better feedback to the policy. A stronger policy also helps MCTS explore better strategies.

That completes an ingenious “perpetual motion machine”. In this way, AlphaGo was able to bootstrap its own capabilities and beat the human world champion, Lee Sedol, 4–1 in 2016. *An AI can never become super-human just by imitating human data alone.*

— — -
Q* will likely have 4 corresponding components as above. Let’s take a deep dive in Part II! [link to post]

Update)
https://arxiv.org/pdf/2107.04805.pdf

Explain Q-learning to a 5 y.o.

Alright, imagine you have a little robot friend, and you want to teach it how to do something fun, like finding a hidden treasure in your house. At first, the robot doesn’t know where the treasure is, so it tries different things like going left, right, forward, or backward. Every time it tries something, you give it a sticker if it does a good job or a frowny face if it doesn’t. The robot’s goal is to collect as many stickers as possible.

Now, the robot is very smart, and it remembers which actions got stickers and which got frowny faces. Over time, it figures out the best way to move around to get the most stickers. This is a bit like Q-learning. The robot is learning the best way to do something by trying different actions, getting feedback, and getting better each time. So, just like how you learn by trying things and getting feedback, the robot learns to find the treasure by getting stickers and avoiding frowny faces!

https://twitter.com/icreatelife/status/1727741299703328799

Q-learning and Q* (pronounced “Q-star”) are concepts from a field of AI called reinforcement learning. It’s like teaching a computer or robot to learn from its experiences, much like how we learn from our successes and failures. Imagine you’re teaching a robot to play a game. In reinforcement learning, the robot tries different actions and learns which ones lead to the best outcomes. Q-learning is a specific method for learning, where the robot uses a table to keep track of its actions and the rewards it gets for each action. Q* is the best possible version of this table. It represents the best actions the robot can take in any situation to get the most rewards. The goal of Q-learning is to make the robot’s table as good as Q*. Reuters article published today (source and link to it in comments) says that OpenAI researchers warned their board of directors about a powerful AI discovery, Q*, that could potentially threaten humanity. The article says that this was one of several reasons for the board to oust CEO Sam Altman

https://twitter.com/cwolferesearch/status/1727727148859797600

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet