Category: Safety & Ethics
Definition
RLHF (Reinforcement Learning from Human Feedback) trains AI systems using human ratings of good and bad responses.
How It Works
First, train an AI model on text data. Then, humans rate its outputs as helpful, harmful, or neutral. The AI learns to produce responses that get better ratings.
This process makes AI more helpful and less likely to generate harmful content than models trained only on raw text data.
Why It Matters
RLHF makes AI systems behave better. ChatGPT, Claude, and other consumer AI products use this technique to be more helpful and less harmful.
Without RLHF, AI systems often produce accurate but unhelpful or inappropriate responses.
← Back to Safety & Ethics | All Terms