AI Alignment

Aligning AI systems with human interests.

Editors

Latest Posts

My views on “doom”

I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.

Can we efficiently distinguish different mechanisms?

Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?

Can we efficiently explain model behaviors?

It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.

AI alignment is distinct from its near-term applications

Not everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.

Finding gliders in the game of life

Walking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.

Mechanistic anomaly detection and ELK

An approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.

Eliciting latent knowledge

How can we train an AI to honestly tell us when our eyes deceive us?

Answering questions honestly given world-model mismatches

I expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?

A naive alignment strategy and optimism about generalization

I describe a very naive strategy for training a model to “tell us what it knows.”

Teaching ML to answer questions honestly instead of predicting human answers

I discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.

Reviews

No reviews, submit yours below.
Add review

Login to submit your review.