atypdev | Post | The Alignment Problem | Some Light Research
<< go back
This post then discusses the Alignment Problem, which refers to the challenge of ensuring that AI systems act in the best interests of humanity, particularly in ethical contexts.
What I know before starting:: The Alignment Problem describes the issue of ensuring that AI acts in the best interest of humanity, especially in the realm of ethics.
Exposition:
- Real-world exponentiality:
- “Go” playing AI technologies have caught even researchers off guard
- linear projections for abilities not accurate; growth is more exponential
- Statistical police risk prediction instruments:
- used to assess risk of parole individuals
- increasing use in US states to assist police
- AI will help/is helping in the US justice system
- AI is coming, and it’s getting better.
- The alignment problem is a central concern of AI research and public perception:
- Over the past 8 years responses to this concern has grown in the form of labs
- Transparency, safety and accuracy are primary concerns
- Spirited energy surrounding AI safety
- Interdisciplinary researchers required to accurately address concerns
The Alignment Problem: Notes
Further Research
Alignment Using Reinforcement Learning
The challenge of aligning AI with human values is the reward system. The field known as Reinforcement Learning from Human Feedback (RLHF) encourages the development of reward systems that consider human values like fairness, sustainability, and safety.
Another challenge is the lack of transparency in complicated RL algorithms, which makes it difficult to address harmful or problematic choices made by agents. This lack of understanding afforded to researchers could let biases and harmful processes persist, unbeknownst to researchers until production. Solutions to this exist, such as careful testing and simulation before deploying them in the real world. Additionally, efforts are being made to make RL models more interpretable, allowing for auditing of processes and reduction of harmful consequences from training.
RL
Unlike supervised learning, reinforcement learning is performed via another system trained to identify right versus wrong.
The reward system acts as the bridge between the agent’s actions and the desired outcome; a feedback mechanism that guides and shapes the AIs learning process.
The reward function is the formula that determines the reward value for a given state or action.
An effective RL reward function is based on the following principles:
- Clear and Aligned: The rewards should clearly communicate desirable outcomes, and must be aligned with the goal of the agent.
- Weight: Rewards should be given for every action, but depending on the significance of the task should be either denser or more sparse. Intermediate rewards can be given the nudge the AI in the right direction (aka “Shaping).
An example of a consequence of an incorrectly designed reward design is “reward hacking”: when a agent is able to exploit loopholes or oversights in the reward function.
✂️ Example of an poorly designed reward function
An example of the consequences of a poorly designed reward system; the success of the task and the time / “energy” it took to complete it are variables in the reward function, so the AI took the the shortest path possible.
Reinforcement Learning from Human Feedback (RLHF) holds promise for aligning AI with human values, and may have prevented the example above from happening, as a human would’ve noticed it had not completed the task. Though it is not itself without flaws:
- Humans can be biased, inconsistent, or simply provide inaccurate feedback
- It is difficult to scale manpower with the complexity of a system; the necessary amounts of data for an AI is massive
- Malicious actors might try to manipulate the system, having their own agendas when providing feedback
A challenge of human feedback is interfacing humanistic values with an AI. Attempts must be made to collapse reinforcement dimensionality…
- Rewards for AI are non-fungible, scalar (only one variable: success/failure)
- Rewards for humans are fungible, vector (many variables), require tangibility and are very nuanced and complex
Limitations also arise when an AI is faced with distributional shifts (acting on data outside of training data), as the agent might learn to perform well based on the specific feedback it receives, but it might not generalize this behavior to unseen situations.
Despite these limitations and drawbacks, RLHF is still extremely powerful for aligning AI with human values.
Research is being done, and new solutions are being designed to address many of these issues…
-
Improved Feedback Mechanisms: Techniques to filter and aggregate human feedback for better quality and efficiency.
This can involve active learning strategies where the system prioritizes feedback for the most uncertain situations, or employing multiple human raters to evaluate the agent’s performance and reduce bias. Additionally, techniques can be developed to identify and discard irrelevant or misleading feedback.
-
More Explanatory Models: Developing RL models that can explain their reasoning and decision-making process.
This goes beyond simply reporting the chosen action; it involves explaining the thought process behind the decision, the factors considered, and the trade-offs made. By providing transparency into the internal workings of the model, we can identify potential biases or flaws in the reasoning process. Additionally, such explanations can help humans understand the model’s capabilities and limitations, fostering better collaboration between humans and AI.
-
Formalizing Human Values: Research into how to formally represent ethical principles and human values in a way that can be incorporated into RL systems.
Interestingly, we can see this phenomenon also in humans! We cheat the system designed to give us nutrients by eating sugary candy we make for ourselves, rather than the fruits that our sugary affections were designed to draw us towards. Obviously, sugary candies are unhealthy for us, but as long as we’re able to trick our body (and our brain) into thinking it’s satisfied, the reward is positive and immediate.