Control Methods

From: Superintelligence

Capability control methods

preventing undesirable outcome by limiting what a superintelligence can do \

Boxing methods

physical containment methods: restricting systems physical output channels so only limited ways of communication with outside world are available (e.g. Faraday Cage to prevent wave communication)
informational containment methods: information inside the box and outputs are deliberately limited (e.g. communication only via single bit answers (yes/no)) -> limiting skills might render superintelligence quite useless

Incentive Methods

creating an environment that gives the agent instrumental reason to promote the principals goals; possible approach: combining incentive methods with motivation selection

Stunting

Limiting an agents internal capabilities for example by restricting it access to information or engineering limitations to its cognitive faculties

Domesticity

AI's motivation system is designed in a way that gives it very modest ambitions, limiting the mount of things it wants to infer with

Tripwires

a mechanism that constantly performs diagnostic tests the agents might not even know about and automatically shuts the system down if suspicious behaviour is detected
kinds of tripwires: behaviour (detecting agents attempts to breach containments e.g. communication), abilities (skill testing to analyse agents capabilities), content (monitoring for changes in strategies and final goals)

Collaboration

general assumption: collaboration might bring many benefits when developing machine intelligence

race dynamic exists if one project is threatened to be overtaken by another, factors: closeness of the race, relative importance of capability and luck, number of competitors, goals and approaches of the different project; harm will arise not from "smashup of battle" but from downgrade of precautions
samples for benefits of collaboration:
- reduction of haste in developing machine intelligence
- allows for greater safety investment
- sharing of ideas and visions on how to tackle key issues (eg. control problem)
- can promote more equal distribution of outcomes
- broad collaboration likelier to benefit people outside the collaboration
- more likely to operate under public oversight
pre transition collaboration will influence post transition collaboration / can either reduce or maximise it depending on scenario
broad collaboration does not necessarily mean many active parts of the project itself

Other:

My current impression is that government regulation of AI today would probably be unhelpful or even counterproductive (for instance by slowing development of AI systems, which I think currently pose few risks and do significant good, and/or by driving research underground or abroad). If we funded people to think and talk about misuse risks, I’d worry that they’d have incentives to attract as much attention as possible to the issues they worked on, and thus to raise the risk of such premature/counterproductive regulation. https://www.openphilanthropy.org/blog/potential-risks-advanced-artificial-intelligence-philanthropic-opportunity

––––––––––

https://futureoflife.org/data/documents/research_priorities.pdf?x82936

Short-Term Research Priorities
- Optimizing AI’s Economic Impact
  - Labor Market Forecasting
  - Other Market Disruptions
  - Policy for Managing Adverse Effects
  - Economic Measures
- Law and Ethics Research
  - Liability and Law for Autonomous Vehicles
  - Machine Ethics
  - Autonomous Weapons
  - Privacy
  - Professional Ethics
  - Policy Questions
- Computer Science Research for Robust AI
  - Verification: How to prove that a system satisfies certain desired formal properties. (Did I build the sys- tem right?)
  - Validity: How to ensure that a system that meets its formal requirements does not have unwanted behaviors and consequences. (Did I build the right system?)
  - Security: How to prevent intentional manipulation by unauthorized parties.
  - Control: How to enable meaningful human control over an AI system after it begins to operate. (OK, I built the system wrong; can I fix it?)
Long-Term Research Priorities
- Moreover, to justify a modest invest- ment in this AI robustness research, this probability need not be high, merely nonnegligible, just as a modest investment in home insurance is justified by a nonnegligible probability of the home burning down.
- Verification
- Validity
- Security
- Control
––––––––––
- advisor game, played by two agents: a “friendly” advisor and an “adversarial” advisor. In each round of the game, each advisor proposes a decision in the underlying task. A human judge can interact with the two advisors, and then must decide which of their decisions to accept. At the start of each round, the human doesn’t know which advisor is which. https://ai-alignment.com/advisor-games-b33382fef68c

–––––––––––

Internal Constraints: Constraining Values
http://consc.net/papers/singularity.pdf
- First, we might try to constrain their cognitive capacities in certain respects, so that they are good at certain tasks with which we need help, but so that they lack certain key features such as autonomy. (such an approach is likely to be unstable in the long run)
- In what follows, I will assume that AI systems have goals, desires, and preferences: I will subsume all of these under the label of values (very broadly construed)
- Under human-based AI, each system is either an extended human or an emulation of a human. The resulting systems are likely to have the same basic values as their human sources. These differences aside, human-based systems have the potential to lead to a world that con- forms broadly to human values. Of course human values are imperfect (we desire some things that on reflection we would prefer not to desire), and human-based AI is likely to inherit these imperfections. But these are at least imperfections that we understand well.
- From a prudential point of view, it makes sense to ensure that an AI values human survival and well-being and that it values obeying human commands. it makes sense to ensure that AIs value much of what we value (scientific progress, peace, justice, and many more specific values). we need to avoid an outcome in which an AI++ ensures that our values are fulfilled by changing our values
- If we create an AI through learning or evolution, the matter is more complex. Here the final state of a system is not directly under our control, and can only be influenced by controlling the initial state, the learning algorithm or evolutionary algorithm, and the learning or evolutionary process.
- This sort of “cautious intelligence explosion” might slow down the explosion significantly. It is very far from foolproof, but it might at least increase the probability of a good outcome; one might ensure that the first AI and AI+ systems assign strong negative value to the creation of further systems in turn
- David Hume advocated a view on which value is independent of rationality: a system might be as intelligent and as rational as one likes, while still having arbitrary values. By contrast, Immanuel Kant advocated a view on which values are not independent of rationality: some values are more rational than others.
- Kant held more specifically that rationality correlates with morality; a fully rational system will be fully moral as well; If this is right, and if intelligence correlates with rationality, we can expect an intelligence explosion to lead to a morality explosion along with it.
External Constraints: The Leakproof Singularity
- Here one obvious concern is safety. Even if we have designed these systems to be benign, we will want to verify that they are benign before allowing them unfettered access to our world.
- First, humans and AI may be competing for common physical resources
- Second, embodied AI systems will have the capacity to act physically upon us
- Leakproof singularity – we should create AI and AI+ in a virtual environment from which nothing can leak out
- For an AI system to be useful or interesting to us at all, it must have some effects on us. At a minimum, we must be able to observe it. And the moment we observe a virtual environment, some information leaks out from that environment into our environment and affects us
- For an AI++, the task will be straightforward: reverse engineering of human psychology will enable it to determine just what sorts of communications are likely to result in access
- At this stage it becomes clear that the leakproof singularity is an unattainable ideal. Confining a superintelligence to a virtual world is almost certainly impossible: if it wants to escape, it almost certainly will
–––––––

https://arxiv.org/pdf/1606.06565.pdf
- An accident can be described as a situation where a human designer had in mind a certain (perhaps informally specified) objective or task, but the system that was designed and deployed for that task produced harmful and unexpected results. Samples of possible safty problems to avoid:
- Avoiding Negative Side Effects; there is reason to expect side effects to be negative on average, since they tend to disrupt the wider environment away from a status quo state that may reflect human preferences.
- Avoiding Reward Hacking:
- Scalable Oversight: One framework for thinking about this problem is semi-supervised reinforcement learning,3 which resembles ordinary reinforcement learning except that the agent can only see its reward on a small fraction of the timesteps or episodes. The agent’s performance is still evaluated based on reward from all episodes but it must optimize this based only on the limited reward samples it sees.
- Safe Exploration; There is a sizable literature on such safe exploration—it is arguably the most studied of the problems we discuss in this document.
- Robustness to Distributional Shift: In general, when the testing distribution differs from the training distribution, machine learning systems may not only exhibit poor performance, but also wrongly assume that their performance is good. Additionally, safety checks that depend on trained machine learning systems (e.g. “does my visual system believe this route is clear?”) may fail silently and unpredictably if those systems encounter real-world data that differs sufficiently from their training data.