Can open-weight models ever be safe?

Opinion: Bengüsu Özcan, Alex Petropoulos and Max Reddel argue that technical safeguards, societal preparedness, and new standards could make open-weight models safer

Sep 17, 2025

*Credit: Alexa Steinbrück/Explainable AI/Licenced by CC-BY 4.0*

From Brussels to Beijing, governments are promoting more open AI ecosystems. Washington even sees advanced open-weight models as a matter of geostrategic value. Yet the significant risks that come with this ambition — and the policy actions required to mitigate them — remain largely unaddressed.

Today, performance of the most powerful open-weight models lags closed systems by about 15%. One trajectory suggests they will keep closing that gap in the foreseeable future, especially if policy and commercial pressures incentivize their development. As these models grow more capable, their openness raises the stakes: while not fully open-source, their weights — the parameters that define their behavior — can be fine-tuned for almost any application, from a customer service bot to an automated cyberattack.

Once an open-weight AI model is released, copies spread quickly, and jailbreaking or fine-tuning the models to behave in ways they were not designed to requires little expertise or cost. On Hugging Face, thousands of “uncensored” or “abliterated” versions of advanced open-weight models — variants of the originals with every safety guardrail deliberately removed — are already available.

Why is Hugging Face hosting tools to make deepfake porn of teenage celebrities?

Shakeel Hashim

Jul 23

Read full story

This proliferation illustrates an underlying problem: the decision to release open-weight models rests solely on the judgment of their developers, with no shared standards to guide them. In July 2025, OpenAI delayed its open-weight release, gpt-oss, for additional safety testing, and, since its release, hundreds of abliterated versions have emerged. To determine whether to release gpt-oss, OpenAI deliberately fine-tuned and stress-tested the model under controlled conditions to perform malicious tasks. The company ultimately concluded that gpt-oss did not pose “substantially” greater danger than other existing models — open-weight or closed — and moved forward with the release. This approach may seem sensible, but it’s based on a moving target, tied to internal frameworks and made without external oversight. Within weeks, DeepSeek and Alibaba followed with open-weight models of similar capability but without presenting comparable evaluations. Perhaps they too judged their release safe — but without a universal standard, we can only take their word for it.

Yet, open-weight models matter. They let startups, educators, and governments access tools that would otherwise stay locked away, and allow independent researchers to study and stress-test AI systems directly. The risk is that their dangers may only become clear after a major misuse, an incident that could quickly turn public opinion and endanger openness itself. The UK-led International AI Safety Report has already warned that future open-weight releases are likely to aid even moderately skilled actors in cyberattacks or biological experiments.

Open-weight models sit on a knife’s edge. Handled well, they expand access and drive innovation; handled poorly, one misuse could end openness altogether. To mitigate this trade-off, we see three pillars that need to reinforce one another: stronger safeguards, more resilient societies, and standards that establish clear thresholds for open release.

The first line of defense is to make models harder to misuse, even when their weights are freely available. Promising ideas include tamper-resistant architectures or “machine unlearning” techniques that selectively erase risky knowledge from inside models, but none are without problems. Researchers have criticized these approaches for either weakening useful capabilities or lacking resilience to circumvention. However, it is far too early to dismiss their efficacy.

More targeted investment to improve these safeguards is essential. If developers like DeepSeek, Alibaba, Mistral, Meta — and now OpenAI — are serious about openness, they must match their commitment to releasing models with equal commitment to making them safe.

Even with the best safeguards, powerful general-purpose technologies will always carry some dual-use potential. That risk must be matched with societal resilience. Domain experts in areas where misuse risk is highest should lead this adaptation. For example, critical infrastructure providers, from energy grids to nuclear facilities, are adopting AI cautiously, but that caution must extend into stronger cybersecurity standards to resist AI-assisted attacks. In biosecurity, as AI lowers the barrier to expert-level know-how, critical chokepoints need heightened oversight, such as more rigorous screening of DNA synthesis requests.

The quest to build better defenses for AI risks

Lynette Bye

Mar 20

Read full story

Examples of how this could work already exist. The Organisation for the Prohibition of Chemical Weapons has a Scientific Advisory Board that routinely evaluates how new technologies affect chemical weapons treaties. NATO created a dedicated Cyber Defence Centre to protect Allied networks and critical infrastructure from emerging threats. If agencies in these fields partner with AI safety and security institutes, they can share intelligence in both directions, feeding domain-specific risks into AI evaluations, while gaining early insight into emerging AI capabilities.

Finally, the way that open AI models are released needs to be rethought. This should mean taking a tiered approach to evaluating the risks of an open-weight model at a given capability level and deciding whether it can be released safely under current conditions, or what additional safeguards would be needed to make it safe. In practice, more advanced systems would face stricter conditions before release, and in the future, some model weights might only be accessible to verified users and for approved use cases in high-stakes domains. Analogies exist — such as restricted access to medical databases — but applying this principle to AI would require careful new design.

Establishing such a framework will take time. It will evolve along the full spectrum of governance, from voluntary industry efforts to binding laws, and will require ongoing adaptation as capabilities and misuse evolve. Progress today is scattered, from voluntary guidelines to binding law still in its early stages, such as the EU AI Act. The task ahead is to converge these efforts into standards that are credible, transparent, and widely accepted so that developers and policymakers alike know when openness is justified and when restraint is necessary.

Without substantial progress, each new open-weight model is another uncontrolled experiment on the public. Openness and safety are not mutually exclusive, but the window to design safe openness is closing fast.

Author bios

Bengüsu Özcan is an Advanced AI Researcher at the Centre for Future Generations working on risk thresholds and dual use aspects of advanced AI, and future AI scenarios.

Alex Petropoulos is an Advanced AI Researcher at the Centre for Future Generations. His research focus is on institutional design and CERN for AI. He tweets here.

Max Reddel is the Advanced AI Director at the Centre for Future Generations, where he leads the strategy on CERN for AI, institutional design, compute governance, and the geopolitics of AGI. He also serves on the board of the Talos Network and advises European institutions on responsible innovation and technology. He substacks here.

Can open-weight models ever be safe?

Opinion: Bengüsu Özcan, Alex Petropoulos and Max Reddel argue that technical safeguards, societal preparedness, and new standards could make open-weight models safer

Why is Hugging Face hosting tools to make deepfake porn of teenage celebrities?

The quest to build better defenses for AI risks

Discussion about this post