AI Threat and Trustworthiness | 신뢰할 수 있는 인공지능 연구실

Artificial Intelligence, Threats, and Trustworthiness
- Fear comes from a lack of understanding
- The two-layer risk structure our lab proposes
Intrinsic Threats: AI That Is Not Perfect
Extrinsic Threats: The Person Who Wields the Tool
Trustworthy AI and Regulation
- Key Elements of Trustworthy AI
- Trustworthiness in Regulation: the EU AI Act and South Korea’s AI Basic Act
Conclusion

Artificial Intelligence, Threats, and Trustworthiness

For millions of years, tools have driven human prosperity. From fire and the wheel to movable type and the computer, the history of humanity is the history of tools, and AI is the most powerful tool of the 21st century. ChatGPT acquired 100 million users in just two months and is now seen as a new revolution; one industry forecast estimates that 90% of corporate services will be AI-based by 2025.

The essence of a tool, however, is that it “makes a human action easier,” and this property does not always lead to positive outcomes. This article splits the threats AI brings into two layers and then discusses the technical pillars and regulations that make trustworthy AI possible.

Fear comes from a lack of understanding

“A cave feels frightening because it is dark; once it is illuminated, it no longer seems as frightening.” Diffuse fear of AI, too, is usually amplified by not knowing exactly what the technology is. Understanding the threats is what makes preparation possible, and what makes the fear shrink. This article therefore organises the threats into two axes, intrinsic and extrinsic, and walks through what each one is and which technologies are designed to keep it in check.

The two-layer risk structure our lab proposes

Our lab classifies the threats of artificial intelligence into the following two layers. This is the classification proposed by Prof. Hoki Kim of our lab in his Lawtimes column “Dual Risks of Artificial Intelligence and Regulation” (2026.02.21, Korean), and it forms the spine of this article.

Intrinsic threats arise from the structural limits and defects inherent in the tool itself. Just as the unexpected failure of an external hard drive can erase precious data in an instant, or a malfunctioning aircraft engine can cause significant loss of life, these risks occur regardless of the user’s intent because they come from the limits of the tool itself.
Extrinsic threats arise from the humans who wield the tool. The very property that “makes an action easier” makes harmful actions just as easy as good ones.

The taxonomy of AI threats our lab proposes. Intrinsic threats arise from AI's own structural limits, while extrinsic threats arise from humans who misuse AI.

Intrinsic Threats: AI That Is Not Perfect

Intrinsic threats arise from the limits of AI itself. Modern AI is a probabilistic system that runs on billions of parameters, and as a result, in ChatGPT’s own words it is “a highly capable tool within its operational domain, but not perfect” because of inherent uncertainties in its training data and underlying algorithms. Prof. Abeba Birhane, a scholar listed in AI 100, has put the same point more bluntly: “current artificial intelligence is a probabilistic parrot that repeats basic data.”

This limitation comes from the basic principle of modern AI itself: “programming that learns rules from data.” When an AI encounters data it has not seen during training, there is a significant chance it will make mistakes, including common-sense mistakes that humans would normally avoid.

Hallucination

A hallucination is the phenomenon of an AI confidently producing plausible but incorrect answers. In 2023, ChatGPT generated an elaborate response describing “the incident of King Sejong throwing a MacBook Pro,” as though this story were recorded in the historical Annals of the Joseon Dynasty, a now-canonical example of the problem.

A representative hallucination case: the AI-fabricated story of King Sejong throwing a MacBook Pro (source: Hankook Ilbo, JoongAng Ilbo)

Recent ChatGPT 4o now correctly answers “no such record exists” to the same question, and previously weak areas such as mathematics and coding have improved substantially. However, the model still gives wrong answers to simple questions like counting Korean consonants in a phrase. Hallucination has shrunk, but it has not disappeared.

Adversarial Attacks

If a hallucination is “the AI making a mistake on its own,” an adversarial attack is “an attacker deliberately steering the AI into a mistake.” The defining feature of adversarial attacks is that the perturbation is essentially invisible to the human eye.

The classic adversarial example: adding imperceptible noise to a bus image causes the classifier to label it as an ostrich [Szegedy et al., "Intriguing properties of neural networks", arXiv:1312.6199, 2013]

The threat is not limited to image classifiers. Recent work has shown that adversarially perturbed images can steer a large language model away from a correct caption (“the Eiffel Tower”) into a malicious URL output, a building block for phishing campaigns and for hijacking LLM agents.

Image-driven attack on a large language model: an adversarially perturbed image forces ChatGPT to emit an arbitrary URL in its caption [SH23]

Privacy Leakage and Jailbreaks

Variants of adversarial attacks lead to a related concern: extracting private information from the trained model itself. Researchers have shown that the gradients shared in federated learning can be inverted to reconstruct original training images, and that LLM agents bound by privacy policies can be coaxed step-by-step into revealing a user’s bank account number. A recent example, the blocking of Open Claw, which lets AI control the user’s whole desktop, by domestic platforms such as Naver and Kakao on security grounds, illustrates how seriously these intrinsic risks are being taken.

Privacy breaches in AI systems: (left) reconstructing training images from federated-learning gradients [Zhang et al., CVPR 2022]; (right) bypassing a database privacy policy to recover another user's bank account number via an LLM [arXiv:2505.09921]

In short, intrinsic threats are “the risks the AI generates by itself.” Because AI learns rules from data, it cannot guarantee correct answers, it can be steered by imperceptible input perturbations, and it can leak traces of its training data in ways its designers never intended.

Extrinsic Threats: The Person Who Wields the Tool

A knife, in the hand of a chef, makes excellent food; in the hand of a doctor, it saves a life. In the hand of a criminal, however, it produces an entirely different outcome. The function of a technology may be neutral, but the context of its use is never neutral. Artificial intelligence is no different.

The duality of a tool: the same knife becomes a force for good or evil depending on who uses it and how (source: Freepik)

Extrinsic threats appear when someone uses this powerful tool with malicious intent. They divide naturally into (1) fraud and manipulation, (2) copyright infringement, and (3) misuse / abuse of AI.

Fraud and Manipulation: Deepfakes

The most visible example is the deepfake. What started as celebrity-face swaps and the fake video of President Zelenskyy quickly turned into a vehicle for financial fraud and identity impersonation. In the UK, US, and Canada, the share of fraud cases involving deepfakes grew 5x to 46x from 2022 to 2023; in global identity fraud reports, the share of deepfake cases doubled in a single year.

Explosive growth of AI-powered fraud: (left) share of deepfakes in fraud cases across the UK, US and Canada; (right) regional growth of deepfake-specific fraud cases (source: Statista)

Copyright Infringement

Generative AI can reproduce its training data with striking fidelity. One study showed that diffusion models effectively replicate images from their training set (LAION-A) in many cases [Somepalli et al., 2023], making the unauthorised training on and reproduction of copyrighted work a practical reality rather than a theoretical concern.

Misuse of AI: Jailbreaks

Finally, jailbreaking lets a user bypass the safety guardrails of an AI system to produce prohibited content. Outputs that should never reach an ordinary user, such as instructions for synthesising drugs or building explosives, have been generated through clever circumventing prompts.

AI jailbreak cases: circumventing prompts coax models into producing prohibited content (source: Block Media)

In short, extrinsic threats are “a question of how humans choose to wield the tool.” Deepfake video production, mass disinformation, and automated criminal techniques are already in the news. The AI itself is not evil; what is needed is social consensus and regulation about how the tool should be used.

Trustworthy AI and Regulation

If neither layer of threat is brought under control, AI imposes large social costs and fails to coexist with humans. The AI industry has long treated performance gains and market dominance as its top priorities, and that has not meaningfully changed. While the race for higher-performing models accelerates, safety and accountability tend to be pushed down the priority list, which is precisely why institutional response is not optional, it is necessary.

The EU Artificial Intelligence Act (2024.03), followed by South Korea’s “Basic Act on the Development of Artificial Intelligence and the Establishment of a Foundation of Trust” which took effect on 22 January 2026, attests to a broader social consensus: self-correction alone is no longer enough.

Key Elements of Trustworthy AI

Recent surveys break trustworthiness into many sub-properties, but three technical pillars recur across them: (1) privacy, (2) robustness, and (3) explainability.

Relations between the different aspects of AI trustworthiness: technical requirements such as Robustness, Generalization and Explainability together with ethical requirements such as Privacy, Fairness and Accountability [Li, Qiang, Liu, "Trustworthy AI: A Computational Perspective", 2023]

Privacy: Ensures that personal data is protected from exposure or misuse when the AI processes it. Defending against membership inference and model inversion attacks motivates research areas such as differential privacy (modifying data during training to mask individual records), homomorphic encryption (computing directly on encrypted data), and machine unlearning (safely removing specific records from a trained model on request).
Robustness: Ensures that the AI functions correctly even when faced with unexpected inputs or attacks such as adversarial perturbations. Our lab has published a number of papers in adversarial attack and adversarial training, directly addressing the adversarial-attack threat covered earlier in this article.
Explainability: Provides explanations of how AI decisions were made so users can understand the reasoning behind them. Producing correct answers is not enough for social trust; high-stakes domains such as finance and healthcare additionally demand the reasoning behind each decision. Research splits into post-hoc explainability and intrinsic explainability.

Trustworthiness in Regulation: the EU AI Act and South Korea’s AI Basic Act

The EU AI Act (March 2024) adopts a risk-based approach, classifying AI systems into four tiers (Unacceptable / High / Limited / Minimal Risk). AI systems that pose significant threats to human dignity, freedom, equality, and fundamental rights are prohibited outright; high-risk AI providers are obliged to design their systems for adequate levels of accuracy, robustness, and cybersecurity.

The EU AI Act's risk-based approach: AI systems are classified into Unacceptable, High, Limited, and Minimal Risk tiers (source: European Commission)

South Korea’s AI Basic Act (effective 22 January 2026) is built on the same direction. Article 1 explicitly names the “establishment of a foundation of trust” as the law’s purpose; Article 30 covers support for AI safety and trustworthiness certification, Article 31 mandates transparency, Article 32 mandates safety, and Article 34 obliges operators of high-impact AI to establish risk-management plans. The EU and South Korea are both moving “trustworthiness” out of the realm of recommendation and into legal obligation.

That said, current institutions still tend to stay at the level of abstract principles relative to the complexity of how AI actually works. Designing rules in too much detail would dampen technological progress and innovation; what is needed in the end is a balance, making the accountability framework for extrinsic misuse explicit while simultaneously raising the bar on the technical safety standards that minimise intrinsic limitations.

Conclusion

Artificial intelligence is the most powerful tool of the 21st century. Because it is so powerful, the threats it brings are equally multi-dimensional: the two layers our lab classifies, intrinsic threats from flaws inside the AI itself and extrinsic threats from humans who misuse the AI, neither of which alone is sufficient to explain the picture. Hallucinations, adversarial attacks, and privacy leakage are model-level concerns; deepfakes, copyright infringement, and AI jailbreaks are society-level concerns.

Neither class of threat alone can be controlled without addressing the other. Trustworthy AI built on privacy, robustness, and explainability, together with regulation that mandates these properties, must therefore advance in step. The EU AI Act and South Korea’s AI Basic Act are only the beginning; as AI evolves into new forms, regulation must evolve with it.

Positive coexistence between humans and AI begins the moment AI earns human trust. That is why technical and regulatory research on trustworthy AI is both the starting point and the continuing mission of our lab.