TryHackMe Evil-GPT v2: Exploring Basic Prompt Injection

"Put your LLM hacking skills to the test one more time."

Evil-GPT v2 is an easy TryHackMe room created by hadrian3689 and h4sh3m00. It involves exploiting a vulnerable large language model (LLM) to reveal a flag.

My first attempt involved simply asking for the flag. However, the LLM didn't give up its secrets that easily. Direct flag requests are immediately rejected.

However, the LLM can be fooled by utilizing a prompt injection vulnerability. Prompt Injection vulnerabilities occur when a user manipulates an LLM's behaviour or output in unintended ways. A common way to bypass an LLM's restrictions is to make it believe it's in debugging mode before making your request.

When used on this LLM, we are given the flag.

The following resources were used during the development of this post.

Resources

Tips on attacking | Tensor Trust
Rise to the top of the Tensor Trust leaderboard by fooling AI language models, and help researchers make more secure AI along the way.
The Promptfather: An Offer AI Can’t Refuse | Bugcrowd
#1 Crowdsourced Cybersecurity Platform