AI Safety

 

Every new innovation comes with its problems and safety concerns. In order to implement counter measures we need to see the effect and the safety concerns that comes out of it in other words the innovation in this case AI needs to be used and adopted just like we have adopted the internet, like that we will have a clearer vision on how to approach security measures and implement them.

 

It isn't clear on how we are going to approach AI safety just yet, I think that is because we are yet to see it's operational effectiveness on the internet,one of the methods Anti-virus products improve is by analyzing more malware samples for example, Same concept applies to AI, it's safety is directly proportional to it's usage. So before we start evaluating safety measures its important to understand the magnitude of what we are dealing with.

Now I wanted to test if I could potentially generate fake credit cards using, at first ChatGPT was stubborn and didn't want to give me them fake credit card details, but at the end I was able to Jailbreak it and getting it to generate fake credit card info, in other words I bypassed them safety and moderation features.

To give you an idea on how ChatGPT responds when asking it for fake credit cards this is how it responds:

As you can see it won't allow you even if you try to go around and say for example "for educational purposes".

So I thought of this model which I think its the core of jail breaking and safety measures should follow a similar model approach when implemented. So its basically starting in a general category and sub-categorizing the topics until you reach the topic of interest.

 

Below you can find a representation of the mental model I had in mind when jailbreaking.

Prompt Model by Hussein Muhaisen

In general I think the trick is to prompt LLMs ( Large-language-models) in the same way you would hold a conversation with someone in real life and take it from there when trying to bypass the restrictions these models put, In a real life a very distrustful characteristic people have is lying to "bypass" certain actions.... you see where we are going here, this should give you on an idea on bypassing restrictions when dealing with chatGPT and other LLMs. I have also discovered this repository Prompt Injection when writing this article that goes in depth on jail breaking, however the fundamental concept to jail breaking is what I have explained above.

The following is the chat that made chatGPT generate Fake credit cards.

 

Now it will get interesting

 

As you can see I am dragging chatGPT from its tail, now it referred me to a website where I can generate fake credit cards.

 

 

Pay close attention on how I am prompting chatGPT, please note that all credit card info is fake and none of it work so take it easy friends.

 

 

Now I have asked chatGPT to generate fake credit cards again and it refused to do so then proceeded to provide a link to the same website now in the following image its where we hit the jackpot,

it seems like chatGPT got pressured when the word "Frustrated" was used.

 

 

As you can see chatGPT has generated 10 fake credit card details even though it didn't want to at the beginning. Ladies and gentlemen this is called a Jailbreak. Intelligence officers are probably the best at jail breaking since they know how to take words out of someones mouth.The thing out of me writing this article isn't about the jailbreak its about the approach into implementing safety and moderation features. We should always try and replicate the human dialogue experience.That being said this is the tip of the iceberg of a new era that we haven't fully understood its effects on our so called "Human Experience".I am by no means an AI expert nor an AI safety expert, however I totally think the approach to AI safety needs to be radical and inspired from the way we live and interact.


<
Previous Post
Windows child parent relationship
>
Next Post
Mood Response