Syntax hacking: Researchers discover sentence structure can bypass AI safety rules

Researchers at MIT, Northeastern University, and Meta recently released a paper suggesting that the large language models (LLMs) that power ChatGPT may sometimes prioritize sentence structure over meaning when answering questions. The findings reveal changes in how these models process instructions, which could shed light on why some quick injection or jailbreaking approaches work, though the researchers caution that their analysis of some production models remains speculative because training data details of major commercial AI models are not publicly available.

The team led by Chantal Scheib and Vineeth M. Suryakumar tested this by asking model questions with preserved grammatical patterns but nonsense words. For example, when asked “Sit down quickly, it’s cloudy in Paris?” (Mimicking the structure of “Where is Paris located?”), the models still answered “France.”

This suggests that models absorb both semantic and syntactic patterns, but may rely more on structural shortcuts when they are strongly correlated with specific domains in the training data, which sometimes allows patterns to override semantic understanding in edge cases. The team plans to present these findings at NeuroIPS later this month.

As a refresher, syntax describes sentence structure – how words are organized grammatically and what parts of speech they are used for. Semantics describes the actual meaning of words, which may differ even if the grammatical structure remains the same.

Semantics largely depend on context, and navigating the context is what implements LLM. The process of transforming an input, your prompt, into an output, an LLM answer, involves a complex series of matching patterns against the encoded training data.

To investigate when and how this pattern-matching might go wrong, the researchers designed a controlled experiment. They created a synthetic dataset by designing prompts that had a unique grammatical template based on part-of-speech patterns in each topic area. For example, geography questions followed one structural pattern while creative works questions followed another. They then trained Allen AI’s Olmo model on this data and tested whether the model could distinguish between syntax and semantics.



<a href

Leave a Comment