Aerificial Intelligence

Dark Web ChatGPT Unleashed: Meet DarkBERT

We’re still early in the snowball effect unleashed by the release of Large Language Models (LLMs) like ChatGPT into the wild. Paired with the open-sourcing of other GPT (Generative Pre-Trained Transformer) models, the number of applications employing AI is exploding; and as we know, ChatGPT itself can be used to create highly advanced malware.

As time passes, applied LLMs will only increase, each specializing in their own area, trained on carefully curated data for a specific purpose. And one such application just dropped, one that was trained on data from the dark web itself. DarkBERT, as its South Korean creators called it, has arrived — follow that link for the release paper, which gives an overall introduction to the dark web itself.

DarkBERT is based on the RoBERTa architecture, an AI approach developed back in 2019. It has seen a renaissance of sorts, with researchers discovering it actually had more performance to give than could be extracted from it in 2019. It seems the model was severely undertrained when released, far below its maximum efficiency.


To train the model, the researchers crawled the Dark Web through the anonymyzing firewall of the Tor network, and then filtered the raw data (applying techniques such as deduplication, category balancing, and data pre-processing) to generate a Dark Web database.

DarkBERT is the result of that database being used to feed the RoBERTa Large Language Model, a model that can analyze a new piece of Dark Web content — written in its own dialects and heavily-coded messages — and extract useful information from it.

Saying that English is the business language of the Dark Web wouldn’t be entirely correct, but it’s a specific enough concotion that the researchers believe a specific LLM had to be trained on it. In the end, they were right: the researchers showed that DarkBERT outperformed other large language models, which should allow security researchers and law enforcement to penetrate deeper into the recesses of the web. That is, after all, where most of the action is.

As with other LLMs, that doesn’t mean DarkBERT is finished, and further training and tuning can continue to improve its results. How it will be used, and what knowledge can be gleaned, remains to be seen.

Culled from