DarkBERT is ChatGPT for the Dark Web: How it works
We are just months into the AI frenzy started by ChatGPT, and large language models (LLMs) and applications based on them have gained massive popularity in this short span. LLMs are usually trained on large datasets from the surface of the internet. But what about an LLM trained on the Dark Web, the internet's seedy underbelly? That is DarkBERT. Let's see what it is.
Why does this story matter?
The dark web is synonymous with malicious and illegal activities. So what can an LLM trained on datasets from the Dark Web bring to the table? It may sound nightmarish to several people, but DarkBERT can be the answer to a new problem - AI-powered cybercrime. With AI making committing cybercrime easier than ever, it is essential to have an antidote.
DarkBERT is based on the RoBERTa architecture
DarkBERT is an LLM developed by a group of South Korean developers. It is based on the RoBERTa architecture. RoBERTa, or Robustly Optimized BERT Pre-training Approach, was developed in 2019 by researchers from Facebook (now Meta) and Washington University. DarkBERT's development and usage are detailed in a yet-to-be-peer-reviewed paper titled DarkBERT: A language model for the dark side of the Internet.
Researchers used Tor to crawl the Dark Web
The researchers used the Tor network to train DarkBERT. They crawled the Dark Web with the help of Tor's firewall to create a dataset for the LLM. They collected 6.1 million pages to train DarkBERT. To find useful pages from that, they used techniques such as text preprocessing, deduplication, and category balancing. The database was then fed to RoBERTa.
RoBERTa does not predict the next sentence while training
The researchers used RoBERTa as the base model because it does not do Next Sentence Prediction (NSP) during training. This is useful in training a model based on the Dark Web as it does not have many sentence-like structures like the Surface Web.
DarkBERT has several applications in cybersecurity
DarkBERT's dataset collected from the Dark Web makes it an ally in fighting cybercrime. According to the researchers, it can monitor sites that sell or publish confidential data of organizations leaked by ransomware groups. It can also crawl through forums on the Dark Web to find the exchange of illicit information. DarkBERT can monitor illicit exchanges based on keywords as well.