Cybersecurity Projects for Beginners with Open Datasets

Sharpen your data science and cybersecurity skills by doing a project with open datasets

Ensar Seker
2 min readJun 9, 2020

Hacker News Posts

This data set is Hacker News posts from the last 12 months (up to September 26, 2016). It includes the following columns:

title: title of the post (self-explanatory)

URL: the URL of the item being linked to

num_points: the number of upvotes the post received

num_comments: the number of comments the post received

author: the name of the account that made the post

created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Link to the dataset

Credit Card Fraud Detection

The datasets contain transactions made by credit cards in September 2013 by European cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Link to the dataset

Fake and Real News

You can use this data set to make an algorithm able to determine if an article is fake news or not.

Link to the dataset

US Military Academy Dataset (Including Data Capture from National Security Agency (NSA))

The National Security Agency permitted both the recording and release of the following datasets.

In an attempt to provide users of our dataset a means to correlate IP addresses found in the PCAP files with the IP addresses to hosts on the internal USMA network

Link to the dataset

DARPA Intrusion Detection Data Sets [1998/1999]

The goal of the DARPA CGC was to engender a new generation of autonomous cyber defense capabilities that combined the speed and scale of automation with reasoning abilities exceeding those of human experts.

Link to the dataset

Detecting Malicious URLs

The long-term goal of this research is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.

Link to the dataset

Dataset of Probing Attacks (Port Scan)

Dataset of Probing Attacks (Port Scan) performed with nmap, unicornscan, hping3, zmap and masscan.

It also presents a way to extract background traffic to be used as “normal” traffic to support Machine Learning algorithms development in IDS research. In this project, the current source is the MAWILab datasets.

Link to the dataset

For more datasets

--

--

Ensar Seker

Cybersecurity | Artificial Intelligence | Blockchain