Cybersecurity Projects for Beginners with Open Datasets
Sharpen your data science and cybersecurity skills by doing a project with open datasets
Hacker News Posts
This data set is Hacker News posts from the last 12 months (up to September 26, 2016). It includes the following columns:
title: title of the post (self-explanatory)
URL: the URL of the item being linked to
num_points: the number of upvotes the post received
num_comments: the number of comments the post received
author: the name of the account that made the post
created_at: the date and time the post was made (the time zone is Eastern Time in the US)
Link to the dataset
Credit Card Fraud Detection
The datasets contain transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Link to the dataset
Fake and Real News
You can use this data set to make an algorithm able to determine if an article is fake news or not.
Link to the dataset
US Military Academy Dataset (Including Data Capture from National Security Agency (NSA))
The National Security Agency permitted both the recording and release of the following datasets.
In an attempt to provide users of our dataset a means to correlate IP addresses found in the PCAP files with the IP addresses to hosts on the internal USMA network
Link to the dataset
DARPA Intrusion Detection Data Sets [1998/1999]
The goal of the DARPA CGC was to engender a new generation of autonomous cyber defense capabilities that combined the speed and scale of automation with reasoning abilities exceeding those of human experts.
Link to the dataset
Detecting Malicious URLs
The long-term goal of this research is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.
Link to the dataset
Dataset of Probing Attacks (Port Scan)
Dataset of Probing Attacks (Port Scan) performed with nmap, unicornscan, hping3, zmap and masscan.
It also presents a way to extract background traffic to be used as “normal” traffic to support Machine Learning algorithms development in IDS research. In this project, the current source is the MAWILab datasets.
Link to the dataset
For more datasets