Overview
With thanks to the National Science Foundation (NSF), Duke University has embarked on a project titled “CICI: RSSD: Massive Internal System Traffic Research Analysis and Logging Dataset” (MISTRAL Dataset) to leverage and expand an internal network monitoring fabric and data collection points, and to create a reference scientific security dataset (RSSD) and associated data pipeline and analysis techniques. The project aims to improve detection of abnormal or malicious activities impacting identified science drivers and associated cyberinfrastructure in Duke datacenters and research labs. The project is a key step in a larger effort to detect and respond to security incidents or Indicators of Compromise (IoC) for scientific application workflows and workloads in a privacy- preserving manner, including developing the data handling capacity to evolve threat intelligence and alerting. Specifically, the MISTRAL Dataset will capture domain science workflow behavior for analysis and interpretation of that behavior, and to identify an expected or standard characterization of the science workflow. Thus, the highest order goals of the project extend beyond creation of an RSSD and encompass improvements in the protection and security of the science cyberinfrastructure itself and associated improvement to application workflows.
Principles
- MISTRAL aims to preserve activity observed on networks associated with research endeavors.
- MISTRAL masks unique, identifying information about organizations using the framework using one-way hashing of identifiable information.
- “Normal” or expected data collected by MISTRAL should be differentiated from abnormal or unusual activity.
- Traffic flows associated with personal use are more problematic than those associated with science.
- Post-processing on raw data is required to add enriched data sources and/or mask identifiable information.
- MISTRAL data will be summarized in different formats to manage privacy expectations.
Practices
- MISTRAL retains the following data elements, or unique hashes associated with each metadata element (see item 2):
- Source/destination IP addresses
- Source/destination ports
- File types
- MISTRAL masks IP addresses associated with the institution using a unique one-way hash for each IP address captured.
- Example 1: Research machine communicating with an Internet node:
<IP Address HASH VALUE 1>, Port 7643 -> <140.82.112.3>, Port 443
- Example 2: Two research machines in the same institution communicating with each other:
<IP Address HASH VALUE 1>, Port 9632 -> <IP Address HASH VALUE 2>, Port 139
- MISTRAL masks identifiable elements with 1-way unique hashes including:
- Names/IDs
- File names/paths
- MISTRAL data sets group/flag traffic flows based on type of network traffic:
- Science vs. Personal
- Performance indicators
- Security indicators
- Expected vs. unexpected
- Summarized data sets will initially include:
- Masked/obfuscated
- Science area types
- Infrastructure