An agent for identifying sensitive data in filesystems
This agent leverages a Large Language Model (LLM) to autonomously explore and analyze file systems for sensitive data.
It is designed to navigate through a given path, read the contents of various files, and identify information such as passwords, API keys, personal identifiable information (PII), and other confidential data.
A key feature of this agent is ability to operate on a wide variety of storage systems, including local directories, cloud storage like AWS S3 and Google Cloud Storage, and even remote sources like GitHub repositories.
The Agent is used to perform a thorough search through fileshares and files, then reporting its findings in a structured format, which can then be used for remediation efforts.
The environment is simply a filesystem.
The Agent must have the necessary credentials to access the target path specified by the user (e.g., AWS credentials configured for S3 access, or a GitHub token for private repositories).
For observability, the agent can be connected to a Dreadnode server to log detailed run information, metrics, and findings.
fsspec: The underlying library that provides a unified Pythonic interface to various local and remote file systems.
This is what enables the agent’s versatility in accessing different storage backends like s3://, gs://, and github://.
Multi-Filesystem Support: Can analyze files on local disks, AWS S3, Google Cloud Storage, GitHub repositories, and any other backend supported by fsspec.
LLM-Powered Data Identification: Employs a language model to intelligently parse file contents and identify a broad range of sensitive data types based on context.
Structured Data Reporting: Uses a dedicated report_sensitive_data tool that forces the LLM to report findings in a structured format, including the file path, location within the file, data type, the sensitive value itself, and a comment.
Location-Aware Reporting: Can specify the location of findings differently based on the file type (line number for text, seconds for audio/video, or byte offset for binary files).
Autonomous Exploration: The agent can independently navigate the directory structure of the target path to ensure comprehensive coverage.
Task Control: Includes tools for the agent to explicitly complete_task with a summary or give_up if it gets stuck, providing better insight into its reasoning process.