GitHub: https://github.com/DedSecInside/TorBot
TorBot is an open-source web scraping tool designed to operate over the Tor network, providing anonymity during the scraping process.
Features
- Onion Crawler (.onion)
- Returns page title or host name if no page title is available and address with a short description of the site.
- Save links to a database (Not done)
- Output the HTML from a site or save it to an HTML file. (Not done)
- Save the link tree as a JSON file.
- Crawl custom domains
- Check if the link is live
- Built-in Updater
- Build a visual tree of link relationships that can be quickly viewed or saved to a file
...(will be updated)
Dependencies
- Tor (Optional)
- Python ^3.9
- Poetry (Optional)
Python Dependencies
(see pyproject.toml or requirements.txt for more details)
Installation
TorBot
Using venv
- If using Python ^3.4,
python -m venv torbot_venv
source torbot_venv/bin/activate
pip install -r requirements.txt
pip install -e .
./main.py --help
Using docker
docker build -t {image_name} .
# Running without Tor
docker run {image_name} poetry run python torbot -u https://example.com --depth 2 --visualize tree --save json --disable-socks5
# Running with Tor
docker run --network="host" {image_name} poetry run python torbot -u https://example.com --depth 2 --visualize tree --save json --disable-socks5
Options
usage: Gather and analyze data from Tor sites.
optional arguments:
-u URL, --url URL Specify a website link to crawl
--depth DEPTH Specify max depth of crawler (default 1)
-h, --help Show this help message and exit
--host Set IP address for SOCKS5 proxy (defaults to 127.0.0.1)
--port Set port for SOCKS5 proxy (defaults to 9050)
-v Displays DEBUG level logging, default is INFO
--version Show the current version of TorBot.
--update Update TorBot to the latest stable version
-q, --quiet Prevents display of header and IP address
--save FORMAT Save results in a file. (tree, JSON)
--visualize FORMAT Visualizes tree of data gathered. (tree, JSON, table)
-i, --info Info displays basic info of the scanned site
--disable-socks5 Executes HTTP requests without using SOCKS5 proxy
- NOTE: -u is a mandatory for crawling
Read more about torrc here : Torrc
Curated Features
- Visualization Module Revamp
- Implement BFS Search for webcrawler
- Improve stability (Handle errors gracefully, expand test coverage, etc.)
- Increase test coverage
- Save the most recent search results to a database
- Randomize Tor Connection (Random Header and Identity)
- Keyword/Phrase Search
- Social Media Integration
- Increase anonymity
- Screenshot capture
Developer Guidelines
We welcome contributions to this project! Here are a few guidelines to follow:
- Fork the repository and create a new branch for your contribution.
- Make sure your code passes all tests by running
pytest
before submitting a pull request todev
branch. - Follow the PEP8 style guide for Python code.
- Make sure to add appropriate documentation for any new features or changes.
- When submitting a pull request, please provide a detailed description of the changes made.