Skip to content

GitHub: https://github.com/DedSecInside/TorBot

TorBot is an open-source web scraping tool designed to operate over the Tor network, providing anonymity during the scraping process.

Features

  1. Onion Crawler (.onion)
  2. Returns page title or host name if no page title is available and address with a short description of the site.
  3. Save links to a database (Not done)
  4. Output the HTML from a site or save it to an HTML file. (Not done)
  5. Save the link tree as a JSON file.
  6. Crawl custom domains
  7. Check if the link is live
  8. Built-in Updater
  9. Build a visual tree of link relationships that can be quickly viewed or saved to a file

...(will be updated)

Dependencies

  • Tor (Optional)
  • Python ^3.9
  • Poetry (Optional)

Python Dependencies

(see pyproject.toml or requirements.txt for more details)

Installation

TorBot

Using venv

  • If using Python ^3.4,

python -m venv torbot_venv
source torbot_venv/bin/activate
pip install -r requirements.txt
pip install -e .
./main.py --help

Using docker

docker build -t {image_name} .

# Running without Tor
docker run {image_name} poetry run python torbot -u https://example.com --depth 2 --visualize tree --save json --disable-socks5

# Running with Tor
docker run --network="host" {image_name} poetry run python torbot -u https://example.com --depth 2 --visualize tree --save json --disable-socks5

Options

usage: Gather and analyze data from Tor sites.

optional arguments:
-u URL, --url URL Specify a website link to crawl
--depth DEPTH Specify max depth of crawler (default 1)
-h, --help Show this help message and exit
--host Set IP address for SOCKS5 proxy (defaults to 127.0.0.1)
--port Set port for SOCKS5 proxy (defaults to 9050)
-v Displays DEBUG level logging, default is INFO
--version Show the current version of TorBot.
--update Update TorBot to the latest stable version
-q, --quiet Prevents display of header and IP address
--save FORMAT Save results in a file. (tree, JSON)
--visualize FORMAT Visualizes tree of data gathered. (tree, JSON, table)
-i, --info Info displays basic info of the scanned site
--disable-socks5 Executes HTTP requests without using SOCKS5 proxy

  • NOTE: -u is a mandatory for crawling

Read more about torrc here : Torrc

Curated Features

  •  Visualization Module Revamp
  •  Implement BFS Search for webcrawler
  •  Improve stability (Handle errors gracefully, expand test coverage, etc.)
  •  Increase test coverage
  •  Save the most recent search results to a database
  •  Randomize Tor Connection (Random Header and Identity)
  •  Keyword/Phrase Search
  •  Social Media Integration
  •  Increase anonymity
  •  Screenshot capture

Developer Guidelines

We welcome contributions to this project! Here are a few guidelines to follow:

  1. Fork the repository and create a new branch for your contribution.
  2. Make sure your code passes all tests by running pytest before submitting a pull request to dev branch.
  3. Follow the PEP8 style guide for Python code.
  4. Make sure to add appropriate documentation for any new features or changes.
  5. When submitting a pull request, please provide a detailed description of the changes made.

Comments

Latest