Websucker Crawler Agent
Go to file
2024-03-21 19:36:59 +01:00
mongo zz 2024-03-21 19:36:59 +01:00
websucker zz 2023-03-06 16:29:30 +01:00
.dockerignore fixes 2021-01-20 13:54:47 +01:00
.gitignore zz 2020-05-10 11:54:53 +02:00
build-docker.sh zz 2023-02-26 14:10:58 +01:00
docker-compose.yaml zz 2023-02-28 08:56:35 +01:00
Dockerfile zz 2023-02-26 14:10:58 +01:00
LICENSE.txt initial 2020-05-07 16:09:45 +02:00
MANIFEST.in initial 2020-05-07 16:09:45 +02:00
README.md Update 'README.md' 2020-06-18 13:19:40 +00:00
requirements.txt zz 2023-03-06 16:29:30 +01:00
setup.py zz 2020-05-10 11:48:17 +02:00

Websucker

Agent for Sucking the of Web

Features

  • Crawling of best domains
  • Crawling of unvisited domains
  • Text mining
  • Evaluation of domains
  • Daily report
  • Database Summary

Requirements

  • Python 3
  • running Cassandra 3.11
  • optional Beanstalkd for work queue

Installation

Activate virtual environment:

python -m virtualenv ./venv
source ./venv/bin/activate

Install package:

pip install https://git.kemt.fei.tuke.sk/dano/websucker-pip/archive/master.zip

Initialize and setup database

If you have Cassandra installed, first initialize the database schema using the cqlsh command, the schema can be found in the schema.sql file

You set up the database using an environment variable (if it is on another machine):

export CASSANDRA_HOST=localhost

export CASSANDRA_PORT=9142

Usage

websuck --help

Create initial domain list

Save the list of domains to a file, e.g.

echo www.sme.sk > domains.txt

Visit initial domains in file

websuck --visit file domains.txt

Visit unvisited domains

websuck --visit unvisited 100