dano/websucker-pip

Websucker Crawler Agent

Go to file

Daniel Hladek 5b887a13c7 zz		2024-03-21 19:36:59 +01:00
mongo	zz	2024-03-21 19:36:59 +01:00
websucker	zz	2023-03-06 16:29:30 +01:00
.dockerignore	fixes	2021-01-20 13:54:47 +01:00
.gitignore	zz	2020-05-10 11:54:53 +02:00
build-docker.sh	zz	2023-02-26 14:10:58 +01:00
docker-compose.yaml	zz	2023-02-28 08:56:35 +01:00
Dockerfile	zz	2023-02-26 14:10:58 +01:00
LICENSE.txt	initial	2020-05-07 16:09:45 +02:00
MANIFEST.in	initial	2020-05-07 16:09:45 +02:00
README.md	Update 'README.md'	2020-06-18 13:19:40 +00:00
requirements.txt	zz	2023-03-06 16:29:30 +01:00
setup.py	zz	2020-05-10 11:48:17 +02:00

README.md

Websucker

Agent for Sucking the of Web

Features

Crawling of best domains
Crawling of unvisited domains
Text mining
Evaluation of domains
Daily report
Database Summary

Requirements

Python 3
running Cassandra 3.11
optional Beanstalkd for work queue

Installation

Activate virtual environment:

python -m virtualenv ./venv
source ./venv/bin/activate

Install package:

pip install https://git.kemt.fei.tuke.sk/dano/websucker-pip/archive/master.zip

Initialize and setup database

If you have Cassandra installed, first initialize the database schema using the cqlsh command, the schema can be found in the schema.sql file

You set up the database using an environment variable (if it is on another machine):

export CASSANDRA_HOST=localhost

export CASSANDRA_PORT=9142

Usage

websuck --help

Create initial domain list

Save the list of domains to a file, e.g.

echo www.sme.sk > domains.txt

Visit initial domains in file

websuck --visit file domains.txt

Visit unvisited domains

websuck --visit unvisited 100