Websucker Crawler Agent
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Daniel Hládek 5b887a13c7 zz 2 months ago
mongo zz 2 months ago
websucker zz 1 year ago
.dockerignore fixes 3 years ago
.gitignore zz 4 years ago
Dockerfile zz 1 year ago
LICENSE.txt initial 4 years ago
MANIFEST.in initial 4 years ago
README.md Update 'README.md' 4 years ago
build-docker.sh zz 1 year ago
docker-compose.yaml zz 1 year ago
requirements.txt zz 1 year ago
setup.py zz 4 years ago

README.md

Websucker

Agent for Sucking the of Web

Features

  • Crawling of best domains
  • Crawling of unvisited domains
  • Text mining
  • Evaluation of domains
  • Daily report
  • Database Summary

Requirements

  • Python 3
  • running Cassandra 3.11
  • optional Beanstalkd for work queue

Installation

Activate virtual environment:

python -m virtualenv ./venv
source ./venv/bin/activate

Install package:

pip install https://git.kemt.fei.tuke.sk/dano/websucker-pip/archive/master.zip

Initialize and setup database

If you have Cassandra installed, first initialize the database schema using the cqlsh command, the schema can be found in the schema.sql file

You set up the database using an environment variable (if it is on another machine):

export CASSANDRA_HOST=localhost

export CASSANDRA_PORT=9142

Usage

websuck --help

Create initial domain list

Save the list of domains to a file, e.g.

echo www.sme.sk > domains.txt

Visit initial domains in file

websuck --visit file domains.txt

Visit unvisited domains

websuck --visit unvisited 100