websucker-pip/README.md

# Websucker

Agent for Sucking the of Web

## Features

- Crawling of best domains
- Crawling of unvisited domains
- Text mining
- Evaluation of domains
- Daily report
- Database Summary

## Requirements

- Python 3
- running Cassandra 3.11
- optional Beanstalkd for work queue

## Installation

Activate virtual environment:

    python -m virtualenv ./venv
    source ./venv/bin/activate

Install package:

    pip install https://git.kemt.fei.tuke.sk/dano/websucker-pip/archive/master.zip

### Initialize and setup database

If you have Cassandra installed, first initialize the database schema using the cqlsh command, the schema can be found in the schema.sql file


You set up the database using an environment variable (if it is on another machine):


    export CASSANDRA_HOST=localhost

    export CASSANDRA_PORT=9142


## Usage

    websuck --help


### Create initial domain list

Save the list of domains to a file, e.g.

    echo www.sme.sk > domains.txt


### Visit initial domains in file


    websuck --visit file domains.txt

### Visit unvisited domains

    websuck --visit unvisited 100