66 lines
1.1 KiB
Markdown
66 lines
1.1 KiB
Markdown
# Websucker
|
|
|
|
Agent for Sucking the of Web
|
|
|
|
## Features
|
|
|
|
- Crawling of best domains
|
|
- Crawling of unvisited domains
|
|
- Text mining
|
|
- Evaluation of domains
|
|
- Daily report
|
|
- Database Summary
|
|
|
|
## Requirements
|
|
|
|
- Python 3
|
|
- running Cassandra 3.11
|
|
- optional Beanstalkd for work queue
|
|
|
|
## Installation
|
|
|
|
Activate virtual environment:
|
|
|
|
python -m virtualenv ./venv
|
|
source ./venv/bin/activate
|
|
|
|
Install package:
|
|
|
|
pip install https://git.kemt.fei.tuke.sk/dano/websucker-pip/archive/master.zip
|
|
|
|
### Initialize and setup database
|
|
|
|
If you have Cassandra installed, first initialize the database schema using the cqlsh command, the schema can be found in the schema.sql file
|
|
|
|
|
|
You set up the database using an environment variable (if it is on another machine):
|
|
|
|
|
|
export CASSANDRA_HOST=localhost
|
|
|
|
export CASSANDRA_PORT=9142
|
|
|
|
|
|
## Usage
|
|
|
|
websuck --help
|
|
|
|
|
|
|
|
### Create initial domain list
|
|
|
|
Save the list of domains to a file, e.g.
|
|
|
|
echo www.sme.sk > domains.txt
|
|
|
|
|
|
### Visit initial domains in file
|
|
|
|
|
|
websuck --visit file domains.txt
|
|
|
|
### Visit unvisited domains
|
|
|
|
websuck --visit unvisited 100
|
|
|