websucker-pip/README.md

66 lines
1.1 KiB
Markdown
Raw Permalink Normal View History

2020-05-07 14:09:45 +00:00
# Websucker
2020-05-13 13:20:20 +00:00
Agent for Sucking the of Web
## Features
- Crawling of best domains
- Crawling of unvisited domains
- Text mining
- Evaluation of domains
- Daily report
- Database Summary
## Requirements
- Python 3
- running Cassandra 3.11
- optional Beanstalkd for work queue
## Installation
Activate virtual environment:
python -m virtualenv ./venv
source ./venv/bin/activate
Install package:
pip install https://git.kemt.fei.tuke.sk/dano/websucker-pip/archive/master.zip
2020-06-18 13:19:40 +00:00
### Initialize and setup database
If you have Cassandra installed, first initialize the database schema using the cqlsh command, the schema can be found in the schema.sql file
You set up the database using an environment variable (if it is on another machine):
export CASSANDRA_HOST=localhost
export CASSANDRA_PORT=9142
2020-05-13 13:20:20 +00:00
## Usage
websuck --help
2020-06-18 13:19:40 +00:00
### Create initial domain list
Save the list of domains to a file, e.g.
echo www.sme.sk > domains.txt
### Visit initial domains in file
websuck --visit file domains.txt
### Visit unvisited domains
websuck --visit unvisited 100