You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Daniel Hládek
5b887a13c7
|
2 months ago | |
---|---|---|
mongo | 2 months ago | |
websucker | 1 year ago | |
.dockerignore | 3 years ago | |
.gitignore | 4 years ago | |
Dockerfile | 1 year ago | |
LICENSE.txt | 4 years ago | |
MANIFEST.in | 4 years ago | |
README.md | 4 years ago | |
build-docker.sh | 1 year ago | |
docker-compose.yaml | 1 year ago | |
requirements.txt | 1 year ago | |
setup.py | 4 years ago |
README.md
Websucker
Agent for Sucking the of Web
Features
- Crawling of best domains
- Crawling of unvisited domains
- Text mining
- Evaluation of domains
- Daily report
- Database Summary
Requirements
- Python 3
- running Cassandra 3.11
- optional Beanstalkd for work queue
Installation
Activate virtual environment:
python -m virtualenv ./venv
source ./venv/bin/activate
Install package:
pip install https://git.kemt.fei.tuke.sk/dano/websucker-pip/archive/master.zip
Initialize and setup database
If you have Cassandra installed, first initialize the database schema using the cqlsh command, the schema can be found in the schema.sql file
You set up the database using an environment variable (if it is on another machine):
export CASSANDRA_HOST=localhost
export CASSANDRA_PORT=9142
Usage
websuck --help
Create initial domain list
Save the list of domains to a file, e.g.
echo www.sme.sk > domains.txt
Visit initial domains in file
websuck --visit file domains.txt
Visit unvisited domains
websuck --visit unvisited 100