Skip to content

wotsyula/flask-scraper

Repository files navigation

flask-scraper

Scraper engine based on Selenium Python bindings. Uses Flask for its API and ReactJS for its client.

Requirements

Install

  1. Set up Python enviromnent:

    python -m venv .env
    source .env/bin/activate
    python -m pip install -r requirements.txt

    Windows users should use activate.bat instead:

    .env\Scripts\activate.bat
  2. Setup Node.js environment:

    npm install
  3. Build client:

    npm run build
    
  4. Edit composer/standalone-chrome with your VNC password. Replace flaskscraper@123.

  5. Edit docker-compose.yml with your PostgreSQL database. Afterwards start the service with:

    sudo docker compose up

    Windows users should omit the sudo

    docker compose up
  6. Navigate to home page:

    http://localhost:5000
    

 

Development & Testing

Environment Variables

  • PYTHONUNBUFFERED: Used to configure python. Set to true
  • FLASK_ENV: Used to configure Flask server. Set to development
  • NODE_ENV: Used to configure Webpack. Set to development
  • DATABASE_URI: Link to PostgreSQL database. Default postgres://postgress:postgress@localhost/postgres
  • SELENIUM_URI: Link to Selenium API server. Should not include a trailing slash. Default http:/localhost:4444

Directory Structure

.
+-- src
|   +-- client
|   |   +-- static
|   |       +-- index.htm       # react single page app
|   |       +-- favicon.ico
|   |       +-- main.js         # webpack bundle file
|   +-- server
|       +-- app.py              # flask application file
|       +-- conftest.py         # pytest configuration file
|       +-- routes
|           +-- scrapper        # scripts are stored here
+-- .browserlist                # configuration used by babel-loader
+-- .babelrc                    # babel-loader configuration file
+-- docker-compose.yml          # docker service configuration
+-- Dockerfile                  # docker file for flask container
+-- package.json                # node.js configuration
+-- setup.py                    # python configuration
+-- requirements.txt            # python configuration

React.js client files are found in the src/client directory. These are compiled using Webpack into src/client/static directory. See README.md for more information

Flask REST server files are found in src/server directory. You can add new scripts by creating a folder in src/server/routes/scraper directory. See README.md for more information

Run Selenium in Docker

docker run \
    --rm -d -p 4444:4444/tcp -p 5900:5900/tcp \
    --name selenium \
    -e SE_NODE_SESSION_TIMEOUT=240 \
    -e SE_NODE_MAX_SESSIONS=16 \
    -v /dev/shm:/dev/shm \
    selenium/standalone-chrome:91.0

Start Development Server (Linux / bash)

export NODE_ENV=development
export FLASK_ENV=development
source .env/bin/activate
npm run watch &
python -m flask run

Start Development Server (Windows / powershell)

$env:NODE_ENV=development
$env:FLASK_ENV=development
.env\Scripts\activate.bat
Start-Process -NoNewWindow npm -ArgumentList "run", "watch"
python -m flask run

About

Scrapper engine based on Selenium Python bindings. Uses Flask for its API and ReactJS for its client.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors