movie review scraper to generate data for natural language processing
Collecting data in the form of sentences can be useful for many applications including sentiment analysis using NLP or for a simple classification model. Gathering such data can often be a tedious and frustrating task. This project will help you generate the required data by performing a few simple steps.
- python
- scrapy python package
pip install scrapyThis project was created on Scrapy 2.5.0, but any subsequent versions will also work.
You will first need to go to the rotten tomatoes website and search for the movie whose reviews you want. Then proceed to the critic or audience reviews and click on view all. The link in your browser will act as the starting point for the crawler. Copy this link and navigate to Scrapy Web Crawler/crawler/crawler/spiders/review_spider.py. Paste the copied link in the start_urls list and also in the next_page variable below.
To crawl the reviews and store them in a csv file, we simply have to navigate to the project in our command line terminal and write
scrapy crawl reviews -o reviews.csvHere, you can replace reviews.csv with any filename of your choice. The crawler will crawl and store the first 500 reviews of the movie. This can be changed by navigating to the reviews_spider.py file as shown above and changing the number of pages from 25 to the required amount.
- Instagram - @AnishMulay
- Email - f20180907@goa.bits-pilani.ac.in