How to Install Scrapy a Web Crawling Tool in Ubuntu

Scrapy is an open-source software which is used for extracting data from websites. Scrapy framework is developed in Python and it performs the crawling job in a fast, simple, and extensible way.  We have created a Virtual Machine (VM) in a virtual box and Ubuntu 14.04 LTS is installed on it.

Install Scrapy

Scrapy is dependent on Python, development libraries and pip software. Python latest version is pre-installed on Ubuntu. So we have to install pip and python developer libraries before the installation of Scrapy.

Pip is the replacement for easy_install for python package indexer. It is used for the installation and management of Python packages.

To install pip package, run:

$ sudo apt-get install python-pip
installation of python package indexer
Pip installation

We have to install python development libraries by using the following command. If this package is not installed then the installation of scrapy framework generates error about python.h header file.

$ sudo apt-get install python-dev
Libraries for Python Development
Python Developer Libraries

Scrapy framework can be installed either from deb package or source code. However, we have installed deb package using pip (Python package manager).

$ sudo pip install scrapy
Installation of Scrapy
Scrapy Installation

Scrapy successful installation takes some time.

Scrapy Framework
Successful installation of Scrapy Framework

Data extraction using Scrapy framework

(Basic Tutorial)

We will use Scrapy for the extraction of store names (which are providing Cards) item from fatwallet.com website. First of all, we created a new scrapy project “store_name” using the following command.

$ sudo scrapy startproject store_name
New project in Scrapy Framework
Creation of new project in Scrapy Framework

The above command creates a directory with title “store_name” at current path. This main directory of the project contains files/folders which are shown in the following Figure 6.

$ sudo ls –lR store_name
Project store_name
Contents of store_name project

A brief description of each file/folder is given below:

  • scrapy.cfg is the project configuration file
  • store_name/ is another directory inside the main directory. This directory contains python code of the project.
  • store_name/items.py contains those items which will be extracted by the spider.
  • store_name/pipelines.py is the pipelines file.
  • Setting of store_name project is in store_name/settings.py file.
  • and the store_name/spiders/ directory, contains spider for the crawling

As we are interested to extract the store names of the Cards from fatwallet.com site, so we updated the contents of the file as shown below.

import scrapy

class StoreNameItem(scrapy.Item):

   name = scrapy.Field()   # extract the names of Cards store

After this, we have to write new spider under store_name/spiders/ directory of the project. Spider is python class which consists of the following mandatory attributes:

  1. Name of the spider (name )

  2. Starting url of spider for crawling (start_urls)
  3. And parse method which consist of regex for the extraction of desired items from the page response. Parse method is the important part of spider.

We created spider “store_name.py” under store_name/spiders/ directory and added following python code for the extraction of store name from fatwallet.com site. The output of the spider is written in the file (StoreName.txt).

from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
import re
class StoreNameItem(BaseSpider):
name = "storename"
allowed_domains = ["fatwallet.com"]
start_urls = ["http://fatwallet.com/cash-back-shopping/"]

def parse(self,response):
output = open('StoreName.txt','w')
resp = Selector(response)

tags = resp.xpath('//tr[@class="storeListRow"]|\
         //tr[@class="storeListRow even"]|\
         //tr[@class="storeListRow even last"]|\
          //tr[@class="storeListRow last"]').extract()
for i in tags:
i = i.encode('utf-8', 'ignore').strip()
store_name = ''
if re.search(r"class=\"storeListStoreName\">.*?<",i,re.I|re.S):
store_name = re.search(r"class=\"storeListStoreName\">.*?<",i,re.I|re.S).group()
store_name = re.search(r">.*?<",store_name,re.I|re.S).group()
store_name = re.sub(r'>',"",re.sub(r'<',"",store_name,re.I))
store_name = re.sub(r'&amp;',"&",re.sub(r'&amp;',"&",store_name,re.I))
#print store_name
output.write(store_name+""+"\n")
Spider Code
Output of the Spider code

 NOTE: The purpose of this tutorial is only the understanding of Scrapy Framework

9 Comments... add one

  1. Hello. Thanks for this post. However, can you make the last store_name.py code more clearer with proper indents? Or even post a better picture. When I run it, I get expected indentation errors. Thank you.

    Reply
  2. hi you there writer?
    i wanna ask how to running the script store_name.py and pls make clear this paragraph

    As we are interested to extract the store names of the Cards from fatwallet.com site, so we updated the contents of the file as shown below.

    import scrapy

    class StoreNameItem(scrapy.Item):

    name = scrapy.Field() # extract the names of Cards store

    After this, we have to write new spider under store_name/spiders/ directory of the project. Spider is python class which consist of following mandatory attributes :

    Name of the spider (name )
    Starting url of spider for crawling (start_urls)

    And parse method which consist of regex for the extraction of desired items from the page response. Parse method is the important part of spider.
    thx :)

    Reply
  3. hello , thanks for your blog

    i have tried with your example, in fact in have create the name_store.py file like you exactly in the directory /store_name but i when i run this command:
    scrapy crawl store_name
    i have this error :

    File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
    File "/home/nalo/store_name/store_name/spiders/store_name.py", line 7
    name = "storename"
    ^
    IndentationError: expected an indented block
    root@nalo-VirtualBox:/home/nalo/store_name#

    waiting your reply thanks in advance

    Reply
  4. Thanks nilo for following us..

    Following is the reason of error. you are facing indentation issue.

    IndentationError: expected an indented block

    please share full code with me

    Reply

Leave a Comment