Quickstart

Requirements

  • Python 2.7 or above

Installation

The quick way:

pip install scrapinghub

You can also install the library with MessagePack support, it provides better response time and improved bandwidth usage:

pip install scrapinghub[msgpack]

Basic usage

Instantiate new client:

>>> from scrapinghub import ScrapinghubClient
>>> client = ScrapinghubClient('APIKEY')

Work with your projects:

>>> client.projects.list()
[123, 456]

Run new jobs from the client:

>>> project = client.get_project(123)
>>> project.jobs.run('spider1', job_args={'arg1': 'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

Access your jobs data:

>>> job = client.get_job('123/1/2')
>>> for item in job.items():
...     print(item)
{
    'name': ['Some other item'],
    'url': 'http://some-url/other-item.html',
    'size': 35000,
}

Many more features are awaiting for you.

Tests

The package is covered with integration tests based on VCR.py library: there are recorded cassettes files in tests/*/cassettes used instead of HTTP requests to real services, it helps to simplify and speed up development.

By default, tests use VCR.py once mode to:

  • replay previously recorded interactions.
  • record new interactions if there is no cassette file.
  • cause an error to be raised for new requests if there is a cassette file.

It means that if you add new integration tests and run all tests as usual, only new cassettes will be created, all existing cassettes will stay unmodified.

To ignore existing cassettes and use real services, please provide a flag:

py.test --ignore-cassettes

If you want to update/recreate all the cassettes from scratch, please use:

py.test --update-cassettes

Note that internally the above command erases the whole folder with cassettes.