Quickstart

Requirements

  • Python 2.7 or above

Installation

The quick way:

pip install scrapinghub

It is recommended to install the library with MessagePack support, it provides better response time and improved bandwidth usage:

pip install scrapinghub[msgpack]

Basic usage

Instantiate a new client with your Scrapinghub API key:

>>> from scrapinghub import ScrapinghubClient
>>> apikey = '84c87545607a4bc0****************' # your API key as a string
>>> client = ScrapinghubClient(apikey)

Note

Your Scrapinghub API key is available at https://app.scrapinghub.com/account/apikey after you sign up with the service.

List your deployed projects:

>>> client.projects.list()
[123, 456]

Run a new job for one of your projects:

>>> project = client.get_project(123)
>>> project.jobs.run('spider1', job_args={'arg1': 'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

Access your job’s output data:

>>> job = client.get_job('123/1/2')
>>> for item in job.items.iter():
...     print(item)
{
    'name': ['Some item'],
    'url': 'http://some-url/item.html',
    'value': 25,
}
{
    'name': ['Some other item'],
    'url': 'http://some-url/other-item.html',
    'value': 35,
}
...

Checkout all the other features in Overview or in the more detailed API Reference.

Tests

The package is covered with integration tests based on VCR.py library: there are recorded cassettes files in tests/*/cassettes used instead of HTTP requests to real services, it helps to simplify and speed up development.

By default, tests use VCR.py once mode to:

  • replay previously recorded interactions.
  • record new interactions if there is no cassette file.
  • cause an error to be raised for new requests if there is a cassette file.

It means that if you add new integration tests and run all tests as usual, only new cassettes will be created, all existing cassettes will stay unmodified.

To ignore existing cassettes and use real services, please provide a flag:

py.test --ignore-cassettes

If you want to update/recreate all the cassettes from scratch, please use:

py.test --update-cassettes

Note that internally the above command erases the whole folder with cassettes.