scrapinghub.Connection¶

The module is the very first Python library for communicating with the Scrapinghub API.

[WARNING] It is deprecated, please use scrapinghub.ScrapinghubClient instead.

Overview¶

First, you connect to Scrapinghub:

>>> from scrapinghub import Connection
>>> conn = Connection('APIKEY')
>>> conn
Connection('APIKEY')

You can list the projects available to your account:

>>> conn.project_ids()
[123, 456]

And select a particular project to work with:

>>> project = conn[123]
>>> project
Project(Connection('APIKEY'), 123)
>>> project.id
123

To schedule a spider run (it returns the job id):

>>> project.schedule('myspider', arg1='val1')
u'123/1/1'

To get the list of spiders in the project:

>>> project.spiders()
[
  {u'id': u'spider1', u'tags': [], u'type': u'manual', u'version': u'123'},
  {u'id': u'spider2', u'tags': [], u'type': u'manual', u'version': u'123'}
]

To get all finished jobs:

>>> jobs = project.jobs(state='finished')

jobs is a JobSet. JobSet objects are iterable and, when iterated, return an iterable of Job objects, so you typically use it like this:

>>> for job in jobs:
...     # do something with job

Or, if you just want to get the job ids:

>>> [x.id for x in jobs]
[u'123/1/1', u'123/1/2', u'123/1/3']

To select a specific job:

>>> job = project.job(u'123/1/2')
>>> job.id
u'123/1/2'

To retrieve all scraped items from a job:

>>> for item in job.items():
...     # do something with item (it's just a dict)

To retrieve all log entries from a job:

>>> for logitem in job.log():
...     # logitem is a dict with logLevel, message, time

To get job info:

>>> job.info['spider']
'myspider'
>>> job.info['started_time']
'2010-09-28T15:09:57.629000'
>>> job.info['tags']
[]
>>> job.info['fields_count]['description']
1253

To mark a job with tag consumed:

>>> job.update(add_tag='consumed')

To mark several jobs with tag consumed (JobSet also supports the update() method):

>>> project.jobs(state='finished').update(add_tag='consumed')

To delete a job:

>>> job.delete()

To delete several jobs (JobSet also supports the update() method):

>>> project.jobs(state='finished').delete()

Module contents¶

Scrapinghub API Client Library

exception scrapinghub.legacy.APIError(message, _type=None)¶

Bases: exceptions.Exception

ERR_AUTH_ERROR = 'err_auth_error'¶

ERR_BAD_REQUEST = 'err_bad_request'¶

ERR_DEFAULT = 'err_default'¶

ERR_NOT_FOUND = 'err_not_found'¶

ERR_SERVER_ERROR = 'err_server_error'¶

ERR_VALUE_ERROR = 'err_value_error'¶

class scrapinghub.legacy.Connection(apikey=None, password='', _old_passwd='', url=None)¶

Bases: object

Main class to access Scrapinghub API.

API_METHODS = {'run': 'run', 'addversion': 'scrapyd/addversion', 'jobs_stop': 'jobs/stop', 'schedule': 'schedule', 'as_spider_properties': 'as/spider-properties', 'items': 'items', 'jobs_list': 'jobs/list', 'reports_add': 'reports/add', 'as_project_slybot': 'as/project-slybot', 'jobs_count': 'jobs/count', 'jobs_update': 'jobs/update', 'listprojects': 'scrapyd/listprojects', 'eggs_list': 'eggs/list', 'spiders': 'spiders/list', 'eggs_add': 'eggs/add', 'eggs_delete': 'eggs/delete', 'jobs_delete': 'jobs/delete', 'log': 'log'}¶

DEFAULT_ENDPOINT = 'https://app.scrapinghub.com/api/'¶

auth¶

project_ids()¶: Returns a list of projects available for this connection and crendentials.

project_names()¶

class scrapinghub.legacy.Job(project, id, info)¶

Bases: scrapinghub.legacy.RequestProxyMixin

MAX_RETRIES = 180¶

RETRY_INTERVAL = 60¶

add_report(key, content, content_type='text/plain')¶

delete()¶

id¶

items(offset=0, count=None, meta=None)¶

log(**params)¶

stop()¶

update(**modifiers)¶

class scrapinghub.legacy.JobSet(project, **params)¶

Bases: scrapinghub.legacy.RequestProxyMixin

count()¶: Returns total results count of current filters. Does not inclue count neither offset.

delete()¶

stop()¶

update(**modifiers)¶

class scrapinghub.legacy.Project(connection, projectid)¶

Bases: scrapinghub.legacy.RequestProxyMixin

autoscraping_project_slybot(spiders=(), outputfile=None)¶

autoscraping_spider_properties(spider, start_urls=None)¶

job(id)¶

jobs(**params)¶

name¶

schedule(spider, **params)¶

spiders(**params)¶

class scrapinghub.legacy.RequestProxyMixin¶: Bases: object