scrapinghub.Connection¶
The module is the very first Python library for communicating with the Scrapinghub API.
[WARNING] It is deprecated, please use scrapinghub.ScrapinghubClient instead.
Overview¶
First, you connect to Scrapinghub:
>>> from scrapinghub import Connection
>>> conn = Connection('APIKEY')
>>> conn
Connection('APIKEY')
You can list the projects available to your account:
>>> conn.project_ids()
[123, 456]
And select a particular project to work with:
>>> project = conn[123]
>>> project
Project(Connection('APIKEY'), 123)
>>> project.id
123
To schedule a spider run (it returns the job id):
>>> project.schedule('myspider', arg1='val1')
u'123/1/1'
To get the list of spiders in the project:
>>> project.spiders()
[
{u'id': u'spider1', u'tags': [], u'type': u'manual', u'version': u'123'},
{u'id': u'spider2', u'tags': [], u'type': u'manual', u'version': u'123'}
]
To get all finished jobs:
>>> jobs = project.jobs(state='finished')
jobs
is a JobSet
. JobSet
objects are iterable and, when iterated,
return an iterable of Job
objects, so you typically use it like this:
>>> for job in jobs:
... # do something with job
Or, if you just want to get the job ids:
>>> [x.id for x in jobs]
[u'123/1/1', u'123/1/2', u'123/1/3']
To select a specific job:
>>> job = project.job(u'123/1/2')
>>> job.id
u'123/1/2'
To retrieve all scraped items from a job:
>>> for item in job.items():
... # do something with item (it's just a dict)
To retrieve all log entries from a job:
>>> for logitem in job.log():
... # logitem is a dict with logLevel, message, time
To get job info:
>>> job.info['spider']
'myspider'
>>> job.info['started_time']
'2010-09-28T15:09:57.629000'
>>> job.info['tags']
[]
>>> job.info['fields_count]['description']
1253
To mark a job with tag consumed
:
>>> job.update(add_tag='consumed')
To mark several jobs with tag consumed
(JobSet
also supports the
update()
method):
>>> project.jobs(state='finished').update(add_tag='consumed')
To delete a job:
>>> job.delete()
To delete several jobs (JobSet
also supports the update()
method):
>>> project.jobs(state='finished').delete()
Module contents¶
Scrapinghub API Client Library
- exception scrapinghub.legacy.APIError(message, _type=None)¶
Bases:
Exception
- ERR_AUTH_ERROR = 'err_auth_error'¶
- ERR_BAD_REQUEST = 'err_bad_request'¶
- ERR_DEFAULT = 'err_default'¶
- ERR_NOT_FOUND = 'err_not_found'¶
- ERR_SERVER_ERROR = 'err_server_error'¶
- ERR_VALUE_ERROR = 'err_value_error'¶
- class scrapinghub.legacy.Connection(apikey=None, password='', _old_passwd='', url=None, connection_timeout=None)¶
Bases:
object
Main class to access Scrapinghub API.
- API_METHODS = {'addversion': 'scrapyd/addversion', 'as_project_slybot': 'as/project-slybot', 'as_spider_properties': 'as/spider-properties', 'eggs_add': 'eggs/add', 'eggs_delete': 'eggs/delete', 'eggs_list': 'eggs/list', 'items': 'items', 'jobs_count': 'jobs/count', 'jobs_delete': 'jobs/delete', 'jobs_list': 'jobs/list', 'jobs_stop': 'jobs/stop', 'jobs_update': 'jobs/update', 'listprojects': 'scrapyd/listprojects', 'log': 'log', 'reports_add': 'reports/add', 'run': 'run', 'schedule': 'schedule', 'spiders': 'spiders/list'}¶
- DEFAULT_ENDPOINT = 'https://app.zyte.com/api/'¶
- property auth¶
- project_ids()¶
Returns a list of projects available for this connection and crendentials.
- project_names()¶
- class scrapinghub.legacy.Job(project, id, info)¶
Bases:
RequestProxyMixin
- MAX_RETRIES = 180¶
- RETRY_INTERVAL = 60¶
- add_report(key, content, content_type='text/plain')¶
- delete()¶
- property id¶
- items(offset=0, count=None, meta=None)¶
- log(**params)¶
- stop()¶
- update(**modifiers)¶
- class scrapinghub.legacy.JobSet(project, **params)¶
Bases:
RequestProxyMixin
- count()¶
Returns total results count of current filters. Does not inclue count neither offset.
- delete()¶
- stop()¶
- update(**modifiers)¶
- class scrapinghub.legacy.Project(connection, projectid)¶
Bases:
RequestProxyMixin
- autoscraping_project_slybot(spiders=(), outputfile=None)¶
- autoscraping_spider_properties(spider, start_urls=None)¶
- job(id)¶
- jobs(**params)¶
- property name¶
- schedule(spider, **params)¶
- spiders(**params)¶
- class scrapinghub.legacy.RequestProxyMixin¶
Bases:
object