scrapinghub.Connection¶
The module is the very first Python library for communicating with the Scrapinghub API.
[WARNING] It is deprecated, please use scrapinghub.ScrapinghubClient instead.
Overview¶
First, you connect to Scrapinghub:
>>> from scrapinghub import Connection
>>> conn = Connection('APIKEY')
>>> conn
Connection('APIKEY')
You can list the projects available to your account:
>>> conn.project_ids()
[123, 456]
And select a particular project to work with:
>>> project = conn[123]
>>> project
Project(Connection('APIKEY'), 123)
>>> project.id
123
To schedule a spider run (it returns the job id):
>>> project.schedule('myspider', arg1='val1')
u'123/1/1'
To get the list of spiders in the project:
>>> project.spiders()
[
{u'id': u'spider1', u'tags': [], u'type': u'manual', u'version': u'123'},
{u'id': u'spider2', u'tags': [], u'type': u'manual', u'version': u'123'}
]
To get all finished jobs:
>>> jobs = project.jobs(state='finished')
jobs
is a JobSet
. JobSet
objects are iterable and, when iterated,
return an iterable of Job
objects, so you typically use it like this:
>>> for job in jobs:
... # do something with job
Or, if you just want to get the job ids:
>>> [x.id for x in jobs]
[u'123/1/1', u'123/1/2', u'123/1/3']
To select a specific job:
>>> job = project.job(u'123/1/2')
>>> job.id
u'123/1/2'
To retrieve all scraped items from a job:
>>> for item in job.items():
... # do something with item (it's just a dict)
To retrieve all log entries from a job:
>>> for logitem in job.log():
... # logitem is a dict with logLevel, message, time
To get job info:
>>> job.info['spider']
'myspider'
>>> job.info['started_time']
'2010-09-28T15:09:57.629000'
>>> job.info['tags']
[]
>>> job.info['fields_count]['description']
1253
To mark a job with tag consumed
:
>>> job.update(add_tag='consumed')
To mark several jobs with tag consumed
(JobSet
also supports the
update()
method):
>>> project.jobs(state='finished').update(add_tag='consumed')
To delete a job:
>>> job.delete()
To delete several jobs (JobSet
also supports the update()
method):
>>> project.jobs(state='finished').delete()
Module contents¶
Scrapinghub API Client Library
-
exception
scrapinghub.legacy.
APIError
(message, _type=None)¶ Bases:
exceptions.Exception
-
ERR_AUTH_ERROR
= 'err_auth_error'¶
-
ERR_BAD_REQUEST
= 'err_bad_request'¶
-
ERR_DEFAULT
= 'err_default'¶
-
ERR_NOT_FOUND
= 'err_not_found'¶
-
ERR_SERVER_ERROR
= 'err_server_error'¶
-
ERR_VALUE_ERROR
= 'err_value_error'¶
-
-
class
scrapinghub.legacy.
Connection
(apikey=None, password='', _old_passwd='', url=None, connection_timeout=None)¶ Bases:
object
Main class to access Scrapinghub API.
-
API_METHODS
= {'addversion': 'scrapyd/addversion', 'as_project_slybot': 'as/project-slybot', 'as_spider_properties': 'as/spider-properties', 'eggs_add': 'eggs/add', 'eggs_delete': 'eggs/delete', 'eggs_list': 'eggs/list', 'items': 'items', 'jobs_count': 'jobs/count', 'jobs_delete': 'jobs/delete', 'jobs_list': 'jobs/list', 'jobs_stop': 'jobs/stop', 'jobs_update': 'jobs/update', 'listprojects': 'scrapyd/listprojects', 'log': 'log', 'reports_add': 'reports/add', 'run': 'run', 'schedule': 'schedule', 'spiders': 'spiders/list'}¶
-
DEFAULT_ENDPOINT
= 'https://app.scrapinghub.com/api/'¶
-
auth
¶
-
project_ids
()¶ Returns a list of projects available for this connection and crendentials.
-
project_names
()¶
-
-
class
scrapinghub.legacy.
Job
(project, id, info)¶ Bases:
scrapinghub.legacy.RequestProxyMixin
-
MAX_RETRIES
= 180¶
-
RETRY_INTERVAL
= 60¶
-
add_report
(key, content, content_type='text/plain')¶
-
delete
()¶
-
id
¶
-
items
(offset=0, count=None, meta=None)¶
-
log
(**params)¶
-
stop
()¶
-
update
(**modifiers)¶
-
-
class
scrapinghub.legacy.
JobSet
(project, **params)¶ Bases:
scrapinghub.legacy.RequestProxyMixin
-
count
()¶ Returns total results count of current filters. Does not inclue count neither offset.
-
delete
()¶
-
stop
()¶
-
update
(**modifiers)¶
-
-
class
scrapinghub.legacy.
Project
(connection, projectid)¶ Bases:
scrapinghub.legacy.RequestProxyMixin
-
autoscraping_project_slybot
(spiders=(), outputfile=None)¶
-
autoscraping_spider_properties
(spider, start_urls=None)¶
-
job
(id)¶
-
jobs
(**params)¶
-
name
¶
-
schedule
(spider, **params)¶
-
spiders
(**params)¶
-
-
class
scrapinghub.legacy.
RequestProxyMixin
¶ Bases:
object