scrapinghub.HubstorageClient
The library can be used for interaction with spiders, jobs and scraped data through storage.scrapinghub.com endpoints.
[WARNING] It is deprecated, please use scrapinghub.ScrapinghubClient instead.
Overview
First, use your API key for authorization:
>>> from scrapinghub import HubstorageClient
>>> hc = HubstorageClient(auth='apikey')
>>> hc.server_timestamp()
1446222762611
Project
To get project settings or jobs summary:
>>> project = hc.get_project('1111111')
>>> project.settings['botgroups']
[u'botgroup1', ]
>>> project.jobsummary()
{u'finished': 6,
u'has_capacity': True,
u'pending': 0,
u'project': 1111111,
u'running': 0}
Spider
To get spider id correlated with its name:
>>> project.ids.spider('foo')
1
To see last jobs summaries:
>>> summaries = project.spiders.lastjobsummary(count=3)
To get job summary per spider:
>>> summary = project.spiders.lastjobsummary(spiderid='1')
Job
Job can be retrieved directly by id (project_id/spider_id/job_id):
>>> job = hc.get_job('1111111/1/1')
>>> job.key
'1111111/1/1'
>>> job.metadata['state']
u'finished'
Creating a new job requires a spider name:
>>> job = hc.push_job(projectid='1111111', spidername='foo')
>>> job.key
'1111111/1/1'
Priority can be between 0 and 4 (from lowest to highest), the default is 2.
To push job from project level with the highest priority:
>>> job = project.push_job(spidername='foo', priority=4)
>>> job.metadata['priority']
4
Pushing a job with spider arguments:
>>> project.push_job(spidername='foo', spider_args={'arg1': 'foo', 'arg2': 'bar'})
Running job can be cancelled by calling request_cancel():
>>> job.request_cancel()
>>> job.metadata['cancelled_by']
u'John'
To delete job:
>>> job.purged()
>>> job.metadata['state']
u'deleted'
Job details
Job details can be found in jobs metadata and it’s scrapystats:
>>> job = hc.get_job('1111111/1/1')
>>> job.metadata['version']
u'5123a86-master'
>>> job.metadata['scrapystats']
...
u'downloader/response_count': 104,
u'downloader/response_status_count/200': 104,
u'finish_reason': u'finished',
u'finish_time': 1447160494937,
u'item_scraped_count': 50,
u'log_count/DEBUG': 157,
u'log_count/INFO': 1365,
u'log_count/WARNING': 3,
u'memusage/max': 182988800,
u'memusage/startup': 62439424,
...
Anything can be stored in metadata, here is example how to add tags:
>>> job.update_metadata({'tags': 'obsolete'})
Jobs
To iterate through all jobs metadata per project (descending order):
>>> jobs_metadata = project.jobq.list()
>>> [j['key'] for j in jobs_metadata]
['1111111/1/3', '1111111/1/2', '1111111/1/1']
Jobq metadata fieldset is less detailed, than job.metadata, but contains few new fields as well.
Additional fields can be requested using the jobmeta parameter.
If it used, then it’s up to the user to list all the required fields, so only few default fields would be added except requested ones:
>>> metadata = next(project.jobq.list())
>>> metadata.get('spider', 'missing')
u'foo'
>>> jobs_metadata = project.jobq.list(jobmeta=['scheduled_by'])
>>> metadata = next(jobs_metadata)
>>> metadata.get('scheduled_by', 'missing')
u'John'
>>> metadata.get('spider', 'missing')
missing
By default jobq.list() returns maximum last 1000 results. Pagination is available using the start parameter:
>>> jobs_metadata = project.jobq.list(start=1000)
There are several filters like spider, state, has_tag, lacks_tag, startts and endts. To get jobs filtered by tags:
>>> jobs_metadata = project.jobq.list(has_tag=['new', 'verified'], lacks_tag='obsolete')
List of tags has OR power, so in the case above jobs with ‘new’ or ‘verified’ tag are expected.
To get certain number of last finished jobs per some spider:
>>> jobs_metadata = project.jobq.list(spider='foo', state='finished', count=3)
There are 4 possible job states, which can be used as values for filtering by state:
pending
running
finished
deleted
Items
To iterate through items:
>>> items = job.items.iter_values()
>>> for item in items:
... # do something, item is just a dict
Logs
To iterate through 10 first logs for example:
>>> logs = job.logs.iter_values(count=10)
>>> for log in logs:
... # do something, log is a dict with log level, message and time keys
Collections
Let’s store hash and timestamp pair for foo spider. Usual workflow with Collections would be:
>>> collections = project.collections
>>> foo_store = collections.new_store('foo_store')
>>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'})
>>> foo_store.count()
1
>>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7')
{u'value': u'1447221694537'}
>>> # iterate over _key & value pair
... list(foo_store.iter_values())
[{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
>>> # filter by multiple keys - only values for keys that exist will be returned
... list(foo_store.iter_values(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah']))
[{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
>>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7')
>>> foo_store.count()
0
Frontier
Typical workflow with Frontier:
>>> frontier = project.frontier
Add a request to the frontier:
>>> frontier.add('test', 'example.com', [{'fp': '/some/path.html'}])
>>> frontier.flush()
>>> frontier.newcount
1
Add requests with additional parameters:
>>> frontier.add('test', 'example.com', [{'fp': '/'}, {'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}])
>>> frontier.flush()
>>> frontier.newcount
2
To delete the slot example.com from the frontier:
>>> frontier.delete_slot('test', 'example.com')
To retrieve requests for a given slot:
>>> reqs = frontier.read('test', 'example.com')
To delete a batch of requests:
>>> frontier.delete('test', 'example.com', '00013967d8af7b0001')
To retrieve fingerprints for a given slot:
>>> fps = [req['requests'] for req in frontier.read('test', 'example.com')]
Module contents
HubStorage client library
- class scrapinghub.hubstorage.HubstorageClient(auth=None, endpoint=None, connection_timeout=None, max_retries=None, max_retry_time=None, user_agent=None, use_msgpack=True)[source]
Bases:
object- DEFAULT_CONNECTION_TIMEOUT_S = 60.0
- DEFAULT_ENDPOINT = 'https://storage.scrapinghub.com/'
- DEFAULT_USER_AGENT = 'python-scrapinghub/2.7.0'
- RETRY_DEFAULT_EXPONENTIAL_BACKOFF_MS = 500
- RETRY_DEFAULT_JITTER_MS = 500
- RETRY_DEFAULT_MAX_RETRIES = 3
- RETRY_DEFAUT_MAX_RETRY_TIME_S = 60.0
- property batchuploader