scrapinghub.HubstorageClient
============================

The library can be used for interaction with spiders, jobs and scraped data through ``storage.scrapinghub.com`` endpoints.

[WARNING] It is deprecated, please use `scrapinghub.ScrapinghubClient`_ instead.


Overview
--------

First, use your API key for authorization::

    >>> from scrapinghub import HubstorageClient
    >>> hc = HubstorageClient(auth='apikey')
    >>> hc.server_timestamp()
    1446222762611

Project
^^^^^^^

To get project settings or jobs summary::

    >>> project = hc.get_project('1111111')
    >>> project.settings['botgroups']
    [u'botgroup1', ]
    >>> project.jobsummary()
    {u'finished': 6,
     u'has_capacity': True,
     u'pending': 0,
     u'project': 1111111,
     u'running': 0}

Spider
^^^^^^

To get spider id correlated with its name::

    >>> project.ids.spider('foo')
    1

To see last jobs summaries::

    >>> summaries = project.spiders.lastjobsummary(count=3)

To get job summary per spider::

    >>> summary = project.spiders.lastjobsummary(spiderid='1')

Job
^^^

Job can be **retrieved** directly by id (project_id/spider_id/job_id)::

    >>> job = hc.get_job('1111111/1/1')
    >>> job.key
    '1111111/1/1'
    >>> job.metadata['state']
    u'finished'

**Creating** a new job requires a spider name::

    >>> job = hc.push_job(projectid='1111111', spidername='foo')
    >>> job.key
    '1111111/1/1'

Priority can be between 0 and 4 (from lowest to highest), the default is 2.

To push job from project level with the highest priority::

    >>> job = project.push_job(spidername='foo', priority=4)
    >>> job.metadata['priority']
    4

Pushing a job with spider arguments::

    >>> project.push_job(spidername='foo', spider_args={'arg1': 'foo', 'arg2': 'bar'})

Running job can be **cancelled** by calling ``request_cancel()``::

    >>> job.request_cancel()
    >>> job.metadata['cancelled_by']
    u'John'

To **delete** job::

    >>> job.purged()
    >>> job.metadata['state']
    u'deleted'

Job details
^^^^^^^^^^^

Job details can be found in jobs metadata and it's scrapystats::

    >>> job = hc.get_job('1111111/1/1')
    >>> job.metadata['version']
    u'5123a86-master'
    >>> job.metadata['scrapystats']
    ...
    u'downloader/response_count': 104,
    u'downloader/response_status_count/200': 104,
    u'finish_reason': u'finished',
    u'finish_time': 1447160494937,
    u'item_scraped_count': 50,
    u'log_count/DEBUG': 157,
    u'log_count/INFO': 1365,
    u'log_count/WARNING': 3,
    u'memusage/max': 182988800,
    u'memusage/startup': 62439424,
    ...

Anything can be stored in metadata, here is example how to add tags::

    >>> job.update_metadata({'tags': 'obsolete'})

Jobs
^^^^

To iterate through all jobs metadata per project (descending order)::

    >>> jobs_metadata = project.jobq.list()
    >>> [j['key'] for j in jobs_metadata]
    ['1111111/1/3', '1111111/1/2', '1111111/1/1']

Jobq metadata fieldset is less detailed, than ``job.metadata``, but contains few new fields as well.
Additional fields can be requested using the ``jobmeta`` parameter.
If it used, then it's up to the user to list all the required fields, so only few default fields would be added except requested ones::

    >>> metadata = next(project.jobq.list())
    >>> metadata.get('spider', 'missing')
    u'foo'
    >>> jobs_metadata = project.jobq.list(jobmeta=['scheduled_by'])
    >>> metadata = next(jobs_metadata)
    >>> metadata.get('scheduled_by', 'missing')
    u'John'
    >>> metadata.get('spider', 'missing')
    missing

By default ``jobq.list()`` returns maximum last 1000 results. Pagination is available using the ``start`` parameter::

    >>> jobs_metadata = project.jobq.list(start=1000)

There are several filters like spider, state, has_tag, lacks_tag, startts and endts.
To get jobs filtered by tags::

    >>> jobs_metadata = project.jobq.list(has_tag=['new', 'verified'], lacks_tag='obsolete')

List of tags has ``OR`` power, so in the case above jobs with 'new' or 'verified' tag are expected.

To get certain number of last finished jobs per some spider::

    >>> jobs_metadata = project.jobq.list(spider='foo', state='finished', count=3)

There are 4 possible job states, which can be used as values for filtering by state:

- pending
- running
- finished
- deleted


Items
^^^^^

To iterate through items::

    >>> items = job.items.iter_values()
    >>> for item in items:
    ...     # do something, item is just a dict

Logs
^^^^

To iterate through 10 first logs for example::

    >>> logs = job.logs.iter_values(count=10)
    >>> for log in logs:
    ...     # do something, log is a dict with log level, message and time keys

Collections
^^^^^^^^^^^

Let's store hash and timestamp pair for foo spider. Usual workflow with `Collections`_ would be::

    >>> collections = project.collections
    >>> foo_store = collections.new_store('foo_store')
    >>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'})
    >>> foo_store.count()
    1
    >>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7')
    {u'value': u'1447221694537'}
    >>> # iterate over _key & value pair
    ... list(foo_store.iter_values())
    [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
    >>> # filter by multiple keys - only values for keys that exist will be returned
    ... list(foo_store.iter_values(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah']))
    [{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
    >>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7')
    >>> foo_store.count()
    0

Frontier
^^^^^^^^

Typical workflow with `Frontier`_::

    >>> frontier = project.frontier

Add a request to the frontier::

    >>> frontier.add('test', 'example.com', [{'fp': '/some/path.html'}])
    >>> frontier.flush()
    >>> frontier.newcount
    1

Add requests with additional parameters::

    >>> frontier.add('test', 'example.com', [{'fp': '/'}, {'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}])
    >>> frontier.flush()
    >>> frontier.newcount
    2

To delete the slot ``example.com`` from the frontier::

    >>> frontier.delete_slot('test', 'example.com')

To retrieve requests for a given slot::

    >>> reqs = frontier.read('test', 'example.com')

To delete a batch of requests::

    >>> frontier.delete('test', 'example.com', '00013967d8af7b0001')

To retrieve fingerprints for a given slot::

    >>> fps = [req['requests'] for req in frontier.read('test', 'example.com')]


Module contents
---------------

.. automodule:: scrapinghub.hubstorage
    :members:
    :undoc-members:
    :show-inheritance:

.. _scrapinghub.ScrapinghubClient: ../client/overview.html