API Reference

Client object

class scrapinghub.client.ScrapinghubClient(auth=None, dash_endpoint=None, connection_timeout=60, **kwargs)[source]

Main class to work with the Scrapy Cloud API.

Parameters:

auth – (optional) Scrapy Cloud API key or other Scrapy Cloud auth credentials. If not provided, it will read, respectively, from SH_APIKEY or SHUB_JOBAUTH environment variables. SHUB_JOBAUTH is available by default in Scrapy Cloud, but it does not provide access to all endpoints (e.g. job scheduling), but it is allowed to access job data, collections, crawl frontier. If you need full access to Scrapy Cloud features, you’ll need to provide a Scrapy Cloud API key through this argument or deploying SH_APIKEY.
dash_endpoint – (optional) Scrapy Cloud API URL. If not provided, it will be read from the SHUB_APIURL environment variable, or fall back to "https://app.zyte.com/api/".
kwargs – (optional) Additional arguments for HubstorageClient constructor.

Variables:

projects – projects collection, Projects instance.

Usage:

>>> from scrapinghub import ScrapinghubClient
>>> client = ScrapinghubClient('APIKEY')
>>> client
<scrapinghub.client.ScrapinghubClient at 0x1047af2e8>

close(timeout=None)[source]

Close client instance.

Parameters:: timeout – (optional) float timeout secs to stop gracefully.

get_job(job_key)[source]

Get Job with a given job key.

Parameters:: job_key – job key string in format project_id/spider_id/job_id, where all the components are integers.
Returns:: a job instance.
Return type:: Job

Usage:

>>> job = client.get_job('123/1/1')
>>> job
<scrapinghub.client.jobs.Job at 0x10afe2eb1>

get_project(project_id)[source]

Get scrapinghub.client.projects.Project instance with a given project id.

The method is a shortcut for client.projects.get().

Parameters:: project_id – integer or string numeric project id.
Returns:: a project instance.
Return type:: Project

Usage:

>>> project = client.get_project(123)
>>> project
<scrapinghub.client.projects.Project at 0x106cdd6a0>

Activity

class scrapinghub.client.activity.Activity(cls, client, key)[source]

Representation of collection of job activity events.

Not a public constructor: use Project instance to get a Activity instance. See activity attribute.

Please note that list() method can use a lot of memory and for a large amount of activities it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

Usage:

get all activity from a project:

>>> project.activity.iter()
<generator object jldecode at 0x1049ee990>

get only last 2 events from a project:

>>> project.activity.list(count=2)
[{'event': 'job:completed', 'job': '123/2/3', 'user': 'jobrunner'},
 {'event': 'job:started', 'job': '123/2/3', 'user': 'john'}]

post a new event:

>>> event = {'event': 'job:completed',
...          'job': '123/2/4',
...          'user': 'jobrunner'}
>>> project.activity.add(event)

post multiple events at once:

>>> events = [
...    {'event': 'job:completed', 'job': '123/2/5', 'user': 'jobrunner'},
...    {'event': 'job:cancelled', 'job': '123/2/6', 'user': 'john'},
... ]
>>> project.activity.add(events)

add(values, **kwargs)[source]

Add new event to the project activity.

Parameters:: values – a single event or a list of events, where event is represented with a dictionary of (‘event’, ‘job’, ‘user’) keys.

iter(count=None, **params)[source]

Iterate over activity events.

Parameters:: count – limit amount of elements.
Returns:: a generator object over a list of activity event dicts.
Return type:: types.GeneratorType[dict]

list(*args, **kwargs)

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

Collections

class scrapinghub.client.collections.Collection(client, collections, type_, name)[source]

Representation of a project collection object.

Not a public constructor: use Collections instance to get a Collection instance. See Collections.get_store() and similar methods.

Usage:

add a new item to collection:

>>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7',
...                'value': '1447221694537'})

count items in collection:
```
>>> foo_store.count()
1
```

get an item from collection:

>>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7')
{'value': '1447221694537'}

get all items from collection:

>>> foo_store.iter()
<generator object jldecode at 0x1049eef10>

iterate over _key & value pair:

>>> for elem in foo_store.iter(count=1)):
...     print(elem)
[{'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}]

get generator over item keys:

>>> keys = foo_store.iter(nodata=True, meta=["_key"]))
>>> next(keys)
{'_key': '002d050ee3ff6192dcbecc4e4b4457d7'}

filter by multiple keys, only values for keys that exist will be returned:

>>> foo_store.list(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah'])
[{'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}]

delete an item by key:

>>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7')

remove the entire collection with a single API call:
```
>>> foo_store.truncate()
```

count(*args, **kwargs)[source]

Count collection items with a given filters.

Returns:: amount of elements in collection.
Return type:: int

create_writer(start=0, auth=None, size=1000, interval=15, qsize=None, content_encoding='identity', maxitemsize=1048576, callback=None)[source]

Create a new writer for a collection.

Parameters:

start – (optional) initial offset for writer thread.
auth – (optional) set auth credentials for the request.
size – (optional) set initial queue size.
interval – (optional) set interval for writer thread.
qsize – (optional) setup max queue size for the writer.
content_encoding – (optional) set different Content-Encoding header.
maxitemsize – (optional) max item size in bytes.
callback – (optional) some callback function.

Returns:

a new writer object.

Return type:

scrapinghub.hubstorage.batchuploader._BatchWriter

If provided - calllback shouldn’t try to inject more items in the queue, otherwise it can lead to deadlocks.

delete(keys)[source]

Delete item(s) from collection by key(s).

Parameters:: keys – a single key or a list of keys.

The method returns None (original method returns an empty generator).

get(key, **params)[source]

Get item from collection by key.

Parameters:

key – string item key.
params – (optional) additional query params for the request.

Returns:

an item dictionary if exists.

Return type:

dict

iter(key=None, prefix=None, prefixcount=None, startts=None, endts=None, requests_params=None, **params)[source]

A method to iterate through collection items.

Parameters:

key – a string key or a list of keys to filter with.
prefix – a string prefix to filter items.
prefixcount – maximum number of values to return per prefix.
startts – UNIX timestamp at which to begin results.
endts – UNIX timestamp at which to end results.
requests_params – (optional) a dict with optional requests params.
params – (optional) additional query params for the request.

Returns:

an iterator over items list.

Return type:

collections.abc.Iterable[dict]

list(key=None, prefix=None, prefixcount=None, startts=None, endts=None, requests_params=None, **params)[source]

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

Parameters:

key – a string key or a list of keys to filter with.
prefix – a string prefix to filter items.
prefixcount – maximum number of values to return per prefix.
startts – UNIX timestamp at which to begin results.
endts – UNIX timestamp at which to end results.
requests_params – (optional) a dict with optional requests params.
params – (optional) additional query params for the request.

Returns:

a list of items where each item is represented with a dict.

Return type:

list[dict]

set(value)[source]

Set item to collection by key.

Parameters:: value – a dict representing a collection item.

The method returns None (original method returns an empty generator).

truncate()[source]

Remove the entire collection with a single API call.

The method returns None (original method returns an empty generator).

class scrapinghub.client.collections.Collections(cls, client, key)[source]

Access to project collections.

Not a public constructor: use Project instance to get a Collections instance. See collections attribute.

Usage:

>>> collections = project.collections
>>> collections.list()
[{'name': 'Pages', 'type': 's'}]
>>> foo_store = collections.get_store('foo_store')

get(type_, name)[source]

Base method to get a collection with a given type and name.

Parameters:

type_ – a collection type string.
name – a collection name string.

Returns:

a collection object.

Return type:

Collection

get_cached_store(name)[source]

Method to get a cashed-store collection by name.

The collection type means that items expire after a month.

Parameters:: name – a collection name string.
Returns:: a collection object.
Return type:: Collection

get_store(name)[source]

Method to get a store collection by name.

Parameters:: name – a collection name string.
Returns:: a collection object.
Return type:: Collection

get_versioned_cached_store(name)[source]

Method to get a versioned-cached-store collection by name.

Multiple copies are retained, and each one expires after a month.

Parameters:: name – a collection name string.
Returns:: a collection object.
Return type:: Collection

get_versioned_store(name)[source]

Method to get a versioned-store collection by name.

The collection type retains up to 3 copies of each item.

Parameters:: name – a collection name string.
Returns:: a collection object.
Return type:: Collection

iter()[source]

Iterate through collections of a project.

Returns:: an iterator over collections list where each collection is represented by a dictionary with (‘name’,’type’) fields.
Return type:: collections.abc.Iterable[dict]

list()[source]

List collections of a project.

Returns:: a list of collections where each collection is represented by a dictionary with (‘name’,’type’) fields.
Return type:: list[dict]

Exceptions

exception scrapinghub.client.exceptions.BadRequest(message=None, http_error=None)[source]: Usually raised in case of 400 response from API.

exception scrapinghub.client.exceptions.DuplicateJobError(message=None, http_error=None)[source]: Job for given spider with given arguments is already scheduled or running.

exception scrapinghub.client.exceptions.Forbidden(message=None, http_error=None)[source]: You don’t have the permission to access the requested resource. It is either read-protected or not readable by the server.

exception scrapinghub.client.exceptions.NotFound(message=None, http_error=None)[source]: Entity doesn’t exist (e.g. spider or project).

exception scrapinghub.client.exceptions.ScrapinghubAPIError(message=None, http_error=None)[source]: Base exception class.

exception scrapinghub.client.exceptions.ServerError(message=None, http_error=None)[source]: Indicates some server error: something unexpected has happened.

exception scrapinghub.client.exceptions.Unauthorized(message=None, http_error=None)[source]: Request lacks valid authentication credentials for the target resource.

exception scrapinghub.client.exceptions.ValueTooLarge(message=None, http_error=None)[source]: Value cannot be writtent because it exceeds size limits.

Frontiers

class scrapinghub.client.frontiers.Frontier(client, frontiers, name)[source]

Representation of a frontier object.

Not a public constructor: use Frontiers instance to get a Frontier instance. See Frontiers.get() method.

Usage:

get iterator with all slots:

>>> frontier.iter()
<list_iterator at 0x1030736d8>

list all slots:

>>> frontier.list()
['example.com', 'example.com2']

get a slot by name:

>>> frontier.get('example.com')
<scrapinghub.client.frontiers.FrontierSlot at 0x1049d8978>

flush frontier data:
```
>>> frontier.flush()
```
show amount of new requests added to frontier:
```
>>> frontier.newcount
3
```

flush()[source]: Flush data for a whole frontier.

get(slot)[source]

Get a slot by name.

Returns:: a frontier slot instance.
Return type:: FrontierSlot

iter()[source]

Iterate through slots.

Returns:: an iterator over frontier slots names.
Return type:: collections.abc.Iterable[str]

list()[source]

List all slots.

Returns:: a list of frontier slots names.
Return type:: list[str]

property newcount: Integer amount of new entries added to frontier.

class scrapinghub.client.frontiers.FrontierSlot(client, frontier, slot)[source]

Representation of a frontier slot object.

Not a public constructor: use Frontier instance to get a FrontierSlot instance. See Frontier.get() method.

Usage:

add request to a queue:

>>> data = [{'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}]
>>> slot.q.add('example.com', data)

add fingerprints to a slot:
```
>>> slot.f.add(['fp1', 'fp2'])
```
flush data for a slot:
```
>>> slot.flush()
```
show amount of new requests added to a slot:
```
>>> slot.newcount
2
```

read requests from a slot:

>>> slot.q.iter()
<generator object jldecode at 0x1049aa9e8>
>>> slot.q.list()
[{'id': '0115a8579633600006',
  'requests': [['page1.html', {'depth': 1}]]}]

read fingerprints from a slot:

>>> slot.f.iter()
<generator object jldecode at 0x103de4938>
>>> slot.f.list()
['page1.html']

delete a batch with requests from a slot:

>>> slot.q.delete('0115a8579633600006')

delete a whole slot:
```
>>> slot.delete()
```

delete()[source]: Delete the slot.

property f

Shortcut to have quick access to slot fingerprints.

Returns:: fingerprints collection for the slot.
Return type:: FrontierSlotFingerprints

flush()[source]: Flush data for the slot.

property newcount: Integer amount of new entries added to slot.

property q

Shortcut to have quick access to a slot queue.

Returns:: queue instance for the slot.
Return type:: FrontierSlotQueue

class scrapinghub.client.frontiers.FrontierSlotFingerprints(slot)[source]

Representation of request fingerprints collection stored in slot.

add(fps)[source]

Add new fingerprints to slot.

Parameters:: fps – a list of string fingerprints to add.

iter(**params)[source]

Iterate through fingerprints in the slot.

Parameters:: params – (optional) additional query params for the request.
Returns:: an iterator over fingerprints.
Return type:: collections.abc.Iterable[str]

list(**params)[source]

List fingerprints in the slot.

Parameters:: params – (optional) additional query params for the request.
Returns:: a list of fingerprints.
Return type:: list[str]

class scrapinghub.client.frontiers.FrontierSlotQueue(slot)[source]

Representation of request batches queue stored in slot.

add(fps)[source]: Add requests to the queue.

delete(ids)[source]: Delete request batches from the queue.

iter(mincount=None, **params)[source]

Iterate through batches in the queue.

Parameters:

mincount – (optional) limit results with min amount of requests.
params – (optional) additional query params for the request.

Returns:

an iterator over request batches in the queue where each batch is represented with a dict with (‘id’, ‘requests’) field.

Return type:

collections.abc.Iterable[dict]

list(mincount=None, **params)[source]

List request batches in the queue.

Parameters:

mincount – (optional) limit results with min amount of requests.
params – (optional) additional query params for the request.

Returns:

a list of request batches in the queue where each batch is represented with a dict with (‘id’, ‘requests’) field.

Return type:

list[dict]

class scrapinghub.client.frontiers.Frontiers(*args, **kwargs)[source]

Frontiers collection for a project.

Not a public constructor: use Project instance to get a Frontiers instance. See frontiers attribute.

Usage:

get all frontiers from a project:

>>> project.frontiers.iter()
<list_iterator at 0x103c93630>

list all frontiers:

>>> project.frontiers.list()
['test', 'test1', 'test2']

get a frontier by name:

>>> project.frontiers.get('test')
<scrapinghub.client.frontiers.Frontier at 0x1048ae4a8>

flush data of all frontiers of a project:
```
>>> project.frontiers.flush()
```
show amount of new requests added for all frontiers:
```
>>> project.frontiers.newcount
3
```
close batch writers of all frontiers of a project:
```
>>> project.frontiers.close()
```

close()[source]: Close frontier writer threads one-by-one.

flush()[source]: Flush data in all frontiers writer threads.

get(name)[source]

Get a frontier by name.

Parameters:: name – a frontier name string.
Returns:: a frontier instance.
Return type:: Frontier

iter()[source]

Iterate through frontiers.

Returns:: an iterator over frontiers names.
Return type:: collections.abc.Iterable[str]

list()[source]

List frontiers names.

Returns:: a list of frontiers names.
Return type:: list[str]

property newcount: Integer amount of new entries added to all frontiers.

Items

class scrapinghub.client.items.Items(cls, client, key)[source]

Representation of collection of job items.

Not a public constructor: use Job instance to get a Items instance. See items attribute.

Please note that list() method can use a lot of memory and for a large number of items it’s recommended to iterate through them via iter() method (all params and available filters are same for both methods).

Usage:

retrieve all scraped items from a job:

>>> job.items.iter()
<generator object mpdecode at 0x10f5f3aa0>

iterate through first 100 items and print them:

>>> for item in job.items.iter(count=100):
...     print(item)

retrieve items with timestamp greater or equal to given timestamp (item here is an arbitrary dictionary depending on your code):

>>> job.items.list(startts=1447221694537)
[{
    'name': ['Some custom item'],
    'url': 'http://some-url/item.html',
    'size': 100000,
}]

retrieve items via a generator of lists. This is most useful in cases where the job has a huge amount of items and it needs to be broken down into chunks when consumed. This example shows a job with 3 items:

>>> gen = job.items.list_iter(chunksize=2)
>>> next(gen)
[{'name': 'Item #1'}, {'name': 'Item #2'}]
>>> next(gen)
[{'name': 'Item #3'}]
>>> next(gen)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

retrieving via meth::list_iter also supports the start and count. params. This is useful when you want to only retrieve a subset of items in a job. The example below belongs to a job with 10 items:

>>> gen = job.items.list_iter(chunksize=2, start=5, count=3)
>>> next(gen)
[{'name': 'Item #5'}, {'name': 'Item #6'}]
>>> next(gen)
[{'name': 'Item #7'}]
>>> next(gen)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

retrieve 1 item with multiple filters:

>>> filters = [("size", ">", [30000]), ("size", "<", [40000])]
>>> job.items.list(count=1, filter=filters)
[{
    'name': ['Some other item'],
    'url': 'http://some-url/other-item.html',
    'size': 35000,
}]

close(block=True): Close writers one-by-one.

flush(): Flush data from writer threads.

get(key, **params)

Get element from collection.

Parameters:: key – element key.
Returns:: a dictionary with element data.
Return type:: dict

iter(_path=None, count=None, requests_params=None, **apiparams)

A general method to iterate through elements.

Parameters:: count – limit amount of elements.
Returns:: an iterator over elements list.
Return type:: collections.abc.Iterable

list(*args, **kwargs)

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

list_iter(chunksize=1000, *args, **kwargs)[source]

An alternative interface for reading items by returning them as a generator which yields lists of items sized as chunksize.

This is a convenient method for cases when processing a large amount of items from a job isn’t ideal in one go due to the large memory needed. Instead, this allows you to process it chunk by chunk.

You can improve I/O overheads by increasing the chunk value but that would also increase the memory consumption.

Parameters:

chunksize – size of list to be returned per iteration
start – offset to specify the start of the item iteration
count – overall number of items to be returned, which is broken down by chunksize.

Returns:

an iterator over items, yielding lists of items.

Return type:

collections.abc.Iterable

stats()

Get resource stats.

Returns:: a dictionary with stats data.
Return type:: dict

write(item)

Write new element to collection.

Parameters:: item – element data dict to write.

Jobs

class scrapinghub.client.jobs.Job(client, job_key)[source]

Class representing a job object.

Not a public constructor: use ScrapinghubClient instance or Jobs instance to get a Job instance. See scrapinghub.client.ScrapinghubClient.get_job() and Jobs.get() methods.

Variables:

project_id – integer project id.
key – a job key.
items – Items resource object.
logs – Logs resource object.
requests – Requests resource object.
samples – Samples resource object.
metadata – JobMeta resource object.

Usage:

>>> job = project.jobs.get('123/1/2')
>>> job.key
'123/1/2'
>>> job.metadata.get('state')
'finished'

cancel()[source]

Schedule a running job for cancellation.

Usage:

>>> job.cancel()
>>> job.metadata.get('cancelled_by')
'John'

close_writers()[source]

Stop job batch writers threads gracefully.

Called on ScrapinghubClient.close() method.

delete(**params)[source]

Mark finished job for deletion.

Parameters:: params – (optional) keyword meta parameters to update.
Returns:: a previous string job state.
Return type:: str

Usage:

>>> job.delete()
'finished'

finish(**params)[source]

Move running job to finished state.

Parameters:: params – (optional) keyword meta parameters to update.
Returns:: a previous string job state.
Return type:: str

Usage:

>>> job.finish()
'running'

start(**params)[source]

Move job to running state.

Parameters:: params – (optional) keyword meta parameters to update.
Returns:: a previous string job state.
Return type:: str

Usage:

>>> job.start()
'pending'

update(state, **params)[source]

Update job state.

Parameters:

state – a new job state.
params – (optional) keyword meta parameters to update.

Returns:

a previous string job state.

Return type:

str

Usage:

>>> job.update('finished')
'running'

update_tags(add=None, remove=None)[source]

Partially update job tags.

It provides a convenient way to mark specific jobs (for better search, postprocessing etc).

Parameters:

add – (optional) list of tags to add.
remove – (optional) list of tags to remove.

Usage: to mark a job with tag consumed:

>>> job.update_tags(add=['consumed'])

class scrapinghub.client.jobs.JobMeta(cls, client, key)[source]

Class representing job metadata.

Not a public constructor: use Job instance to get a JobMeta instance. See metadata attribute.

Usage:

get job metadata instance:

>>> job.metadata
<scrapinghub.client.jobs.JobMeta at 0x10494f198>

iterate through job metadata:

>>> job.metadata.iter()
<dict_itemiterator at 0x104adbd18>

list job metadata:

>>> job.metadata.list()
[('project', 123), ('units', 1), ('state', 'finished'), ...]

get meta field value by name:
```
>>> job.metadata.get('version')
'test'
```
update job meta field value (some meta fields are read-only):
```
>>> job.metadata.set('my-meta', 'test')
```

update multiple meta fields at once

>>> job.metadata.update({'my-meta1': 'test1', 'my-meta2': 'test2'})

delete meta field by name:
```
>>> job.metadata.delete('my-meta')
```

delete(key)

Delete element by key.

Parameters:: key – a string key

get(key)

Get element value by key.

Parameters:: key – a string key

iter()

Iterate through key/value pairs.

Returns:: an iterator over key/value pairs.
Return type:: collections.abc.Iterable

list(*args, **kwargs)

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

set(key, value)

Set element value.

Parameters:

key – a string key
value – new value to set for the key

update(values)

Update multiple elements at once.

The method provides convenient interface for partial updates.

Parameters:: values – a dictionary with key/values to update.

class scrapinghub.client.jobs.Jobs(client, project_id, spider=None)[source]

Class representing a collection of jobs for a project/spider.

Not a public constructor: use Project instance or Spider instance to get a Jobs instance. See scrapinghub.client.projects.Project.jobs and scrapinghub.client.spiders.Spider.jobs attributes.

Variables:

project_id – a string project id.
spider – Spider object if defined.

Usage:

>>> project.jobs
<scrapinghub.client.jobs.Jobs at 0x10477f0b8>
>>> spider = project.spiders.get('spider1')
>>> spider.jobs
<scrapinghub.client.jobs.Jobs at 0x104767e80>

cancel(keys=None, count=None, **params)[source]

Cancel a list of jobs using the keys provided.

Parameters:

keys – (optional) a list of strings containing the job keys in the format: <project>/<spider>/<job_id>.
count – (optional) it requires admin access. Used for admins to bulk cancel an amount of count jobs.

Returns:

a dict with the amount of jobs cancelled.

Return type:

dict

Usage:

cancel jobs 123 and 321 from project 111 and spiders 222 and 333:

>>> project.jobs.cancel(['111/222/123', '111/333/321'])
{'count': 2}

cancel 100 jobs asynchronously:

>>> project.jobs.cancel(count=100)
{'count': 100}

count(spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, **params)[source]

Count jobs with a given set of filters.

Parameters:

spider – (optional) filter by spider name.
state – (optional) a job state, a string or a list of strings.
has_tag – (optional) filter results by existing tag(s), a string or a list of strings.
lacks_tag – (optional) filter results by missing tag(s), a string or a list of strings.
startts – (optional) UNIX timestamp at which to begin results, in milliseconds.
endts – (optional) UNIX timestamp at which to end results, in milliseconds.
params – (optional) other filter params.

Returns:

jobs count.

Return type:

int

The endpoint used by the method counts only finished jobs by default, use state parameter to count jobs in other states.

Usage:

>>> spider = project.spiders.get('spider1')
>>> spider.jobs.count()
5
>>> project.jobs.count(spider='spider2', state='finished')
2

get(job_key)[source]

Get a Job with a given job_key.

Parameters:: job_key – a string job key.

job_key’s project component should match the project used to get Jobs instance, and job_key’s spider component should match the spider (if Spider was used to get Jobs instance).

Returns:: a job object.
Return type:: Job

Usage:

>>> job = project.jobs.get('123/1/2')
>>> job.key
'123/1/2'

iter(count=None, start=None, spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, meta=None, **params)[source]

Iterate over jobs collection for a given set of params.

Parameters:

count – (optional) limit amount of returned jobs.
start – (optional) number of jobs to skip in the beginning.
spider – (optional) filter by spider name.
state – (optional) a job state, a string or a list of strings.
has_tag – (optional) filter results by existing tag(s), a string or a list of strings.
lacks_tag – (optional) filter results by missing tag(s), a string or a list of strings.
startts – (optional) UNIX timestamp at which to begin results, in millisecons.
endts – (optional) UNIX timestamp at which to end results, in millisecons.
meta – (optional) request for additional fields, a single field name or a list of field names to return.
params – (optional) other filter params.

Returns:

a generator object over a list of dictionaries of jobs summary for a given filter params.

Return type:

types.GeneratorType[dict]

The endpoint used by the method returns only finished jobs by default, use state parameter to return jobs in other states.

Usage:

retrieve all jobs for a spider:

>>> spider.jobs.iter()
<generator object jldecode at 0x1049bd570>

get all job keys for a spider:

>>> jobs_summary = spider.jobs.iter()
>>> [job['key'] for job in jobs_summary]
['123/1/3', '123/1/2', '123/1/1']

job summary fieldset is less detailed than JobMeta but contains a few new fields as well. Additional fields can be requested using meta parameter. If it’s used, then it’s up to the user to list all the required fields, so only few default fields would be added except requested ones:
```
>>> jobs_summary = project.jobs.iter(meta=['scheduled_by', ])
```
by default Jobs.iter() returns maximum last 1000 results. Pagination is available using start parameter:
```
>>> jobs_summary = spider.jobs.iter(start=1000)
```

get jobs filtered by tags (list of tags has OR power):

>>> jobs_summary = project.jobs.iter(
...     has_tag=['new', 'verified'], lacks_tag='obsolete')

get certain number of last finished jobs per some spider:

>>> jobs_summary = project.jobs.iter(
...     spider='spider2', state='finished', count=3)

iter_last(start=None, start_after=None, count=None, spider=None, **params)[source]

Iterate through last jobs for each spider.

Parameters:

start – (optional)
start_after – (optional)
count – (optional)
spider – (optional) a spider name (not needed if instantiated with Spider).
params – (optional) additional keyword args.

Returns:

a generator object over a list of dictionaries of jobs summary for a given filter params.

Return type:

types.GeneratorType[dict]

Usage:

get all last job summaries for a project:

>>> project.jobs.iter_last()
<generator object jldecode at 0x1048a95c8>

get last job summary for a a spider:

>>> list(spider.jobs.iter_last())
[{'close_reason': 'success',
  'elapsed': 3062444,
  'errors': 1,
  'finished_time': 1482911633089,
  'key': '123/1/3',
  'logs': 8,
  'pending_time': 1482911596566,
  'running_time': 1482911598909,
  'spider': 'spider1',
  'state': 'finished',
  'ts': 1482911615830,
  'version': 'some-version'}]

list(count=None, start=None, spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, meta=None, **params)[source]

Convenient shortcut to list iter results.

Parameters:

count – (optional) limit amount of returned jobs.
start – (optional) number of jobs to skip in the beginning.
spider – (optional) filter by spider name.
state – (optional) a job state, a string or a list of strings.
has_tag – (optional) filter results by existing tag(s), a string or a list of strings.
lacks_tag – (optional) filter results by missing tag(s), a string or a list of strings.
startts – (optional) UNIX timestamp at which to begin results, in milliseconds.
endts – (optional) UNIX timestamp at which to end results, in milliseconds.
meta – (optional) request for additional fields, a single field name or a list of field names to return.
params – (optional) other filter params.

Returns:

list of dictionaries of jobs summary for a given filter params.

Return type:

list[dict]

The endpoint used by the method returns only finished jobs by default, use state parameter to return jobs in other states.

Please note that list() can use a lot of memory and for a large amount of logs it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

run(spider=None, units=None, priority=None, meta=None, add_tag=None, job_args=None, job_settings=None, cmd_args=None, environment=None, **params)[source]

Schedule a new job and returns its job key.

Parameters:

spider – a spider name string (not needed if job is scheduled via Spider.jobs).
units – (optional) amount of units for the job.
priority – (optional) integer priority value.
meta – (optional) a dictionary with metadata.
add_tag – (optional) a string tag or a list of tags to add.
job_args – (optional) a dictionary with job arguments.
job_settings – (optional) a dictionary with job settings.
cmd_args – (optional) a string with script command args.
environment – (option) a dictionary with custom environment
params – (optional) additional keyword args.

Returns:

a job instance, representing the scheduled job.

Return type:

Job

Usage:

>>> job = project.jobs.run('spider1', job_args={'arg1': 'val1'})
>>> job
<scrapinghub.client.jobs.Job at 0x7fcb7c01df60>
>>> job.key
'123/1/1'

summary(state=None, spider=None, **params)[source]

Get jobs summary (optionally by state).

Parameters:

state – (optional) a string state to filter jobs.
spider – (optional) a spider name (not needed if instantiated with Spider).
params – (optional) additional keyword args.

Returns:

a list of dictionaries of jobs summary for a given filter params grouped by job state.

Return type:

list[dict]

Usage:

>>> spider.jobs.summary()
[{'count': 0, 'name': 'pending', 'summary': []},
 {'count': 0, 'name': 'running', 'summary': []},
 {'count': 5, 'name': 'finished', 'summary': [...]}

>>> project.jobs.summary('pending')
{'count': 0, 'name': 'pending', 'summary': []}

update_tags(add=None, remove=None, spider=None)[source]

Update tags for all existing spider jobs.

Parameters:

add – (optional) list of tags to add to selected jobs.
remove – (optional) list of tags to remove from selected jobs.
spider – (optional) spider name, must if used with Project.jobs.

It’s not allowed to update tags for all project jobs, so spider must be specified (it’s done implicitly when using Spider.jobs, or you have to specify spider param when using Project.jobs).

Returns:: amount of jobs that were updated.
Return type:: int

Usage:

mark all spider jobs with tag consumed:

>>> spider = project.spiders.get('spider1')
>>> spider.jobs.update_tags(add=['consumed'])
5

remove existing tag existing for all spider jobs:

>>> project.jobs.update_tags(
...     remove=['existing'], spider='spider2')
2

Logs

class scrapinghub.client.logs.Logs(cls, client, key)[source]

Representation of collection of job logs.

Not a public constructor: use Job instance to get a Logs instance. See logs attribute.

Please note that list() method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

Usage:

retrieve all logs from a job:

>>> job.logs.iter()
<generator object mpdecode at 0x10f5f3aa0>

iterate through first 100 log entries and print them:

>>> for log in job.logs.iter(count=100):
...     print(log)

retrieve a single log entry from a job:

>>> job.logs.list(count=1)
[{
    'level': 20,
    'message': '[scrapy.core.engine] Closing spider (finished)',
    'time': 1482233733976,
}]

retrive logs with a given log level and filter by a word:

>>> filters = [("message", "contains", ["mymessage"])]
>>> job.logs.list(level='WARNING', filter=filters)
[{
    'level': 30,
    'message': 'Some warning: mymessage',
    'time': 1486375511188,
}]

batch_write_start()[source]: Override to set a start parameter when commencing writing.

close(block=True): Close writers one-by-one.

debug(message, **other)[source]: Log a message with DEBUG level.

error(message, **other)[source]: Log a message with ERROR level.

flush(): Flush data from writer threads.

get(key, **params)

Get element from collection.

Parameters:: key – element key.
Returns:: a dictionary with element data.
Return type:: dict

info(message, **other)[source]: Log a message with INFO level.

iter(_path=None, count=None, requests_params=None, **apiparams)

A general method to iterate through elements.

Parameters:: count – limit amount of elements.
Returns:: an iterator over elements list.
Return type:: collections.abc.Iterable

list(*args, **kwargs)

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

log(message, level=20, ts=None, **other)[source]

Base method to write a log entry.

Parameters:

message – a string message.
level – (optional) logging level, default to INFO.
ts – (optional) UNIX timestamp in milliseconds.
other – other optional kwargs.

stats()

Get resource stats.

Returns:: a dictionary with stats data.
Return type:: dict

warn(message, **other)[source]: Log a message with WARN level.

warning(message, **other): Log a message with WARN level.

write(item)

Write new element to collection.

Parameters:: item – element data dict to write.

Projects

class scrapinghub.client.projects.Project(client, project_id)[source]

Class representing a project object and its resources.

Not a public constructor: use ScrapinghubClient instance or Projects instance to get a Project instance. See scrapinghub.client.ScrapinghubClient.get_project() or Projects.get() methods.

Variables:

key – string project id.
activity – Activity resource object.
collections – Collections resource object.
frontiers – Frontiers resource object.
jobs – Jobs resource object.
settings – Settings resource object.
spiders – Spiders resource object.

Usage:

>>> project = client.get_project(123)
>>> project
<scrapinghub.client.projects.Project at 0x106cdd6a0>
>>> project.key
'123'

class scrapinghub.client.projects.Projects(client)[source]

Collection of projects available to current user.

Not a public constructor: use ScrapinghubClient client instance to get a Projects instance. See scrapinghub.client.Scrapinghub.projects attribute.

Usage:

>>> client.projects
<scrapinghub.client.projects.Projects at 0x1047ada58>

get(project_id)[source]

Get project for a given project id.

Parameters:: project_id – integer or string numeric project id.
Returns:: a project object.
Return type:: Project

Usage:

>>> project = client.projects.get(123)
>>> project
<scrapinghub.client.projects.Project at 0x106cdd6a0>

iter()[source]

Iterate through list of projects available to current user.

Provided for the sake of API consistency.

Returns:: an iterator over project ids list.
Return type:: collections.abc.Iterable[int]

list()[source]

Get list of projects available to current user.

Returns:: a list of project ids.
Return type:: list[int]

Usage:

>>> client.projects.list()
[123, 456]

summary(state=None, **params)[source]

Get short summaries for all available user projects.

Parameters:: state – a string state or a list of states.
Returns:: a list of dictionaries: each dictionary represents a project summary (amount of pending/running/finished jobs and a flag if it has a capacity to run new jobs).
Return type:: list[dict]

Usage:

>>> client.projects.summary()
[{'finished': 674,
  'has_capacity': True,
  'pending': 0,
  'project': 123,
  'running': 1},
 {'finished': 33079,
  'has_capacity': True,
  'pending': 0,
  'project': 456,
  'running': 2}]

class scrapinghub.client.projects.Settings(cls, client, key)[source]

Class representing job metadata.

Not a public constructor: use Project instance to get a Settings instance. See Project.settings attribute.

Usage:

get project settings instance:

>>> project.settings
<scrapinghub.client.projects.Settings at 0x10ecf1250>

iterate through project settings:

>>> project.settings.iter()
<dictionary-itemiterator at 0x10ed11578>

list project settings:

>>> project.settings.list()
[(u'default_job_units', 2), (u'job_runtime_limit', 20)]

get setting value by name:

>>> project.settings.get('default_job_units')
2

update setting value (some settings are read-only):

>>> project.settings.set('default_job_units', 2)

update multiple settings at once:

>>> project.settings.update({'default_job_units': 1,
...                          'job_runtime_limit': 20})

delete project setting by name:

>>> project.settings.delete('job_runtime_limit')

delete(key)

Delete element by key.

Parameters:: key – a string key

get(key)

Get element value by key.

Parameters:: key – a string key

iter()

Iterate through key/value pairs.

Returns:: an iterator over key/value pairs.
Return type:: collections.abc.Iterable

list(*args, **kwargs)

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

set(key, value)[source]

Update project setting value by key.

Parameters:

key – a string setting key.
value – new setting value.

update(values)

Update multiple elements at once.

The method provides convenient interface for partial updates.

Parameters:: values – a dictionary with key/values to update.

Requests

class scrapinghub.client.requests.Requests(cls, client, key)[source]

Representation of collection of job requests.

Not a public constructor: use Job instance to get a Requests instance. See requests attribute.

Please note that list() method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

Usage:

retrieve all requests from a job:

>>> job.requests.iter()
<generator object mpdecode at 0x10f5f3aa0>

iterate through the requests:

>>> for reqitem in job.requests.iter(count=1):
...     print(reqitem['time'])
1482233733870

retrieve single request from a job:

>>> job.requests.list(count=1)
[{
    'duration': 354,
    'fp': '6d748741a927b10454c83ac285b002cd239964ea',
    'method': 'GET',
    'rs': 1270,
    'status': 200,a
    'time': 1482233733870,
    'url': 'https://example.com'
}]

add(url, status, method, rs, duration, ts, parent=None, fp=None)[source]

Add a new requests.

Parameters:

url – string url for the request.
status – HTTP status of the request.
method – stringified request method.
rs – response body length.
duration – request duration in milliseconds.
ts – UNIX timestamp in milliseconds.
parent – (optional) parent request id.
fp – (optional) string fingerprint for the request.

close(block=True): Close writers one-by-one.

flush(): Flush data from writer threads.

get(key, **params)

Get element from collection.

Parameters:: key – element key.
Returns:: a dictionary with element data.
Return type:: dict

iter(_path=None, count=None, requests_params=None, **apiparams)

A general method to iterate through elements.

Parameters:: count – limit amount of elements.
Returns:: an iterator over elements list.
Return type:: collections.abc.Iterable

list(*args, **kwargs)

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

stats()

Get resource stats.

Returns:: a dictionary with stats data.
Return type:: dict

write(item)

Write new element to collection.

Parameters:: item – element data dict to write.

Samples

class scrapinghub.client.samples.Samples(cls, client, key)[source]

Representation of collection of job samples.

Not a public constructor: use Job instance to get a Samples instance. See samples attribute.

Please note that list() method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

Usage:

retrieve all samples from a job:

>>> job.samples.iter()
<generator object mpdecode at 0x10f5f3aa0>

retrieve samples with timestamp greater or equal to given timestamp:

>>> job.samples.list(startts=1484570043851)
[[1484570043851, 554, 576, 1777, 821, 0],
 [1484570046673, 561, 583, 1782, 821, 0]]

close(block=True): Close writers one-by-one.

flush(): Flush data from writer threads.

get(key, **params)

Get element from collection.

Parameters:: key – element key.
Returns:: a dictionary with element data.
Return type:: dict

iter(_key=None, count=None, **params)

Iterate over elements in collection.

Parameters:: count – limit amount of elements.
Returns:: a generator object over a list of element dictionaries.
Return type:: types.GeneratorType[dict]

list(*args, **kwargs)

Convenient shortcut to list iter results.

Please note that list() method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it via iter() method (all params and available filters are same for both methods).

stats()

Get resource stats.

Returns:: a dictionary with stats data.
Return type:: dict

write(item)

Write new element to collection.

Parameters:: item – element data dict to write.

Spiders

class scrapinghub.client.spiders.Spider(client, project_id, spider_id, spider)[source]

Class representing a Spider object.

Not a public constructor: use Spiders instance to get a Spider instance. See Spiders.get() method.

Variables:

project_id – a string project id.
key – a string key in format ‘project_id/spider_id’.
name – a spider name string.
jobs – a collection of jobs, Jobs object.

Usage:

>>> spider = project.spiders.get('spider1')
>>> spider.key
'123/1'
>>> spider.name
'spider1'

list_tags()[source]

List spider tags.

Returns:: a list of spider tags.
Return type:: list[str]

update_tags(add=None, remove=None)[source]

Update tags for the spider.

Parameters:

add – (optional) a list of string tags to add.
remove – (optional) a list of string tags to remove.

class scrapinghub.client.spiders.Spiders(client, project_id)[source]

Class to work with a collection of project spiders.

Not a public constructor: use Project instance to get a Spiders instance. See spiders attribute.

Variables:: project_id – string project id.

Usage:

>>> project.spiders
<scrapinghub.client.spiders.Spiders at 0x1049ca630>

get(spider, **params)[source]

Get a spider object for a given spider name.

The method gets/sets spider id (and checks if spider exists).

Parameters:: spider – a string spider name.
Returns:: a spider object.
Return type:: scrapinghub.client.spiders.Spider

Usage:

>>> project.spiders.get('spider2')
<scrapinghub.client.spiders.Spider at 0x106ee3748>
>>> project.spiders.get('non-existing')
NotFound: Spider non-existing doesn't exist.

iter()[source]

Iterate through a list of spiders for a project.

Returns:: an iterator over spiders list where each spider is represented as a dict containing its metadata.
Return type:: collection.Iterable[dict]

Provided for the sake of API consistency.

list()[source]

Get a list of spiders for a project.

Returns:: a list of dictionaries with spiders metadata.
Return type:: list[dict]

Usage:

>>> project.spiders.list()
[{'id': 'spider1', 'tags': [], 'type': 'manual', 'version': '123'},
 {'id': 'spider2', 'tags': [], 'type': 'manual', 'version': '123'}]