scrapelib overview

scrapelib is configured by instantiating an instance of a Scraper with the desired options and paths.

Scraper object

class scrapelib.Scraper(raise_errors=True, requests_per_minute=60, retry_attempts=0, retry_wait_seconds=5, header_func=None)

Scraper is the most important class provided by scrapelib (and generally the only one to be instantiated directly). It provides a large number of options allowing for customization.

Usage is generally just creating an instance with the desired options and then using the urlopen() & urlretrieve() methods of that instance.

Parameters:
  • raise_errors – set to True to raise a HTTPError on 4xx or 5xx response
  • requests_per_minute – maximum requests per minute (0 for unlimited, defaults to 60)
  • retry_attempts – number of times to retry if timeout occurs or page returns a (non-404) error
  • retry_wait_seconds – number of seconds to retry after first failure, subsequent retries will double this wait
urlretrieve(url, filename=None, method='GET', body=None, dir=None, **kwargs)

Save result of a request to a file, similarly to urllib.urlretrieve().

If an error is encountered may raise any of the scrapelib exceptions.

A filename may be provided or urlretrieve() will safely create a temporary file. If a directory is provided, a file will be given a random name within the specified directory. Either way, it is the responsibility of the caller to ensure that the temporary file is deleted when it is no longer needed.

Parameters:
  • url – URL for request
  • filename – optional name for file
  • method – any valid HTTP method, but generally GET or POST
  • body – optional body for request, to turn parameters into an appropriate string use urllib.urlencode()
  • dir – optional directory to place file in
Returns filename, response:
 

tuple with filename for saved response (will be same as given filename if one was given, otherwise will be a temp file in the OS temp directory) and a Response object that can be used to inspect the response headers.

Response objects

Exceptions

All scrapelib exceptions are a subclass of ScrapeError.

class scrapelib.HTTPMethodUnavailableError(message, method)

Raised when the supplied HTTP method is invalid or not supported by the HTTP backend.

class scrapelib.HTTPError(response, body=None)

Raised when urlopen encounters a 4xx or 5xx error code and the raise_errors option is true.