storage

This module contains the abstract backend class as well as the built-in local backend LevelDBBackend.


class polymr.storage.AbstractBackend

A backend is a data access object, abstracting away details for getting data from places like LevelDB and Redis.

close()

Close the backend. Clean up any temporary files. Close any connections.

delete_record(idx)

Drop a record from the backend.

Parameters:idx (int) – The index of the record to drop
drop_records_from_token(name, bad_record_ids)

Remove record ids from the list of ids associated with a token.

Parameters:
  • name (bytes) – The token
  • bad_record_ids (list of int) – The record ids to remove
find_least_frequent_tokens(toks, r, k=None)

Filter a list of tokens by token frequency. More frequent (aka common) tokens will be dropped before less frequent tokens. This method should return the minimum of records between r and k parameters e.g. if r=100 would return 3 tokens, but k=2, return the two least frequent tokens.

Parameters:
  • toks (list of bytes) – The list of tokens to filter
  • r (int) – The maximum number of record ids taken before dropping tokens.
  • k (int) – The maximum number of tokens to keep.
Returns:

A perhaps smaller list of tokens

Return type:

list of bytes

classmethod from_urlparsed(parsed)

Receives the parsed URL from polymr.storage.parse_url() and returns the backend object.

Parameters:parsed (6-tuple of str) – The standard library urllib.parse.urlparse 6-tuple
get_freqs()

Get a the feature freqeuency dict

Returns:dict consisting of tokens and the number of records containing that token
Return type:dict {bytes: int}
get_record(idx)

Gets a record with a record id

Parameters:idx (int) – The id of the record to retreive
Return type:polymr.record.Record
get_records(idxs)

Get records by record id

Parameters:idxs (list of int) – The ids of the records to retreive
Return type:Iterator of polymr.record.Record
get_rowcount()

Get the number of records indexed

Return type:int
get_token(name)

Get the list of records containing the named token

Parameters:name (bytes) – The token to get
Returns:The list of records containing that token
Return type:list
increment_rowcount(n)

Increase the rowcount

Parameters:n (int) – The number by which to increase the rowcount
save_freqs(d)

Save the feature frequency dict.

Parameters:d (dict {bytes: int}) – The dict consisting of tokens and the number of records containing that token
save_record(rec)

Save a record.

Parameters:rec (polymr.record.Record) – The record to save
Returns:The ID of the newly created record
Return type:int
save_records(idx_recs)

Save records.

Parameters:idx_recs (iterable of (int, record) pairs.) – The record id, record pairs to save
Returns:The number of rows saved
Return type:int
save_rowcount(cnt)

Save the number of records indexed

Parameters:cnt (int) – The row count to save
save_token(name, record_ids)

Save the list of records containing a named token. Overwrites existing record id list.

Parameters:
  • name (bytes) – The token
  • record_ids (iterable of int) – The list of record ids containing the token
save_tokens(names_ids)

Save many tokens in bulk. See polymr.storage.AbstractBackend.save_token().

Parameters:names_ids (iterable of (bytes, list-of-int) tuples) – A iterable of two-part tuples: the token, and the ids corresponding to the token
update_freqs(toks_cnts)

Update the feature frequency dict.

update_record(rec)

Update a record. Some backends simply alias update_record to save_record.

Parameters:rec (polymr.record.Record) – The record to save
Returns:The ID of the newly created record
Return type:int
update_token(name, record_ids)

Update the list of record ids corresponding to a token. The new list of record ids corresponding to this token will be a set union of record_ids and the record ids currently in the backend.

Parameters:
  • name (bytes) – The token
  • record_ids (list of int) – The list of record ids containing the token
class polymr.storage.AbstractBackend

A backend is a data access object, abstracting away details for getting data from places like LevelDB and Redis.

close()

Close the backend. Clean up any temporary files. Close any connections.

delete_record(idx)

Drop a record from the backend.

Parameters:idx (int) – The index of the record to drop
drop_records_from_token(name, bad_record_ids)

Remove record ids from the list of ids associated with a token.

Parameters:
  • name (bytes) – The token
  • bad_record_ids (list of int) – The record ids to remove
find_least_frequent_tokens(toks, r, k=None)

Filter a list of tokens by token frequency. More frequent (aka common) tokens will be dropped before less frequent tokens. This method should return the minimum of records between r and k parameters e.g. if r=100 would return 3 tokens, but k=2, return the two least frequent tokens.

Parameters:
  • toks (list of bytes) – The list of tokens to filter
  • r (int) – The maximum number of record ids taken before dropping tokens.
  • k (int) – The maximum number of tokens to keep.
Returns:

A perhaps smaller list of tokens

Return type:

list of bytes

classmethod from_urlparsed(parsed)

Receives the parsed URL from polymr.storage.parse_url() and returns the backend object.

Parameters:parsed (6-tuple of str) – The standard library urllib.parse.urlparse 6-tuple
get_freqs()

Get a the feature freqeuency dict

Returns:dict consisting of tokens and the number of records containing that token
Return type:dict {bytes: int}
get_record(idx)

Gets a record with a record id

Parameters:idx (int) – The id of the record to retreive
Return type:polymr.record.Record
get_records(idxs)

Get records by record id

Parameters:idxs (list of int) – The ids of the records to retreive
Return type:Iterator of polymr.record.Record
get_rowcount()

Get the number of records indexed

Return type:int
get_token(name)

Get the list of records containing the named token

Parameters:name (bytes) – The token to get
Returns:The list of records containing that token
Return type:list
increment_rowcount(n)

Increase the rowcount

Parameters:n (int) – The number by which to increase the rowcount
save_freqs(d)

Save the feature frequency dict.

Parameters:d (dict {bytes: int}) – The dict consisting of tokens and the number of records containing that token
save_record(rec)

Save a record.

Parameters:rec (polymr.record.Record) – The record to save
Returns:The ID of the newly created record
Return type:int
save_records(idx_recs)

Save records.

Parameters:idx_recs (iterable of (int, record) pairs.) – The record id, record pairs to save
Returns:The number of rows saved
Return type:int
save_rowcount(cnt)

Save the number of records indexed

Parameters:cnt (int) – The row count to save
save_token(name, record_ids)

Save the list of records containing a named token. Overwrites existing record id list.

Parameters:
  • name (bytes) – The token
  • record_ids (iterable of int) – The list of record ids containing the token
save_tokens(names_ids)

Save many tokens in bulk. See polymr.storage.AbstractBackend.save_token().

Parameters:names_ids (iterable of (bytes, list-of-int) tuples) – A iterable of two-part tuples: the token, and the ids corresponding to the token
update_freqs(toks_cnts)

Update the feature frequency dict.

update_record(rec)

Update a record. Some backends simply alias update_record to save_record.

Parameters:rec (polymr.record.Record) – The record to save
Returns:The ID of the newly created record
Return type:int
update_token(name, record_ids)

Update the list of record ids corresponding to a token. The new list of record ids corresponding to this token will be a set union of record_ids and the record ids currently in the backend.

Parameters:
  • name (bytes) – The token
  • record_ids (list of int) – The list of record ids containing the token
polymr.storage.copy(backend_from, backend_to, droptop=None, skip_copy_records=False, skip_copy_featurizer=False, skip_copy_freqs=False, skip_copy_tokens=False, threads=None)

Copy the contents of one backend object into another backend object. This function should be backend-agnostic; use it to convert a LevelDBBackend into a RedisBackend.

This function may take a while to execute depending mainly on the size of the index to be copied. Use the logging module to monitor the copy progress:

>>> import logging
>>> logging.basicConfig(
...    format="%(asctime)s        %(name)s        %(levelname)s:  %(message)s",
...    level=logging.DEBUG
...)

If the copy halts partway (I’m looking at you, AWS Dynamo), use the various skip keywords to skip to where the copy left off.

Parameters:
  • backend_from (Subclass of polymr.storage.AbstractBackend) – The backend from which to retrieve data.
  • backend_to – The backend to which to send data.
  • droptop (float or None) – Do not copy the x most common features, where x is the total number of features and droptop is a ratio between zero and one. To, for example, drop the 15% most common features, set droptop=.15
  • skip_copy_records (bool) – Don’t copy records into backend_to?
  • skip_copy_featurizer – Don’t copy the featurizer name into backend_to?
  • skip_copy_freqs – Don’t copy the token frequencies into backend_to?
  • skip_copy_tokens – Don’t copy feature-to-record mappings into backend_to?
  • threads (int or None) – Use multiple threads to copy data. Defaults to single-threaded.
polymr.storage.dumps(obj)

Serialize a python object into a bytestring. This method abstracts away the particular serializer used under the hood.

polymr.storage.loads(bs)

Deserialize a bytestring back into a python object. This method abstracts away the particular deserializer used under the hood.

polymr.storage.parse_url(u, **kwargs)

Instantiate a backend by way of a URL. The type of backend is inferred from the scheme component of the URL. Extra keyword arguments are passed directly to the backend’s constructor.

>>> be = polymr.storage.parse_url('leveldb://localhost/tmp/my_db.polymr')
>>> be.path
'/tmp/my_db.polymr'
>>> be = polymr.storage.parse_url('leveldb://localhost/tmp/my_db.polymr',
...                               featurizer_name='k4')
>>> be.featurizer_name
'k4'
Parameters:u (str) – The URL to parse