storage¶
This module contains the abstract backend class as well as the
built-in local backend LevelDBBackend
.
-
class
polymr.storage.
AbstractBackend
¶ A backend is a data access object, abstracting away details for getting data from places like LevelDB and Redis.
-
close
()¶ Close the backend. Clean up any temporary files. Close any connections.
-
delete_record
(idx)¶ Drop a record from the backend.
Parameters: idx (int) – The index of the record to drop
-
drop_records_from_token
(name, bad_record_ids)¶ Remove record ids from the list of ids associated with a token.
Parameters: - name (bytes) – The token
- bad_record_ids (list of int) – The record ids to remove
-
find_least_frequent_tokens
(toks, r, k=None)¶ Filter a list of tokens by token frequency. More frequent (aka common) tokens will be dropped before less frequent tokens. This method should return the minimum of records between
r
andk
parameters e.g. ifr=100
would return 3 tokens, butk=2
, return the two least frequent tokens.Parameters: - toks (list of bytes) – The list of tokens to filter
- r (int) – The maximum number of record ids taken before dropping tokens.
- k (int) – The maximum number of tokens to keep.
Returns: A perhaps smaller list of tokens
Return type: list of bytes
-
classmethod
from_urlparsed
(parsed)¶ Receives the parsed URL from
polymr.storage.parse_url()
and returns the backend object.Parameters: parsed (6-tuple of str) – The standard library urllib.parse.urlparse
6-tuple
-
get_freqs
()¶ Get a the feature freqeuency dict
Returns: dict consisting of tokens and the number of records containing that token Return type: dict {bytes: int}
-
get_record
(idx)¶ Gets a record with a record id
Parameters: idx (int) – The id of the record to retreive Return type: polymr.record.Record
-
get_records
(idxs)¶ Get records by record id
Parameters: idxs (list of int) – The ids of the records to retreive Return type: Iterator of polymr.record.Record
-
get_rowcount
()¶ Get the number of records indexed
Return type: int
-
get_token
(name)¶ Get the list of records containing the named token
Parameters: name (bytes) – The token to get Returns: The list of records containing that token Return type: list
-
increment_rowcount
(n)¶ Increase the rowcount
Parameters: n (int) – The number by which to increase the rowcount
-
save_freqs
(d)¶ Save the feature frequency dict.
Parameters: d (dict {bytes: int}) – The dict consisting of tokens and the number of records containing that token
-
save_record
(rec)¶ Save a record.
Parameters: rec ( polymr.record.Record
) – The record to saveReturns: The ID of the newly created record Return type: int
-
save_records
(idx_recs)¶ Save records.
Parameters: idx_recs (iterable of (int, record) pairs.) – The record id, record pairs to save Returns: The number of rows saved Return type: int
-
save_rowcount
(cnt)¶ Save the number of records indexed
Parameters: cnt (int) – The row count to save
-
save_token
(name, record_ids)¶ Save the list of records containing a named token. Overwrites existing record id list.
Parameters: - name (bytes) – The token
- record_ids (iterable of int) – The list of record ids containing the token
-
save_tokens
(names_ids)¶ Save many tokens in bulk. See
polymr.storage.AbstractBackend.save_token()
.Parameters: names_ids (iterable of (bytes, list-of-int) tuples) – A iterable of two-part tuples: the token, and the ids corresponding to the token
-
update_freqs
(toks_cnts)¶ Update the feature frequency dict.
-
update_record
(rec)¶ Update a record. Some backends simply alias update_record to save_record.
Parameters: rec ( polymr.record.Record
) – The record to saveReturns: The ID of the newly created record Return type: int
-
update_token
(name, record_ids)¶ Update the list of record ids corresponding to a token. The new list of record ids corresponding to this token will be a set union of
record_ids
and the record ids currently in the backend.Parameters: - name (bytes) – The token
- record_ids (list of int) – The list of record ids containing the token
-
-
class
polymr.storage.
AbstractBackend
A backend is a data access object, abstracting away details for getting data from places like LevelDB and Redis.
-
close
() Close the backend. Clean up any temporary files. Close any connections.
-
delete_record
(idx) Drop a record from the backend.
Parameters: idx (int) – The index of the record to drop
-
drop_records_from_token
(name, bad_record_ids) Remove record ids from the list of ids associated with a token.
Parameters: - name (bytes) – The token
- bad_record_ids (list of int) – The record ids to remove
-
find_least_frequent_tokens
(toks, r, k=None) Filter a list of tokens by token frequency. More frequent (aka common) tokens will be dropped before less frequent tokens. This method should return the minimum of records between
r
andk
parameters e.g. ifr=100
would return 3 tokens, butk=2
, return the two least frequent tokens.Parameters: - toks (list of bytes) – The list of tokens to filter
- r (int) – The maximum number of record ids taken before dropping tokens.
- k (int) – The maximum number of tokens to keep.
Returns: A perhaps smaller list of tokens
Return type: list of bytes
-
classmethod
from_urlparsed
(parsed) Receives the parsed URL from
polymr.storage.parse_url()
and returns the backend object.Parameters: parsed (6-tuple of str) – The standard library urllib.parse.urlparse
6-tuple
-
get_freqs
() Get a the feature freqeuency dict
Returns: dict consisting of tokens and the number of records containing that token Return type: dict {bytes: int}
-
get_record
(idx) Gets a record with a record id
Parameters: idx (int) – The id of the record to retreive Return type: polymr.record.Record
-
get_records
(idxs) Get records by record id
Parameters: idxs (list of int) – The ids of the records to retreive Return type: Iterator of polymr.record.Record
-
get_rowcount
() Get the number of records indexed
Return type: int
-
get_token
(name) Get the list of records containing the named token
Parameters: name (bytes) – The token to get Returns: The list of records containing that token Return type: list
-
increment_rowcount
(n) Increase the rowcount
Parameters: n (int) – The number by which to increase the rowcount
-
save_freqs
(d) Save the feature frequency dict.
Parameters: d (dict {bytes: int}) – The dict consisting of tokens and the number of records containing that token
-
save_record
(rec) Save a record.
Parameters: rec ( polymr.record.Record
) – The record to saveReturns: The ID of the newly created record Return type: int
-
save_records
(idx_recs) Save records.
Parameters: idx_recs (iterable of (int, record) pairs.) – The record id, record pairs to save Returns: The number of rows saved Return type: int
-
save_rowcount
(cnt) Save the number of records indexed
Parameters: cnt (int) – The row count to save
-
save_token
(name, record_ids) Save the list of records containing a named token. Overwrites existing record id list.
Parameters: - name (bytes) – The token
- record_ids (iterable of int) – The list of record ids containing the token
-
save_tokens
(names_ids) Save many tokens in bulk. See
polymr.storage.AbstractBackend.save_token()
.Parameters: names_ids (iterable of (bytes, list-of-int) tuples) – A iterable of two-part tuples: the token, and the ids corresponding to the token
-
update_freqs
(toks_cnts) Update the feature frequency dict.
-
update_record
(rec) Update a record. Some backends simply alias update_record to save_record.
Parameters: rec ( polymr.record.Record
) – The record to saveReturns: The ID of the newly created record Return type: int
-
update_token
(name, record_ids) Update the list of record ids corresponding to a token. The new list of record ids corresponding to this token will be a set union of
record_ids
and the record ids currently in the backend.Parameters: - name (bytes) – The token
- record_ids (list of int) – The list of record ids containing the token
-
-
polymr.storage.
copy
(backend_from, backend_to, droptop=None, skip_copy_records=False, skip_copy_featurizer=False, skip_copy_freqs=False, skip_copy_tokens=False, threads=None)¶ Copy the contents of one backend object into another backend object. This function should be backend-agnostic; use it to convert a LevelDBBackend into a RedisBackend.
This function may take a while to execute depending mainly on the size of the index to be copied. Use the
logging
module to monitor the copy progress:>>> import logging >>> logging.basicConfig( ... format="%(asctime)s %(name)s %(levelname)s: %(message)s", ... level=logging.DEBUG ...)
If the copy halts partway (I’m looking at you, AWS Dynamo), use the various skip keywords to skip to where the copy left off.
Parameters: - backend_from (Subclass of
polymr.storage.AbstractBackend
) – The backend from which to retrieve data. - backend_to – The backend to which to send data.
- droptop (float or None) – Do not copy the
x
most common features, wherex
is the total number of features anddroptop
is a ratio between zero and one. To, for example, drop the 15% most common features, setdroptop=.15
- skip_copy_records (bool) – Don’t copy records into
backend_to
? - skip_copy_featurizer – Don’t copy the featurizer name into
backend_to
? - skip_copy_freqs – Don’t copy the token frequencies into
backend_to
? - skip_copy_tokens – Don’t copy feature-to-record mappings
into
backend_to
? - threads (int or None) – Use multiple threads to copy data. Defaults to single-threaded.
- backend_from (Subclass of
-
polymr.storage.
dumps
(obj)¶ Serialize a python object into a bytestring. This method abstracts away the particular serializer used under the hood.
-
polymr.storage.
loads
(bs)¶ Deserialize a bytestring back into a python object. This method abstracts away the particular deserializer used under the hood.
-
polymr.storage.
parse_url
(u, **kwargs)¶ Instantiate a backend by way of a URL. The type of backend is inferred from the
scheme
component of the URL. Extra keyword arguments are passed directly to the backend’s constructor.>>> be = polymr.storage.parse_url('leveldb://localhost/tmp/my_db.polymr') >>> be.path '/tmp/my_db.polymr'
>>> be = polymr.storage.parse_url('leveldb://localhost/tmp/my_db.polymr', ... featurizer_name='k4') >>> be.featurizer_name 'k4'
Parameters: u (str) – The URL to parse