index¶

This module contains functions that comprise batch indexing. Batch indexing is generally the fastest way to get data into a storage backend, but it does dominate the storage backend while loading data.

polymr.index.cat()¶

chain.from_iterable(iterable) –> chain object

Alternate chain() contructor taking a single iterable argument that evaluates lazily.

polymr.index.create(input_records, nproc, chunksize, backend, tmpdir='/tmp', featurizer_name='default')¶

Create a search index in a storage backend. This function does everything necessary to turn a collection of records into a populated storage backend, which can then be used by polymr.query.Index

Parameters:

input_records (Iterable of polymr.record.Record) – The records to index.
nproc (int) – The number of subprocesses to spawn. Probably best to not exceed the number of cpus on the system.
chunksize (int) – The number of records to process in memory at once. Use this as a rudimentary way to control memory usage. Larger chunksize is faster and uses less CPU, but uses more memory.
backend (Subclass of polymr.storage.AbstractBackend) – The storage backend to populate
tmpdir (str) – Where to store temporary files. Be sure to have enough space in that directory to store all the input_records.
featurizer_name (str) – What featurizer to use c.f. polymr.featurizers.

polymr.index.records(input_records, backend)¶

Save records into a storage backend.

index¶

Related Topics

This Page