index

This module contains functions that comprise batch indexing. Batch indexing is generally the fastest way to get data into a storage backend, but it does dominate the storage backend while loading data.


polymr.index.cat()

chain.from_iterable(iterable) –> chain object

Alternate chain() contructor taking a single iterable argument that evaluates lazily.

polymr.index.create(input_records, nproc, chunksize, backend, tmpdir='/tmp', featurizer_name='default')

Create a search index in a storage backend. This function does everything necessary to turn a collection of records into a populated storage backend, which can then be used by polymr.query.Index

Parameters:
  • input_records (Iterable of polymr.record.Record) – The records to index.
  • nproc (int) – The number of subprocesses to spawn. Probably best to not exceed the number of cpus on the system.
  • chunksize (int) – The number of records to process in memory at once. Use this as a rudimentary way to control memory usage. Larger chunksize is faster and uses less CPU, but uses more memory.
  • backend (Subclass of polymr.storage.AbstractBackend) – The storage backend to populate
  • tmpdir (str) – Where to store temporary files. Be sure to have enough space in that directory to store all the input_records.
  • featurizer_name (str) – What featurizer to use c.f. polymr.featurizers.
polymr.index.records(input_records, backend)

Save records into a storage backend.