index¶
This module contains functions that comprise batch indexing. Batch indexing is generally the fastest way to get data into a storage backend, but it does dominate the storage backend while loading data.
-
polymr.index.
cat
()¶ chain.from_iterable(iterable) –> chain object
Alternate chain() contructor taking a single iterable argument that evaluates lazily.
-
polymr.index.
create
(input_records, nproc, chunksize, backend, tmpdir='/tmp', featurizer_name='default')¶ Create a search index in a storage backend. This function does everything necessary to turn a collection of records into a populated storage backend, which can then be used by
polymr.query.Index
Parameters: - input_records (Iterable of
polymr.record.Record
) – The records to index. - nproc (int) – The number of subprocesses to spawn. Probably best to not exceed the number of cpus on the system.
- chunksize (int) – The number of records to process in memory at once. Use this as a rudimentary way to control memory usage. Larger chunksize is faster and uses less CPU, but uses more memory.
- backend (Subclass of polymr.storage.AbstractBackend) – The storage backend to populate
- tmpdir (str) – Where to store temporary files. Be sure to have enough space in that directory to store all the input_records.
- featurizer_name (str) – What featurizer to use
c.f.
polymr.featurizers
.
- input_records (Iterable of
-
polymr.index.
records
(input_records, backend)¶ Save records into a storage backend.