Quicker DataFrame Serialization

0
26


Learn and write dataframes op to 10 instances sooner than Parquet with StaticFrame NPZ

Water on a leaf
Photograph by Writer

The Apache Parquet format gives an environment friendly binary illustration of columnar desk knowledge, as seen with widespread use in Apache Hadoop and Spark, AWS Athena and Glue, and Pandas DataFrame serialization. Whereas Parquet presents broad interoperability with efficiency superior to textual content codecs (comparable to CSV or JSON), it’s as a lot as ten instances slower than NPZ, another DataFrame serialization format launched in StaticFrame.

StaticFrame (an open-source DataFrame library of which I’m an creator) builds upon NumPy NPY and NPZ codecs to encode DataFrames. The NPY format (a binary encoding of array knowledge) and the NPZ format (zipped bundles of NPY information) are outlined in a NumPy Enhancement Proposal from 2007. By extending the NPZ format with specialised JSON metadata, StaticFrame gives a whole DataFrame serialization format that helps all NumPy dtypes.

This text extends work first offered at PyCon USA 2022 with additional efficiency optimizations and broader benchmarking.

The Problem of Serializing DataFrames

DataFrames aren’t simply collections of columnar knowledge with string column labels, comparable to present in relational databases. Along with columnar knowledge, DataFrames have labelled rows and columns, and people row and column labels will be of any sort or (with hierarchical labels) many sorts. Additional, it’s common to retailer metadata with a reputation attribute, both on the DataFrame or on the axis labels.

As Parquet was initially designed simply to retailer collections of columnar knowledge, the total vary of DataFrame traits just isn’t straight supported. Pandas provides this extra data by including JSON metadata into the Parquet file.

Additional, Parquet helps a minimal collection of varieties; the total vary of NumPy dtypes just isn’t straight supported. For instance, Parquet doesn’t natively help unsigned integers or any date varieties.

Whereas Python pickles are able to effectively serializing DataFrames and NumPy arrays, they’re solely appropriate for short-term caches from trusted sources. Whereas pickles are quick, they’ll grow to be invalid on account of code adjustments and are insecure to load from untrusted sources.

One other different to Parquet, originating within the Arrow mission, is Feather. Whereas Feather helps all Arrow varieties and succeeds in being sooner than Parquet, it’s nonetheless at the very least two instances slower studying DataFrames than NPZ.

Parquet and Feather help compression to scale back file measurement. Parquet defaults to utilizing “snappy” compression, whereas Feather defaults to “lz4”. Because the NPZ format prioritizes efficiency, it doesn’t but help compression. As will probably be proven under, NPZ outperforms each compressed and uncompressed Parquet information by important components.

DataFrame Serialization Efficiency Comparisons

Quite a few publications provide DataFrame benchmarks by testing only one or two datasets. McKinney and Richardson (2020) is an instance, the place two datasets, Fannie Mae Mortgage Efficiency and NYC Yellow Taxi Journey knowledge, are used to generalize about efficiency. Such idiosyncratic datasets are inadequate, as each the form of the DataFrame and the diploma of columnar sort heterogeneity can considerably differentiate efficiency.

To keep away from this deficiency, I examine efficiency with a panel of 9 artificial datasets. These datasets range alongside two dimensions: form (tall, sq., and huge) and columnar heterogeneity (columnar, blended, and uniform). Form variations alter the distribution of components between tall (e.g., 10,000 rows and 100 columns), sq. (e.g., 1,000 rows and columns), and huge (e.g., 100 rows and 10,000 columns) geometries. Columnar heterogeneity variations alter the variety of varieties between columnar (no adjoining columns have the identical sort), blended (some adjoining columns have the identical sort), and uniform (all columns have the identical sort).

The frame-fixtures library defines a domain-specific language to create deterministic, randomly-generated DataFrames for testing; the 9 datasets are generated with this software.

To display among the StaticFrame and Pandas interfaces evaluated, the next IPython session performs fundamental efficiency assessments utilizing %time. As proven under, a sq., uniformly-typed DataFrame will be written and skim with NPZ many instances sooner than uncompressed Parquet.

>>> import numpy as np
>>> import static_frame as sf
>>> import pandas as pd

>>> # an sq., uniform float array
>>> array = np.random.random_sample((10_000, 10_000))

>>> # write peformance
>>> f1 = sf.Body(array)
>>> %time f1.to_npz('/tmp/body.npz')
CPU instances: consumer 710 ms, sys: 396 ms, complete: 1.11 s
Wall time: 1.11 s

>>> df1 = pd.DataFrame(array)
>>> %time df1.to_parquet('/tmp/df.parquet', compression=None)
CPU instances: consumer 6.82 s, sys: 900 ms, complete: 7.72 s
Wall time: 7.74 s

>>> # learn efficiency
>>> %time f2 = f1.from_npz('/tmp/body.npz')
CPU instances: consumer 2.77 ms, sys: 163 ms, complete: 166 ms
Wall time: 165 ms

>>> %time df2 = pd.read_parquet('/tmp/df.parquet')
CPU instances: consumer 2.55 s, sys: 1.2 s, complete: 3.75 s
Wall time: 866 ms

Efficiency assessments offered under prolong this fundamental method by utilizing frame-fixtures for systematic variation of form and sort heterogeneity, and common outcomes over ten iterations. Whereas {hardware} configuration will have an effect on efficiency, relative traits are retained throughout various machines and working programs. For all interfaces the default parameters are used, aside from disabling compression as wanted. The code used to carry out these assessments is out there at GitHub.

Learn Efficiency

As knowledge is usually learn extra typically then it’s written, learn efficiency is a precedence. As proven for all 9 DataFrames of 1 million (1e+06) components, NPZ considerably outperforms Parquet and Feather with each fixture. NPZ learn efficiency is over ten instances sooner than compressed Parquet. For instance, with the Uniform Tall fixture, compressed Parquet studying is 21 ms in comparison with 1.5 ms with NPZ.

The chart under reveals processing time, the place decrease bars correspond to sooner efficiency.

1*JYAs1eHXqRJ5jpyUSC0 DQ

This spectacular NPZ efficiency is retained with scale. Shifting to 100 million (1e+08) components, NPZ continues to carry out at the very least twice as quick as Parquet and Feather, no matter if compression is used.

1*nGUod mszDm61 KCJSAQpQ

Write Efficiency

In writing DataFrames to disk, NPZ outperforms Parquet (each compressed and uncompressed) in all situations. For instance, with the Uniform Sq. fixture, compressed Parquet writing is 200 ms in comparison with 18.3 ms with NPZ. NPZ write efficiency is usually corresponding to uncompressed Feather: in some situations NPZ is quicker, in others, Feather is sooner.

As with learn efficiency, NPZ write efficiency is retained with scale. Shifting to 100 million (1e+08) components, NPZ continues to be at the very least twice as quick as Parquet, no matter if compression is used or not.

1*fUdtYqjPTIN1ybRnfOFh3g

Idiosyncratic Efficiency

As a further reference, we may even benchmark the identical NYC Yellow Taxi Journey knowledge (from January 2010) utilized in McKinney and Richardson (2020). This dataset accommodates virtually 300 million (3e+08) components in a tall, heterogeneously typed DataFrame of 14,863,778 rows and 19 columns.

NPZ learn efficiency is proven to be round 4 instances sooner than Parquet and Feather (with or with out compression). Whereas NPZ write efficiency is quicker than Parquet, Feather writing right here is quickest.

1*hBB36ZoNTaFKyRkFppw0VQ

File Dimension

As proven under for a million (1e+06) factor and 100 million (1e+08) factor DataFrames, uncompressed NPZ is usually equal in measurement on disk to uncompressed Feather and at all times smaller than uncompressed Parquet (typically smaller than compressed Parquet too). As compression gives solely modest file-size reductions for Parquet and Feather, the advantage of uncompressed NPZ in velocity would possibly simply outweigh the price of better measurement.

1*lQ7Ir9J4cGaSs5kcu5w0gA
1*CzLHLFZhMkHsfVA9PIruwA

Serializing DataFrames

StaticFrame shops knowledge as a group of 1D and 2D NumPy arrays. Arrays symbolize columnar values, in addition to variable-depth index and column labels. Along with NumPy arrays, details about part varieties (i.e., the Python class used for the index and columns), in addition to the part identify attributes, are wanted to completely reconstruct a Body. Fully serializing a DataFrame requires writing and studying these elements to a file.

DataFrame elements will be represented by the next diagram, which isolates arrays, array varieties, part varieties, and part names. This diagram will probably be used to display how an NPZ encodes a DataFrame.

1*zTJTeqTTjcjAAy1qWGqXbw

The elements of that diagram map to elements of a Body string illustration in Python. For instance, given a Body of integers and Booleans with hierarchical labels on each the index and columns (downloadable by way of GitHub with StaticFrame’s WWW interface), StaticFrame gives the next string illustration:

>>> body = sf.Body.from_npz(sf.WWW.from_file('https://github.com/static-frame/static-frame/uncooked/grasp/doc/supply/articles/serialize/body.npz', encoding=None))
>>> body
<Body: p>
<IndexHierarchy: q> knowledge knowledge knowledge legitimate <<U5>
A B C * <<U1>
<IndexHierarchy: r>
2012-03 x 5 4 7 False
2012-03 y 9 1 8 True
2012-04 x 3 6 2 True
<datetime64[M]> <<U1> <int64> <int64> <int64> <bool>

The elements of the string illustration will be mapped to the DataFrame diagram by coloration:

1*gtDOh54r69V48QfuUjQADQ

Encoding an Array in NPY

A NPY shops a NumPy array as a binary file with six elements: (1) a “magic” prefix, (2) a model quantity, (3) a header size and (4) header (the place the header is a string illustration of a Python dictionary), and (5) padding adopted by (6) uncooked array byte knowledge. These elements are proven under for a three-element binary array saved in a file named “__blocks_1__.npy”.

Given a NPZ file named “body.npz”, we will extract the binary knowledge by studying the NPY file from the NPZ with the usual library’s ZipFile:

>>> from zipfile import ZipFile
>>> with ZipFile('/tmp/body.npz') as zf: print(zf.open('__blocks_1__.npy').learn())
b'x93NUMPYx01x006x00b1","fortran_order":True,"form":(3,) nx00x01x01

As NPY is properly supported in NumPy, the np.load() perform can be utilized to transform this file to a NumPy array. Because of this underlying array knowledge in a StaticFrame NPZ is well extractable by different readers.

>>> with ZipFile('/tmp/body.npz') as zf: print(repr(np.load(zf.open('__blocks_1__.npy'))))
array([False, True, True])

As a NPY file can encode any array, massive two-dimensional arrays will be loaded from contiguous byte knowledge, offering glorious efficiency in StaticFrame when a number of contiguous columns are represented by a single array.

Constructing a NPZ File

A StaticFrame NPZ is a regular uncompressed ZIP file that accommodates array knowledge in NPY information and metadata (containing part varieties and names) in a JSON file.

Given the NPZ file for the Body above, we will listing its contents with ZipFile. The archive accommodates six NPY information and one JSON file.

>>> with ZipFile('/tmp/body.npz') as zf: print(zf.namelist())
['__values_index_0__.npy', '__values_index_1__.npy', '__values_columns_0__.npy', '__values_columns_1__.npy', '__blocks_0__.npy', '__blocks_1__.npy', '__meta__.json']

The illustration under maps these information to elements of the DataFrame diagram.

StaticFrame extends the NPZ format to incorporate metadata in a JSON file. This file defines identify attributes, part varieties, and depth counts.

>>> with ZipFile('/tmp/body.npz') as zf: print(zf.open('__meta__.json').learn())
b'{"__names__": ["p", "r", "q"], "__types__": ["IndexHierarchy", "IndexHierarchy"], "__types_index__": ["IndexYearMonth", "Index"], "__types_columns__": ["Index", "Index"], "__depths__": [2, 2, 2]}'

Within the illustration under, elements of the __meta__.json file are mapped to elements of the DataFrame diagram.

1*9mTwRcLMcXX7wZdSNg pFA

As a easy ZIP file, instruments to extract the contents of a StaticFrame NPZ are ubiquitous. Then again, the ZIP format, given its historical past and broad options, incurs efficiency overhead. StaticFrame implements a customized ZIP reader optimized for NPZ utilization, which contributes to the superb learn efficiency of NPZ.

Conclusion

The efficiency of DataFrame serialization is essential to many functions. Whereas Parquet has widespread help, its generality compromises sort specificity and efficiency. StaticFrame NPZ can learn and write DataFrames as much as ten-times sooner than Parquet with or with out compression, with comparable (or solely modestly bigger) file sizes. Whereas Feather is a beautiful different, NPZ learn efficiency continues to be usually twice as quick as Feather. If knowledge I/O is a bottleneck (and it typically is), StaticFrame NPZ presents an answer.

stat?event=post


Quicker DataFrame Serialization was initially revealed in In the direction of Knowledge Science on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.



Supply hyperlink

LEAVE A REPLY

Please enter your comment!
Please enter your name here