diff options
Diffstat (limited to 'vendor/github.com/klauspost/compress/s2/README.md')
-rw-r--r-- | vendor/github.com/klauspost/compress/s2/README.md | 256 |
1 files changed, 238 insertions, 18 deletions
diff --git a/vendor/github.com/klauspost/compress/s2/README.md b/vendor/github.com/klauspost/compress/s2/README.md index 81fad652..e6716aea 100644 --- a/vendor/github.com/klauspost/compress/s2/README.md +++ b/vendor/github.com/klauspost/compress/s2/README.md @@ -20,6 +20,7 @@ This is important, so you don't have to worry about spending CPU cycles on alrea * Concurrent stream compression * Faster decompression, even for Snappy compatible content * Ability to quickly skip forward in compressed stream +* Random seeking with indexes * Compatible with reading Snappy compressed content * Smaller block size overhead on incompressible blocks * Block concatenation @@ -29,8 +30,8 @@ This is important, so you don't have to worry about spending CPU cycles on alrea ## Drawbacks over Snappy -* Not optimized for 32 bit systems. -* Streams use slightly more memory due to larger blocks and concurrency (configurable). +* Not optimized for 32 bit systems +* Streams use slightly more memory due to larger blocks and concurrency (configurable) # Usage @@ -141,7 +142,7 @@ Binaries can be downloaded on the [Releases Page](https://github.com/klauspost/c Installing then requires Go to be installed. To install them, use: -`go install github.com/klauspost/compress/s2/cmd/s2c && go install github.com/klauspost/compress/s2/cmd/s2d` +`go install github.com/klauspost/compress/s2/cmd/s2c@latest && go install github.com/klauspost/compress/s2/cmd/s2d@latest` To build binaries to the current folder use: @@ -176,6 +177,8 @@ Options: Compress faster, but with a minor compression loss -help Display help + -index + Add seek index (default true) -o string Write output to another file. Single input file only -pad string @@ -217,11 +220,15 @@ Options: Display help -o string Write output to another file. Single input file only - -q Don't write any output to terminal, except errors + -offset string + Start at offset. Examples: 92, 64K, 256K, 1M, 4M. Requires Index + -q Don't write any output to terminal, except errors -rm - Delete source file(s) after successful decompression + Delete source file(s) after successful decompression -safe - Do not overwrite output files + Do not overwrite output files + -tail string + Return last of compressed file. Examples: 92, 64K, 256K, 1M, 4M. Requires Index -verify Verify files, but do not write output ``` @@ -633,12 +640,12 @@ Compression and speed is typically a bit better `MaxEncodedLen` is also smaller Comparison of [`webdevdata.org-2015-01-07-subset`](https://files.klauspost.com/compress/webdevdata.org-2015-01-07-4GB-subset.7z), 53927 files, total input size: 4,014,735,833 bytes. amd64, single goroutine used: -| Encoder | Size | MB/s | Reduction | -|-----------------------|------------|--------|------------ -| snappy.Encode | 1128706759 | 725.59 | 71.89% | -| s2.EncodeSnappy | 1093823291 | 899.16 | 72.75% | -| s2.EncodeSnappyBetter | 1001158548 | 578.49 | 75.06% | -| s2.EncodeSnappyBest | 944507998 | 66.00 | 76.47% | +| Encoder | Size | MB/s | Reduction | +|-----------------------|------------|------------|------------ +| snappy.Encode | 1128706759 | 725.59 | 71.89% | +| s2.EncodeSnappy | 1093823291 | **899.16** | 72.75% | +| s2.EncodeSnappyBetter | 1001158548 | 578.49 | 75.06% | +| s2.EncodeSnappyBest | 944507998 | 66.00 | **76.47%**| ## Streams @@ -649,11 +656,11 @@ Comparison of different streams, AMD Ryzen 3950x, 16 cores. Size and throughput: | File | snappy.NewWriter | S2 Snappy | S2 Snappy, Better | S2 Snappy, Best | |-----------------------------|--------------------------|---------------------------|--------------------------|-------------------------| -| nyc-taxi-data-10M.csv | 1316042016 - 517.54MB/s | 1307003093 - 8406.29MB/s | 1174534014 - 4984.35MB/s | 1115904679 - 177.81MB/s | -| enwik10 | 5088294643 - 433.45MB/s | 5175840939 - 8454.52MB/s | 4560784526 - 4403.10MB/s | 4340299103 - 159.71MB/s | -| 10gb.tar | 6056946612 - 703.25MB/s | 6208571995 - 9035.75MB/s | 5741646126 - 2402.08MB/s | 5548973895 - 171.17MB/s | -| github-june-2days-2019.json | 1525176492 - 908.11MB/s | 1476519054 - 12625.93MB/s | 1400547532 - 6163.61MB/s | 1321887137 - 200.71MB/s | -| consensus.db.10gb | 5412897703 - 1054.38MB/s | 5354073487 - 12634.82MB/s | 5335069899 - 2472.23MB/s | 5201000954 - 166.32MB/s | +| nyc-taxi-data-10M.csv | 1316042016 - 539.47MB/s | 1307003093 - 10132.73MB/s | 1174534014 - 5002.44MB/s | 1115904679 - 177.97MB/s | +| enwik10 (xml) | 5088294643 - 451.13MB/s | 5175840939 - 9440.69MB/s | 4560784526 - 4487.21MB/s | 4340299103 - 158.92MB/s | +| 10gb.tar (mixed) | 6056946612 - 729.73MB/s | 6208571995 - 9978.05MB/s | 5741646126 - 4919.98MB/s | 5548973895 - 180.44MB/s | +| github-june-2days-2019.json | 1525176492 - 933.00MB/s | 1476519054 - 13150.12MB/s | 1400547532 - 5803.40MB/s | 1321887137 - 204.29MB/s | +| consensus.db.10gb (db) | 5412897703 - 1102.14MB/s | 5354073487 - 13562.91MB/s | 5335069899 - 5294.73MB/s | 5201000954 - 175.72MB/s | # Decompression @@ -679,7 +686,220 @@ The 10 byte 'stream identifier' of the second stream can optionally be stripped, Blocks can be concatenated using the `ConcatBlocks` function. -Snappy blocks/streams can safely be concatenated with S2 blocks and streams. +Snappy blocks/streams can safely be concatenated with S2 blocks and streams. +Streams with indexes (see below) will currently not work on concatenated streams. + +# Stream Seek Index + +S2 and Snappy streams can have indexes. These indexes will allow random seeking within the compressed data. + +The index can either be appended to the stream as a skippable block or returned for separate storage. + +When the index is appended to a stream it will be skipped by regular decoders, +so the output remains compatible with other decoders. + +## Creating an Index + +To automatically add an index to a stream, add `WriterAddIndex()` option to your writer. +Then the index will be added to the stream when `Close()` is called. + +``` + // Add Index to stream... + enc := s2.NewWriter(w, s2.WriterAddIndex()) + io.Copy(enc, r) + enc.Close() +``` + +If you want to store the index separately, you can use `CloseIndex()` instead of the regular `Close()`. +This will return the index. Note that `CloseIndex()` should only be called once, and you shouldn't call `Close()`. + +``` + // Get index for separate storage... + enc := s2.NewWriter(w) + io.Copy(enc, r) + index, err := enc.CloseIndex() +``` + +The `index` can then be used needing to read from the stream. +This means the index can be used without needing to seek to the end of the stream +or for manually forwarding streams. See below. + +Finally, an existing S2/Snappy stream can be indexed using the `s2.IndexStream(r io.Reader)` function. + +## Using Indexes + +To use indexes there is a `ReadSeeker(random bool, index []byte) (*ReadSeeker, error)` function available. + +Calling ReadSeeker will return an [io.ReadSeeker](https://pkg.go.dev/io#ReadSeeker) compatible version of the reader. + +If 'random' is specified the returned io.Seeker can be used for random seeking, otherwise only forward seeking is supported. +Enabling random seeking requires the original input to support the [io.Seeker](https://pkg.go.dev/io#Seeker) interface. + +``` + dec := s2.NewReader(r) + rs, err := dec.ReadSeeker(false, nil) + rs.Seek(wantOffset, io.SeekStart) +``` + +Get a seeker to seek forward. Since no index is provided, the index is read from the stream. +This requires that an index was added and that `r` supports the [io.Seeker](https://pkg.go.dev/io#Seeker) interface. + +A custom index can be specified which will be used if supplied. +When using a custom index, it will not be read from the input stream. + +``` + dec := s2.NewReader(r) + rs, err := dec.ReadSeeker(false, index) + rs.Seek(wantOffset, io.SeekStart) +``` + +This will read the index from `index`. Since we specify non-random (forward only) seeking `r` does not have to be an io.Seeker + +``` + dec := s2.NewReader(r) + rs, err := dec.ReadSeeker(true, index) + rs.Seek(wantOffset, io.SeekStart) +``` + +Finally, since we specify that we want to do random seeking `r` must be an io.Seeker. + +The returned [ReadSeeker](https://pkg.go.dev/github.com/klauspost/compress/s2#ReadSeeker) contains a shallow reference to the existing Reader, +meaning changes performed to one is reflected in the other. + +To check if a stream contains an index at the end, the `(*Index).LoadStream(rs io.ReadSeeker) error` can be used. + +## Manually Forwarding Streams + +Indexes can also be read outside the decoder using the [Index](https://pkg.go.dev/github.com/klauspost/compress/s2#Index) type. +This can be used for parsing indexes, either separate or in streams. + +In some cases it may not be possible to serve a seekable stream. +This can for instance be an HTTP stream, where the Range request +is sent at the start of the stream. + +With a little bit of extra code it is still possible to use indexes +to forward to specific offset with a single forward skip. + +It is possible to load the index manually like this: +``` + var index s2.Index + _, err = index.Load(idxBytes) +``` + +This can be used to figure out how much to offset the compressed stream: + +``` + compressedOffset, uncompressedOffset, err := index.Find(wantOffset) +``` + +The `compressedOffset` is the number of bytes that should be skipped +from the beginning of the compressed file. + +The `uncompressedOffset` will then be offset of the uncompressed bytes returned +when decoding from that position. This will always be <= wantOffset. + +When creating a decoder it must be specified that it should *not* expect a stream identifier +at the beginning of the stream. Assuming the io.Reader `r` has been forwarded to `compressedOffset` +we create the decoder like this: + +``` + dec := s2.NewReader(r, s2.ReaderIgnoreStreamIdentifier()) +``` + +We are not completely done. We still need to forward the stream the uncompressed bytes we didn't want. +This is done using the regular "Skip" function: + +``` + err = dec.Skip(wantOffset - uncompressedOffset) +``` + +This will ensure that we are at exactly the offset we want, and reading from `dec` will start at the requested offset. + +## Index Format: + +Each block is structured as a snappy skippable block, with the chunk ID 0x99. + +The block can be read from the front, but contains information so it can be read from the back as well. + +Numbers are stored as fixed size little endian values or [zigzag encoded](https://developers.google.com/protocol-buffers/docs/encoding#signed_integers) [base 128 varints](https://developers.google.com/protocol-buffers/docs/encoding), +with un-encoded value length of 64 bits, unless other limits are specified. + +| Content | Format | +|---------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------| +| ID, `[1]byte` | Always 0x99. | +| Data Length, `[3]byte` | 3 byte little-endian length of the chunk in bytes, following this. | +| Header `[6]byte` | Header, must be `[115, 50, 105, 100, 120, 0]` or in text: "s2idx\x00". | +| UncompressedSize, Varint | Total Uncompressed size. | +| CompressedSize, Varint | Total Compressed size if known. Should be -1 if unknown. | +| EstBlockSize, Varint | Block Size, used for guessing uncompressed offsets. Must be >= 0. | +| Entries, Varint | Number of Entries in index, must be < 65536 and >=0. | +| HasUncompressedOffsets `byte` | 0 if no uncompressed offsets are present, 1 if present. Other values are invalid. | +| UncompressedOffsets, [Entries]VarInt | Uncompressed offsets. See below how to decode. | +| CompressedOffsets, [Entries]VarInt | Compressed offsets. See below how to decode. | +| Block Size, `[4]byte` | Little Endian total encoded size (including header and trailer). Can be used for searching backwards to start of block. | +| Trailer `[6]byte` | Trailer, must be `[0, 120, 100, 105, 50, 115]` or in text: "\x00xdi2s". Can be used for identifying block from end of stream. | + +For regular streams the uncompressed offsets are fully predictable, +so `HasUncompressedOffsets` allows to specify that compressed blocks all have +exactly `EstBlockSize` bytes of uncompressed content. + +Entries *must* be in order, starting with the lowest offset, +and there *must* be no uncompressed offset duplicates. +Entries *may* point to the start of a skippable block, +but it is then not allowed to also have an entry for the next block since +that would give an uncompressed offset duplicate. + +There is no requirement for all blocks to be represented in the index. +In fact there is a maximum of 65536 block entries in an index. + +The writer can use any method to reduce the number of entries. +An implicit block start at 0,0 can be assumed. + +### Decoding entries: + +``` +// Read Uncompressed entries. +// Each assumes EstBlockSize delta from previous. +for each entry { + uOff = 0 + if HasUncompressedOffsets == 1 { + uOff = ReadVarInt // Read value from stream + } + + // Except for the first entry, use previous values. + if entryNum == 0 { + entry[entryNum].UncompressedOffset = uOff + continue + } + + // Uncompressed uses previous offset and adds EstBlockSize + entry[entryNum].UncompressedOffset = entry[entryNum-1].UncompressedOffset + EstBlockSize +} + + +// Guess that the first block will be 50% of uncompressed size. +// Integer truncating division must be used. +CompressGuess := EstBlockSize / 2 + +// Read Compressed entries. +// Each assumes CompressGuess delta from previous. +// CompressGuess is adjusted for each value. +for each entry { + cOff = ReadVarInt // Read value from stream + + // Except for the first entry, use previous values. + if entryNum == 0 { + entry[entryNum].CompressedOffset = cOff + continue + } + + // Compressed uses previous and our estimate. + entry[entryNum].CompressedOffset = entry[entryNum-1].CompressedOffset + CompressGuess + + // Adjust compressed offset for next loop, integer truncating division must be used. + CompressGuess += cOff/2 +} +``` # Format Extensions |