Commit 8ab2fc60 authored by Mateusz Kozlowski's avatar Mateusz Kozlowski Committed by Jim Harris
Browse files

doc/ftl: Update the cache device VSS requirements

parent 866c093c
Loading
Loading
Loading
Loading
+31 −43
Original line number Diff line number Diff line
@@ -3,7 +3,8 @@
The Flash Translation Layer library provides efficient 4K block device access on top of devices
with >4K write unit size (eg. raid5f bdev) or devices with large indirection units (some
capacity-focused NAND drives), which don't handle 4K writes well. It handles the logical to
physical address mapping and manages the garbage collection process.
physical address mapping and manages the garbage collection process. It is the core component of
[CSAL](https://www.solidigm.com/products/technology/cloud-storage-acceleration-layer-write-shaping-csal.html) - Cloud Storage Acceleration Layer.

## Terminology {#ftl_terminology}

@@ -14,8 +15,7 @@ physical address mapping and manages the garbage collection process.
Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs
are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
are calculated during device formation and are subtracted from the available address space). The
spare blocks account for zones going offline throughout the lifespan of the device as well as
provide necessary buffer for data [garbage collection](#ftl_reloc).
spare blocks provide the necessary buffer for data during [garbage collection](#ftl_reloc).

Since the L2P would occupy a significant amount of DRAM (4B/LBA for drives smaller than 16TiB,
8B/LBA for bigger drives), FTL will, by default, store only the 2GiB of most recently used L2P
@@ -24,47 +24,32 @@ as necessary.

### Band {#ftl_band}

A band describes a collection of zones, each belonging to a different parallel unit. All writes to
a band follow the same pattern - a batch of logical blocks is written to one zone, another batch
to the next one and so on. This ensures the parallelism of the write operations, as they can be
executed independently on different zones. Each band keeps track of the LBAs it consists of, as
well as their validity, as some of the data will be invalidated by subsequent writes to the same
logical address. The L2P mapping can be restored from the SSD by reading this information in order
from the oldest band to the youngest.
A band is a logical division of the underlying base device, by default 1GiB. All writes to
a band follow the same pattern - a batch of logical blocks (by default 1MiB, the write unit size)
is written to the base device, another batch to the next offset and so on. This ensures the
parallelism of the write operations, increasing the overall bandwidth.

```text
             +--------------+        +--------------+                        +--------------+
    band 1   |   zone 1     +--------+    zone 1    +---- --- --- --- --- ---+     zone 1   |
             +--------------+        +--------------+                        +--------------+
    band 2   |   zone 2     +--------+     zone 2   +---- --- --- --- --- ---+     zone 2   |
             +--------------+        +--------------+                        +--------------+
    band 3   |   zone 3     +--------+     zone 3   +---- --- --- --- --- ---+     zone 3   |
             +--------------+        +--------------+                        +--------------+
             |     ...      |        |     ...      |                        |     ...      |
             +--------------+        +--------------+                        +--------------+
    band m   |   zone m     +--------+     zone m   +---- --- --- --- --- ---+     zone m   |
             +--------------+        +--------------+                        +--------------+
             |     ...      |        |     ...      |                        |     ...      |
             +--------------+        +--------------+                        +--------------+

              parallel unit 1              pu 2                                    pu n
```
Each band keeps track of the LBAs it consists of, its sequence ID (to detect the relative age
of the data), as well as their validity - as some of the data will be invalidated by subsequent
writes to the same logical address. The L2P mapping can be restored from the SSD by reading this
information in order from all bands based on the sequence IDs.

The address map (`P2L`) is saved as a part of the band's metadata, at the end of each band:

```text
                        band's data                        tail metadata
    +-------------------+-------------------------------+------------------------+
    |zone 1 |...|zone n |...|...|zone 1 |...|           | ... |zone  m-1 |zone  m|
    |block 1|   |block 1|   |   |block x|   |           |     |block y   |block y|
    +-------------------+-------------+-----------------+------------------------+
    +----------------+------------------------------------------------------------+
    |       ||       |...|...|       |...|         | ... |seq 1 |seq 2 |...|seq x |
    |block 1||block 2|   |   |block x|   |         |     |LBA 1 |LBA 2 |   |LBA x |
    +----------------+------------------------------------------------------------+
```

Bands are written sequentially (in a way that was described earlier). Before a band can be written
to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP`
state. Then the band moves to the `OPEN` state and actual user data can be written to the
band. Once the whole available space is filled, tail metadata is written and the band transitions to
`CLOSING` state. When that finishes the band becomes `CLOSED`.
to it needs to be in a `FREE` state, i.e. without user data. This happends either with a fresh FTL
(no data has been written to the BDEV), or after [garbage collection](#ftl_reloc).
The band moves to the `OPEN` state when FTL requires space for writing data. After the state transition actual user data
can be written to the band. Once the whole available space is filled, tail metadata is written and the band transitions
to `CLOSING` state. When that finishes the band becomes `CLOSED`.

### Non volatile cache {#ftl_nvcache}

@@ -97,10 +82,10 @@ is moved to base_bdev. This process is called chunk compaction.
- Shorthand: gc, reloc

Since a write to the same LBA invalidates its previous physical location, some of the blocks on a
band might contain old data that basically wastes space. As there is no way to overwrite an already
written block for a ZNS drive, this data will stay there until the whole zone is reset. This might create a
situation in which all of the bands contain some valid data and no band can be erased, so no writes
can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
band might contain old data that basically wastes space. Since writing to random locations on the base device
may incure additional Write Amplification Factor, FTL strives to issue sequential workload in large blocks within
each band. This might create a situation in which all of the bands contain some valid data and no band can be
freed, so no writes can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
bands, so that they can be reused.

```text
@@ -150,8 +135,9 @@ the mapping itself, but also a sequence id (`seq_id`), which describes the relat
(multiple writes to the same logical block would produce the same amount of P2L entries, only the last one having the current data).

FTL will therefore rebuild the whole L2P by reading the P2L of all closed bands and chunks. For open bands, the P2L is stored on
the cache device, in a separate metadata region (see [the P2L section](#ftl_metadata)). Open chunks can be restored thanks to storing
the mapping in the VSS DIX metadata, which the cache device must be formatted with.
the cache device, in a separate metadata region (see [the P2L section](#ftl_metadata)). In case of a cache device formatted
with VSS DIX metadata, open chunks can be restored thanks to storing the mapping in the metadata. For cache devices without
DIX, an additional log structure is maintained to maintain data consistency after power failure.

### Shared memory recovery {#ftl_shm_recovery}

@@ -169,7 +155,9 @@ currently only allows for trims (unmaps) aligned to 4MiB (alignment concerns bot

### Prerequisites {#ftl_prereq}

In order to use the FTL module, a cache device formatted with VSS DIX metadata is required.
In order to use the FTL module, a cache device with at least 5GiB capacity is required. The base device requires at
least 20GiB capacity. Currently the base device write unit size (LBA alignment and number of blocks that must be
issued during a write) must be a power of 2 (1 block, i.e. no write unit size requirements is a power of 2).

### FTL bdev creation {#ftl_create}

@@ -179,7 +167,7 @@ Both interfaces require the same arguments which are described by the `--help` o

- bdev's name
- base bdev's name
- cache bdev's name (cache bdev must support VSS DIX mode)
- cache bdev's name
- UUID of the FTL device (if the FTL is to be restored from the SSD)

## FTL bdev stack {#ftl_bdev_stack}