Commit e9a236d2 authored by Wojciech Malikowski's avatar Wojciech Malikowski Committed by Ben Walker
Browse files

ftl: Initial headers



This patch introduces core structures required for implementing FTL on
top of Open Channel drives. The Open Channel specification describes raw
access to the media on the SSD. The FTL consumes that API and exposes a
block device interface.

The implementation is based on the revision 2.0 of the Open Channel SSD
specification.

Change-Id: Ie306cdfb7920df3b02233fcb60896745f3184cdc
Signed-off-by: default avatarWojciech Malikowski <wojciech.malikowski@intel.com>
Signed-off-by: default avatarKonrad Sztyber <konrad.sztyber@intel.com>
Reviewed-on: https://review.gerrithub.io/c/431321


Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: default avatarSPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: default avatarBen Walker <benjamin.walker@intel.com>
Reviewed-by: default avatarJim Harris <james.r.harris@intel.com>
parent 3dc3f416
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -811,6 +811,7 @@ INPUT += \
                         concurrency.md \
                         directory_structure.md \
                         event.md \
                         ftl.md \
                         getting_started.md \
                         ioat.md \
                         iscsi.md \

doc/ftl.md

0 → 100644
+134 −0
Original line number Diff line number Diff line
# Flash Translation Layer {#ftl}

The Flash Translation Layer library provides block device access on top of non-block SSDs
implementing Open Channel interface. It handles the logical to physical address mapping, responds to
the asynchronous media management events, and manages the defragmentation process.

# Terminology {#ftl_terminology}

## Logical to physical address map

 * Shorthand: L2P

Contains the mapping of the logical addresses (LBA) to their on-disk physical location (PPA). The
LBAs are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
are calculated during device formation and are subtracted from the available address space). The
spare blocks account for chunks going offline throughout the lifespan of the device as well as
provide necessary buffer for data [defragmentation](#ftl_reloc).

## Band {#ftl_band}

Band describes a collection of chunks, each belonging to a different parallel unit. All writes to
the band follow the same pattern - a batch of logical blocks is written to one chunk, another batch
to the next one and so on. This ensures the parallelism of the write operations, as they can be
executed independently on a different chunks. Each band keeps track of the LBAs it consists of, as
well as their validity, as some of the data will be invalidated by subsequent writes to the same
logical address. The L2P mapping can be restored from the SSD by reading this information in order
from the oldest band to the youngest.

             +--------------+        +--------------+                        +--------------+
    band 1   |   chunk 1    +--------+     chk 1    +---- --- --- --- --- ---+     chk 1    |
             +--------------+        +--------------+                        +--------------+
    band 2   |   chunk 2    +--------+     chk 2    +---- --- --- --- --- ---+     chk 2    |
             +--------------+        +--------------+                        +--------------+
    band 3   |   chunk 3    +--------+     chk 3    +---- --- --- --- --- ---+     chk 3    |
             +--------------+        +--------------+                        +--------------+
             |     ...      |        |     ...      |                        |     ...      |
             +--------------+        +--------------+                        +--------------+
    band m   |   chunk m    +--------+     chk m    +---- --- --- --- --- ---+     chk m    |
             +--------------+        +--------------+                        +--------------+
             |     ...      |        |     ...      |                        |     ...      |
             +--------------+        +--------------+                        +--------------+

              parallel unit 1              pu 2                                    pu n

The address map and valid map are, along with a several other things (e.g. UUID of the device it's
part of, number of surfaced LBAs, band's sequence number, etc.), parts of the band's metadata. The
metadata is split in two parts:
 * the head part, containing information already known when opening the band (device's UUID, band's
   sequence number, etc.), located at the beginning blocks of the band,
 * the tail part, containing the address map and the valid map, located at the end of the band.


       head metadata               band's data               tail metadata
    +-------------------+-------------------------------+----------------------+
    |chk 1|...|chk n|...|...|chk 1|...|                 | ... |chk  m-1 |chk  m|
    |lbk 1|   |lbk 1|   |   |lbk x|   |                 |     |lblk y   |lblk y|
    +-------------------+-------------+-----------------+----------------------+


Bands are being written sequentially (in a way that was described earlier). Before a band can be
written to, all of its chunks need to be erased. During that time, the band is considered to be in a
`PREP` state. After that is done, the band transitions to the `OPENING` state, in which head metadata
is being written. Then the band moves to the `OPEN` state and actual user data can be written to the
band. Once the whole available space is filled, tail metadata is written and the band transitions to
`CLOSING` state. When that finishes the band becomes `CLOSED`.

## Ring write buffer {#ftl_rwb}

 * Shorthand: RWB

Because the smallest write size the SSD may support can be a multiple of block size, in order to
support writes to a single block, the data needs to be buffered. The write buffer is the solution to
this problem. It consists of a number of pre-allocated buffers called batches, each of size allowing
for a single transfer to the SSD. A single batch is divided into block-sized buffer entries.

                 write buffer
    +-----------------------------------+
    |batch 1                            |
    |   +-----------------------------+ |
    |   |rwb    |rwb    | ... |rwb    | |
    |   |entry 1|entry 2|     |entry n| |
    |   +-----------------------------+ |
    +-----------------------------------+
    | ...                               |
    +-----------------------------------+
    |batch m                            |
    |   +-----------------------------+ |
    |   |rwb    |rwb    | ... |rwb    | |
    |   |entry 1|entry 2|     |entry n| |
    |   +-----------------------------+ |
    +-----------------------------------+

When a write is scheduled, it needs to acquire an entry for each of its blocks and copy the data
onto this buffer. Once all blocks are copied, the write can be signalled as completed to the user.
In the meantime, the `rwb` is polled for filled batches and, if one is found, it's sent to the SSD.
After that operation is completed the whole batch can be freed. For the whole time the data is in
the `rwb`, the L2P points at the buffer entry instead of a location on the SSD. This allows for
servicing read requests from the buffer.

## Defragmentation and relocation {#ftl_reloc}

 * Shorthand: defrag, reloc

Since a write to the same LBA invalidates its previous physical location, some of the blocks on a
band might contain old data that basically wastes space. As there is no way to overwrite an already
written block, this data will stay there until the whole chunk is reset. This might create a
situation in which all of the bands contain some valid data and no band can be erased, so no writes
can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
bands, so that they can be reused.

                    band                                             band
    +-----------------------------------+            +-----------------------------------+
    | ** *    * ***      *    *** * *   |            |                                   |
    |**  *       *    *    * *     *   *|   +---->   |                                   |
    |*     ***  *      *            *   |            |                                   |
    +-----------------------------------+            +-----------------------------------+

Valid blocks are marked with an asterisk '\*'.

Another reason for data relocation might be an event from the SSD telling us that the data might
become corrupt if it's not relocated. This might happen due to its old age (if it was written a
long time ago) or due to read disturb (media characteristic, that causes corruption of neighbouring
blocks during a read operation).

Module responsible for data relocation is called `reloc`. When a band is chosen for defragmentation
or an ANM (asynchronous NAND management) event is received, the appropriate blocks are marked as
required to be moved. The `reloc` module takes a band that has some of such blocks marked, checks
their validity and, if they're still valid, copies them.

Choosing a band for defragmentation depends on several factors: its valid ratio (1) (proportion of
valid blocks to all user blocks), its age (2) (when was it written) and its write count / wear level
index of its chunks (3) (how many times the band was written to). The lower the ratio (1), the
higher its age (2) and the lower its write count (3), the higher the chance the band will be chosen
for defrag.
+1 −0
Original line number Diff line number Diff line
@@ -5,3 +5,4 @@
- @subpage bdev_pg
- @subpage bdev_module
- @subpage nvmf_tgt_pg
- @subpage ftl

include/spdk/ftl.h

0 → 100644
+267 −0
Original line number Diff line number Diff line
/*-
 *   BSD LICENSE
 *
 *   Copyright (c) Intel Corporation.
 *   All rights reserved.
 *
 *   Redistribution and use in source and binary forms, with or without
 *   modification, are permitted provided that the following conditions
 *   are met:
 *
 *     * Redistributions of source code must retain the above copyright
 *       notice, this list of conditions and the following disclaimer.
 *     * Redistributions in binary form must reproduce the above copyright
 *       notice, this list of conditions and the following disclaimer in
 *       the documentation and/or other materials provided with the
 *       distribution.
 *     * Neither the name of Intel Corporation nor the names of its
 *       contributors may be used to endorse or promote products derived
 *       from this software without specific prior written permission.
 *
 *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

#ifndef SPDK_FTL_H
#define SPDK_FTL_H

#include <spdk/stdinc.h>
#include <spdk/nvme.h>
#include <spdk/nvme_ocssd.h>
#include <spdk/uuid.h>
#include <spdk/thread.h>

struct spdk_ftl_dev;

/* Limit thresholds */
enum {
	SPDK_FTL_LIMIT_CRIT,
	SPDK_FTL_LIMIT_HIGH,
	SPDK_FTL_LIMIT_LOW,
	SPDK_FTL_LIMIT_START,
	SPDK_FTL_LIMIT_MAX
};

struct spdk_ftl_limit {
	/* Threshold from which the limiting starts */
	size_t					thld;

	/* Limit percentage */
	size_t					limit;
};

struct spdk_ftl_conf {
	/* Number of reserved addresses not exposed to the user */
	size_t					lba_rsvd;

	/* Write buffer size */
	size_t					rwb_size;

	/* Threshold for opening new band */
	size_t					band_thld;

	/* Trace enabled flag */
	int					trace;

	/* Trace file name */
	const char				*trace_path;

	/* Maximum IO depth per band relocate */
	size_t					max_reloc_qdepth;

	/* Maximum active band relocates */
	size_t					max_active_relocs;

	/* IO pool size per user thread */
	size_t					user_io_pool_size;

	struct {
		/* Lowest percentage of invalid lbks for a band to be defragged */
		size_t				invalid_thld;

		/* User writes limits */
		struct spdk_ftl_limit		limits[SPDK_FTL_LIMIT_MAX];
	} defrag;
};

/* Range of parallel units (inclusive) */
struct spdk_ftl_punit_range {
	unsigned int				begin;
	unsigned int				end;
};

enum spdk_ftl_mode {
	/* Create new device */
	SPDK_FTL_MODE_CREATE = (1 << 0),
};

struct spdk_ftl_dev_init_opts {
	/* NVMe controller */
	struct spdk_nvme_ctrlr			*ctrlr;
	/* Controller's transport ID */
	struct spdk_nvme_transport_id		trid;

	/* Thread responsible for core tasks execution */
	struct spdk_thread			*core_thread;
	/* Thread responsible for read requests */
	struct spdk_thread			*read_thread;

	/* Device's config */
	struct spdk_ftl_conf			*conf;
	/* Device's name */
	const char				*name;
	/* Parallel unit range */
	struct spdk_ftl_punit_range		range;
	/* Mode flags */
	unsigned int				mode;
	/* Device UUID (valid when restoring device from disk) */
	struct spdk_uuid			uuid;
};

struct spdk_ftl_attrs {
	/* Device's UUID */
	struct spdk_uuid			uuid;
	/* Parallel unit range */
	struct spdk_ftl_punit_range		range;
	/* Number of logical blocks */
	uint64_t				lbk_cnt;
	/* Logical block size */
	size_t					lbk_size;
};

struct ftl_module_init_opts {
	/* Thread on which to poll for ANM events */
	struct spdk_thread			*anm_thread;
};

typedef void (*spdk_ftl_fn)(void *, int);
typedef void (*spdk_ftl_init_fn)(struct spdk_ftl_dev *, void *, int);

/**
 * Initialize the FTL module.
 *
 * \param opts module configuration
 * \param cb callback function to call when the module is initialized
 * \param cb_arg callback's argument
 *
 * \return 0 if successfully started initialization, negative values if
 * resources could not be allocated.
 */
int spdk_ftl_module_init(const struct ftl_module_init_opts *opts, spdk_ftl_fn cb, void *cb_arg);

/**
 * Deinitialize the FTL module. All FTL devices have to be unregistered prior to
 * calling this function.
 *
 * \param cb callback function to call when the deinitialization is completed
 * \param cb_arg callback's argument
 *
 * \return 0 if successfully scheduled deinitialization, negative errno
 * otherwise.
 */
int spdk_ftl_module_fini(spdk_ftl_fn cb, void *cb_arg);

/**
 * Initialize the FTL on given NVMe device and parallel unit range.
 *
 * Covers the following:
 * - initialize and register NVMe ctrlr,
 * - retrieve geometry and check if the device has proper configuration,
 * - allocate buffers and resources,
 * - initialize internal structures,
 * - initialize internal thread(s),
 * - restore or create L2P table.
 *
 * \param opts configuration for new device
 * \param cb callback function to call when the device is created
 * \param cb_arg callback's argument
 *
 * \return 0 if initialization was started successfully, negative errno otherwise.
 */
int spdk_ftl_dev_init(const struct spdk_ftl_dev_init_opts *opts, spdk_ftl_init_fn cb, void *cb_arg);

/**
 * Deinitialize and free given device.
 *
 * \param dev device
 * \param cb callback function to call when the device is freed
 * \param cb_arg callback's argument
 *
 * \return 0 if successfully scheduled free, negative errno otherwise.
 */
int spdk_ftl_dev_free(struct spdk_ftl_dev *dev, spdk_ftl_fn cb, void *cb_arg);

/**
 * Initialize FTL configuration structure with default values.
 *
 * \param conf FTL configuration to initialize
 */
void spdk_ftl_conf_init_defaults(struct spdk_ftl_conf *conf);

/**
 * Retrieve device’s attributes.
 *
 * \param dev device
 * \param attr Attribute structure to fill
 *
 * \return 0 if successfully initialized, negated EINVAL otherwise.
 */
int spdk_ftl_dev_get_attrs(const struct spdk_ftl_dev *dev, struct spdk_ftl_attrs *attr);

/**
 * Submits a read to the specified device.
 *
 * \param dev Device
 * \param ch I/O channel
 * \param lba Starting LBA to read the data
 * \param lba_cnt Number of sectors to read
 * \param iov Single IO vector or pointer to IO vector table
 * \param iov_cnt Number of IO vectors
 * \param cb_fn Callback function to invoke when the I/O is completed
 * \param cb_arg Argument to pass to the callback function
 *
 * \return 0 if successfully submitted, negated EINVAL otherwise.
 */
int spdk_ftl_read(struct spdk_ftl_dev *dev, struct spdk_io_channel *ch, uint64_t lba,
		  size_t lba_cnt,
		  struct iovec *iov, size_t iov_cnt, spdk_ftl_fn cb_fn, void *cb_arg);

/**
 * Submits a write to the specified device.
 *
 * \param dev Device
 * \param ch I/O channel
 * \param lba Starting LBA to write the data
 * \param lba_cnt Number of sectors to write
 * \param iov Single IO vector or pointer to IO vector table
 * \param iov_cnt Number of IO vectors
 * \param cb_fn Callback function to invoke when the I/O is completed
 * \param cb_arg Argument to pass to the callback function
 *
 * \return 0 if successfully submitted, negative values otherwise.
 */
int spdk_ftl_write(struct spdk_ftl_dev *dev, struct spdk_io_channel *ch, uint64_t lba,
		   size_t lba_cnt,
		   struct iovec *iov, size_t iov_cnt, spdk_ftl_fn cb_fn, void *cb_arg);

/**
 * Submits a flush request to the specified device.
 *
 * \param dev device
 * \param cb_fn Callback function to invoke when all prior IOs have been completed
 * \param cb_arg Argument to pass to the callback function
 *
 * \return 0 if successfully submitted, negated EINVAL or ENOMEM otherwise.
 */
int spdk_ftl_flush(struct spdk_ftl_dev *dev, spdk_ftl_fn cb_fn, void *cb_arg);

#endif /* SPDK_FTL_H */

lib/ftl/ftl_core.h

0 → 100644
+434 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading