EdgeFS Architecture Overview
Table of Contents
- Technology Overview
- Process Architecture
- Data Architecture
EdgeFS is software defined and Kubernetes managed decentralized data-fabric solution built upon blockchain-like architecture that we also used to call Cloud Copy-On-Write (CCOW). Fundamentally CCOW enforces full immutability of any data and metadata in the segmented global name space. That is metadata and data are equally made immutable and cannot be ever modified.
In CCOW, objects got split into metadata chunks referencing payload chunks. The chunks are immutable, and their physical locations are derived from hash value.
CCOW objects supports two types of chunk payloads: an array of bytes and arrays of key/value records.
In CCOW, any higher level logical protocol abstraction (NFS/SMB File or Directory, AWS S3 Object, EdgeX NoSQL Database, iSCSI LUN, etc) represented as CCOW object construct.
CCOW Segments essentially scale-out clusters, that can scale in performance and capacity by adding more nodes. Any I/O within CCOW segment is always local and as a result blazingly fast. CCOW local I/O doesn’t need to cross inter-segment boundary.
Data chunks location independence is an important property of CCOW architecture and a basis for Inter-Segment Gateway (ISGW) design. In practice, it means that EdgeFS segments can initiate peer to peer communication in terms of to reconstruct an object without necessarily knowing what segment peer will participate in a transfer.
ISGW operates with metadata and payload data chunks, utilizing its hash value to ensure decentralized consistency across all connected segments.
Enterprises generally have well defined requirements related to storage. It’s frequently the case that storage needs to be both reliable, and be available from multiple locations. The unique process architecture of EdgeFS delivers on both of these constraints.
The process architecture is one comprised of a set of processes that manage storage services (S3, NFS, SMB, etc). There are several protocols supported, of these NFS, SMB/CIFS, iSCSI, and S3 are common. Protocol gateways, as they are known, are those processes that handle clients storage requests (commands), and support specific protocols such as NFS, SMB, S3, iSCSI, etc.
Logical subsets of processes, typically associated to a geographic region, are named segments. These segments may be peered in order to provide increased resiliency (non-stop business), or to provide local low-latency access to data. However, rather than have every process connect to every other process, thus forming a full-mesh network (which would increase overall network overhead), each segment has a single dedicated process that is responsible for all inter-segment messaging. Inter-segment gateways, as they are known, are responsible for ensuring all segments remain synchronized with respect to each other through a process of inter-segment replication.
In this way, there are two types of processes: protocol gateways, and inter-segment gateways. Each are discussed below in greater detail.
It was said, above, that segments are typically associated to a geographic region. Though this is generally the case, it is not a hard and fast rule. They may be associated to sub-partitions of a region, the availability zone (data center), or even could be a different rack of hardware at its most minimal form.
In terms of networks, two additional points: (a) within a segment all communication is strictly local, and (b) the storage network can run separately from the client network to minimize the impact of replication on client observable latency.
The user perspective of state is that of objects, files and directories. For instance, using a storage-specific file protocol, commands are issued to the network storage substrate (server) to e.g. open, close, read, and write files. From the client perspective, all reads and writes occur in a streaming IO fashion interacting with sequences of bytes. But at the protocol level this occurs behind the scenes using byte buffers, which when arriving at the protocol gateways, are written as blocks to the target devices.
There exists one type of protocol gateway for each endpoint protocol supported (e.g. NFS, SMB, S3, etc.). The gateways handle actual client traffic; clients are configured to connect to the IP address and port number of the protocol gateway.
For each protocol the actual process counts may vary because some protocols are stateful, meaning the session information being stateful would not permit requests to be load-balanced to other processes that are not aware of the session. Examples of stateful protocols are NFS, SMB/CIFS, and iSCSI. In such configurations the protocol gateway exposes a single port to which clients connect within that region. For stateless protocols, the port is optionally associated to an application load-balancer, which performs round-robin routing.
During the startup procedure, the protocol gateway pulls a FlexHash table from the nearest target daemon. By default it pulls from the one on the same host. And if the FlexHash table changes, protocol gateway will notice the change and automatically fetch the latest version of the FlexHash table. Each IO to a target also includes the FlexHash table generation id and if the target differs an error is raised causing the protocol gateway to refetch the FlexHash table.
As mentioned previously, inter-segment gateways are responsible for ensuring all segments remain synchronized with respect to each other through a process of inter-segment replication. Thus, all connectivity between segments is between inter-segment gateways only. The gateway also provides some caching capabilities, as well as WAN-level de-duplication to reduce cross-segment network bandwidth.
The protocol gateway talks to the target daemon via the Replicast protocol to read or write blocks. And the daemon is responsible for [[virtual device (VDEV) management|VDEV Architecture]], which is to say, responsible for local read and writes to key-value abstractions, which could be to S3 or EBS. See the [[VDEV documentation|VDEV Architecture]] for more details.
The target daemon listens to corosync events, and is responsible for building the FlexHash table. It then distributes it through the corosync daemons.
Responsible for distributing FlexHash table changes only. It uses a RAFT-like mechanism to provide a reliable communication channel between corosync daemons, and ensure the FlexHash tables are consistent across target daemons. It should be noted that the corosync daemons are pair-wise deployed with target daemons (in the same pod in the case of Kubernetes). This is not in the data path; it only distributes the table.
The above section details at a high level the processes, and how they relate to one another. This section details the persistent states involved, and the means by which they are kept consistent, or how they are read or written.
Files are stored in one of two manners. For small files, that is, those that are less than 256 kB in size, these are stored on a local disk (in AWS on EBS). Larger files are chunked along 1 MB boundaries. In order to reconstruct files, a table is necessary to map chunks into files, enter the Tree of Hashes.
Tree of Hashes
Both metadata and file data are equally stored immutably as a data structure known as a merkle tree.
“A [merkle] hash tree is a tree of hashes in which the leaves are hashes of data blocks in, for instance, a file or set of files. Nodes further up in the tree are the hashes of their respective children. For example, in the picture hash 0 is the result of hashing the concatenation of hash 0-0 and hash 0-1. That is, hash 0 = hash( hash(0-0) + hash(0-1) ) where + denotes concatenation.”
“In the top of a hash tree there is a top hash (or root hash or master hash). Before replicating a file on a network, in most cases the top hash is acquired from a trusted source. When the top hash is available, the hash tree can be received from a peer in the network, then the received hash tree top hash is checked against the trusted top hash, and if the top hash is different, then the received tree is known to be represent a file whose content differs (it changed, is damaged or fake).”
The data structure looks like this:
A merkle tree has a few fundamental purposes that are useful:
- the top hashes may be compared to determine if the file is different
- the nodes further up the tree can be compared to determine which blocks have changed (level-order traversals)
- replication can be limited to only the affected blocks, reducing network overhead
- multiple merkle trees can be stored to represent different versions of files
- files can be reconstructed by performing a depth-first traversal
The merkle tree provides for an architecture with immutable self-validating location-independent metadata referencing self-validating location-independent data chunks.
File versioning is intrinsically supported by simply maintaining one merkle tree per file version, and reference counting the leaf nodes. Removing a version simply removes its merkle tree, and decrements leaf node reference counts, and when data blocks are weakly reachable they are automatically pruned.
Two lists are maintained in the system for each file: the active versions list, and the deleted versions list (quarantine index). The quarantine index is designed to hold deleted versions of objects (NFS/SMB Files, iSCSI LUNs, S3 objects, S3X NoSQL database, etc).
Deletions do not occur immediately; configurable data retention duration ensure deleted versions are kept for a certain amount of time.
Each EdgeFS segment is scale-out distributed clustered system of locally connected servers (nodes). Logically nodes can have designated data storage role, designated gateway role or combination of both roles (mixed). Nodes connected together via Ethernet networking fabric using low latency and high-performance UDP-based protocol called Replicast.
Examples of EdgeFS segment deployments:
Previously we discussed target daemons and corosync daemons and indicated they handle all traffic related to storage of file chunks onto target storage. Daemons themselves are distributed across physical servers and racks in accordance to node selection criteria. However, what has not been demonstrated is how file chunks are organized in terms of their physical placement, and considering such constraints as redundancy. The solution: FlexHash.
Distribution of chunks is controlled through hashing table called FlexHash. FlexHash dynamically organizing and assigns all VDEVs into appropriate so-called “Negotiating Group”. Negotiating Group is group of VDEVs distributed across physical servers and racks in accordance with selected policy (server or zone).
FlexHash is a type of hash table whose structure is one where its directory stores hash prefixes, and each prefix refers to an identified set of daemons. These daemon sets (the row highlighted in green in the following image) are termed a negotiating group. Thus, the hash of the chunk identifies which negotiating group its a member of.
It solves dynamic data chunk placement challenges of consistent hashing. And it places data based on utilization, and rebalances resources automatically rather than pre-deterministically.
In the diagram above, the name of object is used to generate a digest using SHA-3 256 or 512-bits cryptographic function. The digest is the starting point for so-called “named objects”. For data chunks a content digest is used. The digest is used to look up a negotiation group of VDEV devices. Real time negotiation with members of Negotiation group using Replicast protocol identifies the best number of replica target VDEV devices to use to store or read the object data chunk based on utilization, latency or both. Identified group of targeted VDEV devices called Rendezvous Group.
Protocol gateways, often called “Initiators”, talk to Target daemons via Replicast transport networking protocol.
Replicast transport protocol is responsible for low latency and high throughput communications and supports the following modes of operation:
UDP/IP Unicast or Multicast communications with a Negotiating Group
UDP/IP Unicast or Multicast communications with a Rendezvous Transfer Group (group selected targeted VDEV devices)
TCP/IP communications within Rendezvous Transfer Group (Experimental)
Accelerated PF_RING Rendezvous Transfer (aka DPDK-like based) or low CPU overhead option (Experimental)
By default EdgeFS segment configured with UDP/IP Unicast transport.
Globally unique and ordered object modifications eventually get merge-sorted and reconciled at all connected segments, thus ensuring decentralized data consistency and security.
The end result is that every stored object is self-identified and self-validated with a strong cryptographic hash (SHA-3 variant) and that the cryptographic hash can be used to recall a chunk for retrieval from any connected segment.
Any write I/O in EdgeFS will be recorded in a so-called Transaction Log. The transaction logs keeps ordered sequences of all local segment modifications, and as such represents a recoverable log to ensure global modification consistency.
Every time the protocol gateway receives a file update, the top hash changes (immutable version), and the state change is recorded into a transaction log. Rather than synchronously replicating the state change to all peers, which would incur significant latency overhead, the change is only recorded locally in the transaction log, and multiple log entries are batched.
Target daemons write new versions into the transaction log. Later, after configurable period of time (batching time) that defaults to an interval of 10s, the inter-segment gateway reads these batches, and processes them in accordance with replication rules (de-duplication) with other segments – namely comparing the top hashes of each merkle tree, only performs delta replication.
There are two configurable parameters for the transaction log:
- how many days the segment needs to keep transaction log history (trlogKeepDays). Default is 3d
- how many seconds the segment would batch object modifications (trlogProcessingInterval). Default is 10s
EdgeFS implements erasure coding. Erasure coding is an advanced technique to provide data protection by keeping multiple redundant copies of state (data blocks), and also storing data parity information. These copies of state are stored on separate devices to protect from single (or even multiple) points of failure. The scheme provides for fast reconstruction of data despite device failure using relatively simple linear algebra: matrix multiplication, with identity and inverted matrices.
EdgeFS does support multiple Reed Solomon protection schemes, all are denoted using a colon-separated (":") syntax indicating ratios of data blocks to parity blocks; for example, “4:2"rs” indicates that for each four (4) data blocks (or alternatively known as “chunks”) there will be two (2) parity blocks. EdgeFS implements erasure coding once an object has come to rest.
The Backblaze YouTube video (link below) provides a fairly well detailed explanation for how erasure codes and Reed Solomon work. For those so inclined to investigate further, I recommend you start with that video before digging in the maths behind it all.
Erasure coding is optimal for full writes and appends. One implication of erasure coding is that it is NOT recommended for workloads that incur lots of overwrites as this will result in poor performance.
A key benefit of erasure coding is that, relative to traditional mirroring techniques, it has a lower total storage overhead, reducing cost. The reduction of storage overhead can be largely attributed to erasure coding, but EdgeFS also uses another technique to reduce storage overhead: De-Duplication.
- A Tale of Two Erasure Codes in HDFS, Facebook
- Reed Solomon Tutorial: Backblaze Reed Solomon Encoding Example Case
- Erasure Coding, Wikipedia
The implementation of EdgeFS/S3 performs deduplication on the server-side to avoid transmission of duplicate payload chunks, also known as inline data deduplication. An EdgeFS client (via CCOW “Cloud-Copy-On-Write” gateway library API) cryptographically hashes each chunk before requesting that it be put. This hash is used to determine if the chunk already exists, and if so avoid re-transmission.
Leadership election occurs at mount-level, and uses RAFT protocols. RAFT logs are stored directly in EdgeFS fabric. The implementation requires odd counts of nodes in membership to avoid split-brain scenarios, so at least one RAFT-protocol process per data center and an odd number of data centers. If there is an even number of data centers, one additional tie-breaker for RAFT must exist.
EdgeFS uses third party library for RAFT implementation in C. The RAFT state for EdgeFS appears to be stored in a simple SQL data store, which is loosely based upon SQLite, where its replication is implemented using RAFT log replication.
EdgeFS DQlite uses the SQLite Write-Ahead-Logging (WAL) feature to avoid updates in-place, as this would both impact write performance (since it stores its state in EdgeFS itself, hence subject to the same performance implications update-in-place has w.r.t. NFS), and it would increase write latency (it could optimistically send out metadata updates in advance of a commit if it implemented total-ordering).
EdgeFS supports CIFS and NFS. These are network filesystem protocols used for providing shared access to files between machines on the network. A client application can read, write, edit and even remove files on the remote server. A necessary minimal understanding of NFS, for example, will set the stage for further discussion of the details below.
To begin a conversation of NFS, being network-based, NFS involves communication between the client and the server using Remote Procedure Calls (RPCs). For example, in order to access the network filesystem, it must first be mounted. Using the MOUNT procedure, a mount is created, and an initial query for filesystem information (the FSINFO procedure) is performed. Once completed, a local file handle is created (typically a directory object).
The fundamental procedures supported by NFS are summarized in the following table. The various use-cases and workflows involving ordinary users are internally implemented using combinations of the constituent procedures listed in its Internet RFC. Important Note: Compound requests are NEVER atomic.
Reading from a file requires an offset parameter, for where to start reading from, and a count of bytes to be
read. Writing to a file requires the offset where the data should be written, and the opaque data parameter, which is inclusive of the count of bytes to write.
The notion of data blocks, discussed below, does not enter in conversation of the core NFS protocol itself; with respect to reading and writing, this is purely an offset-count plus data based protocol. All notions of blocks, erasure coding, de-duplication, all occurs server side. However, client and server can negotiate buffer sizes used in transfers, and the client optionally uses a cache.
NFS caching does suffer from stale cache issues. Cache consistency can cause multiple clients to see state file versions if the file metadata is not retrieved periodically to detect change of file; and in terms of update visibility, updates are only seen when the file is closed.
NFS supports strong semantics for write, where modifications to on-disk state necessarily require modified data be flushed to disk before returning from the call; this impacts performance, particularly for modify-in-place writing schemes (as opposed to append-only schemes).
- What are the optimal NFS configuration parameters, for common use-cases related to architectural data?
- What options exist to monitor NFS servers and clients to determine the right configurations are chosen, and they are performing optimally?
NFS Further Reading
Currently only host-based authentication is supported. For NFS you have to use host:user:group ids; right now they only have support for NFS v3. For SMB they support SMB for Windows groups. Kerberos and Windows Active Directory is not currently supported. There are plans to add Kerberos and Active Directory support in the future.
For host-based authentication, client ids must be consistent across all hosts for any given user. If user has a uid of 101, then it must be uid 101 on all hosts. Whatever id the file was saved with, that id is used on all hosts.
Out of the box existing Linux LDAP/Kerberos integrations, and Windows Groups integrations, handles the file uid automatically for you.
Question: How are uids generated and distributed out across the organization?
EdgeFS uses Prometheus to store time-series data, which are rendered in Datadog or Grafana. Prometheus runs in each region right now, and there are plans to roll up the data cross-region and dashboard with Grafana.