November 2, 2018

Network Data Deduplication Using Pseudo-Time

  Data Deduplication, Networking, Pseudo-Time, Technology


In the last article on the History of Pseudo-Time I mentioned that we use pseudo-time to accelerate network traffic. Network data deduplication is one of those applications. In this article we're going to cover how network data deduplication works; why traditional CDNs don't typically use it; and how we use pseudo-time to provide a more effective form of deduplication.

Network Data Deduplication

Network data deduplication is a WAN optimization technique that reduces the number of bytes sent between two nodes. In the typical use case, you would use WAN accelerator appliances to connect two remote offices.

When data is sent through the accelerator to the remote office, the accelerator creates a buffer of data sent to and received from the remote accelerator. This buffer is used as a compression dictionary. When data is sent to the remote accelerator, the dictionary is checked for a match (meaning the same segment of data was previously sent to the remote accelerator). That segment of data is replaced with a reference to the segment in the dictionary. When the remote accelerator receives the data, it scans it for segment references and replaces it with the data from the dictionary, thereby recreating the original data.

CDN Issues

While this is a useful application of deduplication, the traditional CDN model introduces some complications.

The first issue is the structure of a traditional CDN. Recall from our article on Web Caching History that CDNs route connections directly from the edge node to the origin server:

Deduplication requires a node on each side of the connection to reassemble the original data. Without a remote node, deduplication cannot be applied. While this network architecture is still generally in use today, there are a few CDNs that have more sophisticated routing.

Assuming the CDN has improved their routing, the next issue is the variability of the data. CDNs are shared resources, transmitting a variety of data from various customers through the same servers. In order to deduplicate data, they must store the compression dictionary in memory. Since memory is a limited resource, the size of the dictionary is dwarfed by the amount of data transmitted from the node. And since that data has high variability, the result is that only a small percentage of traffic can be deduplicated.

To work around this issue, WAN accelerators frequently store data to disk instead of memory. If a CDN used a shared compression dictionary (ie: a single compression dictionary applied to all customers), it could be larger but would only cover a small portion of overall traffic. If they used a dictionary for each customer, each dictionary would be significantly smaller, and as a result cover a small portion of overall traffic. Additionally, only a portion of the entire dictionary could fit in memory at any one time, resulting in a disk read each time the node needed to access another portion of the dictionary. This switching would introduce additional latency.

Given these issues, how do we solve this on our CDN to take advantage of deduplication? Enter pseudo-time.

Deduplication Using Pseudo-Time

Observe that long distance connections through NuevoCloud are routed using at least two edge PoPs:

All of the edge PoPs in our network share the same pseudo-time filesystem. When a request is received, these PoPs create a pseudo-temporal environment (PTE) that they both share. Within the PTE, the only files that exist are those of the zone associated with the request. In a pseudo-time filesystem, the filesystem is capable of representing the data at every point in time (the current time is simply another point in time). This capability allows the filesystem to represent a consistent state, even through the filesystem is in constant change as files are created and destroyed at each node. The PTE allows the two (or more) PoPs to work on the data in a consistent environment, even though they may be thousands of miles apart (prohibiting any form of real time communication).

When the response is read from the origin server, the response is compared against the data in the zone's cache inside the PTE. From this a delta is generated, and sent to the remote PoP. Once there, the remote PoP applies the delta to the data read from the same PTE, recreating the original data. This response--identical to the response from the server--is then sent to the browser.

Using this approach, the entire zone cache becomes the compression dictionary.

Through the use of pseudo-time, we've applied deduplication to more data than was previously possible. Optimizing the data of every customer, no matter the frequency of the data transmitted or the size of the customer's website.

This is one application of our pseudo-time filesystem. In future articles, we'll be describing more networking advances we've made and additional uses of pseudo-time in our network.