Managing the Cloud Storage Costs of Big-Data Applications | by Chaim Rand | Jun, 2023


Ideas for Decreasing the Expense of Utilizing Cloud-Primarily based Storage

Photograph by JOSHUA COLEMAN on Unsplash

With the rising reliance on ever-increasing quantities of knowledge, modern-day corporations are extra dependent than ever on high-capacity and extremely scalable data-storage options. For a lot of corporations this answer comes within the type of cloud-based storage service, akin to Amazon S3, Google Cloud Storage, and Azure Blob Storage, every of which include a wealthy set of APIs and options (e.g., multi-tier storage) supporting all kinds of knowledge storage designs. After all, cloud storage companies even have an related price. This price is often comprised of plenty of parts together with the general measurement of the space for storing you employ, in addition to actions akin to transferring knowledge into, out of, or inside cloud storage. The value of Amazon S3, for instance, contains (as of the time of this writing) six cost components, every of which must be considered. It’s simple to see how managing the price of cloud storage can get difficult, and designated calculators (e.g., here) have been developed to help with this.

In a recent post, we expanded on the significance of designing your knowledge and your knowledge utilization in order to cut back the prices related to knowledge storage. Our focus there was on utilizing knowledge compression as a method to cut back the general measurement of your knowledge. On this put up we give attention to a generally missed cost-component of cloud storage — the cost of API requests made against your cloud storage buckets and data objects. We’ll reveal, by instance, why this part is usually underestimated and the way it can turn into a good portion of the price of your large knowledge utility, if not managed correctly. We’ll then talk about a few easy methods to maintain this price beneath management.


Though our demonstrations will use Amazon S3, the contents of this put up are simply as relevant to another cloud storage service. Please don’t interpret our alternative of Amazon S3 or another instrument, service, or library we must always point out, as an endorsement for his or her use. The most suitable choice for you’ll rely on the distinctive particulars of your individual mission. Moreover, please remember the fact that any design alternative concerning the way you retailer and use your knowledge could have its execs and cons that ought to be weighed closely based mostly on the main points of your individual mission.

This put up will embrace plenty of experiments that had been run on an Amazon EC2 c5.4xlarge occasion (with 16 vCPUs and “up to 10 Gbps” of network bandwidth). We’ll share their outputs as examples of the comparative outcomes you would possibly see. Remember the fact that the outputs might fluctuate drastically based mostly on the setting during which the experiments are run. Please don’t depend on the outcomes introduced right here to your personal design choices. We strongly encourage you to run these in addition to further experiments earlier than deciding what’s greatest to your personal initiatives.

Suppose you’ve an information transformation utility that acts on 1 MB knowledge samples from S3 and produces 1 MB knowledge outputs which might be uploaded to S3. Suppose that you’re tasked with reworking 1 billion knowledge samples by working your utility on an acceptable Amazon EC2 instance (in the identical area as your S3 bucket with a purpose to keep away from knowledge switch prices). Now let’s assume that Amazon S3 charges $0.0004 for each 1000 GET operations and $0.005 for each 1000 PUT operations (as on the time of this writing). At first look, these prices might sound so low that they might be negligible in comparison with the opposite prices associated to the information transformation. Nonetheless, a easy calculation reveals that our Amazon S3 API calls alone will tally a invoice of $5,400!! This could simply be probably the most dominant price issue of your mission, much more than the price of the compute occasion. We’ll return to this thought experiment on the finish of the put up.

The apparent method to cut back the prices of the API calls is to group samples collectively into recordsdata of a bigger measurement and run the transformation on batches of samples. Denoting our batch measurement by N, this technique may probably cut back our price by an element of N (assuming that multi-part file switch isn’t used — see beneath). This system would lower your expenses not simply on the PUT and GET calls however on all of the price parts of Amazon S3 which might be depending on the variety of object recordsdata fairly than the general measurement of the information (e.g., lifecycle transition requests).

There are a variety of disadvantages to grouping samples collectively. For instance, once you retailer samples individually, you may freely entry any one in all them at will. This turns into tougher when samples are grouped collectively. (See this post for extra on the professionals and cons of batching samples into giant recordsdata.) In case you do go for grouping samples collectively, the large query is how to decide on the scale N. A bigger N may cut back storage prices however would possibly introduce latency, enhance the compute time, and, by extension, enhance the compute prices. Discovering the optimum quantity might require some experimentation that takes under consideration these and extra concerns.

However let’s not child ourselves. Making this type of change is not going to be simple. Your knowledge might have many customers (each human and synthetic) every with their very own explicit set of calls for and constraints. Storing your samples in separate recordsdata could make it simpler to maintain everybody joyful. Discovering a batching technique that satisfies everybody can be troublesome.

Doable Compromise: Batched Places, Particular person Will get

A compromise you would possibly contemplate is to add giant recordsdata with grouped samples whereas enabling entry to particular person samples. A technique to do that is to take care of an index file with the places of every pattern (the file during which it’s grouped, the start-offset, and the end-offset) and expose a skinny API layer to every client that will allow them to freely obtain particular person samples. The API can be applied utilizing the index file and an S3 API that allows extracting particular ranges from object recordsdata (e.g., Boto3’s get_object perform). Whereas this type of answer wouldn’t save any cash on GET calls (since we’re nonetheless pulling the identical variety of particular person samples), the dearer PUT calls can be decreased since we might be importing a decrease variety of bigger recordsdata. Be aware that this type of answer poses some limitations on the library we use to work together with S3 because it depends upon an API that permits for extracting partial chunks of the big file objects. In earlier posts (e.g., here) we now have mentioned the other ways of interfacing with S3, a lot of which do not help this function.

The code block beneath demonstrates the right way to implement a easy PyTorch dataset (with PyTorch model 1.13) that makes use of the Boto3 get_object API to extract particular person 1 MB samples from giant recordsdata of grouped samples. We examine the pace of iterating the information on this method to iterating the samples which might be saved in particular person recordsdata.

import os, boto3, time, numpy as np
import torch
from torch.utils.knowledge import Dataset
from statistics import imply, variance

KB = 1024
MB = KB * KB
GB = KB ** 3

sample_size = MB
num_samples = 100000

# modify to fluctuate the scale of the recordsdata
samples_per_file = 2000 # for 2GB recordsdata
num_files = num_samples//samples_per_file
bucket = '<s3 bucket>'
single_sample_path = '<path in s3>'
large_file_path = '<path in s3>'

class SingleSampleDataset(Dataset):
def __init__(self):
self.bucket = bucket
self.path = single_sample_path
self.consumer = boto3.consumer("s3")

def __len__(self):
return num_samples

def get_bytes(self, key):
response = self.consumer.get_object(
return response['Body'].learn()

def __getitem__(self, index: int):
key = f'{self.path}/{index}.picture'
picture = np.frombuffer(self.get_bytes(key),np.uint8)
return {"picture": picture}

class LargeFileDataset(Dataset):
def __init__(self):
self.bucket = bucket
self.path = large_file_path
self.consumer = boto3.consumer("s3")

def __len__(self):
return num_samples

def get_bytes(self, file_index, sample_index):
response = self.consumer.get_object(
return response['Body'].learn()

def __getitem__(self, index: int):
file_index = index // num_files
sample_index = index % samples_per_file
picture = np.frombuffer(self.get_bytes(file_index, sample_index),
return {"picture": picture}

# toggle between single pattern recordsdata and enormous recordsdata
use_grouped_samples = True

if use_grouped_samples:
dataset = LargeFileDataset()
dataset = SingleSampleDataset()

# set the variety of parallel employees in line with the variety of vCPUs
dl = torch.utils.knowledge.DataLoader(dataset, shuffle=True,
batch_size=4, num_workers=16)

stats_lst = []
t0 = time.perf_counter()
for batch_idx, batch in enumerate(dl, begin=1):
if batch_idx % 100 == 0:
t = time.perf_counter() - t0

Source link


Please enter your comment!
Please enter your name here