
Amazon S3 was launched on 14 March 2006, Pi Day, as the first generally available AWS service. (Source: Amazon Web Services, “Amazon S3 Launch,” aws.amazon.com/blogs/aws/amazon_s3, March 2006)
Twenty years later, S3 stores hundreds of exabytes of data and more than 500 trillion objects. (Source: Amazon Web Services, “Amazon S3 at 20,” aws.amazon.com/blogs/aws/amazon-s3-20th-anniversary, 2026)
More than one million data lakes run on AWS, with S3 as the storage foundation. (Source: Amazon Web Services, “AWS re:Invent 2024 Keynote,” aws.amazon.com/events/reinvent). It has become the de facto standard for cloud object storage. The S3 API is replicated by competing storage platforms as a compatibility interface.
If you work in data engineering, software development, cloud infrastructure, or analytics, you will interact with S3. Understanding what it is, how it is structured, and how to use it effectively matters regardless of your role. This guide covers what Amazon S3 is, what a bucket is, how objects are organized, the storage classes available, and the use cases where S3 is the right choice.
What Is Amazon S3?
Amazon S3, or Simple Storage Service, is an object storage service. It stores data as objects rather than as files in a file system hierarchy or as blocks in block storage.
An object is a unit of data plus metadata plus a unique identifier. The data can be a file, a database dump, a log file, a machine learning model, a video, a parquet file, or anything else. Objects are stored in S3 buckets. There is no limit to the number of objects a bucket can store, and individual objects can be up to 5TB in size. The defining characteristics of S3 are durability, scalability, and access via API.
S3 is designed for 99.999999999 percent (eleven nines) durability. It automatically replicates data across at least three separate Availability Zones in the AWS Region you choose. (. It is fully elastic. It scales automatically without any provisioning, and you pay only for what you use.
Access is through several interfaces:
- The S3 API.
- The AWS CLI.
- The AWS management console.
- AWS SDKs for Python (boto3), Java, Go, Node, and others.
Every operation is available programmatically.
That includes creating buckets, uploading objects, setting permissions, and configuring lifecycle rules.
What Is an S3 Bucket?
A bucket is the top-level container for storing objects in S3.
Every object stored in S3 lives inside a bucket. The bucket defines the AWS Region where the data is physically stored. Data stored in a region does not leave that region unless you explicitly configure replication or transfer it. Bucket names must be globally unique across all AWS accounts and all regions.
Once you choose a bucket name, no other AWS customer in the world can create a bucket with that name until yours is deleted. This global namespace requirement is a common source of confusion for new S3 users. A bucket name that seems available in the console may be taken by another account.
Bucket names follow specific rules:
- 3 to 63 characters.
- Lowercase letters, numbers, and hyphens only.
- No underscores.
- Must start and end with a letter or number.
Bucket names become part of the object URL, so they are visible in requests and should reflect their purpose clearly.
Types of S3 Buckets
AWS now offers three distinct bucket types, each optimized for a different use case. General purpose buckets are the original and most common bucket type.
They support all S3 storage classes except Express One Zone. They are the right choice for most workloads, including backup, data lakes, application storage, and log archives. Directory buckets are designed for high-performance, low-latency workloads. They only support the S3 Express One Zone storage class and support up to 2 million transactions per second per bucket.
They are used for AI training workloads, real-time analytics, and other latency-sensitive applications where maximum throughput matters more than multi-AZ redundancy. Table buckets are purpose-built for storing tabular data using the Apache Iceberg format.
They automate compaction, snapshot management, and file cleanup. That is the operational overhead data teams typically manage manually when building Iceberg tables on standard S3. They are designed for use cases like daily transaction data, streaming sensor data, or any structured dataset that benefits from the Iceberg table format.
How S3 Organises Objects
S3 is a flat namespace. There is no true directory hierarchy.
Every object is stored with a key (its name within the bucket), and the key can contain forward slashes, which tools display as a folder structure. For example, an object with the key data/2026/04/sales.parquet appears in the S3 console as if it lives in a data/2026/04/ folder. But there is no folder. Just an object with a key that contains slashes.
This distinction matters for performance. S3 prefixes (the key components before the last slash) affect how requests are distributed. Very high request rates to the same prefix can cause throughput throttling.
Each object is uniquely identified by three values:
- The bucket name.
- The object key.
- The version ID, if versioning is enabled.
Object metadata comes in two types.
System metadata is generated by S3 and includes values like object size, last modified date, content type, and storage class. User-defined metadata is key-value pairs you attach to objects for your own classification, search, and management purposes.
S3 Storage Classes: Which One to Use
S3 provides multiple storage classes optimized for different combinations of access frequency, retrieval speed, and cost.
Choosing the right class for each data asset can reduce storage costs significantly without affecting application behavior.
| Storage Class | Access Speed | Min Duration | Relative Cost | Best For |
| S3 Standard | Milliseconds | None | Highest | Frequently accessed data; active production workloads; data lakes |
| S3 Express One Zone | Single-digit ms | None | High (request costs lower) | Latency-sensitive; AI training; real-time analytics; high TPS |
| S3 Intelligent-Tiering | Milliseconds | None | Medium plus monitoring fee | Unknown or unpredictable access patterns; automatic cost optimization |
| S3 Standard-IA | Milliseconds | 30 days | Medium | Backup data; disaster recovery copies; older datasets accessed occasionally |
| S3 One Zone-IA | Milliseconds | 30 days | Lower | Re-creatable secondary copies; cross-region replication targets |
| Glacier Instant Retrieval | Milliseconds | 90 days | Low | Archive data needing millisecond access; medical images; media assets |
| Glacier Flexible Retrieval | 1 to 5 min (expedited) or 3 to 5 hrs (standard) | 90 days | Very low | Backup archives; disaster recovery with same-day retrieval SLA |
| Glacier Deep Archive | 12 to 48 hours | 180 days | Lowest | Compliance archives; long-term retention; rarely accessed regulatory data |
S3 Intelligent-Tiering is worth highlighting specifically. It monitors access patterns for each object and automatically moves it between tiers with no retrieval fee and no performance impact.
The tiering flow is:
- S3 Standard for frequently accessed objects.
- Standard-IA for objects not accessed for 30 days.
- Archive Instant Access for objects not accessed for 90 days.
The only cost is a small per-object monitoring charge.
For data assets with unpredictable access patterns, Intelligent-Tiering removes the cost management burden entirely.
S3 Lifecycle Policies
Lifecycle policies automate the transition of objects between storage classes and the deletion of objects that have reached the end of their retention period. They do this without any application changes.
A common lifecycle pattern for log data:
- Store logs in S3 Standard for the first 30 days (when they are actively queried for troubleshooting).
- Transition to Standard-IA for days 31 to 90 (occasional retrospective analysis).
- Transition to Glacier for days 91 to 365 (retained for compliance but rarely accessed).
- Delete after 365 days.
Without a lifecycle policy, this would require manual management. With a lifecycle policy, it happens automatically. Lifecycle policies apply to entire buckets, to objects matching specific key prefixes, or to objects with specific tags.
S3 Security and Access Control
Private by Default
All S3 buckets and objects are private by default.
A newly created bucket is accessible only to the AWS account that created it. No public access, no cross-account access, no application access until you explicitly grant it. The Block Public Access setting is enabled by default at both the account and bucket level. It prevents any bucket policy or object ACL from granting public access unless you explicitly disable it.
This default has prevented a large class of accidental data exposure that affected early S3 users who inadvertently made buckets public.
Access Control Mechanisms
S3 access is controlled through three mechanisms that apply at different levels.
IAM policies are attached to users, roles, and groups. They define what S3 operations those principals can perform. They are the primary mechanism for granting application and service access to S3. Bucket policies are JSON documents attached to individual buckets.
They control which principals (including principals from other AWS accounts) can access the bucket and what operations they can perform. Bucket policies are the right mechanism for cross-account access and for public-read static website hosting.
S3 Object Lock prevents objects from being deleted or overwritten for a specified retention period.
It is used for compliance workloads that require write-once-read-many (WORM) storage. Examples include financial records, regulatory data, and audit logs that must not be modified.
Encryption
All new S3 buckets use server-side encryption by default (SSE-S3), applying AES-256 encryption to every stored object. Encryption at rest is automatic and does not require application changes. For workloads requiring more control, SSE-KMS allows you to use AWS Key Management Service to manage encryption keys.
This provides audit logs of key usage and the ability to revoke key access. For the highest level of key control, SSE-C allows customers to provide and manage their own encryption keys. Encryption in transit is enforced by requiring HTTPS for all S3 API requests. This is enforceable via bucket policy by denying requests that use HTTP.
S3 for Data Workloads
S3’s role in data engineering and analytics has expanded significantly in 2026.
Beyond raw storage, S3 is now the foundation for several data-specific capabilities.
Data Lake Foundation
S3 is the standard storage layer for data lakes on AWS. Structured, semi-structured, and unstructured data can coexist in S3 without schema constraints.
Query engines like Amazon Athena, Apache Spark on EMR, Presto, and Trino query S3 data directly using SQL without moving data into a database.
S3 supports open table formats including Apache Iceberg, Delta Lake, and Apache Hudi. These allow data lake tables to have ACID transaction semantics, schema evolution, and time travel capabilities on top of S3’s flat object storage.
S3 Tables for Structured Data
S3 Tables (generally available from late 2024) are purpose-built for tabular datasets using Apache Iceberg. (Source: Amazon Web Services, “Announcing Amazon S3 Tables: General Availability,” aws.amazon.com/blogs/aws, December 2024)
They automate the maintenance operations (compaction, snapshot expiry, file cleanup) that data engineers typically manage manually.
For teams building analytics tables, event streams, or transactional datasets on S3, S3 Tables reduce operational overhead compared to managing Iceberg tables on standard general purpose buckets.
S3 Vectors for AI Workloads
S3 Vectors provides native vector storage and query capability within S3. It stores vector embeddings (the numerical representations produced by machine learning models) and supports approximate nearest-neighbour search directly.
This is relevant for teams building retrieval-augmented generation (RAG) pipelines, semantic search applications, or recommendation systems. These teams need to store and query large volumes of embeddings without a separate vector database infrastructure.
Common S3 Use Cases
- Data lake storage: Raw, processed, and curated data layers live in S3, with query engines running on top. S3’s scalability eliminates the storage provisioning bottleneck of traditional data warehouse architectures.
- Backup and disaster recovery: Eleven-nines durability and cross-region replication make S3 the standard destination for database backups, application state snapshots, and DR copies. Point-in-time recovery is supported through versioning.
- Log storage and analysis: Application logs, access logs, and infrastructure metrics live in S3 cost-effectively at scale. Athena or Spark query directly against S3 log data without loading into a database first.
- Static website hosting: S3 can serve static websites (HTML, CSS, JavaScript, images) directly, with CloudFront as a CDN layer. No web server required.
- Software and artifact distribution: Build artifacts, software packages, ML model checkpoints, and dataset releases are commonly distributed from S3. S3 with CloudFront is the standard pattern for high-throughput content distribution.
- Compliance archiving: Glacier Deep Archive provides the lowest-cost long-term storage for regulated data, including financial records, HIPAA healthcare data, and audit trails, at a price competitive with physical tape storage.
S3 Cost Management: The Practical Basics
S3 cost has three components:
- Storage cost: Per GB per month, varying by storage class.
- Request cost: Per API call, varying by operation type.
- Data transfer cost: Egress from S3 to the internet or to other AWS regions.
The most common cost optimization levers are:
- Storage class selection: Use Standard only for data you access regularly. Use Intelligent-Tiering or IA classes for data accessed less than once a month.
- Lifecycle policies: Automate transitions to cheaper classes and deletion of expired data.
- S3 Storage Lens: The built-in analytics tool that surfaces usage patterns across all buckets in an account, identifying unused storage and access pattern anomalies.
Data transfer costs are frequently underestimated.
Transferring data from S3 to EC2 instances in the same region is free. Transferring to other regions, to the internet, or to on-premises incurs per-GB charges. Large-scale data movement between services should be architected to minimise cross-region transfers.
Final Thoughts
Amazon S3 is the foundational storage service of the AWS ecosystem, and increasingly of the broader cloud and data ecosystem. Understanding buckets, objects, storage classes, and access control is foundational knowledge for any team building on AWS.
Understanding the newer capabilities is increasingly relevant for data engineering teams building modern analytics and AI infrastructure. Those capabilities include S3 Tables for structured data, S3 Vectors for AI workloads, and lifecycle automation for cost management.
For data teams building data lakes, analytics pipelines, or AI-ready data platforms on AWS, getting the S3 architecture right has material impact on both cost and performance. That means storage class selection, lifecycle policies, access control, and table format choices.
If you are designing or reviewing a data platform architecture on AWS, Data Pilot’s data strategy and engineering consulting helps teams make the storage, infrastructure, and data architecture decisions that determine whether the platform performs reliably and cost-effectively at scale.