Amazon EMR
Easily Run and Scale Apache Spark, Hadoop, HBase, Presto, Hive and other Big Data Frameworks.
Get started with AWS
Sign up for an AWS account
Instantly get access to the AWS Free Tier.
Learn with 10-minute Tutorials
Explore and learn with simple tutorials.
Start building with AWS
Begin building with step-by-step guides to help you launch your AWS project
Benefits
EASY TO USE
EMR swiftly launches clusters, relieving you of node provisioning, infrastructure setup, Hadoop configuration, and cluster tuning concerns. With EMR Notebooks, analysts, data engineers, and scientists can create serverless Jupyter notebooks instantly, fostering seamless collaboration for data exploration, processing, and visualization.
LOW COST
EMR offers transparent pricing: pay per-instance for every second used, with a one-minute minimum charge. A 10-node cluster with Apache Spark and Apache Hive can cost as little as $0.15 per hour. With native support for Amazon EC2 Spot and Reserved Instances, savings of 50-80% on underlying instance costs are achievable.
ELASTIC
EMR empowers you to provision compute instances ranging from one to thousands, facilitating data processing at any scale. Utilize Auto Scaling to adjust instances dynamically based on utilization, paying only for what you use. Unlike on-premise clusters, EMR separates compute and storage, enabling independent scaling for greater flexibility.
RELIABLE
EMR, cloud-optimized, monitors clusters, retrying failed tasks and replacing subpar instances. It offers up-to-date open-source software, eliminating manual updates and bug fixes for reduced maintenance. Multiple master nodes ensure high availability, with automatic failover in case of node failure, minimizing disruptions.
SECURE
EMR configures EC2 firewall settings and launches clusters in an Amazon VPC for network isolation. It supports server-side or client-side encryption for S3 objects via EMRFS, using AWS Key Management Service or custom keys. EMR also facilitates in-transit/at-rest encryption and strong authentication with Kerberos.
FLEXIBLE
You wield full control over your cluster, with root access to each instance for effortless installation of extra applications. Customize clusters using Bootstrap Actions, launch EMR clusters with personalized Amazon Linux AMIs, and reconfigure running clusters seamlessly without relaunching.
Use cases
MACHINE LEARNING
Utilize EMR’s integrated machine learning resources, such as Apache Spark MLlib, TensorFlow, and Apache MXNet, to deploy scalable algorithms. Employ custom AMIs and Bootstrap Actions for seamless integration of preferred libraries and tools, crafting a personalized predictive analytics toolkit.
EXTRACT TRANSFORM LOAD (ETL)
EMR stands out as an efficient and cost-effective platform for seamlessly handling data transformation workloads (ETL) across massive datasets. Whether it’s sorting, aggregating, or joining data, EMR empowers users with rapid processing capabilities, ensuring smooth and scalable operations.
CLICKSTREAM ANALYSIS
Leverage the combined power of Apache Spark and Apache Hive to meticulously dissect voluminous clickstream data retrieved from Amazon S3. The primary objective is to intricately segment users, gain profound insights into their preferences, and ultimately optimize ad delivery for heightened effectiveness.
REAL-TIME STREAMING
Utilize Apache Spark Streaming and Amazon EMR to analyze real-time data from Apache Kafka, Amazon Kinesis, or similar streaming sources. Construct durable, fault-tolerant streaming data pipelines for continuous operation. Store processed data in Amazon S3 or HDFS, and capture insights in Amazon Elasticsearch for further analysis.
INTERACTIVE ANALYTICS
EMR Notebooks offer a managed analytical platform built on open-source Jupyter, enabling data scientists, analysts, and developers to prepare, visualize, and collaborate on data, as well as build applications and conduct interactive analysis.
GENOMICS
EMR excels in swiftly and effectively processing extensive genomic data and other sizable scientific datasets. Researchers can readily tap into AWS-hosted genomic data at no cost, leveraging EMR’s capabilities for their analyses and studies.
Amazon Elastic MapReduce(EMR)
Amazon EMR is the industry-leading cloud-native Big Data platform, allowing teams to process vast amounts of data quickly, and cost-effectively at scale. Using open-source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, and Presto, coupled with the dynamic scalability of Amazon EC2 and scalable storage of Amazon S3, EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis for a fraction of the cost of traditional on-premise clusters. Developers and analysts can use Jupyter-based EMR Notebooks for iterative development, collaboration, and access to data stored across AWS data products such as Amazon S3, Amazon DynamoDB, and Amazon Redshift to reduce time to insight and quickly operationalize analytics.