소규모 스파크 사용을 위한 AWS EMR 클러스터 생성하기
AWS에는 EMR이라고 불리는 빅데이터 플랫폼이 있습니다. EMR을 통해 스파크, 하이프, HBASE, 플링크 등 다양한 오픈소스 도구 셋트를 생성할 수 있습니다. 온프로미스로 직접 구축하는 것에 비해 매우 빠른 속도로 구축할 수 있으며 AWS에 따르면 온프로미스에서 구축하는 비용에 50%로 사용할 수 있다고 합니다. 오늘은 S3에 저장된 데이터를 스파크로 처리하기 위해 소규모 스파크로만 구성된 EMR클러스터를 생성해보겠습니다.
준비물
- AWS계정
S3 저장소 생성
S3 저장소를 생성하는 것은 쉽습니다. S3 콘솔화면으로 가셔서 버킷을 신규로 생성하면 됩니다.
EMR 클러스터 생성
emr클러스터는 aws의 emr콘솔 화면에서 생성할 수 있습니다. 콘솔로 가셔서 아래와 같은 단계를 따라 하시면 됩니다.
클러스터 생성시 빠른 옵션을 사용하게 되면 스파크 말고도 추가적인 프로젝트가 구성됩니다. 스파크만 사용하는 EMR클러스터를 사용하기 위해서 고급 옵션으로 가도록 합니다.
인스턴스 유형은 다양하게 선택할 수 있습니다. c4.large를 사용하면 가장 저렴한 가격에 3instance로 구성된 emr클러스터를 구성할 수 있습니다.
3개 인스턴스로 구성된 EMR이 설치가 완료되려면 시간이 조금 걸립니다. 짧게는 1분 길게는 10분가량 이상도 걸릴 수 있으므로 인내심을 가지고 기다리는 것을 추천드립니다.
스파크 실행
스파크를 실행하기 위해 master node로 접속하도록 하겠습니다. emr클러스터를 생성할때 사용했던 genesis-test.pem으로 마스터 노드로 ssh접속을 할 수 있습니다. 접속하면 아래와 같은 화면이 출력됩니다.
$ ssh -i genesis-test.pem hadoop@ec2-13-125-193-129.ap-northeast-2.compute.amazonaws.com 255 ↵
Last login: Thu Jul 9 04:45:54 2020
__| __|_ )
_| ( / Amazon Linux 2 AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-2/
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
[hadoop@ip-10-180-77-80 ~]$
위 마스터 노드는 이미 spark와 java가 깔려있기 때문에 추가로 설치할 소프트웨어는 없습니다. 아래와 같이 간단한 스파크 파일을 생성하여 실행해 보겠습니다.
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sc.stop()
위 파일은 main.py 로 생성하였습니다. 이제 실행해 보겠습니다. 실행은 spark-submit 명령어와 main.py를 옵션으로 주면 됩니다.
$ spark-submit main.py
실행 로그
20/07/09 05:22:04 INFO SparkContext: Running Spark version 2.4.5-amzn-0
20/07/09 05:22:04 INFO SparkContext: Submitted application: main.py
20/07/09 05:22:05 INFO SecurityManager: Changing view acls to: hadoop
20/07/09 05:22:05 INFO SecurityManager: Changing modify acls to: hadoop
20/07/09 05:22:05 INFO SecurityManager: Changing view acls groups to:
20/07/09 05:22:05 INFO SecurityManager: Changing modify acls groups to:
20/07/09 05:22:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/07/09 05:22:05 INFO Utils: Successfully started service 'sparkDriver' on port 44467.
20/07/09 05:22:05 INFO SparkEnv: Registering MapOutputTracker
20/07/09 05:22:05 INFO SparkEnv: Registering BlockManagerMaster
20/07/09 05:22:05 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/09 05:22:05 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/09 05:22:05 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-23e435cb-22bb-4c01-b66d-8f94c2700c00
20/07/09 05:22:05 INFO MemoryStore: MemoryStore started with capacity 706.4 MB
20/07/09 05:22:05 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/09 05:22:06 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/09 05:22:06 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ip-10-180-77-130.ap-northeast-2.compute.internal:4040
20/07/09 05:22:06 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
20/07/09 05:22:07 INFO RMProxy: Connecting to ResourceManager at ip-10-180-77-130.ap-northeast-2.compute.internal/10.180.77.130:8032
20/07/09 05:22:07 INFO Client: Requesting a new application from cluster with 1 NodeManagers
20/07/09 05:22:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (1792 MB per container)
20/07/09 05:22:07 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
20/07/09 05:22:07 INFO Client: Setting up container launch context for our AM
20/07/09 05:22:07 INFO Client: Setting up the launch environment for our AM container
20/07/09 05:22:07 INFO Client: Preparing resources for our AM container
20/07/09 05:22:07 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/07/09 05:22:10 INFO Client: Uploading resource file:/mnt/tmp/spark-73d59188-616a-4448-88a6-a3a329cb00f3/__spark_libs__929161307550230739.zip -> hdfs://ip-10-180-77-130.ap-northeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1594271975987_0001/__spark_libs__929161307550230739.zip
20/07/09 05:22:12 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-180-77-130.ap-northeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1594271975987_0001/pyspark.zip
20/07/09 05:22:13 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-10-180-77-130.ap-northeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1594271975987_0001/py4j-0.10.7-src.zip
20/07/09 05:22:14 INFO Client: Uploading resource file:/mnt/tmp/spark-73d59188-616a-4448-88a6-a3a329cb00f3/__spark_conf__3686593065718482714.zip -> hdfs://ip-10-180-77-130.ap-northeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1594271975987_0001/__spark_conf__.zip
20/07/09 05:22:14 INFO SecurityManager: Changing view acls to: hadoop
20/07/09 05:22:14 INFO SecurityManager: Changing modify acls to: hadoop
20/07/09 05:22:14 INFO SecurityManager: Changing view acls groups to:
20/07/09 05:22:14 INFO SecurityManager: Changing modify acls groups to:
20/07/09 05:22:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/07/09 05:22:15 INFO Client: Submitting application application_1594271975987_0001 to ResourceManager
20/07/09 05:22:16 INFO YarnClientImpl: Submitted application application_1594271975987_0001
20/07/09 05:22:16 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1594271975987_0001 and attemptId None
20/07/09 05:22:17 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:17 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1594272135843
final status: UNDEFINED
tracking URL: http://ip-10-180-77-130.ap-northeast-2.compute.internal:20888/proxy/application_1594271975987_0001/
user: hadoop
20/07/09 05:22:18 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:19 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:20 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:21 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:22 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:23 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:24 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:25 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:26 INFO Client: Application report for application_1594271975987_0001 (state: ACCEPTED)
20/07/09 05:22:27 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-180-77-130.ap-northeast-2.compute.internal, PROXY_URI_BASES -> http://ip-10-180-77-130.ap-northeast-2.compute.internal:20888/proxy/application_1594271975987_0001), /proxy/application_1594271975987_0001
20/07/09 05:22:27 INFO Client: Application report for application_1594271975987_0001 (state: RUNNING)
20/07/09 05:22:27 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.180.77.218
ApplicationMaster RPC port: -1
queue: default
start time: 1594272135843
final status: UNDEFINED
tracking URL: http://ip-10-180-77-130.ap-northeast-2.compute.internal:20888/proxy/application_1594271975987_0001/
user: hadoop
20/07/09 05:22:27 INFO YarnClientSchedulerBackend: Application application_1594271975987_0001 has started running.
20/07/09 05:22:27 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33173.
20/07/09 05:22:27 INFO NettyBlockTransferService: Server created on ip-10-180-77-130.ap-northeast-2.compute.internal:33173
20/07/09 05:22:27 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/09 05:22:27 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ip-10-180-77-130.ap-northeast-2.compute.internal, 33173, None)
20/07/09 05:22:27 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-180-77-130.ap-northeast-2.compute.internal:33173 with 706.4 MB RAM, BlockManagerId(driver, ip-10-180-77-130.ap-northeast-2.compute.internal, 33173, None)
20/07/09 05:22:27 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ip-10-180-77-130.ap-northeast-2.compute.internal, 33173, None)
20/07/09 05:22:27 INFO BlockManager: external shuffle service port = 7337
20/07/09 05:22:27 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ip-10-180-77-130.ap-northeast-2.compute.internal, 33173, None)
20/07/09 05:22:27 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/07/09 05:22:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/07/09 05:22:28 INFO EventLoggingListener: Logging events to hdfs:/var/log/spark/apps/application_1594271975987_0001
20/07/09 05:22:28 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
20/07/09 05:22:28 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/07/09 05:22:28 INFO SparkUI: Stopped Spark web UI at http://ip-10-180-77-130.ap-northeast-2.compute.internal:4040
20/07/09 05:22:28 INFO YarnClientSchedulerBackend: Interrupting monitor thread
20/07/09 05:22:28 INFO YarnClientSchedulerBackend: Shutting down all executors
20/07/09 05:22:28 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
20/07/09 05:22:28 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
20/07/09 05:22:28 INFO YarnClientSchedulerBackend: Stopped
20/07/09 05:22:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/09 05:22:28 INFO MemoryStore: MemoryStore cleared
20/07/09 05:22:28 INFO BlockManager: BlockManager stopped
20/07/09 05:22:28 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/09 05:22:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/09 05:22:28 INFO SparkContext: Successfully stopped SparkContext
20/07/09 05:22:28 INFO ShutdownHookManager: Shutdown hook called
20/07/09 05:22:28 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-73d59188-616a-4448-88a6-a3a329cb00f3
20/07/09 05:22:28 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-b2ec8321-47a6-4431-bde3-eac1f20a68c6
20/07/09 05:22:28 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-73d59188-616a-4448-88a6-a3a329cb00f3/pyspark-ecf9c230-b48e-40d1-87ca-89450c0b7eaf
정상적으로 실행된 것을 볼 수 있습니다. EMR 클러스터를 생성할 때 사용했던 spark 2.4.5가 사용된 것도 로그에서 확인할 수 있습니다. 스파크를 돌리고 나면 스파크에 대한 히스토리도 AWS EMR 클러스터 콘솔에서 확인할 수 있습니다.
마치며
EMR을 사용하면 AWS에서 빅데이터 관련 생태계를 구축하는데 매우 빠르게 사용할 수 있습니다. 또한 필요한 소프트웨어만 설치해서 설치에 대한 노하우가 필요없더라도 AWS에서 설정한 최적의 값으로 설치합니다. 추가 상세 설정이 필요하다면 EMR을 생성할 때 설정을 하거나 설치된 이후에 각 인스턴스에 설정할 수도 있습니다. 빅데이터 플랫폼 구축에 관심은 있지만 관련 노하우가 없고 설치에 대한 압박감이 심한 개인이나 스타트업에서는 EMR을 사용한다면 더 빠르게 빅데이터를 체험할 수 있을 것 같습니다.