In today's data-driven world, companies face growing volumes of data, making efficient processing critical for gaining insights, optimizing business processes, and maintaining a competitive edge.
The aim of this master's thesis is to present, test, and evaluate the impact of various optimizations on the performance of extract, transform, load processes (ETL) using Apache Spark on the Databricks platform, with the goal of reducing costs and improving execution speed.
The first part of the thesis provides the theoretical background, reviewing the basic concepts of data engineering. It outlines the ETL processes, detailing the structure, purpose, and creation of each stage. This is followed by an in-depth presentation of Apache Spark, focusing on its core concepts, including MapReduce, Resilient Distributed Datasets (RDDs), and technologies used for query optimization. Various data storage architectures - such as databases, data warehouses and data lakes are also compared. A key feature of Spark is its ability to break tasks into smaller units of work and distribute them across multiple servers, with everything being orchestrated by a cluster manager. Spark processes can run on either self-hosted infrastructure or using cloud platforms like Databricks.
In the practical part, four ETL processes of varying complexity, data volumes, execution times, and operation types are analysed. Multiple optimizations are tested on these processes, across two main categories: infrastructure-level optimizations and task-specific improvements.
The results demonstrate significant improvements in both cost efficiency and execution speed. The most impactful optimization was the use of spot and fleet instance types, which, along with other strategies, resulted in up to an 80% cost reduction and up to a 40% reduction in execution time. Further optimizations, such as task reordering, the use of explicit data schemas, and user-defined functions, enhanced the reliability of the processes and simplified their creation and maintenance. Additional storage optimizations reduced storage costs by up to 90%, while improving data access performance.
|