When working with large amounts of data, we often encounter distributed storage systems that require a lot of configuration and administration (e.g. Apache Hadoop).
In this work, we examine a way to establish a personal data lake for data analysis that will not require much configuration and administration. The deployed data lake is easy to use and can be arbitrarily extended with additional storage capacity and computational resources.
We used the MinIO object store to set up the data lake and used and compared the pandas, Dask and Apache Spark analytical tools for data analysis.
It turned out that the MinIO is fairly easy to set up and that we can easily communicate with the selected tools via the S3 protocol. The pandas' library had some problems when analyzing large amounts of data. At the same time, Dask and Apache Spark could perform the same or more data-intensive queries with the same amount of memory. Dask and Apache Spark are similarly efficient at running time and space-intensive queries. The test data was also suitable for a relational database. We compared the query times with PostgreSQL and found that our approach using MinIO and Dask or Apache Spark to analyze the data was much more time efficient.
|