Implementation of machine learning algorithms in a distributed environment ensures us multiple advantages, like processing of large datasets and linear speedup with additional processing units. We describe the MapReduce paradigm, which enables distributed computing, and the Disco framework, which implements it. We present the summation form, which is a condition for efficient implementation of algorithms with the MapReduce paradigm, and describe the implementations of the selected algorithms. We propose novel distributed random forest algorithms that build models on subsets of the dataset. We compare time and accuracy of the algorithms with the well recognized data analytics tools. We end our master thesis by describing the integration of the implemented algorithms into the ClowdFlows platform, which is a web platform for construction, execution and sharing of interactive workflows for data mining. With this integration, we enabled processing of big batch data with visual programming.
|