RapidMiner Radoop: Big Data Analytics Made Easy

RapidMiner Radoop

Big Data Analytics Made Easy

As data grows larger, so does the amount of memory and processing power needed to store and analyze it.  Single computers can be upgraded, but there is a limit.  Also, upgrades require computer downtime, which can be costly and inconvenient.  Single computers are just not enough, and this is where distributed computing can help.  By combining the power of multiple computers, memory and processing power can be easily scaled up and down when needed, and fault tolerance, redundancy, and computational efficiency is increased.  However, figuring out how to set up and manage the software required to make everything run smoothly can be time consuming and confusing.  That’s where RapidMiner Radoop can help!

What Is RapidMiner Radoop?

Radoop is a RapidMiner product that allows for code free distributed machine learning with Apache Hadoop, Spark, and Hive.
Apache Hadoop is an open-source project made for reliable, scalable, distributed computing.  Within Radoop, it is used for it’s distributed file system (HDFS) and it’s job scheduling and cluster resource management abilities (YARN).
Apache Spark is a unified analytics engine for large-scale data processing.  Within Radoop, it’s machine learning library (MLlib) is used for modeling.
Apache Hive is a data warehouse software that uses SQL for reading, writing, and managing large datasets residing in distributed storage.  Within Radoop, it is used for the same purpose.

Why Should I Use It?

– Ease of Use: eliminates the complexity of data science on Hadoop, Spark, and Hive through a code free visual programming environment.

– Rich Features: includes 70+ Native Hadoop, Spark and Hive Operators with access to all standard Spark MLlib algorithms.

– Flexibility: allows you to re-use existing SparkR, PySpark, Pig, and HiveQL code (or create new code) with the Hive Script, Pig Script, and Spark Script operators.

– Security: support for computer-network authentication protocols (Kerberos), data access authorization (Apache Sentry & Apache Ranger), HDFS encryption, and Hadoop impersonation.

In addition, the Enterprise version includes the ability to run all 1500+ RapidMiner operators inside Hadoop with the Single Process Pushdown and SparkRM operators!

Big data analytics is easy with RapidMiner Radoop, and you can get started by downloading it for free from the RapidMiner Marketplace.  If you need enterprise support and features, KSK Analytics is here to help.  So please feel free to contact us.



Operators Summary

Data Access

– Read from and store and append to Hive tables

– Read from and write to CSV files (from HDFS, Azure Blob or Datalake, Amazon S3 or local filesystem)

– Read from and write to databases (MySQL, PostgreSQL, Sybase, Oracle, HISQLDB, Ingres, Microsoft SQL Server, or any other database using an ODBC Bridge.)


– Attribute selection, generation, aggregation, etc.

– Example sampling, filtering, sorting etc.

– Pivot and Join ExampleSets


– Replace missing values, remove duplicates, normalize, PCA, etc.


– K-Means clustering, PCA, Correlation and Covariance Matrix, Naive Bayes, Logistic Regression, Decision Tree, etc.


– Apply Model


– Performance and Split Validation


– Loops, Scripting, Subprocesses, Single Process Pushdown, SparkRM, etc.

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$