ビジネステクノロジー事業

データサイエンス事業

KSK's Efforts

PAGE TOP

## Big Data Analytics Made Easy

As data grows larger, so does the amount of memory and processing power needed to store and analyze it.  Single computers can be upgraded, but there is a limit.  Also, upgrades require computer downtime, which can be costly and inconvenient.  Single computers are just not enough, and this is where distributed computing can help.  By combining the power of multiple computers, memory and processing power can be easily scaled up and down when needed, and fault tolerance, redundancy, and computational efficiency is increased.  However, figuring out how to set up and manage the software required to make everything run smoothly can be time consuming and confusing.  That’s where RapidMiner Radoop can help!

Radoop is a RapidMiner product that allows for code free distributed machine learning with Apache Hadoop, Spark, and Hive.
Apache Hadoop is an open-source project made for reliable, scalable, distributed computing.  Within Radoop, it is used for it’s distributed file system (HDFS) and it’s job scheduling and cluster resource management abilities (YARN).
Apache Spark is a unified analytics engine for large-scale data processing.  Within Radoop, it’s machine learning library (MLlib) is used for modeling.
Apache Hive is a data warehouse software that uses SQL for reading, writing, and managing large datasets residing in distributed storage.  Within Radoop, it is used for the same purpose.

### Why Should I Use It?

– Ease of Use: eliminates the complexity of data science on Hadoop, Spark, and Hive through a code free visual programming environment.

– Rich Features: includes 70+ Native Hadoop, Spark and Hive Operators with access to all standard Spark MLlib algorithms.

– Flexibility: allows you to re-use existing SparkR, PySpark, Pig, and HiveQL code (or create new code) with the Hive Script, Pig Script, and Spark Script operators.

– Security: support for computer-network authentication protocols (Kerberos), data access authorization (Apache Sentry & Apache Ranger), HDFS encryption, and Hadoop impersonation.

In addition, the Enterprise version includes the ability to run all 1500+ RapidMiner operators inside Hadoop with the Single Process Pushdown and SparkRM operators!

## Operators Summary

Data Access

– Read from and store and append to Hive tables

– Read from and write to CSV files (from HDFS, Azure Blob or Datalake, Amazon S3 or local filesystem)

– Read from and write to databases (MySQL, PostgreSQL, Sybase, Oracle, HISQLDB, Ingres, Microsoft SQL Server, or any other database using an ODBC Bridge.)

Blending

– Attribute selection, generation, aggregation, etc.

– Example sampling, filtering, sorting etc.

– Pivot and Join ExampleSets

Cleansing

– Replace missing values, remove duplicates, normalize, PCA, etc.

Modeling

– K-Means clustering, PCA, Correlation and Covariance Matrix, Naive Bayes, Logistic Regression, Decision Tree, etc.

Scoring

– Apply Model

Validation

– Performance and Split Validation

Utility

– Loops, Scripting, Subprocesses, Single Process Pushdown, SparkRM, etc.

\begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned}
PAGE TOP