We have already split the data into training and test frames using dplyr. Alternatively, we can use the h2o.

For linear regression models produced by H2O, we can use either print or summary to learn a bit more about the quality of our fit. The summary method returns some extra information about scoring history and variable importance. The output suggests that our model is a fairly good fit, and that both a cars weight, as well as the number of cylinders in its engine, will be powerful predictors of its average fuel consumption. The model suggests that, on average, heavier cars consume more fuel.

A model is often fit not on a dataset as-is, but instead on some transformation of that dataset. Transformers can be used on Spark DataFrames, and the final training set can be sent to the H2O cluster for machine learning. We will use the iris data set to examine a handful of learning algorithms and transformers. The iris data set measures attributes for flowers in 3 different species of iris. K-means clustering partitions points into k groups, such that the sum of squares from points to the assigned cluster centers is minimized.

To look at particular metrics of the K-means model, we can use h2o. PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. We will continue to use the iris dataset as an example for this problem. As usual, we define the response and predictor variables using the x and y arguments.

Since we passed a validation frame, the validation metrics will be calculated. We can retrieve individual metrics using functions such as h2o. The confusion matrix can be printed using the following:. To view the variable importance computed from an H2O model, you can use either the h2o. Since this is a multi-class problem, we may be interested in inspecting the confusion matrix on a hold-out set. Grid search in R provides the following capabilities:.

Badr Chentouf, H2O. It fully automates the data science workflow including some of the most challenging tasks in applied data science such as feature engineering, model tuning, model optimization, and model deployment.

Driverless AI turns Kaggle Grandmaster recipes into a full functioning platform that delivers "an expert data scientist in a box" from training to deployment. With this new capability, Driverless AI can now address a whole new set of problems in the text space like automatic document classification, sentiment analysis, emotion detection and so on using the textual data. Stay tuned to the webinar to know more. Vinod Iyengar, H2O. AutoML platforms and solutions are quickly becoming the dominant way for every enterprise that is looking to implement and scale their ML and AI projects.

As Forrester pointed out, these tools are trying to automate the end-to-end life cycle of developing and deploying predictive models — from data prep through feature engineering, model training, validation and deployment. This often involves evaluating numerous platforms and identifying the best fit for their organization. The decision process is based on multiple considerations, including accuracy, ease-of-use, performance, integration with existing tools, economics, competitive differentiation, solution maturity, risk tolerance, regulatory compliance considerations and more.

Keeping pace with new technologies for data science, machine learning, and deep learning can be overwhelming. And it can be challenging to deploy and manage these tools — including H2O and many others — for data science teams in large-scale distributed environments. Most have begun thinking about how AI can be incorporated into their business strategy but the exponential growth of AI resources and offerings is making it difficult to find the right fit for one's organization.