Hadoop on a shared infrastructure with Isilon and VMware – a report from a recent POC in Switzerland

Johannes Geissler

Johannes Geissler

Senior Specialized Solutions Engineer at EMC Computer Systems AG
Johannes Geissler
RT @ValaAfshar: Good companies do not give up on people. This coffee shop only employs people with disabilities.☕️ #TuesdayThoughts https:… - 1 month ago

Traditionally customers deploy Hadoop on dedicated servers with a dedicated Hadoop file system on top of direct attached storage. Loading data into this Hadoop silo can be a challenge and time consuming. This is an outdated approach and dedicated hardware is not needed anymore.

With VMware you already have a server infrastructure to run Hadoop and Isilon allows you to leverage HDFS directly on your scale-out NAS.

I had the chance to test this setup at University of Fribourg and run a couple of Hadoop jobs on a small 4 Node ESX Cluster (VSPEX Blue) pointed to a 3 Node Isilon system for shared HDFS.

To get started you need to invest 0$. Just download the Software;

  • VMware Big Data Extension 2.3
  • existing Isilon NAS or IsilonSD (Software Isilon for ESX)
  • Hortonworks, Cloudera or PivotalHD
  • EMC Isilon Hadoop Starter Kit (documentation and scripts)

Preparation

VMware Big Data Extension

VMware Big Data Extension helps to quickly roll out Hadoop clusters. BDE is a virtual appliance based on Serengenti and integrated as a plug-in to vCenter. https://www.vmware.com/support/pubs/vsphere-big-data-extensions-pubs.html

The vApp deploys a management VM and a template to clone Hadoop VMs from;

BDE_VMs_template

One can manage BDE through CLI or in vCenter.  For consistency it’s recommended to do either CLI or GUI management only. From the new management-server VM you can start Serengeti CLI and connect it to vCenter. Add networks and datastores to be used by BDE and you’re ready to roll out the first cluster:

cluster_create

This command creates a cluster with 1 master node and 16 workers. A name node is not needed. It runs as a clustered service directly on all Isilon nodes. The vspex-bde.json file contains the configuration for the cluster nodes. Like how many nodes, memory and vCPU per node, master / worker node, etc.

VMs_created

In vCenter the cluster is named accordingly and the VMs are visible;

vsphere

Now you have basically 17 naked VMs for Hadoop without a distribution installed. This is important to understand. BDE only automates the deployment of VMs and makes it easy to adjust vCPU, Memory count and add or remove worker VMs.

In the next step you can start to install your preferred Hadoop distribution. I tested Hortonworks and Cloudera utilizing an existing 3 Node Isilon Cluster as HDFS Store.

HDFS on Isilon scale-out NAS

For HDFS we have an Isilon which is a multiprotocol NAS platform. This means the data can be stored through any protocol like NFS, CIFS and directly analyzed by Hadoop nodes through HDFS as a protocol. If you don’t have an Isilon cluster, you can download the software only version for free use.  www.emc.com/getisilon

With HDFS on Isilon one can benefit from all Enterprise NAS features like replication, snapshots and backup integration. Plus the usable capacity on Isilon is 80% of raw.  Instead of < 33% with HDFS on DAS.

Quote from Cloudera; ”In the decade and a half since Google originally designed that architecture, though, the industry has made some progress. A few years back we worked with EMC to integrate the Isilon storage system with Hadoop, and we have customers running Cloudera today against large-scale Isilon stores. Separating the storage grid from the compute grid turns out to have some administrative and cost advantages for enterprise deployment at scale.”

 

EMC Hadoop Start Kits

Before installing your preferred Hadoop distribution, have a look at EMC Hadoop start kits. It includes scripts to prepare your VMs for Hadoop deployment.  Like scripts to enable SSH, mount a common NFS share as repository and install prerequisite packages on all VMs

https://github.com/claudiofahey/hsk-docs

The start kit is available for Pivotal, Cloudera and Hortonworks

Hadoop Distribution Setup

Hortonworks

As preparation for Hortonworks we need to enable the Ambari agent on Isilon and create the required users. During HDP Setup, when registering your hosts, just add the name of the Isilon cluster as one of the hosts; and check to use Isilon as the DataNode only;

hortonworks_install

Cloudera Setup

The Setup with Cloudera Manager is even straighter forward. In the list of services to install one can just choose Isilon as the HDFS Layer;

Cloudera_install_smallPerformance Tests

With the Hadoop cluster ready it’s finally time for some performance tests.

The compute nodes are four nodes with an E5-2620 each all in one 2U chassis and I’ve deployed 16 VMs as Hadoop worker nodes. Each worker VM has 2 vCPU and 16GByte of Memory.

The HDFS Layer is running on a small Isilon Cluster with 3 physical Nodes.

Teragen

Teragen is not a compute intensive workload and the bottleneck is typically the storage layer. I’ve generated 100GByte of random data and wrote it to HDFS on Isilon.

30 tasks, about 2 tasks per VM, have shown to give decent performance.

[root@cloudera-master-0 ~]# time hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000000000 /hadoop/teragen/cloudera/30task-100GB

Isilon GUI shows that inbound throughput jumps to 15-19Gbit/s. That’s a pretty decent number for writes on 3*23 Disks protected with FEC on a distributed files system. Each Isilon Node processes about 700-800 MByte/s ingest.

ISILON_GUI

The teragen job of 100GByte completed in 1 minute and 11 seconds.

Terasort

Next up is the terasort benchmark which is more a compute bound job. For testing sake I run the terasort against the 100GB of random data generated in the last test;

[root@cloudera-master-0 ~]# time hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /hadoop/teragen/cloudera/30task-100GB /hadoop/teragen/cloudera/30task-100GB-out

Terasort by default does mapping first and starts the reduce tasks when mapping is past 85%. After 7 minutes all data was mapped and reduce was at 25%… After 10 minutes the job completed.

Interestingly the load on Isilon was very little during terasort and the utilization of the nodes remained <10%.  Only between 0-500 MByte/s were read (while the 3 Isilon Nodes could easily deliver >3GByte/s of reads.) This is because the bottleneck was on the compute side as witnessed in Cloudera Manager and in vCenter;

At the beginning of the terasort job, the cluster CPU utilization jumped to 100%;

Cloudera_Manager

And in vCenter alerts of vCPU usage popped up showing all 16VMs consuming around 50GHz. That’s about half of the power available to the 4 ESX hosts.

vsphere

Summary

This POC and many other customer examples have shown that with today’s virtualization, networking and storage technology also a Hadoop cluster can run on shared infrastructure leveraging enterprise grade availability, efficiency and reliability. The use of Isilon provides the unique capability to analyze the data directly where it was created on the multi-protocol scale-out NAS platform. Virtualizing Hadoop means increased efficiency; you can start very small and offer Hadoop as a service. Multi tenancy can be implemented end-to-end.

A BIG THANK YOU to www.datastore.ch to provide the equipment and support for setting up this POC @ University of Fribourg.

What’s the next hot thing?

If you need more, much more performance and plan to do real-time analytics, check out the new release of DSSD D5.

Cloudera is also excited about it, check out their view; https://vision.cloudera.com/when-andy-bechtolsheim-starts-a-company-you-pay-attention/

BREAKING NEWS – VIPR CONTROLLER GETS OPEN SOURCED

Sascha

Sascha

Field CTO Office Switzerland at EMC Computer Systems AG
Sascha

@vSaschaMeier

Field CTO ¦ Specialist Presales Manager @ EMC Switzerland Passionate Technologist
@kcdellemc @ElkeStei @DinkoEror Thank you @kcdellemc for having me, proud to be a part of your Presales team! - 1 week ago

CoprHDEMC plans to release an open source project based on EMC ViPR Controller named Project CoprHD into the open source community.
The project will make the code for ViPR Controller — the storage automation and control functionality — open for community-driven development. The code will be available on GitHub in June 2015 and it will be licensed under the Mozilla Public License 2.0 (MPL 2.0).
Project CoprHD APIs are engineered to provide developers a single, vendor-neutral control point for storage automation with deep integration into Cloud automation frameworks from VMware, Microsoft and OpenStack. Just as with the EMC ViPR Controller, with the commercial version of Project CoprHD, customers, partners, service providers and system integrators can develop new service catalog offerings with automated workflows to meet their customers specific needs. ViPR Controller is designed to provide simple storage management for EMC and third-party arrays, and we will continue selling ViPR Controller as a commercial offering. ViPR Controller and Project CoprHD share the same core features and functionality. However ViPR Controller customers benefit from access to EMC’s support and professional services.
(more…)

What a first Day at EMC World 2015!

Sascha

Sascha

Field CTO Office Switzerland at EMC Computer Systems AG
Sascha

@vSaschaMeier

Field CTO ¦ Specialist Presales Manager @ EMC Switzerland Passionate Technologist
@kcdellemc @ElkeStei @DinkoEror Thank you @kcdellemc for having me, proud to be a part of your Presales team! - 1 week ago

EMCWUpdate

EMC World 2015 starts with an impressive first General Session. David Goulden EMC II CEO on stage is talking about our EMC strategy based on Storage platforms, Converged Infrastructures and Federation solutions, all designed to improve the IT agility and to drive down the costs to be able to invest in new technologies and solutions supporting the Digital Business Strategy of a company.

The fastest way to improve the IT Agility and to drive down costs is the deployment of a Converged Infrastructure Solution.

EMC provided a technology outlook how we will expand our CI solution portfolio with a next generation Hyper Converged Rack Scale Infrastructure Solution called VXRACK!

VXRACK
(more…)

Cloud-Infrastruktur: Wie billig muss sie sein?

Stefan Zueger

Stefan Zueger

Presales Manager at EMC Computer Systems AG
Stefan Zueger

@SDZueger

Systems Engineering Manager at Palo Alto Networks, jazz and tech addict. Plus: proud husband and father of twin daugthers. Tweets are my own.
The critical gap when it comes to building empathy https://t.co/4tO3QQXLpI via @instapaper - 2 days ago
Stefan Zueger
Stefan Zueger

Latest posts by Stefan Zueger (see all)

Auch wer mit dem Preiszerfall in der IT-Infrastruktur vertraut ist, kann angesichts der Entwicklung der Cloud-Service-Preise ein gelegentliches Nach-Luft-Schnappen nicht verhindern. Dennoch vollzieht sich die Integration der neuen Dienste in die IT-Landschaft Schweizer Kunden nicht so rasch, wie die Preise fallen. Warum?

“Software is eating the World” war einer der vielgehörten Sätze an den X-Days, die in der letzten Märzwoche in Interlaken 400 CEOs und CIOs angezogen haben. Gemeint ist damit der IT-Branchentrend, Systemkomplexität von den “niederen” Schichten der IT-Infrastruktur in die “höheren” Level der Applikationsumgebungen zu verschieben. Dies bringt Kostenvorteile: Die IT-Infrastruktur eines Unternehmens kann mit sehr kostengünstigen, hochstandardisierten Komponenten sichergestellt werden, wenn sie sich nicht mit der Komplexität zum Beispiel eines Mehrstandort-Betriebs auseinandersetzen müssen. Und dieser Trend passt zur Strategie vieler Public-Cloud-Anbieter, Infrastrukturleistungen mit garantiertem Service-Level, aber einem reduzierten Leistungsumfang zu einem unschlagbaren Preis anzubieten.

Software-as-a-Service(SaaS)-Angebote kommen immer dann gut an, wenn Cloud-Dienste einen isolierten Bereich aus der heimischen IT-Landschaft herauslösen können. Zum Jahresende 2014 profitierten sehr viele Schweizer Unternehmen von der Möglichkeit, durch ein Engagement von Office 365 auf einen Schlag einen grossen Teil der Enduser-Komplexität an Microsoft zu delegieren. Wie aber sieht die Cloud-Nutzung bei den Infrastructure-as-a-Service(IaaS)-Leistungen aus? (more…)

Welcome to Software Defined Data Protection, welcome RecoverPoint for Virtual Machines

Johannes Geissler

Johannes Geissler

Senior Specialized Solutions Engineer at EMC Computer Systems AG
Johannes Geissler
RT @ValaAfshar: Good companies do not give up on people. This coffee shop only employs people with disabilities.☕️ #TuesdayThoughts https:… - 1 month ago

In the midst of the year end of 2014, we have launched a brand new replication technology which is based on our proven RecoverPoint engine and fully integrates with VMware.

I’ve had the first chance to play with it. As a specialist for traditional storage replication I was most impressed by how easy it is to startup and test the replicated Virtual Machine(VM).

First things first, to protect a Virtual Machine, simply select “protect” from its properties:

hc_052

Next you can choose your desired RPO, the target cluster and the host and storage resource for the replicated VM: (more…)

Products are dead, long live solutions!

Jean-Paul Nussbaumer

Jean-Paul Nussbaumer

Sales Director Region West at EMC Computer Systems SA
Jean-Paul Nussbaumer
Une magnifique soirée au cirque Knie à Lausanne pour soutenir une fondation hors du commun. Inscrivez-vous rapidemen…https://t.co/cKqhEm6U1K - 3 weeks ago
Jean-Paul Nussbaumer
Jean-Paul Nussbaumer

Latest posts by Jean-Paul Nussbaumer (see all)

Now that’s a pretty bold statement from a company that’s mainly known for its excellent products, isn’t it? EMC will always be a technology company, but we are definitely seeing a shift in the buying patterns of our customers. The challenges that companies face today are so daunting that a standalone product can never solve all of the issues. Products and related services need to be glued together in total solutions that fit in – and actually enable – a broad IT strategy.

What I found out when talking to customers since joining EMC last spring is that not one company or industry can escape the changes that are affecting our entire society. Even the most traditional and conservative industries have to take into account that the new expectations of their clients are high. Very high. Whether you are in banking, insurance, manufacturing or running a city or government, regardless of whether you serve consumers or other businesses, the people using your products or services expect you to be always online. They want to be able to consult their investment portfolio, place and track their orders at the tip of their fingers, adapt their subscriptions to fit their exact needs. And they demand the possibility of doing that wherever they are and from whatever device they have at hand at that moment. Platform- and location independent instant gratification, that is what makes your – and our – customers happy.

(more…)

Never a dull day in the storage market

Stefano Camuso

@CamusoStefano

If everything seems under control, you're not going fast enough. Mario Andretti
Gartners 2017 Strategische Roadmap für Netzwerke https://t.co/uPK0EOo6nZ - 1 month ago
Stefano Camuso
Stefano Camuso

Latest posts by Stefano Camuso (see all)

Sometimes people ask me why I am so passionate about data and how data are stored and managed. People outside the storage industry consider our domain of expertise to be one of the dullest specializations in information technology. You will believe me when I say I don’t agree. After all, I work for a vendor that has its origins in storage infrastructure. But I am convinced storage will always be one of the key parts of a datacenter.

(more…)

Just IT Agility and Control – EMC Enterprise Hybrid Cloud

Sascha

Sascha

Field CTO Office Switzerland at EMC Computer Systems AG
Sascha

@vSaschaMeier

Field CTO ¦ Specialist Presales Manager @ EMC Switzerland Passionate Technologist
@kcdellemc @ElkeStei @DinkoEror Thank you @kcdellemc for having me, proud to be a part of your Presales team! - 1 week ago

The EMC Enterprise Hybrid Cloud (EHC) Solution provides two important things – IT Agility and Control.

The EMC EHC is an integrated, automated, secure and scalable platform for your business applications.

The solution provides you a modern, agile IaaS solution which is capable to dynamically allocate internal and/or external Cloud resources based on business demands.

You will be able to deploy and manage internal – Private Cloud – and external – Public Cloud – resources through a central platform providing not only a high agility but even more important secures the control of the consumed resources at any point in time. This transforms the traditional Enterprise IT into a modern IT Cloud Service broker.

People within the company and/or the IT department can dynamically consume services through a central Self-Service Portal. The solution provides you resource and cost transparency at any point in time.

The EMC Enterprise Hybrid Cloud based on VMware’s SDDC Architecture running on our Converged Infrastructure is pre-engineered and End-to-End tested providing a single point of contact for any platform related support questions.

And more important it can dramatically decrease the implementation time to less than three months! Which finally improves the Time-to-Market for Enterprise Customers as well as for Service Providers.

(more…)

Backup Redefined – Data Protection-as-Service – for a Software-Defined World

Sascha

Sascha

Field CTO Office Switzerland at EMC Computer Systems AG
Sascha

@vSaschaMeier

Field CTO ¦ Specialist Presales Manager @ EMC Switzerland Passionate Technologist
@kcdellemc @ElkeStei @DinkoEror Thank you @kcdellemc for having me, proud to be a part of your Presales team! - 1 week ago

DPAD-shieldRequirements for data protection have changed over time. This is not only because more systems and users are creating data, and the fact that enterprise applications are critical to nearly all business processes. It’s even more because of the new dynamic way we consume IT resources – The world of IT-as-a-Service. In this context, data protection becomes more challenging with the volume of data being generated across lots of users, across lots of devices and lots of next generation applications which can be dynamically ordered and consumed via Self-Service portals. In this new dynamic on-demand software-defined world, traditional approaches to backup data won’t work anymore.

But data protection redefined doesn’t mean throwing everything away. Customers will still have mission critical platform 2 applications that will need to be protected. But they will have a host of new platform 3 applications and devices that need data protection as well. Through EMC’s strategy, we can bridge the transition, offering consistent data protection for existing and future IT environments – whether data and applications are running on legacy infrastructure or new Hybrid Clouds.

The data protection continuum spans protection tiers to provide everything from continuous availability to replication. We just announced a new level of Active-Active Data Center Solution including an any-Point-in-Time recovery functionality optional even in a 3rd Data Center called VPLEX MetroPoint, to backup and archive. These tiers all need to work together in a complementary fashion, creating a continuum that is aligned to the most stringent requirements for zero downtime and no data loss, for various point-in-time copies, and to secure long-term retention with archiving.

(more…)

EMC ScaleIO SDS – tested in Switzerland and declared as a “disruptive Technology”

Sascha

Sascha

Field CTO Office Switzerland at EMC Computer Systems AG
Sascha

@vSaschaMeier

Field CTO ¦ Specialist Presales Manager @ EMC Switzerland Passionate Technologist
@kcdellemc @ElkeStei @DinkoEror Thank you @kcdellemc for having me, proud to be a part of your Presales team! - 1 week ago

ScaleIO is a potential game-changer, disrupting the way how companies independent of their size can build new elastic Hyper Converged Private Cloud Infrastructures and how businesses purchase storage systems in the future.

ScaleIO is a solution which enables service providers to build not only a Hyper Converged Cloud which scales from 3 to 1000s of nodes providing high performance at low latency, it provides as well Cloud Framework and  hypervisor independency supporting VMware (ESXi), OpenStack (KVM), Linux and in the near future Microsoft.

ScaleIO’s innovative technology called ECS (elastic converged storage) is a software-only solution built for the elastic 3rd generation platforms. It’s Software-Defined-Storage, a virtual Scale-Out storage array built on x86 commodity HW using any combination of solid-state devices, PCIe flash cards and hard drives providing fully protected and shared block LUNs.

5 reasons why ScaleIO is a disruptive technology – based on a customer PoC in Switzerland: (more…)