20 Oct Microsoft tries to Spark relationship with cluster lusters: Promises 5-min big data bang on Azure
First apps on Windows, then Linuxes in Hyper-V and on Azure, now big data via Spark. In another effort to win over the open source crowd, Microsoft has made the speedy big data engine Apache Spark easier to set up and use on Azure, giving devs a dedicated tool to help provision clusters.
The open-source “Azure Distributed Data Engineering toolkit”, which integrates with Docker containers, enables devs to submit jobs and provision on-demand Spark clusters from the command line.
Apache Spark boasts that it can “run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk”. The tricky part can be configuration – companies such as MemSQL have approached this by offering a way to use it without writing code.
“Spinning up a Spark cluster, on-demand, can often be complicated and slow,” Microsoft program manager JS Tan wrote in an Azure blog post. “Spark developers often share [static] pre-existing clusters managed by their company’s IT team,” which means “you’re either out of capacity, or you’re burning dollars on idle nodes.”