Scala as the Language of Choice for Data Analytics

This blog post was motivated by two things:

I love Scala, and
I love data

So I thought, let’s see if I can use both together. It turns out that I can. Scala is, in fact, highly suited to doing analytics with wide and varied data sets. This post will explain why.

Intro

Scala is a programming language that is best known as a functional language, while also enabling object-oriented and imperative programming. For those coming from an object-oriented background and unfamiliar with functional languages, their most distinctive characteristic is probably immutability. Broadly speaking, this means that when we assign a value to a variable, this variable keeps that value. We never assign a new value to it. The result is that Scala functions do the same thing every time we run a program, and this has profound benefits for the scalability of programs. (As mentioned, Scala also allows object-oriented code, and therefore it is possible to create mutable variables. However, it is Scala as a functional language which interests us here.) Now, why is Scala particularly suited to the task of doing data analysis?

Scala is just a great language

Firstly, Scala is simply a fantastic programming language that builds on many of the strengths of Java, while ironing out many of Java’s weaknesses. Scala allows for elegant and fluent functional programming, which means code that is more concise and readable than object-oriented code, e.g. Java code.

Scala is a general purpose language with a large and fast-growing user community and a large number of general purpose libraries. Why is general purpose good? Because it allows for more options and more freedom than a DSL. Scala’s libraries include excellent web frameworks (e.g. Lift and Play) and networking libraries. Because it is built on the JVM we can also use any Java libraries in our Scala applications.

Scala has a strong type system. It is statically-typed, meaning the type of a variable is explicit before compile-time. The compiler does not have to deduce the type itself. This, combined with good compile-time type checking, makes Scala code safe and stable. However, Scala uses type inference, which means that the developer, even still, is not always compelled to specify variable types in his/her code. If the compiler can reduce an expression to implicitly typed atomic values, then type declarations are not needed in that expression. This type inference further reduces the verbosity of Scala code.

Scala has excellent tool support. SBT is considered an excellent build tool. There are several well-regarded testing frameworks such as ScalaTest. The Scala IDE is based on Eclipse and gives good Scala, as does the IntelliJ community version. Furthermore, as you probably have guessed, most Java tools can be used with Scala code.

Scala is good for data analytics, specifically

It is free, open-source and platform independent. These may seem like trivial points in this post-Microsoft world but some well-known solutions for statistical computing are not free. Scala is also quite a fast and efficient language, in a broad sense, being only marginally slower than Java for the average application. Data analytics often involves computationally intensive algorithms. Therefore speed and efficiency are important.

There are several maths and statistical libraries for Scala (e.g. Scalalab, Breeze). Much more interestingly, however, is that Scala has access to a well-designed library for scalable data analytics via Apache Spark. More on that later!

Scala has a REPL (Read-Eval-Print Loop). This, essentially, is an interactive console in which we can write Scala code and have it evaluated in front of our eyes, in real time. In other words, we don’t have to build a functioning program with a main method. Want to know what 2 + 2 is? Fire up the Scala REPL and type in “2 + 2”. The interactive analysis that the REPL affords us means that we can perform interactive data analysis in real time.

Finally, Scala allows imperative programming (spit) for those rare occasions when it makes sense. Some data analytics problems will be more efficient if tackled in an imperative fashion (don’t ask me what they might be).

No language is perfect, even Scala. It doesn’t have excellent data visualisation capability built in, nor does it have a huge number of statistical routines in the standard library, as R does, for example. These things are relatively easy to fix given a suitable language and platform, however. The community can contribute such features. On the other hand the community can do relatively little about a rubbish language and platform!

The Big One: Scala is designed for parallelism, concurrency and scalability

The above three attributes are of vital importance in data analytics. Huge and ever-changing datasets mean that we must be extremely concerned with scalability, and scalability is largely a question of parallelism and concurrency. Scala, as a functional language, handles parallelism and concurrency elegantly and efficiently. Immutability, immutable data structures and monadic design reduce the complexity of multi-threaded software. Each module of functionality is independent and doesn’t really care what the other modules are doing. Compare this to a Java object, for example, which must be painfully aware of what other modules are doing to its various variables.

(Have you wondered where the name Scala comes from? Answers on a postcard.)

Scala works particularly well with Spark

Since Scala handles parallelism and concurrency so efficiently, and parallelism and efficiency are so important for data analytics, it should come as no surprise that Scala is the language in which Spark is written. Spark is a ‘general engine for large-scale data processing’.

Why is Spark great? It processes huge data sets far quicker than other data engines such as Hadoop, and allows for real time, in-process querying of data sets. Other engines based on Google’s MapReduce algorithm are designed for batch processing. In other words, the developer must process his or her data before querying it.

Since Spark is written in Scala it comes as no surprise that using Scala with Spark is a relatively seamless experience. We can interactively query datasets using the Scala REPL, and easily manipulate Spark’s Resilient Distributed Datasets (which are based on Scala collections) as local monolithic objects.

Finally

In conclusion, Scala is great in general and for data analytics specifically. If you are interested in coding in general you should take a look at Scala. If you are also interested in data analytics in particular you should definitely take a closer look at the language. And, while you’re at it, take a look at Spark also 🙂

Sources

http://www.eecs.berkeley.edu/~matei/talks/2012/spark_scala_days_2012.pdf

http://www.slideshare.net/knoldus/unicom-ppt-big-data-delhi

http://java.dzone.com/articles/apache-spark-fast-big-data

http://stackoverflow.com/questions/8760925/is-there-a-good-math-stats-library-for-scala

but mostly

http://darrenjw.wordpress.com/2013/12/23/scala-as-a-platform-for-statistical-computing-and-data-science/

I didn’t end up using much from the following link but it is a motherlode of valuable-looking Scala resources:

http://www.ibm.com/developerworks/library/os-spark/

P.S.

http://scala.com/digital-signage-software/advanced-analytics/

I was quite excited by this link, thinking it was some powerful Scala analytics library!