Pre-register for Streaming Algorithms Course

Date
13 Mar 2013
Author
Noel Welsh

I have given a number of talks on streaming algorithms and had requests for more depth on the material. I would like to expand my talks into a course, but in a true data-driven way I first want to gauge interest. Hence I’m asking interested people to pre-register now (no monetary commitment required) so I know if it’s worthwhile going ahead.

Go ahead and pre-register now or read on for more on the course.

Why?

I believe streaming algorithms are a wildly underappreciated technology for data analysis.

The defining characteristic of a streaming algorithm is that it only processes a data point once. You can run them on data stored on disk, but more commonly you just fire data at the algorithm as it arrives.

Streaming algorithms are real-time by definition. Real-time is a very nice property to have. Obviously, it lets you know right away what’s going on in your system, which can be important in certain situations. An underappreciated benefit is you don’t have any task switching. Just like long compile times lead to distracted developers, if queries take a long time to run you’ll end up reading your email for half an hour before you realise it.

Streaming algorithms also tend to be ridiculously scalable. For example, we’ll look at algorithms that use only 4K to count the number of distinct items in a set with 10^9 elements. Scalability is great even if you don’t have a tidal wave of data, because when things fit into a single machine they are so much easier to develop and maintain.

Finally, streaming algorithms are also easy to implement (and you can often find implementations online). They are the kind of thing you can knock up in an afternoon. Then, because of their awesome scalability, just wrap a HTTP front-end around it and you have yourself an analytics machine. You’ll spend two days instead two months building a system, which is really awesome.

Course Content

The course will run in two parts. The first part will cover methods for processing streams of numbers. These are mainly used in system monitoring scenarios. We’ll start by looking at various ways of calculating moving averages, useful for calculating hits per second over a time window and so on. Then we’ll move onto quantiles, which you’ll typically want to use for calculating 99% response time etc. Finally we will look a methods for constructing histograms from streaming data, useful if you want to get a closer look at your response time distribution, for example.

The second part of the course will focus on methods for sets and multi-sets. These are better suited to answering business questions. We’ll start with ways to find the most frequent items in a stream. You can use this to find who your most active users are, or to, say, filter out attackers by IP address. We will then look at methods to calculate the number of distinct items in a set. This is super useful for general analytics. Say you’re running a website. You can answer a lot of questions if you can count the number of people who visit different parts of your site and count the number of people who arrive from different sources. This is exactly what these methods do. Even better, they also support set algebra, so you can find how many users are in the intersection, union, and set difference of the various sets you’re counting. As these methods are super scalable you count thousands of different sets on a single box, which can get you a very powerful and flexible system.

The course will be very practical. Although we’ll be focused mostly on algorithms, and thus be programming language agnostic, we’ll talk implementation details and I’ll give example code (probably in Scala, Python, or Javascript). You can use whatever language you want. So long as you can perform bit manipulation you’ll be alright.

Course Options

I’d like to offer two options for the course: in person, and online. The former will run in London over a day and cost less than £500 per person. The online course will be self paced and cost less than £100.

Once again, if you’re interested tell me by signing up and I’ll let you know if and when the course goes ahead.

A Tribute to Phil Bagwell

Date
16 Oct 2012
Author
Noel Welsh

I’ve just read that Phil Bagwell has passed away. Although I only met Phil Bagwell once he has had a large influence on my life. His research into data structures is perhaps what he is best known for, and this work is appreciated daily by Scala, Haskell, Clojure, and other programmers.

I, however, owe Phil a bigger debt as his suggestion led directly to the formation of Underscore. At ScalaDays 2011 I was discussing the trials of constructing a consulting organisation around Scala. He suggested forming a consortium with other consultants, and gave me a few names to contact. The result if of course the organisation you see today. We’ve met up with Typesafe people numerous times since, and I always hoped Phil would be present at our meetings so I could say thanks. Alas it was not to be, and now of course that opportunity is gone. Phil will be missed by all of us at Underscore.

Misconceptions about Scala Developer Rates

Date
01 Aug 2012
Author
Richard Dallaway

When we introduce Scala into a organisation, at some point we find someone who makes one or both of these assertions:

  • Scala developers are “expensive”.
  • Scala is hard to understand.

We’ve tackled this a number of times and would like to share the results of our fact finding and experience.

In this posts, we’ll look at the misconception of Scala developers being expensive.

Java and Scala rates are about the same

Cost is relative, and the idea of Scala developers costing more is usually in the context of a comparison to Java. The idea, in the least-sophisticated form, is: there are lots of Java developers, so that must mean they are cheap; there are less Scala developers, so they must be expensive.

It’s an understandable starting point, as there are 9 million Java developers, but it is flawed by two facts:

  • Java developers and Scala developer rates are within two per cent of each other.
  • It is more productive to develop in Scala.

The data on rates

Looking the data from Indeed.com we see that this year, and further back, Java and Scala rates have oscillated above and below each other. The difference, given by Indeed.com is that the “average Scala Developer salaries for job postings nationwide are 2% higher than average Java Developer salaries for job postings nationwide.”

Scala and Java rate comparison from Indeed.com

Those are the market figures. We can also add that from the developer circles we move in we know some Java developers will take a reduced rate to use Scala; some now require a higher rate to consider a Java contract over a Scala one.

The conclusion we make is that it’s a huge stretch to call a two per cent difference “expensive”, especially given the experience on productivity.

The data on productivity

Scala was not just built with productivity in mind, it was the central drive. There are plenty of experience reports showing that gains are being delivered.

When The Guardian moved to Scala from Java the management reported projects not being materially effected during the transition, and then being more productive after about three months. The developer experience is of writing less code, producing higher quality code, and delivering faster by focussing on the problem not boilerplate.

In terms of research evidence there is less to drawn on. A summary by Phil Bagwell of recent findings, points to eye tracking experiments that show developers are faster to see what’s going on in Scala code compared to Java due to the more concise syntax.

Having said that, it’s not going to fit well for everyone or every team. Obviously. But for the right team, the gains are there to be taken.

Conclusions

Scala: more productive, not more expensive.

Another Small Example of the Type Class Pattern

Date
09 Jul 2012
Author
Channing Walton
Source
Casual Miracles

Developers at a client recently asked about the following code:

import scalaz._
import Scalaz._
val x: Option[Int] = whatever
val y = ~x // Huh!?

What is that ~x there?

It's a unary method call.

An example of a unary method is -, as in -1. - is actually defined in scala.Numeric like this:

def unary_-() = negate(lhs)

So -1 is actually 1.unary_-. The unary_ part can be omitted when used in prefix form. Note that the only identifiers that can be used as prefix operators are +, -, !, and ~.

So the answer to the question is: its a unary method defined … somewhere:

def unary_~()

What does ~ do to options?

In Scalaz its a method that returns the value contained in an Option, or a default Zero for the Option's type. For example, ~Some(1) returns 1, ~None (for an Option[Int]) will return 0.

But how does our unary method find an appropriate Zero instance?

The answer is type classes. To the code …

object Zeroes extends App {

/* The type class pattern starts with a trait defining the
 * behaviour we need. In this case, a Zero for some type T,
 * that can provide the zero for that type.
 */
trait Zero[T] { def zero: T }

/* The other half of the pattern is the trait's companion
 * object which contains a set of implicit instances
 * for the type class. Whenever an implicit instance
 * of a Zero is required, the companion object is one of
 * the last places searched.
 * 
 * In this case there is only an instance for Int but it 
 * could contain instances for all the basic types.
 */
object Zero {
  implicit object IntZero extends Zero[Int] { def zero = 0 }
}
 
/* So that's the type class, but scala's Option does not have
 * the method we need, so lets enrich the Option type with an
 * implicit conversion (view) to a new, anonymous type with 
 * the method we need.
 */
implicit def withTheZeroes[T: Zero](option: Option[T]) = new {
  /* Note that the T: Zero in the type parameter is called a context
   * bound. It says that this method wants an implicit instance
   * of Zero[T] which will be used below.
   */
 
  /* The method will use Option's getOrElse method to get the
   * value if the option is a Some, or else return the value
   * returned by the Zero[T] if the Option is a None.
   */
  def unary_~(): T = option.getOrElse( implicitly[Zero[T]].zero )
    
  /* The implicitly here is just a method in scala.Predef
   * 
   *   def implicitly[T](implicit e: T) = e    
   *   
   *   The comment says: 
   *     "for summoning implicit values from the nether world"   
   * 
   * In other words, it's a method that requires an implicit 
   * instance of the supplied type, in this case Zero[T] and
   * returns it. Since withTheZeroes requires an implicit 
   * instance of Zero[T] through the context bound discussed
   * above, implicitly[Zero[T]] will find it.
   */
  }

// ok lets try it out with a standard scala Option 
val something = Some(1)
val nada: Option[Int] = None
  
println(~something) // returns 1
println(~nada) // returns 0
  
/* But there is more. Type classes are open to extension. 
 * We can supply our own zero, either superseding those
 * supplied by the Zero companion object, or add new ones. 
 * 
 * Here is one for a String. 
 */
implicit object StringZero extends Zero[String] { def zero = "" }
val noString: Option[String] = None
println(~noString) // returns an empty string
}

So there it is, a unary method enriching an Option via an implicit view which requires an implicit Zero instance for the Option's enclosed type.

Scalaz provides all the machinery to do this so all you need to do is use it. If you need to modify the defaults then you can provide your own implicit Zero instances and import them. The implicit resolution rules will ensure that your instances will be found before the defaults if they are imported or placed in the companion object of your classes.

If you want to learn more about this kind of thing have a look at the other blogs in this series of Small Examples.

If you would like a course on Scala, look no further than Underscore.

Review of COLT and ICML

Date
06 Jul 2012
Author
Noel Welsh
Source
Untyping

I spent last week in the lovely city of Edinburgh attending COLT and ICML. I’m still digesting a lot of what I saw, but I thought it might be interesting to give a quick list of my highlights.

Modern Banditry

Bandit and reinforcement learning algorithms are of major interest, given my work on Myna, and both conferences featured interesting papers on these problems. Of particular note, Hierarchical Exploration for Accelerating Contextual Bandits is an algorithm I could see us putting into practice very soon. This solves the problem of training a (linear) classifier under bandit feedback by initially restricting the parameters to a lower-dimensional subspace, and then relaxing this restriction as more data arrives. This reduces the amount of time spent exploring, and hence the regret, so long as a good subspace can be identified.

The Best of Both Worlds: Stochastic and Adversarial Bandits is a very interesting paper, showing it is possible to switch between a stochastic and adversarial bandit and still bound the regret if either case is true. I think this work is very early, and hopefully we’ll see further developments in the future. In particular bandit algorithms for the non-stationary but not adversarial case are of interest but largely unexplored in the literature.

The Policy Regret and Local Regret papers also presented some interesting ideas rethinking how we define regret. Finally there are number of reinforcement learning papers, which I haven’t yet had time to read.

Topic Models

Topic models continue to be a hot topic as they have direct application in many content recommendation systems. Latent Dirichlet allocation is one basic topic model, and there has been much work scaling it to large document collections. Previous work includes online LDA and Yahoo’s Hadoop-based LDA. At ICML we were treated to Sparse stochastic inference for latent Dirichlet allocation. This uses a hybrid variational/sampling scheme to do online inference (that is, it processes a stream of data). The experiments are performed on a 33 billion word corpus, which demonstrates the algorithms scalability. Of course this is not the last word on scaling LDA. In the future we can expect spectral methods to provide an alternative learning paradigm. In this particular case the algorithm is based on the singular value decomposition, which is surprisingly easy to scale.

Speeding up LDA is not the only game in the topic model town. The other main direction is developing more expressive or domain specific models. ICML had plenty of these, including models for melodic sequences, labelled data, changing topics and more.

While gathering the links for this post I came across this paper, which looks very very interesting if topic modelling is your bag.

Other Things

There was plenty of work on optimisation and Bayesian inference. I use these techniques, but they’re not something I have a particular interest in so I don’t have much to say here.

Both ICML and COLT had many papers on recommendation systems (also known as collaborative filtering, or matrix prediction/completion). I haven’t got around to reading most of these, but I’ll definitely be going through them at some point.

Finally, three papers that don’t fit any of the above themes but I found interesting:

  1. Exact Soft Confidence-Weighted Learning is the latest in the series of online linear classifiers that began with the passive-aggressive perceptron. I really like these algorithms; they are simple to implement, give great performance, and being online you don’t need to keep your training data around. I’m looking forward to testing this one.

  2. Sequence prediction is a problem that has many applications, from compression to recommender systems. Learning the Experts for Online Sequence Prediction shows how to learn a suffix-tree variant and demonstrates good performance predicting browsing behaviour.

  3. Most developers know about k-d trees for storing spatial data. K-d trees aren’t efficient in high dimensions as they branch too often. Approximate Principal Direction Trees describes an algorithm that represents high dimensional data compactly and doesn’t take much time to compute. The key insight, for me, is to use only a few iterations of the power method to compute a vector that is with high probability in a direction of high variance.

If you read this you might also be interested in Alexandre Passos’ reading list.

A Small Example of Kleisli Arrows

Date
02 Jul 2012
Author
Channing Walton
Source
Casual Miracles

This is a simple example of use of a mysterious beast called a Kleisli arrow.

Again, I will be using scalaz for my incantations as it comes with the necessary potions.

What is the Problem?

You have functions that take a simple type and return higher kinded types like Options or Lists, and you need to compose those functions. You might think you need to get the first result, pattern match on the it, apply it to the second function, repeat. Sounds rather cumbersome.

Kleisli FTW

The intuition is this:

a Kleisli is function composition for Monads.

Hmm, lets try again:

if you have functions that return kinds of things, like Lists, Options, etc, then you can use a Kleisli to compose those functions.

The Example

object Kleisli extends App {
  import scalaz._
  import Scalaz._

  // Some methods that take simple types
  // and return higher-kinded types
  def str(x: Int): Option[String] = Some(x.toString)
  def toInt(x: String): Option[Int] = Some(x.toInt)
  def double(x: Int): Option[Double] = Some(x * 2)

  // Lets compose those functions Ye Olde Way
  def oldSchool(i: Int) =
    for (x <- str(i);
       y <- toInt(x);
       z <- double(y))
    yield z

  // Kleisli!
  val funky = kleisli(str _) >=> (toInt _) >=> (double _)

  println(oldSchool(1)) // Some(2.0)
  println(funky(1))     // Some(2.0)

  // Lets use symbols!
  val reallyFunky = ☆(str) >=> (toInt _) >=> (double _)
}

I was asked by bhericher below about composing functions returning different types. Here is a solution but I am not sure if its the best way:

def opt(x: Int): Option[String] = Some(x.toString)

def list(x: String) = List(x)

// a function to turn an option into a list
def optToList[T](o: Option[T]) = o.toList

// now we can compose opt and list using optToList
val multi = (kleisli(opt _) compose (optToList _)) >=> (list _)

Further reading

A Small Example of Phantom Types

Date
16 Jun 2012
Author
Channing Walton
Source
Casual Miracles

Here is an example of so-called phantom types and covariance in Scala to improve type safety and correctness. The example is a simplified version from a real project.

I present the example as source code so you can copy into an editor and play with it. References for further reading are given at the bottom.

object Phantoms {

  /* Here is a Button, some implementations and an ActionPanel
   * that uses two Buttons. */
  trait Button

  case object Accept extends Button
  case object Reject extends Button
  case object Disabled extends Button

  /* Here is a panel containing an accept and reject button.
   * The types of accept and reject are simply Button because
   * we want to construct the ActionPanel with a DisabledButton
   * too. */
  class ActionPanel(accept: Button, reject: Button)

  /* The problem is that nothing is enforcing the order of the
   * arguments, it would be easy to get them the wrong way around.
   *
   * We can fix this with so-called phantom types, types used at
   * compile-time to assist in correctness but are not required at
   * runtime.
   *
   * Some phantoms... */
  trait Acceptor
  trait Rejector

  // Now a new Button parameterised by T but makes no use of T
  trait Button2[T]

  // New buttons that use the phantom types.
  case object Accept2 extends Button2[Acceptor]
  case object Reject2 extends Button2[Rejector]

  /* and now an ActionPanel that requires Buttons
   * using types to ensure the correct ones are
   * supplied */
  class ActionPanel2(accept: Button2[Acceptor],
                     reject: Button2[Rejector])

  /* Getting the order wrong results in compilation errors.
   * This will not compile:
   *
   * class ActionPanel2(accept: Button2[Rejector],
   *                    reject: Button2[Acceptor])
   */

  /* But what about the DisabledButton? As things stand,
   * subclasses of each button type would be needed,
   * which is duplication.
   *
   * Introducing the Covariant Phantom */
  trait Button3[+T]

  // and the usual subclasses
  case object Reject3 extends Button3[Rejector]
  case object Accept3 extends Button3[Acceptor]

  // and panel
  case class ActionPanel3(accept: Button3[Acceptor],
                          reject: Button3[Rejector])

  // And now the DisabledButton
  case object Disabled3 extends Button3[Nothing]

  // and in use
  ActionPanel3(Disabled3, Reject3)

  /* What?! How did that work?
   *
   * The Nothing type extends everything, its the
   * bottom type. Since Button3 is covariant,
   * Button3[Nothing] is a subclass of all Button3's
   * and can be used in their place. */
}

In summary, using a covariant phantom type has enabled the ActionPanel to enforce the types of its arguments more strongly and supporting a Disabled (default) implementation.

This was exactly what I needed in my real project, I hope its useful to you.

Further Reading

  1. James Iry's Phantom Types In Haskell and Scala
  2. Phantom Types at the Foursquare Engineering Blog
  3. The Nothing type.

A Small Example of the Typeclass Pattern

Date
03 May 2012
Author
Channing Walton
Source
Casual Miracles

Typeclasses, (or type classes), are most famously a language feature of Haskell that has gained interest in the Scala community. Here I describe the basic pattern with references for further study.

What is a type class and why do I need it?

Type classes are useful to solve several fundamental challenges in software engineering and programming languages. In particular type classes support retroactive extension: the ability to extend existing software modules with new functionality without needing to touch or re-compile the original source. [1]

The Scala library includes a few typeclasses such as scala.math.Numeric and scala.math.Ordering, and Scalaz is all typeclasses.

Scala enables the typeclass pattern using traits and implicits, and whilst Scala's implementation is more verbose than Haskell's, it comes with greater flexibility. Haskell only allows a single typeclass instance globally, whereas Scala allows any number to be available.

Furthermore, Scala allows default implementations to be made available if no others can be found.

For a deeper understanding see the references at the bottom of the blog.

Show me the code!

(You can copy and paste this into your favourite IDE)

/* Here is the first piece of the Scala typeclass puzzle,
 * its just a trait. The trait defines a concept which in
 * this case is a transformer that transforms a type T
 * into an R. */
trait Transformer[T, R] {
  def transform(t: T): R
}

/* This is a companion object for the typeclass giving default
 * implementations for the typeclass. These implementations
 * are found after local implicits, so you can still override the
 * default behaviour. For more about the search order see [3]. */
object Transformer {
  implicit object IntToStringTransformer
          extends Transformer[Int, String] {
    def transform(t: Int) = t.toString
  }

  /* This one is interesting. It transforms a List[T]
   * into a String, but to do that it needs a transformer for
   * T to String. So it requires such a transformer implicitly. */
  implicit def ListToStringTransformer[T](
                 implicit tToString: Transformer[T, String]) =
    new Transformer[List[T], String] {
      def transform(t: List[T]) =
        t.map(tToString.transform(_)).mkString(",")
    }
}

// This is something that makes use of the typeclass
trait Transform {
  // The implicit Transformer, transformer, supplies an
  // appropriate transformer for the method
  def transform[T, R](t: T)(
        implicit transformer: Transformer[T, R]): R =
    transformer.transform(t)
}

// These examples will drop back to the default implementations
// in the Transformer's companion object
object ExampleWithDefaults extends App with Transform {
  println(transform(2)) // prints 2
  println(transform(List(1, 2, 3))) // prints 1,2,3
}

// I want my own special transformers!
object ExampleWithMyTransformers extends App with Transform {

  implicit object MyIntToXmlTransformer
          extends Transformer[Int, xml.Node] {
    def transform(t: Int) = { t.toString }
  }

  // As with the default List transformer, this one needs
  // a transformer of T to Node so that it can perform the
  // transform for lists.
  implicit def MyListToXmlTransformer[T](
                 implicit transformer: Transformer[T, xml.Node]) =
    new Transformer[List[T], xml.NodeSeq] {
      import xml._
      def transform(t: List[T]): NodeSeq =
        t.foldLeft(NodeSeq.Empty) { (accumulator, next) =>
          accumulator ++ transformer.transform(next)
        }
    }

  println(transform(2)) // prints 2
  println(transform(List(1, 2, 3))) // prints 123

  // Here is a transformer from Boolean to String
  implicit object BooleanToStringTransformer
          extends Transformer[Boolean, String] {
    def transform(b: Boolean) = b.toString
  }

  // This is the clever bit.
  // Now we can now transform List[Boolean] to String.
  // The List transformer defined in the default Transformer trait
  // will be used but with the new BooleanToStringTransformer. So
  // the default transformers and the new ones work together.
  println(transform(List(true, false))) // prints true,false

  // Finally, if there are no implicits available then the code
  // fails to compile.
  // transform("3")
}

Summary

So we defined a typeclass which has been extended with new instances. If the transformer typeclass were part of a library, we can retroactively extend the library with new behaviour supporting new types not originally conceived of in the library. Furthermore, the library supplied types and the new ones work together perfectly.

Further Study

  1. "Type Classes as Objects and Implicits"; Bruno C. d. S. Oliveira, Adriaan Moors, Martin Odersky
  2. "How to make ad-hoc polymorphism less ad-hoc"; Philip Wadler, Stephen Bolt, (POPL '89)
  3. Implicit resolution order
  4. Tutorial: Typeclass with Dan Rosen
  5. Scala/Haskell: A simple example of type classes by Mark Needham