Metrics and Hiring, some comments on diversity

Lukas Biewald, the CEO of CrowdFlower, posted an essay called Metrics and Hiring. In it, he does something I haven’t seen anyone else do, which is to dive into his own data looking for indicators of success of people he has hired. He found that referral quality and school quality were the two most predictive indicators. Lukas does not break down his data by gender or age or other demographic classes. I have to say, I very much appreciate Lukas’s humility in recognizing that he’d made disastrous hiring decisions. 

I responded briefly (of course) on Twitter, as did Tim Converse. Here’s our conversation — three white guys shooting the breeze. I think Tim and I both were trying to bring in a warning about the cost of limiting diversity, which was what Lukas seemed to be taking from his data (that is, he did better when he hired from his old boy networks). Both Tim and I have written essays about the cost of limiting diversity (Tim’s essay, my essay).

The lesson I take away from Lukas’s essay is the importance of really understanding how people get referred into an organization. How do you make the move from hiring just from within your own networks and hiring more broadly while still hiring strong candidates? One answer might be some kind of moneyball approach, based on what candidates have actually delivered in the past. Another answer is active and strong mentoring programs to both find diverse candidates and build culture together (for example, what Etsy did with their hacker school).

I suspect that diversity has to be built into a startup from its earliest days. If it is part of the culture to find and retain excellent people who are not like themselves, then the advantages of a diverse workplace will scale as the company grows. I’m not suggesting that this is easy, but I do believe it is both necessary and feasible.

A comment on “Computational linguistics and literary scholarship”

There is a controversy over “Computational linguistics and literary scholarship.” A paper published at the Association for Computational Linguistics David Bamman, Brendan O’Connor, & Noah A. Smith, “Learning Latent Personas of Film Characters“) has come under criticims by Hannah Alpert-Abrams with Dan Garrette, “Some thoughts on the relationship between computational linguistics and literary scholarship“. The primary criticism has been that the ACL paper fails to cite any literary scholarship past the middle of the last century.

Brendan O’Connor (a friend of mine, I should say) responded by saying the authors of the ACL paper were not trying to do literary theory, but “developing a computational linguistic research method of analyzing characters in stories.” O’Connor states that this is a paper about method, not literary content; and compares it to the early days of quantitative models in social science and economics (with which he is more familiar).

There is a slight irony in this controversy because a common trope on the Language Log blog (where Alpert-Adams and Garrette published their comments) is about physicists expounding on linguistics without consulting any linguists (for example, “Word String frequency distributions” and Joshua Goodman’s wonderful rant, “Language Trees and Zipping.”) This idea of poaching on other people’s grounds – or, better, trying to do other people’s work for them without a proper understanding of just what that work is – is something linguists, computational or otherwise, might have a sensitivity to.

However, I think this overlooks the real criticism of the latent personas work. To quote from Alpert-Abrams and Garrette:

When we look at Wikipedia entries about film, for example, we would not expect to find universal, latent character personas. This is because Wikipedia entries are not simply transcripts of films: they are written by a community that talks about film in a specific way. The authors are typically male, young, white, and educated; their descriptive language is informed by their cultural context. In fact, to generalize from the language of this community is to make the same mistake as Campbell and Jung by treating the worldview of an empowered elite as representative of the world at large.

I am no literary scholar, but it’s common knowledge that much of post-modern literary scholarship concerns itself with power relationships. One might even say that much of what it means to be post-modern is to study those relationships, especially to disillusion elites and recover disempowered voices. Alpbert-Abrams and Garrette’s criticism is, I believe, not so much about poaching or about ignoring recent advances in the field of literary criticism, but ignoring the realities of power in the Wikipedia data.

This affects the work of Bamman, O’Connor and Smith in at least two ways. First, it suggests that the data upon which they base their conclusions is skewed in a way they did not recognize. It is reasonable for O’Connor to say that they were more interested in method, and not criticism. But it is very important for any machine learning programme that one understands the data that models are being run on. To ignore the quality of the data will produce likely produce erroneous results (in fact, this is one of the points that Mark Liberman makes in his friendly criticism of the word string frequency paper mentioned above). Second, and this is related to the first issue, given the skewed nature of Wikipedia contributors towards young educated males, the results based on this data will likely be towards perpetuating models reflecting their views, and further disempowering women, elders, and non-educated people. The Wikipedia project states that the top quartile of Wikipedia participants starts at 30 years old, and half are under 22. It is easy to imagine how the comparative lack of data about films from people in their 30s, 40s, 50s, etc, can skew the results, and the same is easily said for women (12% of contributors) and educational level, co-varies with the young age of the average participant. Actually, the Wikipedia study did not provide data on race/ethnicity (only nationality), although it seems prima facie likely that contributors tend to be white, heterosexual, and cis-gendered.

As a single example of this, Bamman, O’Connor and Smith list the topics in a 50-topic model in their table 3. Two of the fifty topics are WOMAN (woman, friend wife, sister, husband) and WITCH (witch, villager, kid, boy, mom) – the label name is, as is typical for this research, the word with the highest PMI (pointwise mutual information) for the topic. There is no topic for MAN (because it is totally unmarked?) or ethnicity except for monsters – FAIRY, WEREWOLF, ALIEN. MAYA is a topic, but the other high PMI words in this topic are (monster, monk, goon, dragon). Granted that movies skew towards the interests of young males, these topics are not surprising, but it is likely that the makeup of contributors even further skews these results.

I am grateful for Alpert-Abrams and Garrette’s note. Wikipedia is increasingly being seen as a source of ground truth for any number of machine-learned models and systems. Its breadth and depth make it natural to use Wikipedia titles as a kind of list of things of interest (see, for example, the Wikifier project, which does entity extraction based on Wikipedia titles and content). It is a good reminder that the of the contributors to Wikipedia will bias results based on Wikipedia in certain systematic (and political) ways; understanding these is, as Alpert-Abrams and Garrette note, an important project in its own right.

A Fibonacci A Day: Using Streams and a Fast Memoization Method

Well, I didn’t think I’d write another Fibonacci A Day post — but there’s a great fibonacci function that’s defined right in the Scala documents, in their discussion of Streams.

stream is a standard Scala data structure which is very list like. Recall that a Scala list has two parts, a head and a tail; the head is a value, and the tail is either the special empty list (Nil), or another list. For example, the list created by List(“cat”, “mouse”, “cheese”) is a list whose head is “cat”, and whose tail is a list whose head is “mouse” and whose tail is a list whose head is “cheese” and whose tail is Nil (the cheese does not quite stand alone).

Elements in a stream (the values in their heads), on the other hand, are only evaluated when they are needed. For example, getting the nth value is a stream causes the steam to evaluate its first element, its second element, etc, up to n. Once evaluated, they are remembered (that is, memoized) so they the next time the element is requested, the value has already been computed (the lookup is still O(n), but no computation is done).

Here’s a definition of a fibonacci stream, pretty much as given in the documents

lazy val _fibs: Stream[BigInt] = 
  0 #:: 1 #:: (_fibs zip _fibs.tail map { case (x, y) ⇒ x + y })

This defines an “infinite” list of fibonacci numbers — infinite in the sense that the list can be as long as we have memory for. Note that Scala allows us to have recursive data structures as well as recursive functions. Here the tail of the _fibs stream (after the 2nd element, anyway) is defined in turns of itself — we zip the stream with the tail of the stream, and then each element is the sum of the two previous elements — just as we had defined.

We can access this list in a similar way to our previous functions to get positive and negative fibonacci numbers:

def stream_fibonacci(n: Int) = n match {
    case _ if (n >= 0)     ⇒ _fibs(n) // natural numbers 
    case _ if (n % 2 == 0) ⇒ -_fibs(-n) // even, negative numbers*/
    case _                 ⇒ _fibs(-n) // odd, negative numbers 
  }

How does compare in speed with our previous functions? Calling stream_fibonacci(1000) the first time results in zipping though the stream and calculating the sums. Somewhat surprisingly, this is not more expensive than our fastest version. Afterwards, though, we just “look up” the 1001st element, which requires traversing the stream’s first 1001 elements. But this is very fast, and is about 100x times faster on my machine (see the updated benchmarks).

Of course, this comes at the expense of memory storage for the 1001 fibonacci numbers in the stream. Depending on the application, this could be just fine. Of course, fibonacci numbers get big fast, so maintaining a large stream of them might not be good (the other versions only require two or so BigInts to be held in memory during the computation).

A Fibonacci A Day: Fibonacci(n) in O(lg(n)) steps

I am taking Martin Odensky’s Functional Programming in Scala class, and discovered (or rediscovered) that the Fibonacci function is discussed in SICP (Structure and Interpretation of Computer Programs, the book that Odensky seems to be basing much of his Functional Programming in Scala course on).

Even more interesting is that a Fibonacci function which runs in O(lg(n)) steps is presented — that is, it takes about lg(n) steps to compute the nth Fibonacci number, instead of n steps. It’s given as an exercise (Exercise 1.19). Bill the Lizard has been working through the SICP exercises, and his explanation of Exercise 1.19 is the clearest I’ve found. The code below might not make much sense without looking at the SICP chapter or his explanation, but here is the code in Scala, and put in the form that I’ve been using — this computes fibonacci(n) for any integer (including negative integers). Running the benchmark indicates that this version a lot faster than the straight tail recursive version.

  def fibonacci(n: Int): BigInt = {
    val limit = if (n < 0) -n else n
    /* the fibonacci function per se — on natural numbers */
    def fib(m: Int): BigInt = {
      @tailrec def recurse(a: BigInt, b: BigInt, p: BigInt, q: BigInt, count: Int): BigInt =
        if (count == 0) b
        else if (count % 2 == 0) {
          val qq = q * q
          recurse(a, b, (p * p + qq), (2 * p * q) + qq, count / 2)
        } else {
          val aq = a * q
          recurse(b * q + aq + a * p, b * p + aq, p, q, count - 1)
        }
      (m: @switch) match {
        case 0 ⇒ BigInt(0)
        case 1 ⇒ BigInt(1)
        case _ ⇒ recurse(1, 0, 0, 1, m)
      }
    }
    n match {
      case _ if (n >= 0)     ⇒ fib(n) // natural numbers 
      case _ if (n % 2 == 0) ⇒ -fib(-n) // even, negative numbers*/
      case _                 ⇒ fib(-n) // odd, negative numbers
    }
  }

 

A Fibonacci A Day

The goal of this series was to come up with a general, efficient fibonacci function that goes beyond the typical computer science introduction. The version we developed is fast, efficient with respect to memory, and general. I’d be glad to hear comments.

3200758169_e2701d1e17

The Series:

  1. A Fibonacci A Day — Introduction
  2. A Fibonacci A Day — Recursive Version
  3. A Fibonacci A Day — Special Pi Day Post
  4. A Fibonacci A Day — Iterative Version
  5. A Fibonacci A Day — Out of Bounds
  6. A Fibonacci A Day — Speedy Recursion
  7. A Fibonacci A Day — Golden Ration Version
  8. A Fibonacci A Day — Memoization
  9. A Fibonacci A Day — Benchmarking
  10. A Fibonacci A Day — Extending to Negative Numbers
  11. A Fibonacci A Day — Fibonacci in O(lg(n)) Steps
  12. A Fibonacci A Day — Using Streams and a Fast Memoization Method

 

A Fibonacci A Day — Extending to Negative Numbers

6040055534_e26371bc09We are now confident that the fast recursive version is the best of the algorithms we have explored — it is fast, and can handle large values of the Fibonacci function.

I haven’t been happy, though, with returning a negative -1 when the function is called with a negative number — this feels like a kind of silent error we were trying to avoid in going to BigInts. Fortunately, it turns out that there is a natural extension of the Fibonacci sequence to negative numbers. If the nth Fibonacci number is defined as F(n) = F(n-2) + F(n-1), then we can rearrange this to F(-n) = (-1)nF(n).

So, this is the final version of the Fibonacci function, that supports any integer, up to the the memory size of the computer, computed quite quickly. Much better than our simple recursive or iterative version, which were ripe for error and inefficiency.

 def fibonacci(n: Int): BigInt = {
    val limit = if (n < 0) -n else n
    /* the fibonacci function per se — on natural numbers */
    def fib(m: Int): BigInt = {
      @tailrec def recurse(i: Int, u: BigInt, v: BigInt): BigInt =
        if (i == limit) v
        else recurse(i + 1, v, u + v)
      (m: @switch) match {
        case 0 ⇒ BigInt(0)
        case 1 ⇒ BigInt(1)
        case _ ⇒ recurse(1, 0, 1)
      }
    }
    n match {
      case _ if (n >= 0)     ⇒ fib(n) // natural numbers 
      case _ if (n % 2 == 0) ⇒ -fib(-n) // even, negative numbers*/
      case _                 ⇒ fib(-n) // odd, negative numbers
    }
  }

I hope you’ve enjoyed these series.

 

A Fibonacci A Day — Benchmarking

Race
The following table describes the limits of the various Fibonacci algorithms I’ve created during this series. I ran them through a very simple benchmarking routine, and empirically tested them for the maximum size they can handle. Each benchmark was run 1000 times, except the first algorithm, which was run just once — we already know it runs too slow. These were run on a MacBook Air (Late 2010).

Log Recursive (BigInt)0.0082Unlimited (except for memory)

Algorithm Milliseconds per call for F(46) Max size
Simple Recursive 21,360.00 40 or so
Classic Iterative (Int) 0.0040 46
Classic Iterative (Long) 0.0040 164
Classic Iterative (Double) 0.0030 76
Classic Iterative (BigInt) 0.0080 unlimited (except memory)
Tail Recursive (BigInt) 0.0060 unlimited (except memory)
Using Phi (BigInt) 0.0030 70 or so
Memoization (BigInt) 0.0020 unpredictable
Memoization Monad (BigInt) 0.0940 unpredictable

Here is a table comparing the versions which allow really big fibonacci numbers, calculating F(20000) 100 times each.

Algorithm Milliseconds per call for F(20000) Max size
Classic Iterative (BigInt) 24.23 unlimited (except memory)
Tail Recursive (BigInt) 23.50 unlimited (except memory)
Log Recursive (BigInt) 1.00 unlimited (except memory)
Stream (BigInt), first time 3.19 unlimited (except memory)
Stream (BigInt), subsequent calls 0.22 unlimited (except memory)

F(20000) is a really big integer, and a lot of this time is creating and destoying the memory associated with creating the integers. The thing to notice is that the iterative and the tail recursive versions take approximately the same amount of time — the recursive version actually runs a little more quickly than the iterative version. And the Log recursive version of the same is two orders of magnitude faster — — the stream version is yet another order faster, but it may use too much memory. Note that even the first use is faster than the iterative and tail recursive versions!

Still, there’s really no reason not to use the BigInt, log recursive version of this function — it’s both fast and accurate.

(April 2, 2013: Updated to include log recursive version.

(April 13, 2013: Updated to include < a href="http://willwhim.wpengine.com/2013/04/13/a-fibonacci-a-day-using-streams-and-a-fast-memoization-method/">stream version; rerun benchmarks for F(20000); and update conclusions.

A Fibonacci A Day — Memoization

SONY DSC

Although the iterative recursive version is stateless, it looks different from the standard mathematical definition. The recursive version is a very close mapping, but it’s extremely inefficient. One way to overcome the inefficiencies is to dynamically record (or memoize) previous calculations of F(n). The idea is simple — if we already know the values of F(n – 1) and F(n – 2), we can simply add them without having to do the difficult steps in recreating them.

Here is a version that does just that —

  val memo = scala.collection.mutable.HashMap[Int, BigInt]()
  def fibonacci_8(n: Int): BigInt = {
    def fib(i: Int): BigInt =
      memo.getOrElseUpdate(i, fibonacci_8(i))
    n match {
      case 0 ⇒ BigInt(0)
      case 1 ⇒ BigInt(1)
      case m ⇒ fib(m - 1) + fib(m - 2)
    }
  }

Note that a memoization table is defined outside the scope of the function, and an internal function, fib looks for a memoized value in the table. If it finds it, it simply returns it; if it doesn’t, it makes a recursive call to the outer fibonacci function. Consider F(2). It looks up F(1), which is not yet in the table, so it calls the outer fibonacci function with 1. That returns BigInt(1), which is stored in the table and returned. It does a similar thing with F(0). It now knows enough to calculate the results, which it returns.

The main body of this function looks very similar to the mathematical definition. But there are two disadvantages to this version. One is that we still have mutable state (the memoization table). A clever, if somewhat convoluted way, to overcome this is to use a State monad —Tony Morris describes how to do this in his article,  Memoization WIth State Using Scala. Another problem is this is somewhat less predictable — calling F(2000) out of the gate is likely to fail (with too many stack calls) unless most of the intermediate values are already calculated. That is, fibonacci_8(2000) will probably fail, but for (i ← 1 to 2000) fibonacci_8(n) might not. I have more problems with the second issue than the second —

Side note: English already had a word for remembering things by rote, memorization, and a technical term, cacheing as well. Why did memoization seem necessary?

A Fibonacci A Day — Golden Ratio version

6202116053_165d2c6414

The Fibonacci sequence has a very strong connection to the Golden Ratio (φ, or Phi). φ is defined as (√5 + 1)/2, or approximately 1.618, and has lots of special properties. Ron Knott has a web page with over 300 Fibonacci and φ formulas. One of these is this direct formula: F(n) = round(φn/√5), where the rounding function does rounding to the nearest whole number.

This gives us a quite an efficient function:

def fibonacci_7(n: Int) : Long = {
    val sqrt5 = math.sqrt(5.0)
    val φ = (sqrt5 + 1) / 2.0
    math.round(math.pow(φ, n) / sqrt5)
  }

Because the math.round function produces a Long value, we can’t use BigInts as before. In fact, due to rounding error (in the calculations of φ,√5, and φn, this version is only exact to F(70); after that, there are low-end digits wrong (and, as we saw before, can only handle values up to F(176) even with inaccuracies.