Introduction to Sentiment Analysis: A Tale of Distinct Classes

Given some document (in plain text, preferably), how can we determine the emotional state or sentiment of the writer towards a particular entity at the time of production? Provided we have a sizable corpus of documents —  static or streaming — the answer to this question could be useful for several reasons. Ideally, a Sentiment Analysis tool should be able to estimate an answer to the following time-constrained, yet important questions: what’s the present favorability rating for Obama? How likable was the president just before the election? Knowing the answers to these questions could significantly sway political strategies of different organizations. Similarly, knowing answers to several other comparably phrased yet diverse questions could be very useful to a lot of organizations, brands, and people.

Let’s back up a little bit. We’ve been talking about determining the sentiment of a document in a very broad sense. What answer do we expect to get out of Sentiment Analysis? An extended description of the writer’s sentiments at the time of writing or a Yes/No/Neutral answer? It turns out that the latter is easier, and in some cases even preferable. The Yes/No/Neutral answer is much less opaque and much more suitable for analysis. There are probably other answering models but I’ll from hence concentrate on the Yes/No/Neutral answer.

You can consider the Yes/No/Neutral answer as a three-option, one-choice answer. Loosely speaking, according to this model, there are three classes: Yes, No, Neutral. But for reasons to be uncovered shortly, we’ll use only two classes: Yes, No. A document that has been determined to be positive-enough — with positivity above a certain threshold —  will be placed in the Yes class. On the other hand, documents negative-enough — with negativity above a certain threshold — will be placed in the No class. Documents not in the Yes/No class can be safely placed in the Neutral class. How do we determine the thresholds? Well, through training our model on pre-marked data. Binary classification is much easier to interpret and more amenable to several statistical models. At this point, we’ve been able to formulate the problem as a binary classification problem. The problem of statistical classification is well-known and well-researched.

Out of a cornucopia of classification methods, the best methods (judged by popularity, effectiveness, and ease-of-use) in my humble opinion are:

  1. Naive Bayes
  2. Maximum Entropy (Logistic Regression)
  3. Support Vector Machines

Using any (or a combination) of these methods, we can classify documents into the Yes-No(-Neutral) classes using a trained model. The method that should be used in production is the one with the highest accuracy on the test set.

Nick Jones and I, for our final project in our NLP class, built a twitter Sentiment Analysis tool. We tried two classifiers: Naive Bayes, and Maximum Entropy. Naive Bayes turned out to be more accurate (83.01% accuracy on the test set). Since our training set is a bit old (based on tweets made in 2006), we made a web application to classify more recent tweets in real-time via the twitter API. For more information on this, check out our github repo. There are several improvements we hope to make to this tool. But this is a start. The goal of this project was to experiment with some of the methods used in Sentiment Analysis. We sure did.

Fear of Starting

I’m not going to take my hands off my computer until I’m done with a draft of this blog post.

I want to learn how to be unafraid of  starting a job or project or task, moving towards a goal, initiating a transaction, and so on. Easier said than done. So I’m going to do this now. Ride with me. You might just be as afraid of starting as I am. So flow with me. I’m going to envelope all my fears (for now) somewhere in the foggy landscape of Utah (where I currently reside). And I hope that I don’t find it again, after the atmosphere is less fuzzy, in my head and in the troposphere.

As I write this post, I wander if I’m ever going to “finish” it. By “finish”, I mean complete writing a blog post with perfect flow, heavy but sensible content, and impeccable grammar. If I don’t even start, I’ve clearly failed. If I do complete drafting the blog post, then maybe I’ve succeeded at finding a coarse stepping stone to a better blog post in the future. But am I ever going to “finish” this blog post? Logically, I can’t because no one’s perfect, right. Or so they say. Therefore, the only reasonable line of action is to complete the draft and convince myself that my draft is “good enough.” What’s holding me back is perfectionism. My quest for perfection in writing a blog post is not invisible in other parts of my life. Perhaps, if and when I wrap up this blog post, I could be inspired to wrap up the myriad unfinished projects I have, the lingering tasks my head is convinced is still moldable to perfection. But then, Vince Lombardi once said, “Perfection is not attainable, but if we chase perfection, we catch excellence.” Why should I be afraid of the unattainable? Fear of mediocrity. Flush.

I use a task management software called WunderList that currently has more than 100 unaccomplished non-academic tasks created by me. These tasks aren’t heavy-weight. I know that I can finalize each in less than 3 hours if I put my mind to it. Almost all were created just before the start of Fall semester. During the school year, I inexplicably convinced myself that if I try to accomplish any of these non-academic tasks, I will not have enough time to read for the upcoming test in 1 week, to prepare my worksheets for the fast approaching prefect sessions, or to attend the birthday party I was invited to. Bullshit. Now that I think about it, this fear is completely unreasonable. Because I didn’t even try to start any of these tasks. Fear of the unknown. Flush.

Both fears, the fear of the unknown and the fear of mediocrity actually aren’t trivial. H.P. Lovecraft, the inflential author of several popular horror fictions, said that the “oldest emotion of mankind is fear, and the oldest and strongest kind of fear is fear of the unknown.” As a result, fighting these fears, especially the fear of the unknown, is hard. But I’m going to START fighting them by STARTING several tasks.

I’m done! I mean, I’m done writing a draft of this blog post. Is this draft perfect? Clearly not. But it’s a good start. I’ve moved one step closer towards perfection. I’ve buzzing towards infinity (and beyond). I’m more spirited. Most of all, I’ve started!

Cardinality Estimation in Linear Time using Sub-Linear Space


The cardinality of a collection A (which might be an ordered or unordered list, a set, or what not) is basically the number of unique values in A. For example, the collections [1,2,3,4] and [1,2,1,3,1,4,3] have the same cardinality of 4 (and also correspond to the same set).

Determing the Cardinality of a Collection: The Naive Approach

Consider a collection A=[1,2,1,3,1,4,3]. How can we systematically determine the cardinality of A? Well, here are two of many ways to do this:

  1. First sort A in ascending order. Then, we can perform a linear scan on A to remove duplicates. It’s pretty easy to see how this can be done. Finally, return the size of the possibly trickled-down collection (now a set) obtained. If the initial size of A is n. Then, the cardinality of A, using this method, can be determined in O(n\, log\, n) (if we use merge-sort) time and O(1) extra space.
  2. Use a hash table: Perform a linear scan of A, hashing the values of A. It’s easy to see that cardinality of A is the number of keys in the hash table obtained. This uses O(n) time but O(n) extra space also.

Notice that we can’t do any better (lower upper-bound) than O(n) because we have to look at the entire input (which is of size n). But can we determine the cardinality of A in O(n) time using sub-linear space (using strictly smaller space than n)?

That’s where probability comes in.

Linear Probabilistic Counting

This is a probabilistic algorithm for counting the number of unique values in a collection. It produces an estimation with an arbitrary accuracy that can be pre-specified by the user using only a small amount of space that can also be pre-specified. The accuracy of linear counting depends on the load factor (think hash tables) which is the number of unique values in the collection divided by the size of the collection. The larger the load factor, the less accurate the result of the linear probabilistic counter. Correspondingly, the smaller the load factor, the more accurate the result. Nevertheless, load factors much higher than 1 (e.g. 16) can be used while achieving high accuracy in cardinality estimation (e.g. <1% error).

Note: in simple hashing, the load factor cannot exceed 1 but in linear counting, the load factor can exceed 1.

Linear counting is a two-step process. In step 1, the algorithm allocates a bit map of a specific size in main memory. Let this size be some integer m. All entries in the bit map are initialized to “0”‘s. The algorithm then scans the collection and applies a hash function to each data value in the collection. The hash function generates a bit map address and the algorithm sets this addressed bit to “1”. In step 2, the algorithm first counts the number of empty bit map entries (equivalently, the number of entries set to “0”). It then estimates the cardinality of the collection by dividing this count by the bit map size m (thus obtaining the fraction of empty bit map entries. Call this V_n).

Plug in V_n and m into the equation r = m\, log_e\, V_n to obtain r which is the estimated cardinality of the collection. The derivation of this equation is detailed in this paper[1].

Here’s my simple implementation of the Linear Probabilistic Counting Algorithm.

Errors in Counting

Although the linear probabilistic counting method is faster than deterministic approaches, it might sometimes fail to be accurate as explained above. So this method should be used only when 100% accuracy is not needed. For example, in determining the number of unique visitors to a website.

Probabilistic Alternatives to Linear Probabilistic Counting

The HyperLogLog algorithm is such an alternative. It also runs in linear time (linear in the size of the initial collection). But the HyperLogLog counter usually gives a more accurate estimation of the cardinality count and also uses less space. See this paper for more details on the HyperLogLog counter.


[1] A Linear-time Probablistic Counting Algorithm for Database Applications