Behind the Scenes: Twitter, Part 2 - Lessons from email

rss · #1 11-14-2012, 07:30 PM

Behind the Scenes: Twitter, Part 2 - Lessons from email

This is the second in a three Part ****** about how we use Twitter as a support channel. Yesterday I wrote about how we use Twitter as a support channel and the internal tool that we built to improve the way we handle tweets.

One of our criteria in finding or building a tool to manage Twitter was the ability to filter tweets based on content in order to find those that really need a support response. While we’re thrilled to see people sharing articles like this or quoting REWORK, from a support perspective our first goal is to find those people who are looking for immediate support so that we can get them answers as quickly as possible.
When we used Desk.com for Twitter, we cut down on the **ise somewhat by using negative search terms in the query that was sent to Twitter: rather than searching just for “37signals”, we told it to search for something like “37signals -REWORK”. This was pretty effective at helping us to prioritize tweets, and worked especially well when there were sudden topical spikes (e.g., when Jason was interviewed in Fast Company, more than 5,000 tweets turned up in a generic ‘37signals’ search in the 72-hour period after it was published), but had it’s limitations: it was laborious to update the exclusion list, and there was a limit placed on how long the search string could be, so we never had great accuracy.
When we went to our own tool, our initial implementation took roughly the same approach—we pulled all mentions of 37signals from Twitter, and then prioritized based on k**wn keywords: links to SvN posts and Jobs Board postings are less likely to need an immediate response, so we filtered accordingly.
Using these keywords, we were able to correctly prioritize about 60% of tweets, but that still left a big chunk mixed in with those that did need an immediate reply: for every tweet that needed an immediate reply, there were still three other tweets mixed in to the stream to be handled.
I thought we could do better, so I spent a little while examining whether a simple machine learning algorithm could help.

Lessons from email

While extremely few tweets are truly spam, there are a lot of parallels between the sort of tweet prioritization we want to do and email spam identification:

Have some information about the sender and the content.
Have some mechanism to correct classification mistakes.
Would rather err on the side of false negatives: it’s generally better to let spam end up in your inbox than to send that email from your boss into the spam folder.

Spam detection is an extremely well studied problem, and there’s a large body of k**wledge for us to draw on. While the state of the art in spam filtering has advanced, one of the earliest and simplest techniques generally performs well: Bayesian filtering.

Bayesian filtering: the theory

A disclaimer: I’m **t a credentialed statistician or expert on this topic. My apologies for any errors in explanation; they are indavertent.
The idea Behind Bayesian filtering is that there is a probability that a given message is spam based on the presence of a specific word or phrase.
If you have a set of messages that are spam and **n-spam, you can easily compute the probability for a single word – take the number of messages that have the word and are spam and divide it by the total number of messages that have the word:
Behind Scenes: Twitter, Part Lessons

In most cases, ** single word is going to be a very effective predictor, and so the real value comes in combining the probabilities for a great many words. I’ll skip the mathematical explanation, but the bottom line is that by taking a mapping of words to emails that are k**wn to be spam or **t, you can compute a likelihood that a given new message is spam. If the probability is greater than a threshold, that email is flagged as spam.
This is all relatively simple, and for a reasonable set of words and messages you can do it by hand. There are refinements to deal with rare words, phrases, etc., but the basic theory is straightforward.

Building a classifer

Let’s take a look at the actual steps involved with building a classifier for our Twitter problem. Since it’s the toolchain I’m most familiar with, I’ll refer to steps taken using R, but you can do this in virtually any language.
Our starting point is a dataframe called “tweets” that contains the content of the tweet and whether or **t it needed an immediate reply, which is the classification we’re trying to make. There are other attributes that might improve our classifier, but for **w we’ll scope the problem down to the simplest form possible.
After some cleaning, we’re left with a sample of just over 6,500 tweets since we switched to our internally built tool, of which 12.3% received an immediate reply. > str(tweets) 'data.frame': 6539 obs. of 2 variables: $ body : chr "Some advice from Jeff Bezos http://buff.ly/RNue6l" "http://37signals.com/svn/posts/3289-some-advice-from-jeff-bezos" "Mutual Mobile: Interaction Designer http://jobs.37signals.com/jobs/11965?utm_source=twitterfeed&utm_medium=twitter" "via @37Signals: Hi my name is Sam Brown - I’m the artist Behind Explodingdog. Jason invited me to do some draw... htt"| __truncated__ ... $ replied : logi FALSE FALSE FALSE FALSE FALSE FALSE ...

Even before building any models, we can poke at data and find a few interesting things about the portion of tweets that needed an immediate reply given the presence of a given phrase:

WordPortion requiring immediate reply All tweets12.3% “svn”0.2% “job”0.5% “support”17.3% “highrise”20.8% “campfire”26.9% “help”35.4% “basecamp”49.5% This isn’t earth shattering—this is exactly what you’d expect, and is the basis for the rudimentary classification we initially used.
With our data loaded and cleaned, we’ll get started building a model. First, we’ll split our total sample in two to get a “training” and a “test” set. We don’t want to include all of our data in the “training” of the model (computing the probabilities of a reply given the presence of a given word), because then we don’t have an objective way to evaluate the performance of it.

The simplest model to start

I always like to start by building a very simple model – it helps to clarify the problem in your mind without worrying about any specifics of anything advanced, and with ** expectation of accuracy. In this case, one very simple model is to predict whether or **t a tweet needs a reply based only on the overall probability that one does – in other words, randomly pick 12.3% of tweets as needing an immediate reply. If you do this, you end up with a matrix of predicted vs. actual that looks like:

actual predictedFALSETRUE FALSE2878374 TRUE40251 Here, we got the correct prediction in 2,929 cases and the wrong outcome in 776 cases; overall, we correctly classified the outcome about 79% of the time. If you run this a thousand times, you’ll get slightly different predictions, but they’ll be centered in this neighborhood.
This isn’t very good, for two reasons:

As gross accuracy goes, it’s **t that great—we could have built an even simpler model that always predicted that a message won’t require an immediate reply, because most (about 88%) don’t.
We classified 374 messages that actually did need an immediate response as **t needing one, which in practice means that those people wouldn’t get a response as quickly as we’d like. That’s a pretty terrible hit rate—only 12% of tweets needing a reply got one immediately. This is the real accuracy measure we care about, and this model did pretty terribly on that regard.

Building a real model

To build our real model, we’ll start by cleaning and constructing a dataset that can be analyzed using the “tm” text mining package for R. We’ll construct a corpus of the tweet bodies and perform some light manipulations – stripping whitespace and converting everything to lower case, removing stop words , and stemming. Then, we’ll construct a “document term matrix”, which is a mapping of which documents have which words.

require(tm) train_corpus

المصدر: Forums

Behind the Scenes: Twitter< Part 2 - Lessons from email