What Is Text Normalization?

Apr 17th, 2012

In my previous post I mentioned that some of the word counting approaches may be suitable if the input text had been normalized, but I didn’t really elaborate on what this means. According to Wikipedia:

Text normalization is a process by which text is transformed in some way to make it consistent in a way which it might not have been before.

The article also gives some examples of the kind of transformations that are commonly performed. Of necessity, any normalization process is going to be application specific, but let’s assume for the sake of example that the word count is intended to be used in a writing application of some sort (a text editor or word processor). Given that we probably don’t care about Unicode normalization, and definitely don’t care about anything which would change the words such as stemming or canonicalization. But maybe we could normalize all runs of whitespace into single spaces? Our original test string then changes from “Peter piper picked a peck of pickled pepper . No — really — he did!” to “Peter piper picked a peck of pickled pepper . No — really — he did!”. The difference is probably hard to spot, but all of the doubled spaces in the first string have been replaces with single spaces, and the hair-spaces have been replaced with regular spaces.

How do the different word counting functions work now?

Method	Raw	Normalized
Original Scanner	15	15
Regular Expression	12	12
String Components	18	15
Char Components	22	15
Linguistic tagger	12	12

Better, but it still only leaves the same two functions returning the correct result (assuming of course that you don’t want to count strings of puncuation, this may or may not be the case in a code editor for example).

I can’t speak for the inner working of the linguistic tagger, but the reason that the regex based function works is that it is basing it’s approach on a whitelist rather than a blacklist. The regex basically says “these are valid word characters, everything else can be ignored” whereas all of the other functions take the stance “these are whitespace, eveything else must be part of a word”. Anybody who has done any web development or input validation generally will tell you that whitelists are almost always the correct approach to take. It’s just easier to enumerate all of the valid values for a given set than to try to list all of the exceptions.

Linguistic Tagger

There are quite a few more options available for analysing text here, let’s start by counting sentences as well as words, this can be done by adding a count for sentences and keeping track of the current sentence based on it’s starting location. The interesting code is on lines 9 and 10:

__block NSUInteger words = 0;
__block NSUInteger sentences = 0;
__block NSUInteger current_sentence = 0;
[tagger enumerateTagsInRange:NSMakeRange(0, [string length])
                      scheme:NSLinguisticTagSchemeTokenType
                     options:0
                  usingBlock:^(NSString* tag, NSRange token, NSRange sentence, BOOL *stop) {
  if ([tag isEqual:NSLinguisticTagWord]) ++words;
  if (!sentences || current_sentence != sentence.location) ++sentences;
  current_sentence = sentence.location;
}];

Updating the taggerWordCount function with this code tells us that we still have 12 words, and that they are spread over 2 sentences, cool!

But what about that schemes parameter that we used to set up the tagger and run the enumeration? That allows the tagger to provide different types of information to the enumeration, we can tell the tagger to tag as much as it can by initializing the schemes variable with all available schemes. The en-GB string, by the way, is a BCP-47 code. The list of available schemes for this language is shown as a comment:

NSArray* schemes = [NSLinguisticTagger availableTagSchemesForLanguage:@"en-GB"];
NSLog(@"%@", schemes);

// 2012-04-17 13:08:16.947 wordcounters[54440:707] (
//    TokenType,
//    Language,
//    Script
// )

According to Apple’s docs there are several different schemes available. One warning: if you use BCP-47 codes with more information in (such as en-US or pt-BR) then you will just get the basic 3 schemes shown above, using en gets the full list and other languages have varying levels of support.

Let’s alter the test string and see what the different en schemes give us. For a new test string I’m going to use this little ditty:

NSString* coffee = @"What I want - is a proper cup ’o coffee,"
                   @" Made in a proper copper coffee pot."
                   @" Ik kan van mijn punt,"
                   @" Ach ba mhaith liom cupán caife o ó pota caife cuí."

The 3rd and 4th lines have been replaced with Dutch and Irish translations of the English words in order to test the language detection. Interesting to note here is the syntaxused for multi-line strings in Objective-C, and also that I’ve indented the following lines so that there is a space after the punctuation at the end of the preceeding line.

Let’s take a look at each scheme and what it gives us in this example.

Token Type We can tell the words apart from the whitespace and punctuation by the tag. I could see this being useful for implementing smart punctuation in a word processor (like SmartyPants).
Lexical Class Instead of just words this gives us nouns, adjectives, and so on; it also classifies some of the puntuation more precisely, for example OpenQuote. Possibly useful in a word processing application, or to provide input to a higher-level analyser.
Name Type This attempts to detect people and place names in the text. In this example it identified “Made” as a place name, so it’s probably guessing at this based on the word capitalization.
Name Type or Lexical Class As it suggests, a combination of the previous two schemes.
Lemma This scheme performs word stemming, returning the stemmed word in the tag block parameter.
Language This supposedly analysis each sentence to try to guess which language it is written in. I found that it worked fairly poorly when the language used the same script but did OK when they were different. In the example above it guesses that all of the text is in English, but if you change the 3rd line to “Аз не мога да ми.” (the same in Bulgarian) then it guesses this correctly.
Script This is the script used in the token, for us it is always “Latn” for Latin, unless you make the substitution mentioned above in which case it correctly picks up “Cyrl” for the Bulgarian Cyrillic script.

Conclusion

For a simple word count it seems that the regular expression wins out, but the linguistic tagger provides some interesting additional information. One downside to the tagger is that it doesn’t seem to be extensible in any way, so you’re limited to those schemes and tags that Apple ship with the OS. There is no way to, for example, use this mechanism to tag keywords and operators in a code editor, which may be useful.

The code used for this post can be found in this gist.

How Many Words Make a String?

Apr 16th, 2012

A recent post on the iOS Developer Tips blog provided a handy way to get the the word count for a string by using NSScanner, and asked for comments on alternative approaches. Pretty quickly there were a few different suggestions so I thought that I’d take a look at them to see how they compare. It turns out that the different approaches give pretty different results when run over the same test string! To be honest this isn’t much of a surprise, but what was surprising is just how different the results were.

I tested the original scanner based approach and also the first four alternatives from the comments. For the test string I used this:

NSString* string = @"Peter  piper  picked  a  peck  of  pickled  pepper . No — really — he did!";

there are a couple of things to note here: some of the spaces are doubled up, the period is spaced French-style (i.e. with a space before and after) and the em-dashes have hair-space at either side of them. It’s easier to see some of these features when you look at the same string in a proportional font: “Peter piper picked a peck of pickled pepper . No — really — he did!”

Anyway, the various approaches gave very different word counts for that example:

Method	Count
Original Scanner	15
Regular Expression	12
String Components	18
Char Components	22
Linguistic tagger	12

Anywhere from 12 to 22 words! Let’s take a look at the different approaches in turn.

Original Scanner

This is the original NSScanner based version from John’s post, here’s the code for it:

NSUInteger scannerWordCount(NSString* string)
{
  NSScanner* scanner = [NSScanner scannerWithString:string];
  NSCharacterSet* ws = [NSCharacterSet whitespaceAndNewlineCharacterSet];
  NSUInteger words = 0;
  while ([scanner scanUpToCharactersFromSet:ws intoString:nil])
    ++words;
  return words;
}

This version correctly handles runs of whitespace, but it treats any non-space character as a valid word, so the French-spaced period get’s counted, as do the two em-dashes. Note however that this version does correctly pick up the four hair-spaces.

Regular Expression

This is my contribution:

NSUInteger regexWordCount(NSString* string)
{
  NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:@"\\w+" options:0 error:nil];
  return [regex numberOfMatchesInString:string options:0 range:NSMakeRange(0, [string length])];
}

Obviously this isn’t production code as there is no error handling (or caching of the compiled regex, which may or may not make sense here). But I’d say that this version gives the correct result, both ignoring the French-stop and em-dashes, and handling all of the spaces correctly.

String Components

This is by far the simplest solution, provided by Frank in the comments:

NSUInteger componentsByStringWordCount(NSString* string)
{
  return [[string componentsSeparatedByString:@" "] count];
}

Unfortunately it doesn’t work at all for this string. Just looking at an actual space character means that the double spaces get counted twice, and the entire substring “No — really — he” gets treated as a single word!

Note though, that this approach is really easy to understand, and would be good if the input text had already been heavily normalized.

Char Components

Almost the same as the previous version, except that this uses an NSCharacterSet instead of a string:

NSUInteger componentsByCharsWordCount(NSString* string)
{
  NSCharacterSet* ws = [NSCharacterSet whitespaceAndNewlineCharacterSet];
  return [[string componentsSeparatedByCharactersInSet:ws] count];
}

Compared to the previous version this one still double counts the 2-space wide spaces, but it correctly detects the hair-spaces surrounding the em-dashes. Useful I guess if your text has been partially normalized by collapsing runs of spaces.

Linguistic Tagger

This one was interesting as it’s an API that I haven’t seen before:

NSUInteger taggerWordCount(NSString* string)
{
  NSArray* schemes = [NSArray arrayWithObject:NSLinguisticTagSchemeTokenType];
  NSLinguisticTagger* tagger = [[NSLinguisticTagger alloc] initWithTagSchemes:schemes
                                                                      options:0];
  [tagger setString:string];
  __block NSUInteger words = 0;
  [tagger enumerateTagsInRange:NSMakeRange(0, [string length])
                        scheme:NSLinguisticTagSchemeTokenType
                       options:0
                    usingBlock:^(NSString* tag, NSRange token, NSRange sentence, BOOL *stop) {
    if ([tag isEqualTo: NSLinguisticTagWord]) ++words;
  }];
  return words;
}

This code returns the correct number of words, so we have another winner here! Although the code is definitely more complicated than the regex based version above. Also, the originally posted code gave a result of 30, as it also calls the block for whitespace and punctuation, you need to use the tag block parameter to disambiguate these.

The linguistic tagger provides a number of advanced features which may be useful if you need more than just a simple word count though. Note, for example, the sentence block parameter which could be used to give a sentence count as well as a word count.

Conclusion

For most text the simplest solution is to use a regular expression here. If your input text has already been normalized then the componentsSeparatedByString: based approach is probably the easiest to use. The linguistic tagger allows for more advanced analysis of the text.

Update: all of the code here, plus a main function to call it, is available as a gist.

Update: I talk a little more about normalization and linguistic tagging in this post.

Radial Menus

Mar 30th, 2012

Some alternative menu / control styles with a radial theme, linked from here so that I don’t lose them:

Quad Curve Menu, a ready made control; and
Rotating Wheel Control, a Ray Wenderlich tutorial.

(I should probably just use Pinboard or somethig like that…)

Developing for the BlackBerry

Mar 29th, 2012

For a change from the day job I’ve been doing some mobile development, all iOS up to now, and I’ve got to say it’s a pretty nice development experience - especially with the new features (e.g. ARC, new literals) that are being added to Objective-C. But then earlier this week I was asked to look into writing a BlackBerry app at work, so that led me to looking into the different options that are available for that platform, here’s what I looked into:

the BlackBerry Java SDK;
Appcelerator Titanium; and
PhoneGap (or Cordova as it is becoming).

After working with the iPhone SDK all three of these options left a lot be desired! Herewith, a summary of their shortcomings…

BlackBerry Java SDK

First off let me say that there are too many development options for the BlackBerry platform, even an Android emulation layer if you’re targetting their tablet. It’s a bit of a joke really.

Given their enterprise strengths, RIM should concentrate on getting one good Java based SDK and drop the Android layer. And rather than push their own WebWorks SDK they should concentrate on providing good support for Appcelerator and PhoneGap which will at least provide them with a growing stable of cross-platform apps written using these toolkits.

A final gripe: their simulator is killingly slow to launch, when running in debug mode (which is required to get full console output) it takes 5 minutes to launch on a reasonably modern Windows laptop.

Appcelerator Titanium

I like the idea behind Titanium: native components driven by a JavaScript (or CoffeeScript!) engine, but the current implementation didn’t inspire confidence. The installers for both Mac and Windows were buggy. I encountered several errors during installation and the Eclipse based IDE failed to install the BlackBerry components.

I expect that if you are just targetting iPhone and Android then Titanium is probably a viable option, although the problems that I had just getting it installed would give me pause before selecting it.

PhoneGap / Cordova

The installation process was much smoother, and on the Mac it works with Xcode rather than installing an Eclipse based IDE. That said, the BlackBerry support again seemed to be quite poor, and only available on Windows.

One advantage of PhoneGap is that it’s just HTML5, so you have access to the growing number of excellent frameworks for mobile development (e.g. jQuery Mobile or Spine.mobile) and there is also the opportunity to reuse some code between your mobile app and a web based version.

Conclusion

For the internal app that I’m working on I’m sticking with the BB Java SDK for now, although if I were going to be doing more than a single small app I would probably invest the time to get comfortable with PhoneGap and use that (of course, at the same time I’m trying to persuade the client that iOS is a better choice).

I’d definitely use PhoneGap if I needed to write a cross-platform app as it seems to be the more mature option.

Installing to the Local Maven Repo With Gradle

Nov 4th, 2011

I’ve been playing around with different build tools for my Java projects recently, having never been very happy with Maven. Probably the best that I’ve found is Gradle: it has an easy to use build file format, and seems pretty flexible if you need to do something a little differently.

Unfortunately the documentation isn’t as comprehensive as it could be, and one of the areas where it’s not too great is in it’s interaction with the Maven repository system. So, here’s the magic incantation that you have to add to your build file in order to have gradle install install things correctly to your local repository:

apply plugin: 'maven'
configure(install.repositories.mavenInstaller) {
    pom.project {
        groupId 'com.example'
        artifactId 'project-name'
        inceptionYear '2011'
        packaging 'jar'
        licenses {
            license {
                name 'Eclipse Public License (Version 1.0)'
                url 'http://www.eclipse.org/legal/epl-v10.html'
                distribution 'repo'
            }
        }
    }
}

this will install the project binaries, to also install source and JavaDocs (which every project should really do) then you’ll also need to add:

task sourcesJar(type: Jar, dependsOn:classes) {
    classifier = 'sources'
    from sourceSets.main.allSource
}

task javadocJar(type: Jar, dependsOn:javadoc) {
    classifier = 'javadoc'
    from javadoc.destinationDir
}

artifacts {
    archives sourcesJar
    archives javadocJar
}

Update: here is a demo project with a complete build file and a ‘hello, world’ sample class, you should be able to just unzip this and then run gradle install to install it into your local repo (tested with Gradle 1.3).

← Older Blog Archives Newer →

Digital Magpie

Ooh, ooh, look - shiny things!

What Is Text Normalization?

Linguistic Tagger

Conclusion

How Many Words Make a String?

Original Scanner

Regular Expression

String Components

Char Components

Linguistic Tagger

Conclusion

Radial Menus

Developing for the BlackBerry

BlackBerry Java SDK

Appcelerator Titanium

PhoneGap / Cordova

Conclusion

Installing to the Local Maven Repo With Gradle