A recent post on the iOS Developer Tips blog provided a handy way to get the the word count for a string by using NSScanner
, and asked for comments on alternative approaches. Pretty quickly there were a few different suggestions so I thought that I’d take a look at them to see how they compare. It turns out that the different approaches give pretty different results when run over the same test string! To be honest this isn’t much of a surprise, but what was surprising is just how different the results were.
I tested the original scanner based approach and also the first four alternatives from the comments. For the test string I used this:
1
|
|
there are a couple of things to note here: some of the spaces are doubled up, the period is spaced French-style (i.e. with a space before and after) and the em-dashes have hair-space at either side of them. It’s easier to see some of these features when you look at the same string in a proportional font: “Peter piper picked a peck of pickled pepper . No — really — he did!”
Anyway, the various approaches gave very different word counts for that example:
Method | Count |
---|---|
Original Scanner | 15 |
Regular Expression | 12 |
String Components | 18 |
Char Components | 22 |
Linguistic tagger | 12 |
Anywhere from 12 to 22 words! Let’s take a look at the different approaches in turn.
Original Scanner
This is the original NSScanner
based version from John’s post, here’s the code for it:
1 2 3 4 5 6 7 8 9 |
|
This version correctly handles runs of whitespace, but it treats any non-space character as a valid word, so the French-spaced period get’s counted, as do the two em-dashes. Note however that this version does correctly pick up the four hair-spaces.
Regular Expression
This is my contribution:
1 2 3 4 5 |
|
Obviously this isn’t production code as there is no error handling (or caching of the compiled regex, which may or may not make sense here). But I’d say that this version gives the correct result, both ignoring the French-stop and em-dashes, and handling all of the spaces correctly.
String Components
This is by far the simplest solution, provided by Frank in the comments:
1 2 3 4 |
|
Unfortunately it doesn’t work at all for this string. Just looking at an actual space character means that the double spaces get counted twice, and the entire substring “No — really — he” gets treated as a single word!
Note though, that this approach is really easy to understand, and would be good if the input text had already been heavily normalized.
Char Components
Almost the same as the previous version, except that this uses an NSCharacterSet
instead of a string:
1 2 3 4 5 |
|
Compared to the previous version this one still double counts the 2-space wide spaces, but it correctly detects the hair-spaces surrounding the em-dashes. Useful I guess if your text has been partially normalized by collapsing runs of spaces.
Linguistic Tagger
This one was interesting as it’s an API that I haven’t seen before:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
This code returns the correct number of words, so we have another winner here! Although the code is definitely more complicated than the regex based version above. Also, the originally posted code gave a result of 30, as it also calls the block for whitespace and punctuation, you need to use the tag
block parameter to disambiguate these.
The linguistic tagger provides a number of advanced features which may be useful if you need more than just a simple word count though. Note, for example, the sentence
block parameter which could be used to give a sentence count as well as a word count.
Conclusion
For most text the simplest solution is to use a regular expression here. If your input text has already been normalized then the componentsSeparatedByString:
based approach is probably the easiest to use. The linguistic tagger allows for more advanced analysis of the text.
Update: all of the code here, plus a main
function to call it, is available as a gist.
Update: I talk a little more about normalization and linguistic tagging in this post.