I am trying to tokenise a simple string:
Here is a smiling face: 😀!
My code is:
#!/usr/bin/python # -*- coding: utf-8 -*- from nltk.tokenize.casual import TweetTokenizer s = u"Here is a smiling face: 😀!" s1 = TweetTokenizer().tokenize(s) print (s1)
And here is what I get:
[u'Here', u'is', u'a', u'smiling', u'face', u':', u'ud83d', u'ude00', u'!']
Shouldnt the smiley face come back as ONE token (is this due to a ‘narrow-build’?)?
And how am I to combine the ud83d and the ude00 and print them so I can see that the results contained a smiley face???
I am using Python 2.7 and NLTK 3.0.5 on OS X Yosemite 10.10.5.
Source: Stack Overflow