nltk.tokenize.casual and emojis

By Bender Rodriguez

I am trying to tokenise a simple string:

Here is a smiling face: 😀!

My code is:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from nltk.tokenize.casual import TweetTokenizer
s = u"Here is a smiling face: 😀!"
s1 = TweetTokenizer().tokenize(s)
print (s1)

And here is what I get:

[u'Here', u'is', u'a', u'smiling', u'face', u':', u'ud83d', u'ude00', u'!']

Shouldnt the smiley face come back as ONE token (is this due to a ‘narrow-build’?)?
And how am I to combine the ud83d and the ude00 and print them so I can see that the results contained a smiley face???

I am using Python 2.7 and NLTK 3.0.5 on OS X Yosemite 10.10.5.

Thanks.

Source: Stack Overflow

    

Share it with your friends!

    Fatal error: Uncaught Exception: 12: REST API is deprecated for versions v2.1 and higher (12) thrown in /home/content/19/9652219/html/wp-content/plugins/seo-facebook-comments/facebook/base_facebook.php on line 1273