Removing Stop Words

I have a task to do word counting for some articles. The detail of the task is getting the list of id from Elasticsearch, get the content from ArangoDB, then do some text processing to clean the content and counting the word frequency.

After did it with Scala, Go, and Python, I found out that it is very slow when I am doing it with Python. Doing it with Scala and Go only take around 3-4 seconds to process 12,563 articles. But when we do with Python, it takes around 15-18 seconds. And after do some profiling, finally I found out that it is very slow to remove any stopwords from big number of articles. I am using common method in Python to remove the stopwords.

stopwords = open('stopwords.txt').read().splitlines()
word_count = Counter([i for i in all_words.split() if len(i) > 4 if i not in stopwords])

It needs around 16 seconds to do the stopwords removing and counting with Counter function.

$ time python --start 2018-09-01 --end 2018-09-30
'get_es_data'  802.65 ms
'get_arango_data'  449.94 ms
'clean_texts'  1286.84 ms
'word_count'  13980.26 ms
Total articles: 12563
Top words: ['jokowi', 'presiden', 'indonesia', 'ketua', 'partai', 'prabowo', 'widodo', 'jakarta', 'negara', 'games']
Total time execution (in seconds): 16.5261652469635

real    0m16.647s
user    0m15.668s
sys     0m0.288s

Then after reading some methods about how some people do text processing, I found this good article. It is said that Python dictionaries use hash tables, this means that a lookup operation (e.g., if x in y) is O(1). A lookup operation in a list means that the entire list needs to be iterated, resulting in O(n) for a list of length n.

So I try to convert the stopwords list to dictionaries.

stopwords = open('stopwords.txt').read().splitlines()
stop_dicts = dict.fromkeys(stopwords, True)
words = Counter([i for i in words.split() if len(i) > 4 if i not in stop_dicts])

Then we will get the faster result than the previous one with the list type.

$ time python --start 2018-09-01 --end 2018-09-30
'get_es_data'  787.50 ms
'get_arango_data'  461.97 ms
'clean_texts'  1311.56 ms
'word_count'  785.08 ms
Total articles: 12563
Top words: ['jokowi', 'presiden', 'indonesia', 'ketua', 'partai', 'prabowo', 'widodo', 'jakarta', 'negara', 'games']
Total time execution (in seconds): 3.3524017333984375

real    0m3.503s
user    0m2.560s
sys     0m0.220s

It only takes 3 seconds to do all process. It is very fast, isn’t it? 😀

You may also like


Leave a Reply

Your email address will not be published. Required fields are marked *