Removing Stop Words

I have a task to do word counting for some articles. The detail of the task is getting the list of id from Elasticsearch, get the content from ArangoDB, then do some text processing to clean the content and counting the word frequency.

After did it with Scala, Go, and Python, I found out that it is very slow when I am doing it with Python. Doing it with Scala and Go only take around 3-4 seconds to process 12,563 articles. But when we do with Python, it takes around 15-18 seconds. And after do some profiling, finally I found out that it is very slow to remove any stopwords from big number of articles. I am using common method in Python to remove the stopwords.

stopwords = open('stopwords.txt').read().splitlines()
word_count = Counter([i for i in all_words.split() if len(i) > 4 if i not in stopwords])

It needs around 16 seconds to do the stopwords removing and counting with Counter function.

$ time python counting.py --start 2018-09-01 --end 2018-09-30
'get_es_data'  802.65 ms
'get_arango_data'  449.94 ms
'clean_texts'  1286.84 ms
'word_count'  13980.26 ms
Total articles: 12563
Top words: ['jokowi', 'presiden', 'indonesia', 'ketua', 'partai', 'prabowo', 'widodo', 'jakarta', 'negara', 'games']
Total time execution (in seconds): 16.5261652469635

real    0m16.647s
user    0m15.668s
sys     0m0.288s

Then after reading some methods about how some people do text processing, I found this good article. It is said that Python dictionaries use hash tables, this means that a lookup operation (e.g., if x in y) is O(1). A lookup operation in a list means that the entire list needs to be iterated, resulting in O(n) for a list of length n.

So I try to convert the stopwords list to dictionaries.

stopwords = open('stopwords.txt').read().splitlines()
stop_dicts = dict.fromkeys(stopwords, True)
words = Counter([i for i in words.split() if len(i) > 4 if i not in stop_dicts])

Then we will get the faster result than the previous one with the list type.

$ time python counting.py --start 2018-09-01 --end 2018-09-30
'get_es_data'  787.50 ms
'get_arango_data'  461.97 ms
'clean_texts'  1311.56 ms
'word_count'  785.08 ms
Total articles: 12563
Top words: ['jokowi', 'presiden', 'indonesia', 'ketua', 'partai', 'prabowo', 'widodo', 'jakarta', 'negara', 'games']
Total time execution (in seconds): 3.3524017333984375

real    0m3.503s
user    0m2.560s
sys     0m0.220s

It only takes 3 seconds to do all process. It is very fast, isn’t it? 😀

Continue Reading

Tensorflow GPU Memory Usage (Using Keras)

I am using Keras with Tensorflow backend for my project. In the beginning, when using Tensorflow backend, I am a bit surprised when seeing the memory usage. Although my model size is not more than 10 MB, It is still using all of my GPU memory.

After reading tensorflow documentation, I found out, that by default, TensorFlow maps nearly all of the GPU memory. The purpose is to reduce the memory fragmentation. And also, from the documentation, I know there are two different approaches that can be used to handle this situation.

The first method is limiting the memory usage by percentage. So, for example, you can limit the application just only use 20% of your GPU memory. If you are using 8GB GPU memory, the application will be using 1.4 GB. (I am using Keras, so the example will be done in Keras way)

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
set_session(tf.Session(config=config))

After setting the configuration, we could see the difference.

It only uses 1.6 GB of memory. But, for this configuration, we only limit by percentage, and the application still use 100% of the limitation. We cannot know how much exactly the memory usage of the application. So we will try the second method.

The second method is using allow_growth. This method will make the application allocate only as much GPU memory based on runtime allocation. So, the application will be using the GPU memory as needed.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
set_session(sess)

And this is the result.

So, after trying two of these methods, which method will you use for your application? 😀

Continue Reading

Gin Memcached Middleware

If you are using gin to build a webservice, and you want to use memcached to store your data, you will search some articles about how to create a middleware to do that. After read some articles, I only found this middleware. But I got no luck. I cannot get a value from a key on one endpoint after I set it on another endpoint. So I didn’t use gin session and create my own middleware with gomemcache.

package main

import (
    "github.com/gin-gonic/gin"
    "github.com/bradfitz/gomemcache/memcache"
)

func MCMiddleware(mc *memcache.Client) gin.HandlerFunc {
    return func(c *gin.Context) {
        c.Set("mem", mc)
        c.Next()
    }
}

func main() {
    r := gin.Default()

    mc := memcache.New("127.0.0.1:11211")

    r.Use(MCMiddleware(mc))

    r.GET("/set", func(c *gin.Context) {
        mem, _ := c.MustGet("mem").(*memcache.Client)
        mem.Set(&memcache.Item{Key: "somekey", Value: []byte("somevalue")})
        c.String(200, "ok\n")
    })

    r.GET("/get", func(c *gin.Context) {
        mem, _ := c.MustGet("mem").(*memcache.Client)
        data, _ := mem.Get("somekey")
        c.String(200, string([]byte(data.Value))+"\n")
    })

    r.Run(":8000")
}

Then you can set and get a value from a key on all endpoints.

linx@crawler ~ $ curl localhost:8000/set
ok
linx@crawler ~ $ curl localhost:8000/get
somevalue
Continue Reading

Gogstash – Logstash Alternative

I am now doing some projects that need a monitoring application to monitor the webservice. After having some chit and chat, we decide to use ELK (Elasticsearch, Logstash, and Kibana). If you want to know what ELK is, just search on Google and there will be so many articles related to it.

If you have already read some articles about ELK, you will know that ELK is the application to monitor and analyze all types of log.

  • Elasticsearch: indexing the data.
  • Logstash: log processing / parsing.
  • Kibana: visualize the data.

But after trying to configure and run ELK, I found out that Logstash is heavy to be run on the server with small specification. Because of this reason, I am trying to find some Logstash alternatives, and finally I found Gogstash, Logstash like, written in Golang.

While reading the documentation, I found out that there are some differences between Gogstash and Logstash when using the filter (I am using grok filter in Logstash). I tried to apply same pattern in Gogstash but it didn’t work. After all these things, I decide to use another filter. I am using gonx filter.

Although grok pattern and gonx pattern is different, it is not so difficult to create the configuration for gonx filter. And after some modification, Gogstash run smoothly. For your information, I am using flask for building the webservice, and this is an example line of the application log.

192.168.100.57 - - [05/Dec/2017 16:27:27] "GET / HTTP/1.1" 200 -

There are two types of Gogstash configuration, json and yml format. Here is my yml configuration.

input:
  - type: file
    path: '/home/linggar/webapp/nohup.out'

filter:
  - type: gonx
    format: '$clientip - - [$date $time] "$full_request" $status -'
    source: message
  - type: gonx
    format: '$verb $request HTTP/$httpversion'
    source: full_request

output:
  - type: elastic
    url: 'http://127.0.0.1:9200'
    index: gogstash_log
    document_type: testtype

And that’s it. Although Gogstash is not as powerful as Logstash, it is very light and one of so many Logstash alternatives which you could try.

Continue Reading
1 2 3 10