SutoriaL – Page 3 – My Personal Website

Delete Index With PyArango

December 12, 2018 No Comments

I am using Python to create some background services and web service in the office. For the database, we are using Arangodb for our main database, so we use PyArango to connect to the database.

At a time, I need to delete index in the database collection, but I cannot find any example related to index deletion. So after reading the code and API documentation, I will share the example here.

Connect to database

from pyArango.connection import Connection

conn = Connection(arangoURL='http://<host>:<port>', username='user', password='pass')
db = conn['dbname']

collname = db['collname']

To get index list, we can use getIndexes()

print(collname.getIndexes())

We will get the index list and the index object

{'primary': {'collname/0': <pyArango.index.Index object at 0x7f44404b85c0>}, 'hash': {'collname/7490651': <pyArango.index.Index object at 0x7f44404b8860>, 'collname/7490654': <pyArango.index.Index object at 0x7f44404b8828>}, 'skiplist': {}, 'geo': {}, 'fulltext': {}}

Before we continue, we could see the detail of index class at the github page https://github.com/tariqdaouda/pyArango/blob/master/pyArango/index.py. Then we will try list the attributes of index object.

print(dir(collname.getIndexes()['hash']['collname/7490654']))

Index attributes:

['URL', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_create', 'collection', 'connection', 'delete', 'indexesURL', 'infos']

We could see the index information with collname.getIndexes()[‘hash’][‘collname/7490654’].infos

{'deduplicate': True, 'fields': ['keyTime'], 'id': 'collname/7490654', 'selectivityEstimate': 0.007465129984938773, 'sparse': False, 'type': 'hash', 'unique': False}

If we want to delete the index, just call the delete method.

collname.getIndexes()['hash']['collname/7490654'].delete()

We will get any success response after deleting the index.

SutoriaL

Removing Stop Words

October 20, 2018 2 Comments

I have a task to do word counting for some articles. The detail of the task is getting the list of id from Elasticsearch, get the content from ArangoDB, then do some text processing to clean the content and counting the word frequency.

After did it with Scala, Go, and Python, I found out that it is very slow when I am doing it with Python. Doing it with Scala and Go only take around 3-4 seconds to process 12,563 articles. But when we do with Python, it takes around 15-18 seconds. And after do some profiling, finally I found out that it is very slow to remove any stopwords from big number of articles. I am using common method in Python to remove the stopwords.

stopwords = open('stopwords.txt').read().splitlines()
word_count = Counter([i for i in all_words.split() if len(i) > 4 if i not in stopwords])

It needs around 16 seconds to do the stopwords removing and counting with Counter function.

$ time python counting.py --start 2018-09-01 --end 2018-09-30
'get_es_data'  802.65 ms
'get_arango_data'  449.94 ms
'clean_texts'  1286.84 ms
'word_count'  13980.26 ms
Total articles: 12563
Top words: ['jokowi', 'presiden', 'indonesia', 'ketua', 'partai', 'prabowo', 'widodo', 'jakarta', 'negara', 'games']
Total time execution (in seconds): 16.5261652469635

real    0m16.647s
user    0m15.668s
sys     0m0.288s

Then after reading some methods about how some people do text processing, I found this good article. It is said that Python dictionaries use hash tables, this means that a lookup operation (e.g., if x in y) is O(1). A lookup operation in a list means that the entire list needs to be iterated, resulting in O(n) for a list of length n.

So I try to convert the stopwords list to dictionaries.

stopwords = open('stopwords.txt').read().splitlines()
stop_dicts = dict.fromkeys(stopwords, True)
words = Counter([i for i in words.split() if len(i) > 4 if i not in stop_dicts])

Then we will get the faster result than the previous one with the list type.

$ time python counting.py --start 2018-09-01 --end 2018-09-30
'get_es_data'  787.50 ms
'get_arango_data'  461.97 ms
'clean_texts'  1311.56 ms
'word_count'  785.08 ms
Total articles: 12563
Top words: ['jokowi', 'presiden', 'indonesia', 'ketua', 'partai', 'prabowo', 'widodo', 'jakarta', 'negara', 'games']
Total time execution (in seconds): 3.3524017333984375

real    0m3.503s
user    0m2.560s
sys     0m0.220s

It only takes 3 seconds to do all process. It is very fast, isn’t it? 😀

SutoriaL

Tensorflow GPU Memory Usage (Using Keras)

July 13, 2018 No Comments

I am using Keras with Tensorflow backend for my project. In the beginning, when using Tensorflow backend, I am a bit surprised when seeing the memory usage. Although my model size is not more than 10 MB, It is still using all of my GPU memory.

After reading tensorflow documentation, I found out, that by default, TensorFlow maps nearly all of the GPU memory. The purpose is to reduce the memory fragmentation. And also, from the documentation, I know there are two different approaches that can be used to handle this situation.

The first method is limiting the memory usage by percentage. So, for example, you can limit the application just only use 20% of your GPU memory. If you are using 8GB GPU memory, the application will be using 1.4 GB. (I am using Keras, so the example will be done in Keras way)

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
set_session(tf.Session(config=config))

After setting the configuration, we could see the difference.

It only uses 1.6 GB of memory. But, for this configuration, we only limit by percentage, and the application still use 100% of the limitation. We cannot know how much exactly the memory usage of the application. So we will try the second method.

The second method is using allow_growth. This method will make the application allocate only as much GPU memory based on runtime allocation. So, the application will be using the GPU memory as needed.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
set_session(sess)

And this is the result.

So, after trying two of these methods, which method will you use for your application? 😀

« Previous 1 2 3 4 5 … 12 Next »

Kubectl, Master NotReady

Delete Index With PyArango

Removing Stop Words

Tensorflow GPU Memory Usage (Using Keras)