Fast queries on a large MongoDB collection from Python -


i'm working on project involves visualizing ever growing database of on 100m entries (tweets), , running bottlenecks in python i'm not sure how face.

some details:

  1. the database indexed on fields querying on, including time, , text.

  2. each entry in collection contains large , complex structure, 100 nested fields.

  3. i projecting small number of columns, visualization requires fraction of stored data.

  4. fields queried of types string, float32/64, date, , id.

when querying data given date range within mongo shell, processing times more acceptable, however, queries made within python take exponentially longer. while think have decent understanding of why happens, don't have enough knowledge on matter find solution.

i have used both pymongo , monary, both disappointing results.

are there obvious solutions processing time within python closer time in mongo shell? idea's have thought of include having mongo save query results separate collection before transferring python, , trying javascript based solution instead of python/pandas.

this query (over period of 10 seconds) using monary returns 2878 rows, , takes 76 seconds.

    start_time = datetime.datetime.strptime('2017-09-09 00:00:00', '%y-%m-%d      %h:%m:%s').replace(         tzinfo=timezone).astimezone(tz.tzutc())     end_time = datetime.datetime.strptime('2017-09-09 00:10:00', '%y-%m-%d      %h:%m:%s').replace(tzinfo=timezone).astimezone(         tz.tzutc())      columns = ['created_at']     type = ['date']      arrays = mon.query(             'streamingtwitterdb',             'streamingtwitterdb',             {'created_at': {'$gt': start_time, '$lte': end_time}},             columns,             type         )      df = numpy.matrix(arrays).transpose()     df = pd.dataframe(df, columns=columns) 

in mongo can query on hour instantaneously.

try our prototype bson-numpy library. avoids overhead of pymongo (which must translate documents dicts before translate them rows in numpy array), , overhead of monary (which slows down on large documents due n-squared algorithm matches field names numpy columns). if have issues please let know in the bson-numpy issue tracker.


Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -