Publications by Data Science notes
Locality Sensitive Hashing in R
Introduction In the next series of posts I will try to explain base concepts Locality Sensitive Hashing technique. Note, that I will try to follow general functional programming style. So I will use R’s Higher-Order Functions instead of traditional R’s *apply functions family (I suppose this post will be more readable for non R users). Also I...
6057 sym R (3180 sym/20 pcs) 3 tbl
Analyzing texts with text2vec package
In the last weeks I have actively worked on text2vec (formerly tmlite) – R package, which provides tools for fast text vectorization and state-of-the art word embeddings. This project is an experiment for me – what can a single person do in a particular area? After these hard weeks, I believe, he can do a lot. There are a lot of changes from ...
8523 sym R (5698 sym/49 pcs) 8 img
GloVe vs word2vec revisited.
Today I will start to publish series of posts about experiments on english wikipedia. As I said before, text2vec is inspired by gensim – well designed and quite efficient python library for topic modeling and related NLP tasks. Also I found very useful Radim’s posts, where he tried to evaluate some algorithms on english wikipedia dump. This d...
12522 sym R (4425 sym/12 pcs) 12 img
text2vec GloVe implementation details
Before reading this post, I very recommend to read: Orignal GloVe paper Jon Gauthier’s post, which provides detailed explanation of python implementation. This post helps me a lot with C++ implementation. Word embedding After Tomas Mikolov et al. released word2vec tool, there was a boom of articles about words vector representations. One of t...
7998 sym R (5074 sym/8 pcs)
text2vec 0.3
Today I’m pleased to announce preview of the new version of text2vec. It is located in the 0.3 development branch, but very soon (probably in about a week) it will be merged into master. To reproduce examples below, please install [email protected] from github: devtools::install_github('dselivanov/[email protected]') Also I’m waiting for fe...
6687 sym R (5017 sym/32 pcs)
text2vec 0.4
Introducing text2vec 0.4 Today I’m pleased to announce new major release of text2vec – text2vec 0.4 which is already on CRAN. For those readers who is not familiar with text2vec – it is an R package which provides an efficient framework with a concise API for text analysis and natural language processing. With this release I also launched p...
11931 sym R (7015 sym/52 pcs) 8 img