Publications by Alexej's blog
Salaries by alma mater – an interactive visualization with R and plotly
Based on an interesting dataset from the Wall Street Journal I made the above visualization of the median starting salary for US college graduates from different undergraduate institutions (I have also looked at the mid-career salaries, and the salary increase, but more on that later). However, I thought that it would be a lot more informative, i...
1347 sym R (112 sym/1 pcs) 2 img
Salaries by alma mater – an interactive visualization with R and plotly
Based on an interesting dataset from the Wall Street Journal I made the above visualization of the median starting salary for US college graduates from different undergraduate institutions (I have also looked at the mid-career salaries, and the salary increase, but more on that later). However, I thought that it would be a lot more informative, i...
3603 sym R (1261 sym/8 pcs) 2 img
5 ways to measure running time of R code
A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available. A quick online search revealed at least three R packages for benchmarking R ...
5317 sym R (3029 sym/10 pcs) 2 img
5 ways to measure running time of R code
A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available. A quick online search revealed at least three R packages for benchmarking R ...
5471 sym R (3038 sym/10 pcs) 2 img
Freedman’s paradox
Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure per...
4263 sym R (2189 sym/5 pcs) 2 img
Freedman’s paradox
Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure per...
4263 sym R (2184 sym/5 pcs) 2 img
Mining USPTO full text patent data – Analysis of machine learning and AI related patents granted in 2017 so far – Part 1
The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML),...
11316 sym Python (1573 sym/2 pcs) 16 img
Mining USPTO full text patent data – Analysis of machine learning and AI related patents granted in 2017 so far – Part 1
The United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML),...
11316 sym Python (1573 sym/2 pcs) 16 img
Probabilistic interpretation of AUC
Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be :scream_cat:). So it took me some until I learned that the AUC has a nice probabilistic meaning. What’s AUC anyway? Consider: A dataset : , where is a vector of features collected for the th subject, is the th subject’s...
4483 sym R (1237 sym/4 pcs) 74 img
Probabilistic interpretation of AUC
Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be :scream_cat:). So it took me some time until I learned that the AUC has a nice probabilistic meaning. What’s AUC anyway? AUC is the area under the ROC curve. The ROC curve is the receiver operating characteristic curve. AUC is...
5876 sym R (1431 sym/5 pcs) 86 img