Publications by msuzen

Pitfalls in pseudo-random number sampling at scale with Apache Spark

15.06.2017

In many data science applications and in academic research, techniques involving Bayesian Inference is now used commonly. One of the basic operation in Bayesian Inference techniques is drawing instances from given statistical distribution. This of course well known pseudo-random number sampling. Most commonly used methods first genera...

3992 sym R (3517 sym/6 pcs) 2 img 4 tbl

Post-statistics: Lies, damned lies and data science patents

05.08.2017

US Patent (Wikipedia)Statistics is so important field in our daily lives nowadays, the emerging field of 50 years old data science that is applied to almost every human activity now, or post-statistics, a kind of post-rock,  fusing operations research, data mining, software and performance engineering and of course multitude fields of statisti...

3548 sym 2 img 1 tbl

Understanding overfitting: an inaccurate meme in supervised learning

16.08.2017

Preamble There is a lot of confusion among practitioners regarding the concept of overfitting. It seems like, a kind of an urban legend or a meme, a folklore is circulating in data science or allied fields with the following statement:Applying cross-validation prevents overfitting and a good out-of-sample performance, low generalisation erro...

12075 sym R (4235 sym/12 pcs) 10 img 11 tbl

Teaching to machines: What is learning in machine learning entails?

16.11.2017

PreambleFigure 1: The oldest learning institution  in the world; University of Bologna. (Source: Wikipedia).Machine Learning (ML) is now a de-facto skill for every quantitative job and almost every industry embraced it, even though fundamentals of the field is not new at all. However, what does it mean to teach to a machine? Unfortunately,...

3691 sym 4 img 2 tbl

Collaborative data science: High level guidance for ethical scientific peer reviews

12.05.2020

PreambleCatalan Castellers are collaborating (Wikipedia)Availability of distributed code tracking tools and associated collaborative tools make life much easier in building collaborative scientific tools and products. This is now especially much more important in data science as it is applied in many different industries as a de-facto standard. E...

2392 sym 2 img 1 tbl

Collaborative data science: High level guidance for ethical scientific peer reviews

12.05.2020

PreambleCatalan Castellers are collaborating (Wikipedia)Availability of distributed code tracking tools and associated collaborative tools make life much easier in building collaborative scientific tools and products. This is now especially much more important in data science as it is applied in many different industries as a de-facto standard. E...

2391 sym 2 img 1 tbl