Publications by Bogumił Kamiński
Plotting gain chart
Gain chart is a popular method to visually inspect model performance in binary prediction. It presents the percentage of captured positive responses as a function of selected percentage of a sample. It is easy to obtain it using ROCR package plotting “tpr” against “rpp”. However, it is worth to note that gain chart can be...
1488 sym 4 img
Factor to class-membership matrix
Recently on R-bloggers I found a post from chem-bla-ics blog concerning conversion of factors to integer vectors. At the end it stated a problem of conversion of factor variable to class-membership matrix. In comments several nice solutions were provided. Among them notably function classvec2classmat from kohonen package does the tr...
2047 sym 2 img
Applying multiple functions to data frame
A very typical task in data analysis is calculation of summary statistics for each variable in data frame. Standard lapply or sapply functions work very nice for this but operate only on single function. The problem is that I often want to calculate several diffrent statistics of the data. For example assume that we want to calcu...
2571 sym 2 img
Plotting randu dataset
Recently I have stumbled on help description of randu data from datasets package. It contains pseudorandom numbers that are flawed. Help says that “In three dimensional displays it is evident that the triples fall on 15 parallel planes in 3-space.“. So I decided to generate the plot that would show this.If you simply plot the...
1412 sym 6 img
randu dataset, part 2
In my last post I have plotted randu dataset to show that all its points lie on 15 parallel planes. But I was not fully satified with the solution and decided to show this numerically.It can be done in four steps:identifying four points lying on the same plane and finding its equation (we know that we have 15 planes so it is enough ...
1659 sym 4 img
Working with isTRUE
This week I was running computations transforming some input files into output files. The problem was that it was a repeated process. If new input files were generated or old ones were updated I needed to calculate new output files. The transformation was time consuming so I wanted to run the calculations only when required.My initial...
2718 sym 2 img
Comparing model selection methods
The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare them by their ability to select the true model.To test this I have thought to generate ...
3020 sym 2 img
Stability of classification trees
Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing in...
2396 sym 4 img
Optimal regularization for smoothing splines
In smooth.spline procedure one can use df or spar parameter to control smoothing level. Usually they are not set manually but recently I was asked a question which one of them is a better measure of regularization level. Hastie et al. (2009) discuss the advantages of df but I thought of a simple graphical illustraition of th...
2840 sym 6 img
Programming traps when using "sample"
Standard sample function works differently when it gets single element integer vector as opposed to longer vectors. This can lead to unexpected bugs in R code.Several times I had a problem with code similar to one given here:for (i in 1:4) { x i:4 print(sample(x))}#[1] 4 1 2 3#[1] 3 2 4#[1] 4 3#[1] 3 2 1 4When l...
1850 sym 2 img