Among the various versions of the Occam's razor, perhaps the most useful from a scientific point of view is the statement "among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected". Owing to the objective formulation of the statement, it is not surprising that there are certain mathematical and probabilistic interpretations (see the Wiki page on this). Here, I want to talk about a justification from a machine learning perspective. If we wish to train a classifier where the training data looks almost linearly separable, we wouldn't try to train a non-linear classifier, would we? The reason for this is actually more than simply a subjective reference to the Occam's razor. Any reader who has dealt with building classifiers would immediately know what I am talking about. Its generalizability. The more "complex" the classifier, the higher the probability that its prediction error would higher (even though its classification rate is actually better than a linear classifier). In statistical regression analysis, we want select a model which uses fewer explanatory factors without compromising the explanation itself. The above concept is elegantly quantified by formulations such as Vapnik-Chervonenkis (VC) dimension and Rademacher complexity. In other words, the higher the VC dimension, the more "complex" the classifier is, and is less generalizable. As I am at the end of writing this, I cannot help but feel that the relationship between Occam's razor and VC dimension is, obviously, obvious.
0 Comments
I don't think the reviewers should be asked to recommend accepting or rejecting the paper. This should completely be the responsibility of the editor based on the reviewer comments.
Let \(f\) denote the density function of a crowd in a given space \(X\), and let \(\nabla f\) denote the gradient vector field. If I am at a given point \(x\in X\), compute the integral curve \(c_{\nabla f}(t)\) such that \(c_{\nabla f}(0)=x\) and \( c_{\nabla f}'(t) = -\nabla f(c_{\nabla f}(t)) \). You just computed the trajectory in which I will move. When I am explaining persistence to Engineers, I start from dendrograms for hierarchical clustering, as most of them are familiar with that concept, and then present persistence diagrams and barcodes as analogies for "higher order topological properties". I just realized that I had the wrong impression (because I never thought about it) that 0-persistence as a special case, has some properties which makes dendrograms possible as opposed their "coarser" counterparts, barcodes and persistence diagrams, for higher persistence. But, this is only because we implicitly assume a canonical basis for \( C_0 \) in single linkage clustering, or 0-persistence. It is not always the case that a natural canonical basis (which makes some sort of sense) exists for higher dimensional chain spaces. However, in the case of Rips filtration based on Euclidean metric for point clouds, especially in \(\mathbb{R}^2\) and \(\mathbb{R}^3\), we can use the Alexander duality to define a canonical basis, and thus actually obtain a dendrogram. More generally, whenever we have a meaningful canonical basis for our chain spaces, we can have dendrograms for persistence.
The unfalsifiable hypothesis is designed such that no matter what observation we make, the hypothesis cannot be rejected. For example, "whatever happens, I made that happen". Go ahead, prove it wrong. Equivalently, such hypotheses are true by definition. They can only function as axioms. Impossible hypotheses are usually those which are logically inconsistent, or in other words, self contradictory. Untestable hypotheses are those for which it is impossible to determine how they might affect any particular outcome or experiment. Needless to say, these are also useless, unless we have underestimated the size of \(\mathcal{O}\). |
Harish ChintakuntaI like nerdy analysis of non-nerdy (well, also nerdy) things. Thats right, I am a nerd! Archives
June 2015
Categories |