Among the various versions of the Occam's razor, perhaps the most useful from a scientific point of view is the statement "among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected". Owing to the objective formulation of the statement, it is not surprising that there are certain mathematical and probabilistic interpretations (see the Wiki page on this).
Here, I want to talk about a justification from a machine learning perspective. If we wish to train a classifier where the training data looks almost linearly separable, we wouldn't try to train a non-linear classifier, would we? The reason for this is actually more than simply a subjective reference to the Occam's razor. Any reader who has dealt with building classifiers would immediately know what I am talking about. Its generalizability. The more "complex" the classifier, the higher the probability that its prediction error would higher (even though its classification rate is actually better than a linear classifier). In statistical regression analysis, we want select a model which uses fewer explanatory factors without compromising the explanation itself.
The above concept is elegantly quantified by formulations such as Vapnik-Chervonenkis (VC) dimension and Rademacher complexity. In other words, the higher the VC dimension, the more "complex" the classifier is, and is less generalizable. As I am at the end of writing this, I cannot help but feel that the relationship between Occam's razor and VC dimension is, obviously, obvious.