Understanding optimal distances

Finding your horses' optimal distance can be a painful and expensive endeavour. This post looks under the surface at how we can make this process easier for owners in Photo Finish Live.

Petrocker

7/7/20249 min read

Sometimes its only after retiring a horse that you realise it was being run at the wrong distance. Sometimes horses are given up on too quickly before optimal conditions are found. This post is an attempt to try to understand what dictates the optimal distance. There are lots of myths and rumours circulating about distance, so today we are going to try to keep it simple by just looking at what the numbers say.

To do this, the performance of retired horses has been analysed. This is important because obviously we can see the attributes of retired horses. The focus is on finding the optimal distance - not how good a horse is overall. That's another story for another day. The cynical amongst you might ask, what's the point in understanding the relationship between attributes and optimal distance - you can't tell a horse's attributes until its too late. Well there is truth in this, but I think we are all better off, if we can establish some baseline understanding into what dictates optimal distance.

Let's apply some basic stats to try to tell a story. The underlying data includes all retired horses and all their runs. Those runs have been normalised, accounting for surface and conditions, where performance is measured on a scale of 0-100, where 0 is the best and 100 is the worst. We take the median (less prone to outlier bias) for each distance and compare. So a horse that has a median score of 14 for 4F, 12 for 5F, 10 for 6F, 12 for 7F and 14 for 8F will be deemed to run best at 6F in this analysis (where 10 is the lowest). We are comparing these normalised time scores with a horse's grades. These grades have been converted to integers so we can do some calculations with them (D-=1, D=2, D+=3 all the way up to SSS+= 21). I have conducted the analysis off screen so we can focus here on the results.

Three different statistical approaches to understanding the relationships, correlations, least squares regression and feature importance.

1. Correlations

Never trust a bore online who quotes correlations to you. They are one of the most misused stats in arguments, because humans are really good at seeing causation in correlations. Correlations measure a linear relationship between two elements. They don't explain the directionality (what causes what), they don't capture complex (non-linear) relationships and are generally not transitive (if A&B are correlated and B&C are correlated, you can't be sure A&C are correlated). The range of correlations is from -1 to 1, where scores near zero suggest there is no relationship, and scores closer to -1 or 1 suggest a negative or positive correlation.

All that being said, its a simple test to see if there is a relationship between two features.

This table shows the relationship between each attribute and each distance. The numbers are negative which tells us, as expected, there is a negative relationship between attributes and finish time. The higher the attribute the faster a horse goes. However the pattern emerging here is that start has a stronger relationship with finish times over shorter distances and a much weaker relationship over longer distances. This backs up what we are told and what is common knowledge at this point.

There is a still a significant (I would ignore anything between -0.3 and 0.3) relationship at long distances but this could be that better horses naturally have higher start than weaker horses and therefore run faster. The opposite is true for finish, where its relationship is strongest at long distances. Start and finish mirror each other - and so do speed and stamina - but with less extreme values. At 8F, the middle distance it looks like all attributes have the same relationship with the overall time performance - suggesting a flat attribute distribution would work best at this distance.

correlation graph
correlation graph

For Heart and Temper, there is also a significant relationship at all distances but peaking at 8F and slopping off on either side. In conclusion I would say that all attributes improve performance but at different amounts for each distance. S+ grade horses have higher start stats than most B+ horses - so it might be that this is what is causing start to look more important than it actually is at longer distances. This analysis doesn't consider interaction effects (say speed x start) which might be important too to monitor.

2. Least Squares Regression

Least squares regression (LSR) is a useful technique where you are trying to estimate the input parameters of an equation. Performance by distance is an equation - so well suited for our task here. Again, better for simple linear relationships. The values generated are parameter estimates - or guesses of what the weighting is for each input feature.

lsr table
lsr table

Squinting at this table the colours look the same. However there is one key difference here, the estimates have positive and negative numbers. This would suggest that a higher finish attribute has a negative effect on 4F times, and a high start attribute has a negative effect on 12F times. I think this is unlikely, and probably more due to good horses having good attributes across the board. It might also suggest that a lower grade horse with stats in the right places could run better times than a higher grade horse without the right attributes in the right places. So, I would be cautious when suggesting that high finish is actually bad - it probably just means that horse should be running at different distance. Againt the impact of heart and temper looks fairly similar across all distances. Also 8F stands out as the distance where all weightings are very similar.

3. Feature Importance

One of the challenges of machine learning is the notion of the "black box algorithm" - it works but its hard to explain exactly how it works. There are several solutions to this problem that are far too technical for this blog post, but one of the side products of using machine learning algorithms is being able to quickly assess feature importance. This is a common technique where you have thousands of features and want to hone in on what is important. So here we build a quick random forest regression model, throw away the model, but take a look at the feature importance - because that's all we are really after here - to understand what is important at each distance. Yes we could have spent a bunch of time fine tuning a model to get more and more accurate weightings, but we just want a quick and dirty appraisal of our few features to see what is important.

feature importance
feature importance

The way to read this table is to look at one column at a time. The relative importance to the model, and there is one model per distance because we are treating them as discrete events, is the value where higher numbers mean more important. These models give a lot of importance to start and finish. Too much in my opinion. I think they underestimate speed and stamina. (Maybe I should have spent more time improving the models!). I would rank order each feature for each distance, so for 4F start is most important then speed, then everything else is basically equal. That holds true from 5F to 7F. At the other end, finish and then stamina are most important - then everything else. So the findings are similar to what we saw earlier, but maybe over estimating the importance of start and finish. The impact of start and finish are more pronounced but probably not more important overall than speed and stamina. I wouldn't read too much into individual cell numbers - the data is too noisy to draw more than broad brush conclusions.

So three different techniques, three somewhat similar findings to what we all knew all along. Thanks. For. That.

I am sure those with the motivation might want to build out more precise or advanced approaches to get a lot more specific and have a more robust solution to predicting performance by distance. My attempt here was just to share the basics and provide some more robustness to the rumours that get passed around.

What is the best distance for my horse?

Let's put this knowledge into some sort of practical use. What is the best distance for my horse? Based on the parents or your breeding strategy you should know which attributes are most likely to be the highest. Maybe you are trying to develop a sprinter, or add stamina to a high speed horse so that it can better compete across the endurance sprints of 6-7F. The following table shows the % of horses whose best times are recorded at each distance based on its highest attribute. Sometimes there are joint highest attributes, so I have tried to capture those in the first column. Some combinations don't have a big enough sample so I have only included those combinations that have a robust sample size.

attributes by distance
attributes by distance

The way to read this is for horses who have finish as their highest attribute, historically 1.5% of those horses have recorded their best time at 4F and 26.5% at 12F, with a further 23% at 11F (this is the penultimate row of the table). For those horses with start and finish being their highest attributes - the most likely best distance is 8F (but they will probably be terrible horses because that's a bad combination). This table doesn't say how good a horse will be - just what is most likely to be their best distance. But as we all know, we don't know the attributes before retirement!

I hope this short stats (lesson) post is a launch pad for your own analysis, and has either firmed up what you instinctively knew to be true, or has helped you form your own angle. All of these numbers are based on the population as a whole. There will always be horses that don't align with what has been discussed here.

Take for example the strange case of Olivia Dunne, who has recorded the fastest 4F time (in yielding conditions) with a speed attribute of A+. There will always be examples of horses that don't conform to the norm. That's great - because it means there is a great deal of randomness in the game - which keeps it exciting.

Overall I would say there are definitely 4F and 12F horses that are fairly abject at any other distance - this is due to the dominance of start or finish in their profile. Well balanced horses should excel at 8F (which is maybe a 3rd specialist distance). In between these points the balance of start/speed, speed/stamina or stamina/finish presents a broader range of acceptable distances where there could be multiple distances that a horse can comfortably win at. Bad combinations of high attributes such as start/finish seem to generate bad horses. So when breeding try to figure out what gives you the best opportunity for high attributes yes of course - but pay specific attention to trying to land combinations of high attributes (start/speed for sprinters) together. There are lots of studs that have a high specific attribute - but that may not relate to racing success unless you can get your combinations working together.

Fewer stats next time, I promise.

Join the fun and put these insights into practice at PhotoFinish.Live and if you are considering starting your own stable please consider using my referral code: PADDOCK or just click on this link: https://signup.photofinish.live/?referralCode=PADDOCK

Please remember this is a web3 game where your spend your own money. Nothing I write about should be considered financial or investment advice.

Other blog posts: