Let's use Horizons Web Interface to search for asteroids with the following contraints:

- H mag defined
- diameter defined
- albedo defined
- B-V mag defined
- U-B mag defined

It would be nice to look for I-R mag values but these data are not available.

Let's also extract the asteroid designations and their spectral type (SMASSII) when available.

We get a list of 813 asteroids.

I try to use a data mining package called Weka to visualize and analyze the data.

To start simple: let's look at a well known relation, the one linking H and diameter.

To start simple: let's look at a well known relation, the one linking H and diameter.

__Diameter versus H mag__

We can see this graph ( H mag is on the X-axis, Diameter on the Y-axis):

Two distinct but similar curves exist beyond a certain H threshold (very roughy H = 12 - 12.5 ).

Diameter and H are related via albedo, so we can try to use Weka to cluster the asteroids using for example H and albedo.

__Cluster results__

Let's run the K-Means clustering algorithm with K=2 using H and albedo.

The following two clusters are identified (Weka returns the two cluster centroids and numerosity):

Let's look at the same plot as before using the blue color for the smaller cluster and the red color for the larger one to see if the clusters can be visually associated to the two distinct branches:

The relation between the two branches of the plot and the clusters defined based on H and albedo is not "perfect" (some of the red instances are displayed in proximity of the blue ones) but, at least roughly, it seems to be well confirmed.

Let's now cluster based on B-V and U-B.

We get this result (do not mind about the fact that this time the smaller cluster is called cluster 1 ... we will continue to display the smaller cluster with a blue color):

Let's try to use another software package called QTIPlot to estimate a polynomial fit for the two Diameter vs H curves.

__Clusters versus spectral type__

First of all, we display the two clusters on the B-V and U-B plane:

We want to analyze the two clusters in more detail trying to see if we can see a relation with the spectral type.

Before doing this, let's look at the overall spectral type distribution:

__Cluster1__

First we look at the spectral type distribution of the 322 asteroids that belong to this cluster.

What is interesting in this cluster is this: almost all S-type asteroids (94 out of 96) belong to this cluster and they constitute a little more than half of their cluster population.

Let's use Weka ZeroR algorithm to put a lower boundary on the performance of any classification model.

The ZeroR model always predicts the largest numerosity class: 94 S asteroid, 182 asteroids with a known spectral type (94 / 182 = 51.6%).

In fact, this is the Weka output:

Let's see if it is possible to use H, diameter, albedo B-V, U-B to discriminate the S-type asteroids from the other asteroids types.

We try with the logistic model (cross validation N=10). The result is this:

I would say that the logistic model has failed to make any significant prediction.

Other classification models give apparently a better result but I think that this is just due to an overfitting effect because there is a too big difference between the n=10 cross-classification performance versus the whole training set performance.

__Cluster 0__

Again, first we look at the spectral type distribution of the 491 asteroids that belong to this cluster.

What is interesting in this cluster is this: it contains most of the carbonaceous and metallic asteroids and it is quite etherogeneous.

Let's use Weka ZeroR algorithm to put a lower boundary on the performance of any classification model.

The
ZeroR model always predicts the largest numerosity class: 60 Ch
asteroids, 228 asteroids with a known spectral type (60 / 228 = 26.3%).

In fact, this is the Weka output:

Let's see if it is possible to use H, diameter, albedo B-V, U-B to discriminate among the various asteroid types.

We try with the logistic model (cross validation N=10). The result is this:

This time it seems that the logistic model has a moderate success (44% performance compared to a minimum 26% performance), so the asteroid physical parameters seem to have some predicting power in this cluster.

However, the fact that the accuracy jumps to 57% when you run the model on the whole set of asteroids belonging to this cluster seems to be an overfitting effect.

Better not to believe this model!

Before giving up, w

__e can see that the situation improves a little bit__if we reduce the number of variables just focusing on the B-V and U-B parameters.

In fact, this is what happens with the logistic model that just uses B-V and U-B (cross validation N=10):

... and this is what happens when running the model on the whole cluster:

Not sure of this: the relative small increase in the amount of correctly classified instances may indicate that the overfitting effect is not so big and so we may hope that the model has at least in part been able to correctly generalize.

In that case, as you see in the section "Detailed Accuracy by Class", we may have some success with the prediction of asteroids belonging to these spectral types:

- B (average precision 0.647)
- CH (average precision 0.535)
- X (average precision 0.38)
- C (average precision 0.368)

Note that:

- column spec_B: this is the already known spectral type as downloaded by Horizons database. When no spectral type is available in the Horizon database, you can see a question mark.
- column classification: this is the prediction of the model. Of course, if the previous column is known and the predicted type is different ... this is an error. If the previous column is a question mark ... this is the prediction.
- all other columns contain the probability distribution: the spectral class with the greatest probability is the predicted class.

The fact that Weka returns the probability distribution is very nice because you can plot other graphs, for example: probability distribution of a given spectral type versus B-V.

Look for example at these plots:

type B distribution versus (B-V)

type C distribution versus (B-V)

... but before going to these details, it would be better to be more confident about the validity of the model!

Kind Regards,

Alessandro Odasso