Blog Archive

Showing posts with label qtiplot. Show all posts
Showing posts with label qtiplot. Show all posts

Monday, April 20, 2015

Asteroid Physical Parameters

Let's use Horizons Web Interface to search for asteroids with the following contraints:
  • H mag defined
  • diameter defined
  • albedo defined
  • B-V mag defined
  • U-B mag defined
It would be nice to look for I-R mag values but these data are not available. 
Let's also extract the asteroid designations and their spectral type (SMASSII) when available.


We get a list of 813 asteroids.

I try to use a data mining package called Weka to visualize and analyze the data.
To start simple: let's look at a well known relation, the one linking H and diameter.

Diameter versus H mag

We can see this graph ( H mag is on the X-axis, Diameter on the Y-axis):

Two distinct but similar curves exist beyond a certain H  threshold (very roughy H = 12 - 12.5 ).

Diameter and H are related via albedo, so we can try to use Weka to cluster the asteroids using for example H and albedo.


Cluster results
Let's run the K-Means clustering algorithm with K=2 using H and albedo.
The following two clusters are identified (Weka returns the two cluster centroids and numerosity):

Let's look at the same plot as before using the blue color for the smaller cluster and the red color for the larger one to see if the clusters can be visually associated to the two distinct branches:



The relation between the two branches of the plot and the clusters defined based on H and albedo is not "perfect" (some of the red instances are displayed in proximity of the blue ones) but, at least roughly, it seems to be well confirmed.

Let's now cluster based on B-V and U-B.
We get this result (do not mind about the fact that this time the smaller cluster is called cluster 1 ... we will continue to display the smaller cluster with a blue color):

This is interesting because we have got almost the same result as before suggesting that the two clusters based on B-V and U-B are almost overlapped with those defined based on H and albedo.

Let's try to use another software package called QTIPlot to estimate a polynomial fit for the two Diameter vs H curves.




Clusters versus spectral type

First of all, we display the two clusters on the B-V and U-B plane:

We want to analyze the two clusters in more detail trying to see if we can see a relation with the spectral type.

Before doing this, let's look at the overall spectral type distribution:


Cluster1

First we look at the spectral type distribution of the 322 asteroids that belong to this cluster.


What is interesting in this cluster is this: almost all S-type asteroids (94 out of 96) belong to this cluster and they constitute a little more than half of their cluster population. 

Let's use Weka ZeroR algorithm to put a lower boundary on the performance of any classification model.
The ZeroR model always predicts the largest numerosity class: 94 S asteroid, 182 asteroids with a known spectral type (94 / 182 = 51.6%).
In fact, this is the Weka output:

Let's see if it is possible to use H, diameter, albedo B-V, U-B to discriminate the S-type asteroids from the other asteroids types.

We try with the logistic model (cross validation N=10). The result is this:

I would say that the logistic model has failed to make any significant prediction.
Other classification models give apparently a better result but I think that this is just due to an overfitting effect because there is a too big difference between the n=10 cross-classification performance versus the whole training set performance.

Cluster 0

Again, first we look at the spectral type distribution of the 491 asteroids that belong to this cluster.
What is interesting in this cluster is this: it contains most of the carbonaceous and metallic asteroids and it is quite etherogeneous.
Let's use Weka ZeroR algorithm to put a lower boundary on the performance of any classification model.
The ZeroR model always predicts the largest numerosity class: 60 Ch asteroids, 228 asteroids with a known spectral type (60 / 228 = 26.3%).
In fact, this is the Weka output:

Let's see if it is possible to use H, diameter, albedo B-V, U-B to discriminate among the various asteroid types.
We try with the logistic model (cross validation N=10). The result is this:

This time it seems that the logistic model has a moderate success (44% performance compared to a minimum 26% performance), so the asteroid physical parameters seem to have some predicting power in this cluster.
However, the fact that the accuracy  jumps to 57% when you run the model on the whole set of asteroids belonging to this cluster seems to be an overfitting effect.
Better not to believe this model!

Before giving up, we can see that the situation improves a little bit if we reduce the number of variables just focusing on the B-V and U-B parameters.
In fact, this is what happens with the logistic model that just uses B-V and U-B (cross validation N=10):

... and this is what happens when running the model on the whole cluster:


Not sure of this: the relative small increase in the amount of correctly classified instances may indicate that the overfitting effect is not so big and so we may hope that the model has at least in part been able to correctly generalize.

In that case, as you see in the section "Detailed Accuracy by Class", we may have some  success with the prediction of asteroids belonging to these spectral types:
  • B (average precision 0.647)
  • CH (average precision 0.535)
  • X (average precision 0.38)
  • C (average precision 0.368)
Just for curiosity ... here is the list of predictions.

Note that:
  • column spec_B: this is the already known spectral type as downloaded by Horizons database. When no spectral type is available in the Horizon database, you can see a question mark.
  • column classification: this is the prediction of the model. Of course, if the previous column is known and the predicted type is different ... this is an error. If the previous column is a question mark ... this is the prediction.
  • all other columns contain the probability distribution: the spectral class with the greatest probability is the predicted class.
In spite of the fact that the average precision is not high, there are a few cases where it is very high.

The fact that Weka returns the probability distribution is very nice because you can plot other graphs, for example: probability distribution of a given spectral type versus B-V.

Look for example at these plots:

type B distribution versus (B-V)


type C distribution versus (B-V)


... but before going to these details, it would be better to be more confident about the validity of the model!

Kind Regards,
Alessandro Odasso