Blog Archive

Showing posts with label Weka. Show all posts
Showing posts with label Weka. Show all posts

Friday, May 1, 2015

Asteroid Spectral Type Distribution up to 1st Kirkwood gap

I would like to analyze the relation between asteroid spectral types and photometric data.

MPC has made available an interesting web service to download data from their databases.

Data Acquisition
The web service can be accessed running a powerful Python script that return a lot of asteroids' physical and orbital parameters (more than 100 columns!).

The first python query that I used to extract the list of asteroids was like this:

python mpc-fetch.py order_by semimajor_axis taxonomy_class_min A semimajor_axis_min 0 > part1.xml

This syntax allows you to get all asteroids belonging to taxonomy class A, B, C etc. without having to bother for those that are not yet classified.
The web service limits the output to 16384 asteroids so I had to look at the last xml section, I read the last semimajor_axis value and then I submitted the second query:

python mpc-fetch.py order_by semimajor_axis taxonomy_class_min A semimajor_axis_min 2.2179251 >> part2.xml

Then I repeated the process and I run:

python mpc-fetch.py order_by semimajor_axis taxonomy_class_min A semimajor_axis_min 2.2718021 >> part3.xml

... and so on, till I reached semimajor_axis about 2.5 au where I stopped: no reason to choose this value, I chose it just to limit the number of queries (although, the threshold of 2.5 au is also nice because this is the first big Kirkwood gap).

After that, I had to convert the XML file (quite big: about 131000 ateroids) in a CSV file.

I used another python utility called xml2csv as follows (for every single file):

xml2csv --input part1.xml --output part1.csv --tag property

Finally, I concatenated all CSV files together.

Data Analysis
I used a data mining package called Weka developed by the University of Waikato in New Zealand.

The distribution of asteroids is like this:

Within a set of 131072 asteroids, we can easily see the top three groups:
  • Nr. of S type - 111565
  • Nr. of C type - 13207 
  • Nr. of E type - 5671

Let's see how we can recognize these three groups based on some physical parameters chosen among color indexes.
 
First of all I used the "Select Attribute" tool to rank the list of the most important parameters (among color indexes) that can be used to predict the taxonomy class.
This is the result:

Panstarrs parameters are on top of the list, almost all with the same average merit.
It is nice to visually show why.

Panstarrs parameter distribution
These graphs, made with the ggplot2 tool of the R package, confirm that the panstarrs parameters allow to discriminate between S-type, C-type and E-type asteroids.
In fact, every single distribution is constituted mostly of asteroids belonging to the same taxonomy class.

We can also visually display the covariance matrix as a "heatmap".
I found a very interesting link that explains how to do this:
ggplot2 : Quick correlation matrix heatmap - R software and data visualization

This is the result:



Finally, let's go back to Weka and perform:
  • cluster alanysis
  • logistic model

Cluster Analysis
I run a K-means clustering algorithm with K=3:
  • The S-type asteroids were associated to cluster 0
  • The C-type asteroids were associated to cluster 1
  • The E-type asteroids were associated to cluster 2
  • All different types were mainly attributed to cluster 0 with the exception of V-type that were grouped in the same clusters of E-type asteroids. Cluster 1 is entirely constituted of C-Type asteroids.
The clustering schema in this case is powerful: in fact only 0.48% of the asteroids were incorrectly clustered:


The three cluster centroids are as follows:


The Weka Logistic model
First of all, we must establish a performance boundary about what we expect to get (ZeroR model).
There are 111565 S-type asteroid in a set of 131072 asteroids: the accuracy of any "true" model must be much better that 85%.

After running the logistic model with a N=10 cross-validation, I got these results:


As expected, very good performance not only for S-type asteroids but also C-type asteroids (precision and recall = 1) and E-type asteroids (precision=0.928, recall=1) -  but failure to predict the other less numerous types.


Kind Regards,
Alessandro Odasso

Monday, April 20, 2015

Asteroid Physical Parameters

Let's use Horizons Web Interface to search for asteroids with the following contraints:
  • H mag defined
  • diameter defined
  • albedo defined
  • B-V mag defined
  • U-B mag defined
It would be nice to look for I-R mag values but these data are not available. 
Let's also extract the asteroid designations and their spectral type (SMASSII) when available.


We get a list of 813 asteroids.

I try to use a data mining package called Weka to visualize and analyze the data.
To start simple: let's look at a well known relation, the one linking H and diameter.

Diameter versus H mag

We can see this graph ( H mag is on the X-axis, Diameter on the Y-axis):

Two distinct but similar curves exist beyond a certain H  threshold (very roughy H = 12 - 12.5 ).

Diameter and H are related via albedo, so we can try to use Weka to cluster the asteroids using for example H and albedo.


Cluster results
Let's run the K-Means clustering algorithm with K=2 using H and albedo.
The following two clusters are identified (Weka returns the two cluster centroids and numerosity):

Let's look at the same plot as before using the blue color for the smaller cluster and the red color for the larger one to see if the clusters can be visually associated to the two distinct branches:



The relation between the two branches of the plot and the clusters defined based on H and albedo is not "perfect" (some of the red instances are displayed in proximity of the blue ones) but, at least roughly, it seems to be well confirmed.

Let's now cluster based on B-V and U-B.
We get this result (do not mind about the fact that this time the smaller cluster is called cluster 1 ... we will continue to display the smaller cluster with a blue color):

This is interesting because we have got almost the same result as before suggesting that the two clusters based on B-V and U-B are almost overlapped with those defined based on H and albedo.

Let's try to use another software package called QTIPlot to estimate a polynomial fit for the two Diameter vs H curves.




Clusters versus spectral type

First of all, we display the two clusters on the B-V and U-B plane:

We want to analyze the two clusters in more detail trying to see if we can see a relation with the spectral type.

Before doing this, let's look at the overall spectral type distribution:


Cluster1

First we look at the spectral type distribution of the 322 asteroids that belong to this cluster.


What is interesting in this cluster is this: almost all S-type asteroids (94 out of 96) belong to this cluster and they constitute a little more than half of their cluster population. 

Let's use Weka ZeroR algorithm to put a lower boundary on the performance of any classification model.
The ZeroR model always predicts the largest numerosity class: 94 S asteroid, 182 asteroids with a known spectral type (94 / 182 = 51.6%).
In fact, this is the Weka output:

Let's see if it is possible to use H, diameter, albedo B-V, U-B to discriminate the S-type asteroids from the other asteroids types.

We try with the logistic model (cross validation N=10). The result is this:

I would say that the logistic model has failed to make any significant prediction.
Other classification models give apparently a better result but I think that this is just due to an overfitting effect because there is a too big difference between the n=10 cross-classification performance versus the whole training set performance.

Cluster 0

Again, first we look at the spectral type distribution of the 491 asteroids that belong to this cluster.
What is interesting in this cluster is this: it contains most of the carbonaceous and metallic asteroids and it is quite etherogeneous.
Let's use Weka ZeroR algorithm to put a lower boundary on the performance of any classification model.
The ZeroR model always predicts the largest numerosity class: 60 Ch asteroids, 228 asteroids with a known spectral type (60 / 228 = 26.3%).
In fact, this is the Weka output:

Let's see if it is possible to use H, diameter, albedo B-V, U-B to discriminate among the various asteroid types.
We try with the logistic model (cross validation N=10). The result is this:

This time it seems that the logistic model has a moderate success (44% performance compared to a minimum 26% performance), so the asteroid physical parameters seem to have some predicting power in this cluster.
However, the fact that the accuracy  jumps to 57% when you run the model on the whole set of asteroids belonging to this cluster seems to be an overfitting effect.
Better not to believe this model!

Before giving up, we can see that the situation improves a little bit if we reduce the number of variables just focusing on the B-V and U-B parameters.
In fact, this is what happens with the logistic model that just uses B-V and U-B (cross validation N=10):

... and this is what happens when running the model on the whole cluster:


Not sure of this: the relative small increase in the amount of correctly classified instances may indicate that the overfitting effect is not so big and so we may hope that the model has at least in part been able to correctly generalize.

In that case, as you see in the section "Detailed Accuracy by Class", we may have some  success with the prediction of asteroids belonging to these spectral types:
  • B (average precision 0.647)
  • CH (average precision 0.535)
  • X (average precision 0.38)
  • C (average precision 0.368)
Just for curiosity ... here is the list of predictions.

Note that:
  • column spec_B: this is the already known spectral type as downloaded by Horizons database. When no spectral type is available in the Horizon database, you can see a question mark.
  • column classification: this is the prediction of the model. Of course, if the previous column is known and the predicted type is different ... this is an error. If the previous column is a question mark ... this is the prediction.
  • all other columns contain the probability distribution: the spectral class with the greatest probability is the predicted class.
In spite of the fact that the average precision is not high, there are a few cases where it is very high.

The fact that Weka returns the probability distribution is very nice because you can plot other graphs, for example: probability distribution of a given spectral type versus B-V.

Look for example at these plots:

type B distribution versus (B-V)


type C distribution versus (B-V)


... but before going to these details, it would be better to be more confident about the validity of the model!

Kind Regards,
Alessandro Odasso

Monday, October 20, 2014

Mars-crossing Asteroids - Absolute Magnitude Model

The H mag median for the Mars-crossing asteroids is about 18.

I tried to use a data mining software (Weka) to find a classification model that builds a decision tree based on orbital parameters (a,e,i) to estimate whether a mars-crossing asteroid has H<=18.

After many trials, I found a model that is far from being perfect but that might have some interest.

The model seems to be able to correctly identify the H mag level of the mars-crossing asteroids in 67% of the cases (a performance much better than the 50% probability of success that it could have just by chance).

The data mining program has processed 12477 asteroids using the J48 algorithm (66% of the asteroids used for training, the remainder for testing it). When it finished, the following report was displayed:



In the above report, we see the overall performance of the model (67% of correctly identified instances) plus a detailed accuracy summary for each class showing the rate of True Positives, False Positives  and Precision.

At the bottom, we can see the so called "Confusion Matrix" or contingency table showing the two classes of asteroids magnitude:
  • class a: bright asteroid (H <= 18.0)
  • class b: dim asteroid (H > 18.0)

In order to understand it better, let's explain it looking for example at class b, i.e., the class of dim asteroids:

  • TP Rate: we see that the dim asteroids were correctly predicted with a rate of 72.8% (1554 / (1554+580))
  • FP Rate: we see that 796 bright asteroid were mistakenly classified as dim asteroids, thus the proportion of bright asteroids not correctly classified is 37.8% (796/(1312+796))
  • Precision: any asteroid classified as dim  was truly dim in about 66% of the cases (1554 / (1554+796)).

In the following section, you can see the model itself (as a tecnique called bagging was used, the output contains 10 decision trees that taken together produced the overall result):