How to find appropriate clusters

1. One of possible directions to find clusters and to compare them with usage pattern is (a) to fix arbitrary value of s(i) (which is the dissimilarity of measure/user x).

Once defined this value, is possible to get rid of all the user whose dissimilarity coefficient is lees than s(i), so to restrict the proximity of the clusters.

Other methods to define this value of s(i) may be: b) we can use the program AGNES to suggest acceptable value; c) eventually we can use the distribution of the dissimilarity values to define the acceptable level of confidence to restrict the wrong data.

2. Once defined acceptable clusters we can compare the cluster with each question in the set so to have an understanding of the profile. To do this we have to carry a oneway test of each question to the clustering division, to define the influence of each question to the definition of the cluster.

[So far, I could not find if it is possible to know from the method which factor most influenced a certain dissimilarity distribution]

3. The last step of this analysis is to compare the obtained profile with the social competences section and usage pattern style.

4. We can compare the results with an EXTERNAL VALIDATION.

————————————-
Questions:

1. Which kind of “Matching Method” are we using in DAISY? [Is the method for measuring the similarity or dissimilarity of two measures. The most common method is to use the SIMPLE MATCHING approach: s(i,j)=n/p and d(i,j)=(p-n)/p , where n is the number of matches, or in other words the number of variables for which for which i and j happen to be in the same state.]

2. Do we use the nominal parameter in DAISY? [I tried to use the parameter “symm” but it returned some mistakes. In addition I was using the “as.ordered” method to say that that variable was an interval scaled one. Unfortunately the syntax I used returned some errors. Finally, it was not clear to me whether I had to specify that a variable could be binary in the DAISY method (using the symm parameter) or using the as.factor parameter. ]

3. What are the medoids of the questionnaire? [Finally I could answer this question using the component “$medoids” of the PAM method. I discovered that the modoids of the questionnaire change if we subset the responses.]

4. Is there a difference between PAM and FANNY output?

5. How can we have the system suggesting the best k number of clusters? [There should be a CLUSFIND method contained in the CLUSTAN package that should allow this. www.clustan.com – apparently is not free]

6. Can we use an agglomerative method for defining the best k? [look above]

Leave a Reply