lee
Forum Replies Created
- AuthorPosts
- leeKeymaster
MAXD tries to find the k most different objects as a subset of N objects forming the complete data set. It does this by searching a standard (lower symmetric) PATN association (dissimilarity) matrix. MAXD also provides for reading a set of sequence numbers of objects in a subset of the complete N objects, and an association matrix, and then calculating the overall maximal difference according to the selected criteria.
The ‘maximal difference’ idea has a number of applications. For example, it could be the driving force behind the question “If a subset (only) of taxa can be reserved, which taxa should they be?” The argument goes that the subset should be those that have in some sense maximal (potentially genetic?) differences. As another example, phase-one of the ALOC algorithm uses a simple form of maximal difference in an attempt to ensure that the multivariate space is adequately sampled by ‘seed’ objects.
The algorithms in MAXD could also be used as robust alternatives of the ‘minimal set’ algorithm as proposed by Margules, Nicholls and Pressey (1988; see the MSET module). The ‘robustness’ in this instance may be justified by making the reasonable assumption that using a matrix of association measures has integrated species information into a more realistic estimate of site differences. Species are usually equally weighted in most pattern analysis. However, retrospective estimation of species ‘weights’ after classification or ordination (from the association matrix) will reveal that species contribute differentially; some strongly revealing underlying gradients while others either noisy or seemingly ‘marching to a different drum’.
Option 1
Is due to me and is very simple! It first finds the two objects separated by the greatest dissimilarity value. It next tests all candidates looking for the object that has maximal total distance to all of the already selected objects. This object is then entered and the remaining candidates are tested in the same way against all previously selected objects.Option 2
The second option even simpler. It looks for the candidate that has the largest distance to its closest already-selected object. That is, it finds the candidate object with the maxim distance to the closest of the objects already-selected.Option 3
This option is due to the work of Dan Faith (Australian Museum, Sydney) and relates to a cladistics environment. The first step is the same as option 1. The second and subsequent steps look for that object ‘j’, which has a MAXIMUM of its MINIMAL distance to all other PAIRS of objects. Using ‘k’ and ‘l’ as a typical pair, the distance is calculated asDistance=(Djk+Djl-Dkl)/2
This corresponds to a taxonomic path length (number of character differences) of the added branch. The concept here is to find the taxa with maximal character difference to those already selected.
Cheers
Lee
leeKeymasterHi Marc
I was hoping some of the uers would chime in on this. I’d be interested to see what steps they would take. By the way, I have another case study (eco data) to be released with v3.04.
The most important outcome from PATN is to name the groups. Box and Whisker plots are most helpful for this. I’d consider using two-step association on your species (and name the resulting groups).
I’d then use a two-way table to help confirm your suspicions. Enough or too many groups?
I usually find that I have to iterate through sites and more likely species to reduce noise. B&W and species classification is basic for this.
What about environment data and correlates? What’s driving the variation between the groups? This is basic.
These are just a few ideas for you to kick around.
Lee
leeKeymasterHi Tracy
Easy.
At the bottom of the association TAB of the Analysis Windows there is a check box “Generate lower symmetric matrices of associations”. Check it, then choose what analysis options you want.
Lee
leeKeymasterHi Tiff
I’ve just returned from some surgery in hospital (and slowly recovering) so sorry for the delay in responding to your post.
I ran 1.04 million objects * 4 variables through the non-hierarchical routine into 14 groups ok with random data and reproduced your error in the box and whisker plots. We had improved the precision in some areas in 3.04 but this one has slipped by. So, we will need to work on it!
In the meanwhile (sorry), export the data table (it has the data and the group identifiers) to any stats pack (eg Statistica).
Lee
leeKeymasterYep, that will do it!
You can also select mutliple objects in the Data Table (using CTRL or SHIFT left mouse click).
This is all in the Help.
Lee
leeKeymasterHi Greg
With limited resources, I had to concentrate on PATN’s analysis abilities and screen display. The emf file format output from PATN is however very flexible, vector based and therefore scale independent. EMFs are therefore very good for manipulating.
Everyone these days should have a fair image editor. Adobe Photoshop is the top end, and I’d not recommend it unless you were using it all the time. It is for professionals. I use Paintshop Pro for serious stuff, but find ACDSee excellent for simple tasks. ACDSee is excellent for browsing and with Optimizer, awesome for image sizing and compressing. Good alternatives are Adobe Photoshop Elements, Microsoft Digital Image Pro, and maybe Ulead PhotoImpact.
There is a good review at
http://www.consumersearch.com/www/software/photo_editing_software/fullstory.html
Hope that helps,
Lee
leeKeymasterHi Greg
I thought I’d see if anyone else had any ideas. I’m sure they do, but it appears they are sleeping.
Anyway, there is no problem in editing the images to suit embedding in any applications such as Word. I saved a 23 species by 6 groups box and whisker plot as an emf (enhanced meta file). The emf format is high quality so a good place to start.
I then loaded the image into a simple editor (I used ACDsee but Paintshop, Photoshop or a host of others will do fine) and copped it to a series of full A4 pages in JPG format. I could have easily cropped it to suit any smaller format.
Lee
leeKeymasterThanks Michelle and Ross for a useful discussion. I have sat on my thoughts for a few weeks about this issue.
There is truth in all that has been said. I have certainly had situations where the 4th dimension in an ordination provided evidence for a genuine process that was generating variation in the data. Ross is right however in that caution is well justified. It is often too easy to interpret patterns that may be suspect, particularly in higher dimensions.
I have put the request for more than 3d on the ‘TO DO’ list, but not as priorty-1 (eg, I’d like to add a nearest-neighbour list function first as a number have requested it). I figure additional dimensions in SSH need not be selected, but can be there if desired.
Ross, on stress I do tend to use ~0.15 as a cut-off myself. If stress is higher than 0.15, I seek ways to reduce it closer to 0.1. I look back at the coding, transformations and standardisations, eliminate outlier objects and noisy variables. I’d would like to add an ability of determining which objects were the most difficult to fit in SSH. Another wishlist item!
Lee
leeKeymasterHi Guys
I have been pondering what Derek said about groups. I don’t think I understand the issue. The groups are defined either by the dendrogram ordering or by non-hierarchical classification. In the case of the dendrogram, the group labelling follows a simple algorithm – starting from the top of the dendrogram (the right side in PATN), the group containing the lowest sequenced object (the highest row or leftmost column in the Data Table) is ‘rotated’ to the top of the dendrogram. The process is repeated down the dendrogram. Group ‘1’ will therefore always contain the object in row 1 or the variable in column 1.
Variations in classification strategy (changing association measure or beta value, or adding ro semoving a few objects for example) should produce groups that are similar in definition, but there is less guarantee the more radical the change.
Non-hierarchical clasification will produce groups that are also generally int the order of the sequence of the rows. Object 1 is likely (but not guaranteed) to be in group ‘1’.
Once groups are defined, PATN maintains their definition. In SSH (and all other post-classification options), the group numbers displayed will be those as defined by classification.
Lee
leeKeymasterHi Michelle
A fair request. It was pragmatic that we implemented only up to three dimensions in PATN (for Windows) V1. The significance of achieving our 3d display environment probably blinded us from thinking further, at least at that time. There is however no logic in assuming that there could be only three factors controlling the variation in a dataset.
What do you do if stress is greater than 0.15? At the moment, probably reduce the noise or the complexity of the data in one of a number of ways. I would not publish an ordination result greater than 0.15 myself.
If there is user support to go to say 5d, my strategy would be to enable selection of any 3 of the 5 axes to be selected to display in the 3d plot. The PCC strategy follows easily enough. Listing and output of the coordinates represents no problem.
Over to others to discuss-
Lee
leeKeymasterHi Jeremy
Sorry for the delay. I’ve been up in the Tasmanian mountains all week.
Yes, I would agree that ‘congruence’ between a SSH result an a classification would be comforting. This is not necessarily easy to accomplish though. ‘Visual’ checks maybe ok on very small datasets. Classifying a (Euclidean) ultrametric matrix from SSH and comparing it with the original classification has its complications even so. For example, you would need to check how well each point has been handled by SSH. In comparing classification with SSH, I’d normally expect the classification to be more robust, unless the stress was VERY low (<~0.05).
At the moment, PATN doesn’t produce ultrametrics or an individual stress breakdown, but we could.
Congruence between beta=-0.1 and beta=-0.25 would be nice!
There is no doubt that situations will occur where higher negative beta values will produce a ‘better’ result, even if you know what truth is.
I will take another look at beta using simulation and see what I come up with. Any other user feedback on this issue would be warmly welcomed.
Lee
leeKeymasterHi Jeremy
When Godfrey Lance and Bill Williams developed their combinatorial algorithm, they applied their beta value to what they called flexible WPGMA (weighted pair-group using ArithMetic Averaging). This clustering strategy weights GROUPS equally. This implies that objects are weighted differently during the fusion process. Using this approach, they tended not to like what they called ‘chaining’, the situation where a joins b, then c joins a and b and then d joins a, b and c and so on. It just made the dendrogram difficult to interpret. They liked the ‘tidier’ groups when beta was set to -0.25 (as did many others!). This value effectively dilates the data space making groups separate from each other as fusion progression (like what is happening now with stars).
I prefer ‘reality’.
Flexible UPGMA (developed by myself, Dan Faith and Glenn Milligan) is a weighted pair group counterpart to flexible WPGMA but the former weights objects equally through the fusion process (group weighting changes). The beta values in the two approaches are not totally equivalent. Simulation studies (with seriously complex, but known data that there is no space to elaborate on) suggested that a beta value of -0.1 best recovers known groupings.
Decreasing the beta value will tend to make the groups more equal in size. While this makes a dendrogram easier to interpret, the cost is a greater probability of missclassifications. Using a beta value of -0.1 is conservative. I would not however default to values as low as -0.25 by default.
The negative beta value does have the effect of neatly countering any underestimation of association between distant objects. See my SSH algorithm for more on this issue.
Does that help?
Lee
leeKeymasterHi Fiona
Sorry for the delay in reposning. I’ve been in the backblocks of South Africa. No Internet.
If you can do a 2 way table with the demo dataset and not with another dataset, it is very odd.
All you need to do is to make sure that you have a hierachical classification of BOTH your rows or objects, and columns or variables of your dataset. Then you should be able to press the two way table button and then it should go fine.
I’m in JoBurg on route way home. Back in 24 hours so let me know how you got on.
Lee
leeKeymasterHi David
I’m not sure what you mean. An ordination (SSH in PATN) doesn’t produce groups, only one of the classification methods (hierarchical or non-hierarchical). You could also import a set of groups (called a-priori in PATN).
If groups have been defined, you can view them in various forms in the ordination display-
1. Group centroids can be displayed (with appropriate colour) by pressing ‘c’ on the keyboard
2. Group centoid colours can be applied to all objects by pressing ‘g’
3. Group centroids can be displayed if you click the left mouse button off an object (and inter-centroid diffierences displayed by dragging betwen two centroids)
The legend in the ordination display will display the object (row) label and its colour. If group centriods are displayed, only those will be listed in the legend.
Has a classification been run or imported? Are the object (row) labels being displayed correctly on clicking the objects in the ordination plot, or in the legend (which doesn’t display group labels).
In version 3.02 undefined groups are usually labelled as ‘-1’ (version 3.03 we are now using ‘0’). If you click the groups tab on the right hand side of the data table, are there meaningful group numbers displayed?
Lee
leeKeymasterHi Kristen
It is easy – just right mouse click on the dendrogram in PATN and select “Save Image…” and then provide a filename, and I suggest stick with the ‘Enhanced meta file’ format as this provides excellent image quality for embedding in applications such as Word.
There is no such thing as ‘saving’ a dendrogram in PATN, except as an image. It is only the PATN project file (in its entirity) that needs to be saved. Each time you clik the dendrogram button on the PATN toolbar, the dendrogram is created afresh. Once a hierarchical classification is run, the dendrogram button should be available (not greyout out).
The only way that the button would not be available would be if a new non-hierarchical classification was run, or that the PATN project file was not saved from the analysis, so when you re-opened the file, no analysis would be available to view or evaluate. Did you click an ‘available’ dendrogram button and nothing at all happened?
Lee
- AuthorPosts