• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Files spread between Dropbox, Google Drive, Gmail, Slack, and more? Dokkio, a new product from the PBworks team, integrates and organizes them for you. Try it for free today.


March: Comments and Questions

Page history last edited by bob pruzek 8 years, 11 months ago

Please post your comments and questions in March here.


Comments (Show all 72)

Samira said

at 12:49 pm on Apr 2, 2011

I'm having some trouble with the function corresp() in the MASS library. I have a contingency table which is of class "table" as opposed to "data.frame" or some other form which corresp() will accept. I am having no luck trying to convert the table to a matrix or data.frame and then running corresp(). I tried b=as.data.frame(x) and then corresp(b) and received Error in x%%1 : non-numeric argument to binary operator. (Along with a host of other trial-and-error methods.) I don't know what non-numeric argument it is trying to convert? Maybe the row titles? It looks like a contingency table (Just like the caith data.)! Any suggestions?

bob pruzek said

at 3:49 pm on Apr 2, 2011

Samira, Try this: First, convert the table to a matrix: e.g. xxm=matrix(xx,ncol=p) for xx the table; then xxdf=data.frame(xxm).
It works for me here. (I often find that data.frame is better than as.data.frame, am not sure why.) Let me know how this works. b

Samira said

at 2:34 pm on Apr 3, 2011

This works. Thank you. Very happy now.

Andrey Avakov said

at 4:11 pm on Apr 3, 2011

professor, to use the function corresp() does the data need to be in a table format? is it possible to analyze a 4 dimensional array using corresp() function? thank you.

bob pruzek said

at 8:59 pm on Apr 3, 2011

Andrey, Did you check the help file? It says, about the input, "x, formula The function is generic, accepting various forms of the principal argument for specifying a two-way frequency table. Currently accepted forms are matrices, data frames (coerced to frequency tables), objects of class "xtabs" and formulae of the form ~ F1 + F2, where F1 and F2 are factors." So it is clearly intended ONLY for 2-way arrays, usually called cross-tabs or contingency tables. Did you try it? b

Richard said

at 3:25 pm on Apr 9, 2011

I am having some trouble understanding the 'absolute standardized covariate effect sizes w/ and w/out PS adjustment' graph. How is the red line being 'adjusted'?
What does an individual effect size from one variable mean, is it like a regression value where an increase in that variable increases the outcome by that much?
Finally, how do similar effect sizes indicate balance?
Any help would be appreciated.

Matthew Swahn said

at 5:09 pm on Apr 9, 2011

The individual effect size is the effect size of that variable between two groups (sample mean x1 - sample mean x2)/(pooled standard deviation). In the contexts of binomial covariates, I think the average is the proportion of people with characteristic? A small effect size (in magnitude) means that the covariates are relatively

Samira said

at 9:46 am on Apr 10, 2011

I was not able to be in class on Thursday. Is there an assignment that is due tomorrow?

bob pruzek said

at 11:12 pm on Apr 10, 2011

Richard, Matthew's response is ok, but should have ended w/ 'relatively well balanced' The denominator for the ES is the same for the Unadjusted and adjusted ES's, which means the differences between the two entail comparing simple mean differences w/ averages (across strata) of mean differences -- in both cases scaled using the relevant covariate's (pooled) s.d.
The assignment involves starting to review for Thursday's test, and reading (most of) the long pdf on Missing data that I am posting now. b

Manchun Chang said

at 9:05 am on Apr 13, 2011

Dr. Pruzek,

Just wondering if you received my e-mail about syllabus question. I re-sent it yesterday afternoon but still haven't got your feedback.

bob pruzek said

at 9:12 am on Apr 13, 2011

No, Manchun, I did not get it. Please send to rmpruzek@yahoo.com. same for anyone else who did not get a response.... bob

Manchun Chang said

at 10:45 am on Apr 13, 2011

I just send the document from another e-mail....hope this time it can get through.

Samira said

at 10:01 am on Apr 13, 2011

Will we be able to use our notes during the test on Thursday?

Yi Lu said

at 10:53 am on Apr 13, 2011

Dr. Pruzek,

I didn't get any feedback about my syllabus questions that I sent it last Saturday. I already tried three times. Please let me if you get it. Thank you so much!

bob pruzek said

at 1:37 pm on Apr 13, 2011

For your information, I DID NOT GET your emails before today w/ your answers to questions (for Manchun and Yi). Now that I have, I have
edited them and sent those edits back to you. Let me know if you do not receive them.
Samira, and everyone. Yes, as I said several times in class, all my tests are open book and open notes. But do not bring a computer to class as tests are not 'open computer'. bob

Jamie Kammer said

at 4:31 pm on Apr 13, 2011

Is missing data going to be on the exam?

Marisa Reuber said

at 5:47 pm on Apr 13, 2011

We are unclear about how to interpret the vector(s) of weights that are converted into quantitative composite variables.

Samira said

at 6:40 pm on Apr 13, 2011

I know th

Samira said

at 6:45 pm on Apr 13, 2011

I know that ATE is Average Treatment Effect and ATT is Average Effect of the Treatment on the Treated. How do you compute these? Is ATT the effect for those in the treatment group and ATE the effect on all individuals? I read the Elisabeth Stuart paper and she mentions weighting subclass estimates, but I still don't understand.

bob pruzek said

at 8:56 pm on Apr 13, 2011

Please provide a context for all questions. I will presume that you mean the vectors of weights obtained in a correspondence analysis (but tell me if that is wrong).
Suppose there are four categories (as for caith data, rows); then think about the indicator matrix w/ four columns for this categorical variable; then form a composite
of the columns using the four corresp-derived weights for these four categories. That is, if the weights are a, b, c and d, generally some negative, some positive, you
would take the columns, x1....x4 and multiply each column by these weights: a*x1 +b*x2 + c*x3 +d*x4 to get a composite, say x.comp. If the same were done for
the four columns of the indicator matrix for the ROW categories of the initial cross-tabs matrix, you might call the result y.comp. The correlation between x.comp and y.comp
would now, having used these particular (canonical) weights, be as large as it could possibly be for one pair of composite variables. The categorical variables would have
lead in this case to two quantitative composites. bob

bob pruzek said

at 8:58 pm on Apr 13, 2011

Samira, Let's forget about ATT for this test, or until it could be discussed in class. ATE is what you learned about when the loess.psa or circ.psa functions were run. I went over
their computation in detail in the handouts for the birthwt example, so the computation is fully illustrated there. Ask further if you don't understand what you see there. bob

bob pruzek said

at 9:19 pm on Apr 13, 2011

One more thing: Jamie, I don't know where I answered this question before, but just in case anyone is in doubt, missing data will not be on the exam tomorrow. bp

Richard said

at 3:14 pm on Apr 29, 2011

Will this exam be cumulative?
If so, which topics should we focus on?

bob pruzek said

at 10:20 pm on Apr 29, 2011

The exam will be cumulative, but the emphasis will be on topics we've covered near the end. See the syllabus for the topics we've covered and for examples of questions
you should be ready to address. Write answers HERE if you want to some of those questions and I will publicly edit them if you get them in by say 4 pm on Tuesday. In
order not to be overwhelmed, however, I'll limit this to the first five or six questions you, the class, posts. This puts a premium on posting sooner than later. The rest
we can of course cover in the review class on Tuesday. b

Richard said

at 7:16 pm on May 2, 2011

Q: Identify two key challenges that often arise in dealing with missing data. Elaborate
Do you mean things such as, methods to deal with missing data often assume MAR? Or things such as, many statistical methods were not developed to handle missing data? I don't understand this question exactly.

bob pruzek said

at 9:26 pm on May 2, 2011

I mean this: one challenge has to do w/ deleting cases where 'some' values are missing -- which cases to delete, and which not? The other concerns how
missing values (those that remain) are to be estimated: what different methods might be used, what assumptions do these different methods make, and
what 'seems most reasonable' in particular contexts. You are not asked to answer such questions, as you see, you are asked to elaborate on what these
challenges are, and how they might be expected to play out in data analysis practice. We'll speak about it more in class if you ask. bp

Matthew Swahn said

at 6:21 pm on May 5, 2011

A few of us are studying, and discussing how the stadnardized effect size relates to confidence intervals (a question you mentioned in class). Here's what we have so far: If we take the standardized effect size between two means, and multiply it by sqrt(n) (n being the total sample size), we get a t statistic. From here, we can construct a confidence interval. As an example, P( -2 < (effect size) * sqrt(n) < 2) ~= 95%. Is this a satisfactor answer? We are curious on what relationship you refer to between the standardized ES and confidence intervals.

bob pruzek said

at 10:38 pm on May 5, 2011

To the few of you, and those who look on,
The problem is not so much the algebra, it is the understanding of this relationship. To wit: what is the denominator of the t (for a two independent sample comparison)? Write the algebra. Once you
have that, you have the standard error (estimate) of the sampling distribution of the difference in the two independent means; then ask: what is the denominator of the ES in this case? How does
the standard error relate algebraically to the latter denominator? The answer, with a bit of elaboration, gives the connection between the st. ES and the t. Discuss these two ideas. Use examples (in R?)
to make the ideas explicit. I'd like students to be able to reason about such matters and this means especially to interpret relationships of the kind we are talking about here. HTH, bp

Chuck Yang said

at 11:46 am on May 6, 2011

The algebra of the denominator of the t is (sqrt(1/n1+1/n2))*S.D.(pooled), the standard error(SE) algebra is sqrt(s1^2/n1+s2^2/n2). Under the condition when n1 and n2 are large enough, the two algebra can be seen to be approximately the same. Using the standard error, we can calculate approximate confidence intervals for the mean. For instance, if we want the 95% C.I., we use the following algebra to get the upper bound and lower bound:

Upper 95% Limit = Xbar+1.96*SE
Lower 95% Limit = Xbar-1.96*SE

On the other hand, the denominator of the ES is just S.D.(pooled), therefore by multiplying st. ES by 1/sqrt(1/n1+1/n2), we get the t and can subsequently get C.I.

Am I on the right track here?

Chuck Yang said

at 11:13 am on May 6, 2011

I was reviewing the "MoreComparingGroupsBotht&Anova.pdf" document when I got puzzled how the non-parametric transformation was applied to the original data. It is my understanding that in NP we get all the scores together, rank them, keep track of their groups, and then use parametric methods of analysis on the ranked data. I tried to replicate this practice, but I can't get the exact same dataset as shown right in the middle of the second page of the document. How exactly did we get from the original data frame:
[1,] 26 21 22 26 19 22 26 25 24 21 23 23 18 29 22
[2,] 18 23 21 20 20 29 20 16 20 26 21 25 17 18 19

to the ranked data as below?
[1,] 25.0 21.3 22.0 25.7 18.7 21.7 25.3 24.3 23.7 20.7 22.7 23.0 17.7 26.0 22.3
[2,] 18.0 23.3 21.0 20.0 19.3 26.3 19.0 16.7 19.7 24.7 20.3 24.0 17.0 17.3 18.3

all help is appreciated.

bob pruzek said

at 3:50 pm on May 6, 2011

Chuck, That any careful reader might be puzzled is understandable. First, I jittered the scores (since there are tied values); then I transformed the ranks to a scale w/ the same median and spread as the
initial scale. The latter step helps make the scores to be in the same metric while still being linearly related to the ranks themselves. If you look at the t or F statistic for comparing two groups you should
get the same thing, approximately (due to jittering), as I got w/ my version. (We did not have enough class time for me to discuss the details or my special function that does what I've described here, so
this is why your question helps me address that. Thanks.) BP ps. Let me know what your t and F comparisons are....

Chuck Yang said

at 4:32 pm on May 7, 2011

I obtained the F statistics for comparing two groups in conventional way and not surprisingly I got the same results as shown in the handout. In terms of the NP mothod, I now understand what you did for the transformation but on the operational level , I found it difficult to replicate the second step. I had no problem using the jitter function, but what does it mean to "tranform the ranks to a scale w/ the same media and spread as the initial scale"?


Chuck Yang said

at 4:41 pm on May 7, 2011

Another question, when reviewing the "Basic.Matrix.Ops.correl.regrsn11.pdf" document, I am not sure if I fully understand the role of matrix "D.smc". According to the notes in the document, the diagonal of D.smc is the squared multiple correlations which can be used to predict each column of Z (or X) from all other columns. Since we already have the product-moment correlation calculated from t(Z) %*% Z, what is the benefit of getting D.smc?

Many thanks,

bob pruzek said

at 5:22 pm on May 7, 2011

As for doing what I did in the rank transform way (and this goes for anyone interested), I will email you or put on the wiki, the function that does the job. But only after the final exam.
Q2. Your q. puzzles me. The correlation matrix shows how each pair of variables relate to one another. The squared multiple correlation does what you describe, a wholly different matter.
What am I missing here? bp

Chuck Yang said

at 11:34 pm on May 16, 2011

Hi Bob,

I am not sure if you are stiil monitoring the wiki pages, but I am wondering if you could please share the function you mentioned that would perform the rank transform, I am still very intrigued finding out how it is done in R.

Also, have you considered realeasing a exam answer sheet for the final exam. Not only that I am interested in how I did in the final, but also I hate being blind to what I have possibly done wrong.

Thank you and thanks for the great semester.


Chuck Yang said

at 7:54 pm on May 7, 2011

I think I confused the two concepts before. I think now I understand. t(Z) %*% Z or cor(X) gets the Pearson product-moment correlation coefficient, the "r" and D.smc gets the squared multiple correlations, the "R^2". I guess the reason I was confused before is that the example only have two columns(variables) so the D.smc doesn't really provide additional information that we don't already know from cor(X), numercially speaking. But once we get more than 2 columns (variables), D.smc predict each column from all other columns.

For example, using the tree data (3 variables), we have
Dn=diag(t(tree5) %*% L(5) %*% tree5)
Z=L(5) %*% tree5 %*% Dn
t(Z) %*% Z # Get Pearson product-moment correlation
[,1] [,2] [,3]
[1,] 1.0000000 0.7757376 0.9758850
[2,] 0.7757376 1.0000000 0.8940005
[3,] 0.9758850 0.8940005 1.0000000

> R<-t(Z) %*% Z
> R.inverse<-solve(R)
> S<-diag(solve(R))
> S.sqrd<-diag(1/S)
> D.smc=diag(3)-S.sqrd # Gets squared multiple correlations.
[,1] [,2] [,3]
[1,] 0.9989322 0.000000 0.0000000
[2,] 0.0000000 0.995501 0.0000000
[3,] 0.0000000 0.000000 0.9994617

Hope this helps.

bob pruzek said

at 5:40 pm on May 8, 2011

Chuck, When you say "t(Z) %*% Z or cor(X) gets the Pearson product-moment correlation coefficient, the "r" " , your language needs editing.
Say instead: t(Z) %*% Z or cor(X) gets the MATRIX OF Pearson product-moment correlation coefficientS, the matrix of "r" values for ALL PAIRS OF VARIABLES. bp

Richard said

at 2:09 pm on May 10, 2011

We are still unsure about how to answer the question: What is principal component analysis? Be able to describe using matrix multiplications.
We've reached the understanding that PCA involves making linear combinations of the original variables that are uncorrelated with each other to reduce the amount of variables necessary, and to view the structure of the original matrix.
How are these new vectors that are created by PCA used to analyze the data? What do these new vectors say about the relationships between the original variables? We are still unsure about how to use the principal components after you have found them.

bob pruzek said

at 5:49 pm on May 10, 2011

This question really is too basic to be asked the day before the final exam! To answer it, I will depend on the 'tutorial on PCA' that I
have just uploaded to the Files page (so you must go to the last page of this wiki to get it). I suggest that to cut the reading down to the
relevant minimum, vis-a-vis your question that you go to the section on the SVD, and read that carefully. Ask questions about that here by
say 11 pm this evening and I will answer them here. Best, BP

bob pruzek said

at 5:57 pm on May 10, 2011

Addendum: The section I'm referring to is VI, and you should substitute our Z (where t(Z) %*% Z = R ) for the author's X. Also D.lambda, the singular
values for the author's Sigma. Then note that Z = U D V' (or U %*% D %*% t(V) in R) is consistent w/ the author's U SIGMA V' (V' = V transpose, V exp T).
Finally, an approximation to Z can be based on some limited number of components, say Z-hat(m) = U(m) D(m) (Vm ') where the first m columns are
designated on the right side. Hope this helps. bp

You don't have permission to comment on this page.