FrontPage


Welcome to the Wiki for HSTA 559, Spring 2011: Modern Data Analysis II (largely multivariate methods)

   Instructor: Bob Pruzek, Professor, SPH, SUNY Albany.   email:  rmpruzek@yahoo.com  (Office hours before and after class or by appointment.)

 

Note that I have added a new heading, Most Recent at the bottom of this page. Here you will find listings of most recent files added to the documents (Upload files section) of this wiki, those intended to be especially relevant to current topics under discussion. In addition, I will be creating new pages for Comments and Questions for each month of the course, starting later this week for February. Start to use that page after Tuesday's class. Note that the Syllabus is now available, but be sure you have the one loaded about 11:15 p.m. on 2 February; see sidebar.  Syllabus 

 

For March, use this page: March: Comments and Questions

 

My handouts (usually pdf's) will be posted in the Files section. Your homework (usually a pdf) should be uploaded as a File no later than the evening before each Tuesday class; that is I will assign homework each Thursday so you have four days to complete before the following Monday evening (say no later than about 9 pm).
 
Several files have been made available for download to help you understand my thinking about the course, and to facilitate your learning; they include Guidelines for Homework, to be referenced when you prepare your homework: Guidelines4Homewk.STA559.pdf ; also several Suggestions2AidLearning.pdf ;  Please read these three documents carefully, and if you think of things later that you forgot in class use this wiki (comment below) to let me know, or drop me an email.

 

Most importantly, be sure to adhere to what you see in the Guidelines when you prepare your homework.

 

With respect to learning R (with which several of you claim not to be very experienced), there are multiple sources, starting from free manuals available at the download site http://r-project.org , and continuing. Here are several additional sites:

 

Several students are (rightly) wondering what packages they 'need' for 559. I'll put it this way: if you run the following command this will install (but not load) a long list of packages that will prove especially useful here.

install.packages(c('MASS','Hmisc','BHH2','TeachingDemos','rgl','granova','PSAgraphics', 'YaleToolkit', 'ggplot2','ISwR','sos', 'psych','foreign','nutshell','vcd','car','ellipse','cluster','plyr', 'rpart', 'doBy', 'corrgram', 'UsingR', 'RSiteSearch','mi','session', 'bootstrap', 'boot', 'reshape2'), repos="http://cran.r-project.org", dependencies=TRUE)

To 'load' any package, run library(package), e.g., library(MASS). I usually prefer to create an object (using the R editor) called .First   Then, with the wrapper 'try' incorporated, load many/all packages at once by running .First( )  E.g. In my case, diving right into it (pun intended),the first few lines are:

head(.First)
1     function ()            
2     {                      
3     try(library(MASS))
4     try(library(Hmisc))
5     try(library(ISwR))
6     try(library(bootstrap))

Try these things and ask questions as needed.

 

You can expect to see more files with several readings for the next classes within about 2 days.   

But in the meantime, go to the Upload files page and from that I'd like you to download the two pdfs that concern Comparing Two Independent Groups. Best to read before class on Tuesday!! And one more thing: Only six students gave me useful information that I need to complete the course syllabus. Please download, print, and complete the bottom question of that form to give me your preferences about topics. I have many things on my list (anova, design, variable transformations, non-linear methods, propensity score analysis, cluster analysis, latent variable and tree methods; also categorical and missing data, and a couple more) but I want to try to be sure I cover every topic you really want to see be part of this course -- and I want the syllabus to reflect this. BP

One paper I'd like you to read as soon as you complete the Comparing Groups pdfs (this wiki) is this: http://www.amstat.org/publications/jse/v17n1/helmreich.html I shall assume you will have read this carefully (at least twice) by Tuesday next; the second time, while replicating results (data in Appendix) with R open. Another paper that deals with all of the granova functions will be provided soon.

Assignments: For Feb 8, your assignment can be found in Feb8.HomewrkAssignment559.pdf  You will also see that there is a new version of the blood lead pdf: BloodLead.handout-2ways.pdf

ADDED: Because you all need at least one comprehensive source on TWO-way ANOVA, I give you: Chapter09.schwartz.simfrazu.pdf which I would like you all to read by Thursday of this week.

Added2: The article on confidence intervals is here: cumming_finch_2005.pdf ; in addition I decided to put a pdf on factorial design ANOVA here too:07_Factorial1.pdf More details in class...

 

For details about One-way ANOVA and contrasts, see ONE WAY ANOVA+introContrasts11.pdf

NEW: re: Feb15 Homework. I have posted your assignment Feb15Hmwrk559.pdf (in three parts, so start it soon), as well as two pdfs that concern bootstrapping: BootstrppngIntro11.pdf and Bootstrap methodsChapt18HesterbergPBS18.pdf  See the Questions pdf for today's class! sta559QuestionsFeb15.pdf  Here too is the Dudek pdf on comparing distributions: bDudek.RelationsAmngDistrbs.pdf I strongly recommend that you read this carefully. Aim before class if possible to read: Essence of bootstrapping.pdf

For class on March 1, I want you to read ALL of the document here, ANOVAplannedContrsts&More.pdf ; then try to use the granova.contr function using what you learn. We shall also begin to review for the exam one week later, March 8. NOTE: The key for the brief quiz on Feb 17 is here: KeyHSTA559 Quiz.17Feb11.pdf Please examine it carefully before class; I will return your papers and grades (although I they are only for your information since I did not record them!). Note the latest URL I've posted (on data science) in 'Most Recent' below. Here is the page of Review Topics for Exam 1 next Tuesday: Review topics for Exam 1 HSTA 559.pdf

For March 15, please begin reading this first: IntroLogisticRegressionPengEducResearch.pdf and encycl.biostat.logist.regr.pdf ; then by Thursday, readRubinpsaexposit.pdf  and PropensityScoreAnalysisNotesRP11#.pdf

You should also begin going through this, but read only after finishing the preceding: PSAintroBP.Mar11.ppt.pdf

The Key for Exam 1 has now been posted; try to review carefully before class... we shall go over it after I hand back your tests (that you can keep until Thurs ONLY). For March 24, Thursday, study this: IllustrationMatchingbwtdata.doc

During the Easter break you should of course read ALL missing data documents; then begin to read, and study, these three: Basic.Matrix.Ops.correl.regrsn11.pdf ,

 Details4L2X-AlsoEigen+SVD.11.pdf and Notes.basic matrix analysis11.pdf  After we review highlights of the missing data pdf's (know what MAR, MCAR, and MNAR are; also what simple and multiple imputation are....), then we shall begin with these new documents on matrices (for the last 1/3rd of Tuesday's class when we next meet).

Do your best, before Tuesday's class, to read through the (rather comprehensive) answers you will find on the KEY for Exam 2 here: HSTA 559.exam2.11KEY.pdf

  

NOTE: It would be helpful for nearly everyone I think to do the following asap:

Reread the assignment (as well as what preceded it) in the ComparingGroups pdf. Then reread my Guidelines4Homework pdf. Then begin to read through the submitted homework of all students (now in folder Feb1Hmwk) and make some assessments. I think you will find that for many, purposes are not clear, datasets are not wholly reasonable for the purposes (that can be inferred), and final interpretations tend often not to refer back to purposes, nor to be comprehensive. My goal in citing these matters is to draw attention to the importance of paying close attention to details, and to learning as much as possible in your homework. More after the class 'meeting' today. PS: Take note of the Syllabus; check the Sidebar.

 

Feb1 class info: Rcode for Effect Size (assuming two groups of equal size, in matrix x2: x2 = cbind(col1,col2) )

Effect Size = diff(apply(x2,2,mean))/sqrt(mean(apply(x2,2,var))) 

The remainder of the Class for Feb1, done over the phone, can be seen here: Docs4Course STA558ClassFeb1.pdf  Please review before next class.

 

 

 

 

 

Please, before our March 31 class, begin to study the Sarkar pdf, including some trials w/ his code in: Sarkar08latticeLab.pdf  Also see this intro to Correspondence Analysis http://www.statmethods.net/advstats/ca.html. I will discuss the technical side of CA in class, and then post an exercise. See the data caith in the MASS library.

An outstanding reference on Matching is the first article on this webpage for Elizabeth Stuart, Professor at Johns Hopkins University:http://www.biostat.jhsph.edu/~estuart/papers.html 

Do not miss theseIndicator matrices.pdf and IllustrationsCAw-comments.pdf which I discussed on April 5.

 

See the new HW6 information (as of 4/2):

Then see these intros to Correspondence Analysis, and Mosaics 

 

 

Mosaic Plots (emphasis on package 'vcd' and 'vcdExtra'): 

 

 

 

RE: MISSING DATA (4/12). HERE IS A FINE ARTICLE THAT YOU SHOULD READ SOON: schafer_graham_MissingDataPsychMethods02.pdf  PLEASE READ THIS AS WELL AS AT LEAST SOME OF THIS PRIMER ON MISSING DATA: http://circoutcomes.ahajournals.org/content/3/1/98.full

Also see Prof.RecaiYucel.MissingDataSlides2011.pdf  and Missing DataOverview+Sources.pdf .

 

Most Recent: You can see several new files now, all pertaining either to the comparison of two dependent samples, or to granova functions more generally. One paper about which I'd most like to have your feedback (after careful reading) is one I recently sent off for publication, and that will be revised before publication: ElementalGraphics4ANOVA.RP+JH.pdf . In addition, I put up two files that concern basic (1 way) ANOVA: AnovaDocumentation-1w.pdf  and chpt28IntroStats-anova.pdf (this is a late chapter from the book by deVeaux, Velleman and Bock, Intro Stats (3rd edition) -- an especially good basic book. Another pdf of relevance to comparing two dependent samples is: Dep.SampleDiscExampleDiet.pdf . You should begin to read all of these before Thursday's class.

Several of you may be interested in the following, concerning data science (in the future), especially if you are unsure of your vocational direction: http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/