Data Analysis
Course Description
 This course is sign on statistics.
 It will told us some models, and explain them station.
 like regression, ANOVA, ANCOVA, and so on
key point before test
第2章
1、作业1的第1题，第12章的课件。

Look at the question：

Goal: Inverstigate the relationship between electricity consumption and GDP.

Data: A file named electricity.csv, coverd Electricity, GDP, Country name

Tasks

T1: Comment the initial plot of the data

word notes:

If we want to comment the initial plot of the data, we need get this plot before.
elec.df < read.csv("electricity.csv") # read data in electricity.csv plot(Electricity~GDP, data=elec.df) # show the data plot, and entitled it

and the outcome is here
In the plot, when GDP less than 5000, It is clearly that there are some relationship between GDP and Electricity.
the Electricity will increased by growth with GDP.

T2: We fitted an initial linear model, but there were issues with the two countries with the highest GDP. Identify these two countries, replot the data eliminating these two countries, comment on this plot, and then refit the simple linear regression model (including assumption checks) without these two countries.

word notes:

First, calculate the residuals and fitted valued.
elecfit1=lm(Electricity~GDP,data=elec.df) # function lm: calculate the residuals and fitted value. plot(elecfit1,which=1) # show the plot word notes:{calculate:计算, residuals:残差, fitted value: 拟合值}

The outcome is here

Then, get the norm check to this data
normcheck(elecfit1)

The outcome in here

Then, get the Cook`s distance plot
cooks20x(elecfit1)

The outcome in here

In the cooks distance plot, there are two data is far from others. Its the 4th and 27th. It have a disproportionate impact to the model. And, according to residuals and fitted valued, we can infer that those two countries is which GDP more than 5000.

now, del those data and replot data.
data=elec.df[elec.df$GDP<5000,] # del data which GDP over 5000 plot(Electricity~GDP,data) # and show the plot

Okay, now we only have the last part: check again
elecfit2=lm(Electricity~GDP,data=elec.df[elec.df$GDP<6000,]) plot(elecfit2,which=1) # get data, and show original plot normcheck(elecfit2) # norm check cooks20x(elecfit2) # cooks distance



T3 Create a scatter plot with the fitted line from the new fitted model superimposed over it.

word notes:
plot(Electricity~GDP,data=elec.df[elec.df$GDP<6000,]) # get plot elecgdp.fit=lm(Electricity~GDP,data=elec.df[elec.df$GDP<6000,]) abline(elecgdp.fit,lty=2,col="red")


T4 Write an appropriate Executive Summary.
 word notes:
 We wish to investigate the relationship between electricity consumption and the gross domestic product (GDP) for countries of the world. First,we read in and inspect the data and get an initial plot. Then,we use lm() to fit a appropriate linear model and use normcheck() and cooks20x() to do model checks. To make estimation credible,we eliminate two countries with GDP greater than 6000. Because their Cook’s distances are greater than 0.4. Then,we replot data and fit a more appropriate linear model by same way. Finally,we create a scatter plot with the fitted line from your model superimposed over it.
 word notes:
 We want to find a relationship between the GDP and the electricity consumption. First, we read the data from file and get an initial plot. And then, we use the function lm to fit a appropriate linear model and use two function named normcheck and cooks20x to do model checks. We find two countries have different data, because the cooks distance is more than 0.4. So, we eliminate them, and get initial plot again, and do model check again. Finally, we create a scatter plot with the fitted line from model superimposed over it.



2、会对xy的散点图进行评论。（课件第1章第20页）

Step one: some fixed describtion
Looking at this plot, it is clear that there is some relationship.

Step two: the information about this plot.
but there is also a lot of variability in exam score amongst students with the same test score, especially in the middle of the data.
3、会对同一数据应用不同模型进行拟合的图进行评论。（课件第2章第,31页）

First, judge the type of two models(linear), describe the positive/negative, the slope of them.

Next, Prediction results: When is it close, when is it far away, and why? As can be seen in the figure, the two lines are close to each other in which range between x, and far away in which range?.

The reasons for this phenomenon are; the existence of points; The middle block has more data, while the two sides have less data. When the amount of data is large, different models have different resistance to outliers

Like:
 They are two positive linear model, at first, the scope of red one is more than the blue one, and then end, the station is opposite.
 Those two line is closed at middle, and be far at two side. I guess the reason is, the point almost focus on the middle, only few on the side.
4、知道summary这个函数输出结果中每个数字的含义。（课件第2章第4046页）
 This part only about translate.
 word notes:
 Estimate 估计值
 Std. 标准差
 Error 误差
 Intercept 截距
 Slope 斜率
 Residual standard error 残差标准误差
5、会结合程序描述E(y)的置信区间（CI）和y的预测区间（PI）的具体含义。（课件第2章第4850页）
6、Methods and Assumption Checks的写法。（参考作业1的第1题）
第6章
1、第6章课件，第6章案例。
2、应用乘法模型对数据拟合的三个原则。（课堂笔记L061）
3、乘法模型的形式。（第6章案例）
4、Executive Summary的写法。（第6章案例）