111

Data Analysis

Course Description

  • This course is sign on statistics.
  • It will told us some models, and explain them station.
    • like regression, ANOVA, ANCOVA, and so on

key point before test

第2章
1、作业1的第1题,第1-2章的课件。

image-20221024140059703

  • Look at the question:

    • Goal: Inverstigate the relationship between electricity consumption and GDP.

    • Data: A file named electricity.csv, coverd Electricity, GDP, Country name

    • Tasks

      • T1: Comment the initial plot of the data

        • word notes:

        • If we want to comment the initial plot of the data, we need get this plot before.

          elec.df <- read.csv("electricity.csv")
          # read data in electricity.csv
          plot(Electricity~GDP, data=elec.df)
          # show the data plot, and entitled it
          
        • and the outcome is here

          image-20221024142747779

          In the plot, when GDP less than 5000, It is clearly that there are some relationship between GDP and Electricity.

          the Electricity will increased by growth with GDP.

        • T2: We fitted an initial linear model, but there were issues with the two countries with the highest GDP. Identify these two countries, replot the data eliminating these two countries, comment on this plot, and then refit the simple linear regression model (including assumption checks) without these two countries.

          • word notes:

          • First, calculate the residuals and fitted valued.

            elecfit1=lm(Electricity~GDP,data=elec.df)
            # function lm: calculate the residuals and fitted value.
            plot(elecfit1,which=1)
            # show the plot
            word notes:{calculate:计算, residuals:残差, fitted value: 拟合值}
            
          • The outcome is here

            image-20221024153157109

            • Then, get the norm check to this data

              normcheck(elecfit1)
              
            • The outcome in here

              image-20221024154851480

            • Then, get the Cook`s distance plot

              cooks20x(elecfit1)
              
            • The outcome in here

              image-20221024155821825

            • In the cooks distance plot, there are two data is far from others. Its the 4th and 27th. It have a disproportionate impact to the model. And, according to residuals and fitted valued, we can infer that those two countries is which GDP more than 5000.

            • now, del those data and replot data.

              data=elec.df[elec.df$GDP<5000,]
              # del data which GDP over 5000
              plot(Electricity~GDP,data)
              # and show the plot
              
            • Okay, now we only have the last part: check again

              elecfit2=lm(Electricity~GDP,data=elec.df[elec.df$GDP<6000,])
              plot(elecfit2,which=1)
              # get data, and show original plot
              
              normcheck(elecfit2)
              # norm check
              
              cooks20x(elecfit2)
              # cooks distance
              
        • T3 Create a scatter plot with the fitted line from the new fitted model superimposed over it.

          • word notes:

            plot(Electricity~GDP,data=elec.df[elec.df$GDP<6000,])
            # get plot
            elecgdp.fit=lm(Electricity~GDP,data=elec.df[elec.df$GDP<6000,])
            abline(elecgdp.fit,lty=2,col="red")
            
        • T4 Write an appropriate Executive Summary.

          • word notes:
          • We wish to investigate the relationship between electricity consumption and the gross domestic product (GDP) for countries of the world. First,we read in and inspect the data and get an initial plot. Then,we use lm() to fit a appropriate linear model and use normcheck() and cooks20x() to do model checks. To make estimation credible,we eliminate two countries with GDP greater than 6000. Because their Cook’s distances are greater than 0.4. Then,we replot data and fit a more appropriate linear model by same way. Finally,we create a scatter plot with the fitted line from your model superimposed over it.
          • word notes:
          • We want to find a relationship between the GDP and the electricity consumption. First, we read the data from file and get an initial plot. And then, we use the function lm to fit a appropriate linear model and use two function named normcheck and cooks20x to do model checks. We find two countries have different data, because the cooks distance is more than 0.4. So, we eliminate them, and get initial plot again, and do model check again. Finally, we create a scatter plot with the fitted line from model superimposed over it.

2、会对x-y的散点图进行评论。(课件第1章第20页)

4efa2d4fe6d29678e96c1cbc64803be

  • Step one: some fixed describtion

    Looking at this plot, it is clear that there is some relationship.

  • Step two: the information about this plot.

    but there is also a lot of variability in exam score amongst students with the same test score, especially in the middle of the data.

3、会对同一数据应用不同模型进行拟合的图进行评论。(课件第2章第,31页)

11

  • First, judge the type of two models(linear), describe the positive/negative, the slope of them.

  • Next, Prediction results: When is it close, when is it far away, and why? As can be seen in the figure, the two lines are close to each other in which range between x, and far away in which range?.

  • The reasons for this phenomenon are; the existence of points; The middle block has more data, while the two sides have less data. When the amount of data is large, different models have different resistance to outliers

  • Like:

    • They are two positive linear model, at first, the scope of red one is more than the blue one, and then end, the station is opposite.
    • Those two line is closed at middle, and be far at two side. I guess the reason is, the point almost focus on the middle, only few on the side.

4、知道summary这个函数输出结果中每个数字的含义。(课件第2章第40-46页)

  • This part only about translate.
  • word notes:
    • Estimate 估计值
    • Std. 标准差
    • Error 误差
    • Intercept 截距
    • Slope 斜率
    • Residual standard error 残差标准误差
    • 111

5、会结合程序描述E(y)的置信区间(CI)和y的预测区间(PI)的具体含义。(课件第2章第48-50页)

6、Methods and Assumption Checks的写法。(参考作业1的第1题)

第6章
1、第6章课件,第6章案例。
2、应用乘法模型对数据拟合的三个原则。(课堂笔记L06-1)
3、乘法模型的形式。(第6章案例)
4、Executive Summary的写法。(第6章案例)

最近的文章

Data Analysis

Course DescriptionThis course is sign on statistics.It will told us some models, and explain them station.like regression, ANOVA, ANCOVA, and so onkey…

继续阅读
更早的文章

数据库系统原理

还是要好好学习的…万一哪天就有用了呢?原创文章,考前必备,当前20014字,已更新至3.2.6 SQL的嵌套查询持续更新中…已同步更新至CSDN 风几许绪论1.1 数据库系统概述1. 定义与历史数据库是指按照一定规则存储数据的仓库数据库技术诞生于上世纪六十年代末,是信息系统的核心与基础目前,数据库有…

继续阅读