Data Analysis

Course Description

  • This course is sign on statistics.
  • It will told us some models, and explain them station.
    • like regression, ANOVA, ANCOVA, and so on

key point before test



  • Look at the question:

    • Goal: Inverstigate the relationship between electricity consumption and GDP.

    • Data: A file named electricity.csv, coverd Electricity, GDP, Country name

    • Tasks

      • T1: Comment the initial plot of the data

        • word notes:

        • If we want to comment the initial plot of the data, we need get this plot before.

          elec.df <- read.csv("electricity.csv")
          # read data in electricity.csv
          plot(Electricity~GDP, data=elec.df)
          # show the data plot, and entitled it
        • and the outcome is here


          In the plot, when GDP less than 5000, It is clearly that there are some relationship between GDP and Electricity.

          the Electricity will increased by growth with GDP.

        • T2: We fitted an initial linear model, but there were issues with the two countries with the highest GDP. Identify these two countries, replot the data eliminating these two countries, comment on this plot, and then refit the simple linear regression model (including assumption checks) without these two countries.

          • word notes:

          • First, calculate the residuals and fitted valued.

            # function lm: calculate the residuals and fitted value.
            # show the plot
            word notes:{calculate:计算, residuals:残差, fitted value: 拟合值}
          • The outcome is here


            • Then, get the norm check to this data

            • The outcome in here


            • Then, get the Cook`s distance plot

            • The outcome in here


            • In the cooks distance plot, there are two data is far from others. Its the 4th and 27th. It have a disproportionate impact to the model. And, according to residuals and fitted valued, we can infer that those two countries is which GDP more than 5000.

            • now, del those data and replot data.

              # del data which GDP over 5000
              # and show the plot
            • Okay, now we only have the last part: check again

              # get data, and show original plot
              # norm check
              # cooks distance
        • T3 Create a scatter plot with the fitted line from the new fitted model superimposed over it.

          • word notes:

            # get plot
        • T4 Write an appropriate Executive Summary.

          • word notes:
          • We wish to investigate the relationship between electricity consumption and the gross domestic product (GDP) for countries of the world. First,we read in and inspect the data and get an initial plot. Then,we use lm() to fit a appropriate linear model and use normcheck() and cooks20x() to do model checks. To make estimation credible,we eliminate two countries with GDP greater than 6000. Because their Cook’s distances are greater than 0.4. Then,we replot data and fit a more appropriate linear model by same way. Finally,we create a scatter plot with the fitted line from your model superimposed over it.
          • word notes:
          • We want to find a relationship between the GDP and the electricity consumption. First, we read the data from file and get an initial plot. And then, we use the function lm to fit a appropriate linear model and use two function named normcheck and cooks20x to do model checks. We find two countries have different data, because the cooks distance is more than 0.4. So, we eliminate them, and get initial plot again, and do model check again. Finally, we create a scatter plot with the fitted line from model superimposed over it.



  • Step one: some fixed describtion

    Looking at this plot, it is clear that there is some relationship.

  • Step two: the information about this plot.

    but there is also a lot of variability in exam score amongst students with the same test score, especially in the middle of the data.



  • First, judge the type of two models(linear), describe the positive/negative, the slope of them.

  • Next, Prediction results: When is it close, when is it far away, and why? As can be seen in the figure, the two lines are close to each other in which range between x, and far away in which range?.

  • The reasons for this phenomenon are; the existence of points; The middle block has more data, while the two sides have less data. When the amount of data is large, different models have different resistance to outliers

  • Like:

    • They are two positive linear model, at first, the scope of red one is more than the blue one, and then end, the station is opposite.
    • Those two line is closed at middle, and be far at two side. I guess the reason is, the point almost focus on the middle, only few on the side.


  • This part only about translate.
  • word notes:
    • Estimate 估计值
    • Std. 标准差
    • Error 误差
    • Intercept 截距
    • Slope 斜率
    • Residual standard error 残差标准误差
    • 111


6、Methods and Assumption Checks的写法。(参考作业1的第1题)

4、Executive Summary的写法。(第6章案例)


Data Analysis

Course DescriptionThis course is sign on statistics.It will told us some models, and explain them station.like regression, ANOVA, ANCOVA, and so onkey…



还是要好好学习的…万一哪天就有用了呢?原创文章,考前必备,当前20014字,已更新至3.2.6 SQL的嵌套查询持续更新中…已同步更新至CSDN 风几许绪论1.1 数据库系统概述1. 定义与历史数据库是指按照一定规则存储数据的仓库数据库技术诞生于上世纪六十年代末,是信息系统的核心与基础目前,数据库有…