The Department of Finance (DOF) is required by NY State law to value condominiums as if they were residential rental apartment buildings. DOF uses income information from rental properties similar in physical features and location to the condominiums. DOF applies this income data to the condominium to determine its value in the same way DOF values rental apartment buildings. I choses this data because as an engineer I always wanted to understand how do the civil company chose what kind of building they should do? I understand there is many factors to making this decision. this data helped me understand some of these factors. I found this data in NYC open data website under the Department of Finance (DOF).
This data contains number of important features regarding some condominium in NYC. Some of this feature is the estimated expenses of the condominium, the area in square foot, the number of units or appartement in the condominium, the year the condominium was built in, and the full market value which the amount of rent were played to this condominium since it was built to the year it ware reported.

Statistical methods

As mentioned early this data included very important information and feature regarding the condominium but some of the data were not including so, I had to do some data wrangling and feature engineering. The first step was to drop all the unnecessary columns in the data like the address. Then, some of the feature that needed in the data, and we could find it from the using other feature were. “Unit sqft” whish the area of the unit in the condominium. Assuming the size of the all the units in the condominium the same we could find the area of the unit but dividing the “Gross sqft” by the number of units in condominium. “Estimated Expenses per Unit” how much did one unit caused in the condominium. This feature was calculated assuming the estimated expenses were equal for each unit. Another important feature was “Estimated Gross income per Unit” this feature is the amount of rent a unit made since the building was built to the report year. There also, “Market Value per Unit” this feature is the how much a unit will sail for. Using the “Market value per sqFt” and the “Unit SqFt” feature to calculate the “Market Value per Unit”. Another important feature is “Profit”. The profit of any project is the gross income subtracted from the expense. Another important feature is “profit category” this feature indicates if the condominium has high profit or low profit. A high profit if the profit value is greater than the mean and a low profit if less than the profit mean.

After cleaning the data and organize it and add all the feature I need to my data. I start doing a statistical test to understand the data more and to help me with my prediction. I come up with more than one hypothesis and test them statistically. For all my hypothesis the dependable variable or the target was the “Profit” population. The first hypothesis was to find if the mean of the “Full Market Value” is equal to the “profit” mean or not. So, my null hypothesis was the population mean as equal. My alternative hypothesis was they are not equal.

H_o: the Full Masket Value mean is equal to the mean of the Profit

H_a: the Full Masket Value mean is not equal to the mean of the Profit

H_o: μfullmarketvalue = μprofit

H_a: μfullmarketvalue ≠ μprofit

My second hypothesis was to learn if there is a relation between the profit and the building classification. If the type of building would give the owner more profit or it doesn’t mater which type of building, all give the owner the same profit. So, I chose to make it between two categorical population if the building type would give us high profit or low profit. After knowing that I chose my null hypothesis to be that there is no relationship between the “Building Classification” categorical population, and the “profit category” population. My alternative hypothesis is that there is a relationship.

H_o: Threr is no relation between the Bilding Classification and the profit

H_a: Threr is a relation between the Bilding Classification and the profit

The third hypothesis was used to find the linear equation between the dependent and independent variables. The first independent variable I chose to test if it has a relationship with the profit, is the “Estimated Expenses”. So, my null hypothesis was there is no relation between the two features. My alternative hypothesis was there is a relation between the two features.

H_o: Threr is no relation between the Estimated expenses and the profit

H_a: Threr is a relation between the Estimated expenses and the profit

H_o: β1 = 0

Ha : β1 ≠ 0

After learning if there is a linear relationship and I do the leaner equation I did try to add many features to the equation to improve it as much as I can.

After knowing my null hypothesis, I started doing my statistical test to find out if I should reject the null hypothesis or not. For my first hypothesis I used the two-sample t-test, to find if the two-population mean are equal or not. For my second hypothesis I used the chi-square test. The chi-square test is used to find if a two categorical population have a relationship or not. In the first and second hypothesis, if the p-value less than the significance level which is 0.05. We reject the null hypothesis. For my third hypothesis I did as a linear regression model, where if the coefficient of the slope dose not equal zero that mean there is a relation between the two population. I do this by finding the pearson correlation coefficient to now if there is a linear relation or not. If it has a correlation coefficient that mean it have a slope and there is a relationship. Then we reject the null hypothesis.

Results

For my result I used the statistical python library Scipy.stats to implement the two-sample t-test using the function (ttest_ind()) to calculate the t-statistic and the p-value. After inserted the data in the function it calculated the p-value to be 0.00048 which is less then the significance level. Therefor, the null hypothesis was rejected, and we conclude that “full Market Value” population mean does not equal the “profit” population mean. For the second hypothesis, I used the chi-square test to find the relationship between the two population. I used the function (Chi2_contingency()) to find the p-value. This function returns four value the first is the chi-squar statistic, the second is the p-value, the third is the degree of freedom, and the expected value. To use the chi2_contingency function I need to find the percentage of the category value using the crosstab() function. As seen the figure 1, crosstab was used to plot the low and high profit percentage of each building class or type. Looking at figure 1 will find that the two-building class with the highest profit are (R9-CONDOPS, and R9-CONDOMINIUM). Then using the crosstab() function in chi2_contingency we get a p-value of (0.0000474) which is less than the significance value. Therefor, we reject the null hypothesis and conclude that the alternative hypothesis is true. That there is a relation between the two population.

figure 1

Figure 1: Building Classification Vs Profit

For the third hypothesis, I had to find the correlation to know if there is a slop between the population or not. First, I had to scatterplot the populations as shown in figure 2. From figure2 we can see that there is a linear slope it is not perfect linear relation, but it is strong relation. Then I used the pearsonr() function to get the correlation coefficient. Using the pearsonr() using the Estimated Expenses population and the profit population to get a pearson correlation of 0.87. Therefore, we can say there is a strong correlation between the two population, and we determine that there is a relationship, and we reject the null hypothesis.

figure 2

Figure 2: Linear correlation between Estimated expense and profit

After knowing there is a linear relationship between the two population. We need to write the linear equation. To do so we use the ols model from the ststsmodels.formula.api library. This gives us the intercept and the slope of the linear equation. Also give us the all king of information regarding the linear regression model as seen in figure 3. One of the most important information we could ger form the model is the R-Squared and the Adj R-squared. This feature won’t help us now, but it will help use when we want to improve the linear equation. Using the information of the model we could write the linear equation as:

Y=1297746.21+13.42*(Estimated Expense)

The intercept and the slope showing in the equation we can find them from figure 3 under coeff. Capture.PNG

Figure 3: ols model

To improve the linear equation, we added another feature to the linear equation which is the total unit. After fitting the ols mode to add the total unit we get the summary we see that the adj R-squared is increased. That show that adding this to the equation will improve it. Then we get a new linear equation whish is,

Y=1330840.56+14.36(Estimated Expense)+(-12785.4)(Total unit)

To go even further we add another feature to the equation which is Building Classification. We see there is no change in adj R-square that mean adding this to the equation would not improve the performance of the equation. Then, we add another feature which is the year built but this adj R-squared did not change too. Then we conclude that the best linear equation I came up with is,

Y=1330840.56+14.36(Estimated Expense)+(-12785.4)(Total unit)

Conclusion

In conclusion, my hypothesis test shows that there is relation between all the feature I chose to test and the dependent or target feature which is the profit. It also shows that the relation is not a linear relationship, and the right model to fit this data is not a linear regression model. The key features for my prediction are the estimated expense and the total unit. Using the ols model to find the linear equation and the adj R-square we see that using these features the adj R-square is the highest. The limitation I faced in the analyses was adding the report year feature to the linear model and see the scatterplot as showing in figure 4. Using this feature, we separate the data point to many categories with many linear correlation as shown. What was hard to understand if some categories have higher pearson correlation so, why did the adj R-square did not increase for the model. What is the best model we can use to fit the data and what is the best feature we should use to get better fitting model?

figuer4

Figuer 4