I want you to use at least three of the tools we have covered this term on this project. In the end, you will turn in four things:
1. Excel file showing the application of your first selected tool (or set of tools).
2. Excel file showing the application of your second tool (can be the same workbook with different worksheets).
3. Excel file showing the application of your third tool (can be the same workbook with different worksheets).
4. A MS Word document explaining what you learned about the data set from your analysis along with relevant recommendations for your supervisor. Your explanation should include
1) the tools used,
2) what answer the tools provided and
3) how that answer relates to the questions your supervisor asked you. Hint: 1 page is the minimum length needed to explain your tools, analysis and recommendations properly to your supervisor.
Here is the scenario. Imagine I am a property developer in the Boston, MA area. I am looking for neighborhoods where there is opportunity for development or there are opportunities for home repair / reconstruction. I was given a list of 506 neighborhoods. Each of the neighborhoods is described in my data set through 14 fields which are described in the data dictionary below.
Now Imagine that I am enlisting your assistance. Given that I have 14 figures for each neighborhood, I need you to mine through the data and see what you can tell me about it. Which neighborhoods are the best for me to consider seeking construction or remodel opportunities?
Which ones are the more affluent?
Which ones may have problems that I want to avoid?
Which ones will be most appealing to home buyers?
What else can you tell me about the data?
Here is the list of 14 fields that you have on each of the 506 neighborhoods in the data set:
CRIM – Crime Rate
ZN – Percentage of Residential land zoned for lots over 25,000 squared feet
INDUS – Percentage of land occupied by non-retail business
CHAS – Does tract bound Charles River? (=1 if tract bounds river, = 0 otherwise)
NOX – Nitric oxide concentration (parts per 10 million)
RM – Average number of rooms per dwelling
AGE – Percentage of owner-occupied units built prior to 1940
DIS – Weighted distances to five Boston employment centers
RAD – Index of accessibility to radial highways
TAX – Full-value property tax rate per $10,000
PTRATIO -Pupil-to-teacher ratio by town
LSTAT – Percentage of lower status of the population
MED – Median value of owner-occupied homes in $10,000s
CAT.MEDV – Is median value of owner-occupied homes in tract above $300,000 (CAT.MEDV = 1 or not (CAT.MEDV = 0)?
Remember, conduct analysis and find me three things about the data set that I might find useful. Then tell me why those three things are interesting.
Here is an example (This is kind of a giveaway but you can still use it):
One question I have is whether there are statistically significant differences in the average median values of the homes in the neighborhoods close to the Charles River and homes in those that are not. Sounds like it would be good to divide the set into two groups. So I would sort the whole data set by CHAS. That gives me a lot of CHAS = 0 records and a smaller number of CHAS = 1 records. Now, I can run t-test for independent samples with similar variances. My first group is all of the MEDV records where CHAS = 0. Those are in the first part of column M. My second group is all of the MEDV records where CHAS = 1. Those are in the later part of column M. Run your t-test to compare these two groups of MEDV values and see your result.
Is there a significant difference?
The other two are up to you… And keep in mind that the example I have shown here is pretty complex. They don’t all need to be that complex. Mean, median, mode, range, and standard deviation of average number of rooms could also be useful. BUT… keep in mind that 20% of the grade on this project will be the level of complexity of your analysis.