San Francisco, once known for the notorious Alcatraz prison, has transformed into a hub of technological innovation. Despite its progress, the city faces significant social challenges, including rising wealth inequality, severe housing shortages, and pervasive digital devices contributing to urban crime. This analysis explores a dataset of nearly 12 years of crime reports across San Francisco's diverse neighborhoods to predict crime categories based on temporal and spatial data.
- Cleanse the dataset to remove inconsistencies or errors.
- Understand variables and develop insights for analysis.
- Create additional variables to enhance predictive power or interpretability.
- Standardize or transform the data for machine learning algorithms.
- Partition data into training and testing sets to evaluate model performance.
- Develop a predictive model to estimate crime types based on location and date.
Dates
: Timestamp of crime occurrence.Category
: Type of crime (target variable).Descript
: Detailed description of the crime.DayOfWeek
: Day of the week when the crime occurred.PdDistrict
: Police district of the crime.Resolution
: Outcome of the crime.Address
: Location of the crime.X
: Longitude coordinate.Y
: Latitude coordinate.
Id
: Unique identifier.Dates
: Timestamp of crime occurrence.DayOfWeek
: Day of the week.PdDistrict
: Police district.Address
: Location of the crime.X
: Longitude coordinate.Y
: Latitude coordinate.
Both sets contain 878,049 entries each.
To facilitate analysis and modeling:
- Convert categorical variables to numerical values.
- Use count encoding for categorical variables with many unique values.
- Apply mapping for ordinal variables.
Weekdays are hypothesized to have a higher incidence of crimes compared to weekends.
- Average weekday crimes per day: 126,906.40
- Average weekend crimes per day: 121,758.50
This supports the hypothesis with weekdays showing approximately 4% more crimes.
- Daily Crime Distribution: Higher on weekdays.
- Crime Categories:
- 26 categories show higher frequency on weekdays.
- 13 categories show higher frequency on weekends.
Late night and early morning hours are hypothesized to have higher crime incidence compared to other times of the day.
- Contrary to the hypothesis, crime peaks during:
- Evening (6 PM - 10 PM)
- Noon (12 PM)
Certain crime types are hypothesized to be more concentrated in specific neighborhoods or regions.
- Specific types of crimes are significantly more concentrated in certain police districts:
- Larceny/Theft: Southern district
- Drug/Narcotic incidents: Tenderloin district
- Vehicle Theft: Ingleside and Bayview districts
Convert the Dates
column into a numeric format (timestamps).
Encode categorical features and add new features based on temporal and spatial data.
Use ColumnTransformer
and Pipeline
from scikit-learn for consistent transformations.
Retain original category names for the Category
feature for predictions.
Separate the target variable (Category
) from the feature set and divide the data into training and testing subsets.
- Achieved an accuracy of 26.35% on the test set.
- Improved the accuracy to 27.40%.
- Higher crime rates on weekdays and specific districts.
- Certain crime types cluster in particular areas.
- Both Random Forest and XGBoost models showed room for improvement.
- Further feature engineering and advanced modeling techniques could enhance accuracy.
- Incorporate Resolution Outcomes: Including data on whether crimes were solved or not could provide additional context for predictive modeling.
- TemporalAggregation: Creating aggregated features such as rolling averages or crime counts over different time windows (e.g., last week, last month )could capture trends and periodicity in crime incidents.