Using Machine Learning and NLP to automate Crime Survey for England and Wales (CSEW) offence coding Alessandra Sozzi and Shannan Greaney, ONS Contents What is the CSEW? The current offence coding process The purpose of the case study OffenceCoder model Results Contents What is the CSEW? The current offence coding process The purpose of the case study OffenceCoder model Results What is the Crime Survey for England
and Wales (CSEW)? The CSEW aims to measure the extent of various crimes experienced by the public. It asks respondents whether they have experienced crime in the last 12 months. If YES, respondents are asked a series of detailed questions, known as the Victim Form (VF). One respondent can complete up to 6 VFs Hundreds of closed questions and a free text field (which summarises the crime) are used by the coder to assign an Offence code. These are finally used to produce analysis and outputs
Contents What is the CSEW? The current offence coding process The purpose of the case study OffenceCoder model Results Current process The Crime Statistics team at ONS dual code 10% (approx. 2,000 VFs per year) to check that the external company who manage the CSEW is coding correctly. Its 1 part-time person job for 1 year (in reality 2 EOs, 7 ROs and 7 SROs) On average it takes 10-15 mins per VF Ambiguous cases requires agreement of multiple persons in the team (a decision could take days and a sign-off by a G7) Coders have to choose one of the 50+ offence codes
Ambiguous VFs might be: - if the VF features more than one crime (e.g. a burglar breaks into someones house, beats up the occupants, steals the car and breaks some valuable belongings). A priority order is used. - Duplicates: using the example above, the respondent (or interviewer) could record each of those crimes as separate VFs but because they belong to the same incident, one VF should have been completed and one offence code should be applied CRIMINAL DAMAGE Example of current coding process C2 Force or violence No Yes, against respondent Yes, against someone else C3 Serious injury Yes,
Not sexual No Yes, sexual Coders read through the free text and the closed responses. C4 Intentional No SEXUAL OFFENCE Yes CODE 11 C6 Enter resp. home Yes No
C7 Enter outhouse Please note, this is not a real VF. Yes No C8 Right to do so BURGLARY No Yes C9 Anything stolen No Yes or attempt ROBBERY / BURGLARY
CHECK C2 VIOLENCE NO VIOLENCE C9a Is assault more serious than damage ASSAULT They have to follow written guidance and flow charts (8 in total) to reach an Offence code Yes C10 Deliberate Damage or Accident No C11 Level of damage
Other C12 What was damaged Other Nuisance CODE 87 Vehicle Home C16 Belong to Resp? Someone else CODE 89 Attempt CODE 88 Yes 20 or less CODE 83
C17 Cost of damage Over 20 CODE 86 20 or less CODE 85 C10a Attempt to damage C13 Belong to hh C15 Cost or damage Over 20 CODE 84 Accident CODE 87 No Damage Deliberate
Yes Yes CODE 88 No C14 Cost of damage Over 20 CODE 82 20 or less CODE 81 CODE 89 No OTHER CRIME Contents What is the CSEW?
The current offence coding process The purpose of the case study OffenceCoder model Results The purpose of the case study The purpose of the case study is to assess the feasibility of doing this automatically, using Natural Language Processing (NLP) and classification techniques. Machine learning: explores the study and construction of algorithms that can learn from and make predictions on data. We use 10 years of historic manually classified VFs to build a model that can predict the correct offence code for new unseen VFs. NLP: is a field of computer science that deals with applying linguistic and statistical algorithms to text in order to extract meaning to make their information accessible to computer applications. We use NLP to convert text in new numeric features that can be used by the model to learn more information about the incident.
Contents What is the CSEW? The current offence coding process The purpose of the case study OffenceCoder model Results OffenceCoder model Cleaner pipeline End-to-end process Its composed of three parts Questions Its built entirely in Python and scikitlearn Text
Model pipeline Thresholding System What is a pipeline? A pipeline is simply a chains of steps. It allows you to perform sequence of different transformations or steps (find set of features, generate new features, select only some good features) to a raw dataset. The cleaner pipeline (~15000, ~900) Cleaner pipeline (~150000, ~130) Responsible for taking on the raw .csv files and standardising them across the years. At output they can now be joined together in a single one. Each file enters the pipeline individually and at the end they are joined together in a
single big file Examples of processing steps include: Renaming columns that have changed over the years with a new common name Feature selection based on expert knowledge Feature combination Filtering out invalid forms Example cleaner pipeline Easy to change the number/order of the steps We can combine common step built-in scikit-learn with our own custom built steps Easier to maintain the code Each step has a similar structure The model pipeline
Questions Text Model pipeline Data is not quite ready yet for modelling. Closed questions and the Text description goes through additional but separate processing steps. Questions Questions are further processed after the basic processing performed in the cleaning phase Text The main tasks of this pipeline is to convert
responses like Yes/No into integers (eg 1/0). This is called One-Hot-Encoding Term frequency-inverse document frequency (TF-IDF) measures the importance of each word by comparing it to the frequency of terms in a large set of documents. Each VF description is converted into a vectorised format Some questions have more complex levels, so new dummy variables are created For each VF, each word is scored based on its importance within that VF and w.r.t the whole set of VFs.
We drop levels such as Dont Know/Refused to remove noise The model pipeline Questions Text Model pipeline Data is not quite ready yet for modelling. Closed questions and the Text description goes through additional but separate processing steps. We keep 9 years of data for training the model and test results in multiple batches of the latest year (2017) Run a multinomial logistic regression and for each VF the model predict a probability for each of 50+ offence codes, i.e. for each of the possible outcomes. The predicted Offence Code is the one with the highest probability. Overall, the model achieves a robust 86% of correctly classified cases. However, this is far from the 97% of desired accuracy. The Thresholding System Thresholding
System There is a large variance in the model performance between different offence codes. As a solution, we select only some of the most successful (and robust) predicted classes and apply a class-informed threshold to each one of them. Predictions are considered valid only where the probability on a specific class meets the class thresholds. Results are exported as a csv file Contents What is the CSEW? The current offence coding process The purpose of the case study OffenceCoder model Results Results After running the model on the test set, the thresholding components separates
the selected predicted offence codes which meet the threshold. This on average tends to constitute circa 40% of all VFs. On this subset 40%, the model correctly predicts the offence code for 97% of the VFs We aim now to trial it in production. From this: Coding burden can be reduced from analysts Saves time and money! The process of building the model allowed for improvements to be made in the coding manual and guidelines for interviewers Thank you! Questions [email protected][email protected]
The Habsburg-Valois Wars. Consisted of 4 major wars during the early part of the 16. th. century. Spain crushes the French in each one. Poor France. Niccolo Machiavelli. Wanted political unity and independence of Italy. A . successful ruler instills...
"Minimum of 3 paragraphs" "expressing an opinion on the topic" "develop your main idea with supporting details" What they mean. Really means you need 4 or 5. Make sure your write on the assigned topic; pick one side and stick...
To sum up, we Enthuse, Enlighten and Empower: The three strong pillars that define who we are, our beliefs and our commitment towards growth and focus. School Locations. ProblemStatement - Overview. ... Login. Login. Driver . 2. Forgot Password.
CE 319 F Daene McKinney Elementary Mechanics of Fluids Manometry Manometry Pressure can be estimated by measuring fluid elevation U-tube Manometer Example (3.19) Differential Manometer HW (3.20) Example (3.25) HW (3.28) HW (3.32) Example (3.35) HW (3.51) Elementary Mechanics of...
Aljaz Ule et al. Science 18 Dec 2009;326:1703- Jonathan Cole interview with Academe Collective action problems "We call attention, however, to the behavioral features of collective action and their implications for solving public health policy problems." Gil Siegal, Naomi Siegal,...
Southern gothic Gender roles. Gender roles are typically overthrown in Southern Gothic literature. Women tend to be more independent, or wish to be more independent. Men can often take on more maternal roles, or will take on the roles of...
A Few Quick Facts: Saint-Gobain… Was created in 1665 and built the Hall of Mirrors at the Palace at Versailles. Manufactures 30 billion glass bottles andjars a year. Has equipped 80 capitals and more than 1,000 major cities throughout the...
Learn how to solve real world problems using computers Amazing job prospects Because it can change the world - this business is binary, you're a 1 or a 0 Real World Problems Research in computer science is used to study...
Ready to download the document? Go ahead and hit continue!