Machine Learning: Broken Data

Hobby Projects 2 April 2010

So I’ve been trying to normalise and format the data-set for consumption and processing. Some key points

Random Corruptions
Some Data is outside the specified range
The module code column needs to be ignored as there is no way to convert it into input which is expressed as a float.
Many entries are missing as much as 50% of the inputs
The extremely corrupt data makes up 20%
That 20% contains 90% of the negative results

My approach could be along several paths

Missing Data is replaced by averaged data, hoping that it will not throw the results.
Missing Inputs are not “fired” into the equation

Both of these approaches assume that the missing data is neutral, and the lack is not in some way significant. Also they require you to correctly identify the neutral position.

I don’t know what the best solution is so I have cloned the data down two paths

Heavily corrupt data is removed from the training set, the test set uses averages established by training set for missing inputs.
The inputs which tends to be missing or corrupt is removed, reducing the amount of inputs

So which was is best, what do you do with missing or corrupt data?

Forest of Fun

Claire's Personal Ramblings & Experiments

Home

Experiments

Blog

RSS

Gallery

Career

Machine Learning: Broken Data