How to clean your data—and when and why you shouldn't¶

by Shawn Syms

When I was a young activist, I read a critical feminist sexuality text called Pleasure and Danger, by American anthropologist Carole S. Vance. I was always struck by its effective title and its succinct way of encapsulating the issues at stake. Now as a burgeoning data scientist faced with some of the rigors of my trade, such as data cleaning, the phrase comes in handy.

Pleasures?¶

Well, yes. I know that nearly 60 percent of data scientists surveyed described data cleaning as their most hated task, but I actually enjoy it. Really? Sure. As an introvert, I gain comfort from order and predictability, and as a published author and longtime journalist, I know the satisfaction of putting things in their proper order for maximum effectiveness. But I also know there is such a thing as too much of a good thing.

In my earliest moments in data science, I was drawn to data cleaning. What metaphor can I use for it? Well, it's sort of like taking a shower. Except it's not you, it's the data. Well, I mean, yes — I do shower before cleaning data. Daily, I swear. OK, scratch that. Cleaning data is like being a barber. Or a hairstylist, whatever you prefer. You take something that is in a raw and untamed state and make it pleasant, likeable and predictable, like the haircut of some person your parents would approve of, with whom you might feel comfortable going on a dinner date. That's all very well and good, but if you have ever had a bad haircut, you know how things can go wrong. If cleaning data is like being a potter, what if you're handed a lump of clay that roughly looks like an obelisk, but by the time you are done with it, it looks more like an easter basket? I'll stop there. You're welcome. I think you get the point!

This post has two purposes. First, to look at some means of cleaning data and second, to discuss some of the risks and perils of doing so. The ideas in this post borrow from teachings by Matt Brems and Musfiqur Rahman at General Assembly.

Gather, cleanse, model, evaluate, repeat?¶

Several basic data-cleaning techniques are well-known and perhaps even widespread. Delete observations with missing data. Delete features with missing data! A dataset is mostly complete, but a bunch of info is missing from features related to some respondents' body weight or annual income? Well, let's pound through those puppies and average the heck out of them by replacing the missing data with mean values!

Sounds easy? Repeatable? Something you can write a function to automate? Sure you could. You could also put red hot chili peppers in ice cream. Just because you can do something doesn't mean that you should.

Yes, some times we may make these well-known compromises. But we need to remember something. Changing or deletion of data is, de facto, a distortion. Surveys themselves can suffer from distortion as it is if they, for example, ask questions people might be uncomfortable answering, or asking them in a context where they don't feel safe and secure doing so. As data scientist we don't need to make it worse. Bad data means bad analysis which can lead to bad decisions that affect actual people. We are talking about actual people, not just linear regressions and machine learning models. So let's take care to consider the problem we are trying to solve, apply domain knowledge to inform our decisions when we have it, and seek out that knowledge when we don't.

In the example below, I will explore a dataset and in the process, uncover a few data cleaning challenges that don't always have an easy answer. This case study is based upon the Canadian Community Health Survey, 2012: Mental Health Component, which is available through ODESI (the Ontario Data Documentation, Extraction Service and Infrastructure), a project of the Ontario Council of Univerity Librarians. The original dataset was pristine, but it has been mucked up strictly for illustration purposes.)

Cleaning our mental-health dataset using python and pandas¶

Let's do some imports to start.

# The usual suspects
import numpy as np
import pandas as pd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Presentation
from IPython.display import HTML, Markdown, display
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 25000)
pd.set_option('display.max_rows', 600)

Next, let's read in the dataset using the pandas read_csv function.

mental_health = pd.read_csv('../data/cchs-82M0013-E-2012-mental-health_F1.csv')

Let's have a look at the shape of things!

mental_health.shape

(25113, 586)

Lots going on in here!¶

So, 25,113 observations and 586 features. Sounds about right, since this is a dataset generated from a very elaborate questionnaire on mental health in Canada that asks many questions and also has some calculated features. The survey asked about mental health and wellness as well as associated diagnoses. In addition, it asks a wide range of questions about such matters as workplace stress, suicidal thoughts, physical health and exposure to certain stressors in childhood.

It's a rich dataset from a data-scientific perspective. Subsets of features could be used for binary classification, looking to see if particular medical or mental health conditions can be predicted based on other personal data. There are some opportunties for regression as well, such as predicting income based on health or other factors. This dataset is also crying out for dimensionality reduction. As a rule of thumb, the number of features should be less than the square root of the observations, but 586**2 is over 300,000!

But back to the issue at hand. The data needs to be in proper shape first. Let's have a look.

HTML (mental_health.head().to_html())

Those of you with eagle eyes may be able to see problems with the data just by looking at the first five observations.

Issue one: It looks like, in some columns intended to contain numerical values, a lower-case 'l' sits in place of the number one. Maybe 'l' stands for something though? How would we know? Often we would have no idea. However luckily, this dataset has a very robust data dictionary, so we can have a look. Now, look to the left a little bit and a couple more issues arise.

Issue two: In the DHH_SEX column, numerical values have been spelled out. What were they smoking in the data-entry department that day? Just another reminder that everyone makes mistakes — and mistakes compromise data.

Issue three The DHHGHSZ and DHHGDWE columns have some null values — not good. Let's take a look at these issues one at a time.

Erroneous data¶

Looking at the first and last observations we see the letter l scattered all over the place. When we consult the data dictionary and see that all of the columns' intended datatype are either int64 or float64 — numbers, not letters.

So, we can resolve this erroneous data through a method known as deductive imputation. We know that:

the columns are not meant to have the letter l
the columns are are meant to have the number 1

We can assume that the ls have been accidentally put in place of the 1s. We don't know if someone typed in the letter l when they meant the number 2; but then again, we would not know if they had done this with the number 1 either. Our role here, at minimum, is to introduce no new errors. (Even the dataset's exhaustive data dictionary features typos; no one is perfect.)

We know, logically, it is fine to make this change. We don't know how widespread this error is; let's assume it could be present anywhere in the dataset.

Let's look at one of the columns.

mental_health['ADM_N09'].value_counts()

2    21492
l    3247 
3    323  
9    51   
Name: ADM_N09, dtype: int64

Let's use pandas .replace to correct this error anywhere that it appears in the database.

number_dict = {'l' : '1'}
mental_health.replace(number_dict, inplace=True)

And confirm the change:

mental_health.head()

We can also use the same technique to address the written-out numbers. First, let's see what values are present.

mental_health['DHH_SEX'].value_counts()

two    13773
one    11340
Name: DHH_SEX, dtype: int64

I look in the data dictionary to see this column refers to gender. No trans or intersex people surveyed out of almost 25,000? Or just no option for other genders? Well, the survey is a bit old. Today, one almost never sees a survey with only two options for gender. The words vs numbers problem should be an easy enough fix, though.

number_dict = {'one' : 1, 'two' : 2} # Set up a dictionary to pass to .replace
mental_health.replace({'DHH_SEX' : number_dict}, inplace=True)

mental_health['DHH_SEX'].value_counts() # Confirm the change

2    13773
1    11340
Name: DHH_SEX, dtype: int64

Given all that weirdness, let's have a look at the datatypes:

# I'm creating a df just for ease of scrolling

dtypes = pd.DataFrame(mental_health.dtypes)
dtypes

We can see that four features that should have the datatype int64 are set to be strings (no surprise, since they have unexpected string content in them): GEOGCMAl, ADM_N09, ADM_Nl0 and DHH_SEX. Now that the offending values have been expunged, let's fix that!

wrong_dtypes = ['GEOGCMAl', 'ADM_N09', 'ADM_Nl0', 'DHH_SEX']

for col in wrong_dtypes:
    mental_health[col] = pd.to_numeric(mental_health[col])

To .dropna, or to not .dropna?¶

So, we saw some nulls. What to do with them? First off, how many?

mental_health.isna().sum()

VERDATE     0 
ADM_RNO     0 
GEO_PRV     0 
GEOGCMAl    0 
ADM_N09     0 
ADM_Nl0     0 
ADM_N11     0 
DHHGAGE     0 
DHH_SEX     0 
DHHGMS      0 
DHHGHSZ     20
DHHGDWE     20
dtype: int64

We see there are two columns that each contain twenty null values. These represent missing data. If we delete them, we are losing potentially valuable data for our model.

We could impute the values. Looking in the data dictionary, we see that NaN does not represent a refusal to respond. Respondents were interviewed and the interviewer captured their answers. The responses are captured as categorical variables where 8 means "REFUSAL" and 9 means "NOT STATED."

These NaN could mean anything, mostly likely a data entry error (or rather, a system with a lack of rules to prevent such an error). Given they are categorical values, it's not like you could impute the mean or anything. But filling NaNs with mean values is never a good idea anyway. Besides the fact it's just not true, it distorts the distribution of your feature by inflating the mean bin to be larger than it actually is; the more fake means, the worse the skew.

You could look at the distribution of the categorical values in the rest of the feature, and distribute the nans amongst them at a rate equivalent to their distribution, along these lines:

mental_health['DHHGHSZ'].value_counts(normalize=True)

2.0    0.374208
1.0    0.296537
3.0    0.139601
4.0    0.127924
5.0    0.061571
9.0    0.000159
Name: DHHGHSZ, dtype: float64

Raise the issue for discussion. You may end up just deleting the offending observations. After all, we are only talking about 20-40 out of 20,000+, and it is unlikely there is anything (either normal values or significant outliers) that won't be present somewhere else in the dataset. But deleting data should never be the default decision!

	VERDATE	ADM_RNO	GEO_PRV	GEOGCMAl	ADM_N09	ADM_Nl0	ADM_N11	DHHGAGE	DHH_SEX	DHHGMS	DHHGHSZ	DHHGDWE	DHHGLVG	GEN_01	GEN_02	GEN_04	GEN_02A2	GEN_07	GEN_08	GEN_09	GEN_10	GENDHDI	GENGSWL	SCR_081	SCR_082	SCR_21	SCR_22	SCR_23	SCR_24	SCR_25	SCR_25A	SCR_26	SCR_26A	SCR_26B	SCRDPHY	SCRDMEN	SCRDDEP	SCRDMIA	SCRDGAD	HWT_1	HWT_4	HWTGHTM	HWTGWTK	HWTGBMI	HWTGISW	CCC_031	CCC_051	CCC_061	CCC_071	CCC_072	CCC_081	CCC_091	CCC_101	CCC_121	CCC_131	CCC_132	CCC_151	CCC_171	CCC_251	CCC_261	CCC_280	CCC_28AA	CCCG28AE	CCC_290	CCC_311	CCC_331	CCC_332	CCCG901	CCCF1	CCCDHBP	CCCDCNCR	HUPDPAD	PHSFPPA	PHSGAPA	PMH_01	PMH_02	PMH_03	PMH_04	PMH_05	PMH_06	PMH_07	PMH_08	PMH_09	PMH_10	PMH_11	PMH_12	PMH_13	PMH_14	PMHDHEM	PMHDLEM	PMHDHFU	PMHDCLA	PMHDSCR	STS_1	STS_2	STS_3	STS_4	STS_5	DIS_10A	DIS_10B	DIS_10C	DIS_10D	DIS_10E	DIS_10F	DIS_10G	DIS_10H	DIS_10I	DIS_10J	DIS_10K	DIS_10L	DIS_10M	DIS_10N	DISDK6	DISDDSX	DISDCHR	DEP_72	DEP_86	DEP_87	SUI_01	SUI_02	SUI_03	DEPDDPS	DEPGREC	DEPGPER	DEPDDY	DEPFSLT	DEPFSYT	DEPFINT	DEPDINT	MIA_33	MIA_47	MIADEPS	MIAGREC	MIAGPER	MIADEY	MIADINT	MIAFINT	HYPDL	HYPDEY	HYPFINT	HYPDINT	BIPD1	BIPD2	BIPD1Y	BIPD2Y	BIPDL	BIPDY	BIPDINT1	BIPDINT2	BIPDINT	BIPFINT1	BIPFINT2	BIPFINT	MHPFL	MHPFLM	MHPFLSA	MHPFY	MHPFYM	MHPFYSA	MHPFINT	GAD_44	GAD_58	GAD_59	GADDGDS	GADGPER	GADGREC	GADDDY	GADDINT	GADFINT	SMK_01A	SMK_01B	SMKG01C	SMK_202	SMKG203	SMK_204	SMK_05B	SMK_05C	SMK_05D	SMK_06A	SMKG06C	SMKG207	SMK_208	SMK_09A	SMKG09C	SMK_10	SMK_10A	SMKG10C	SMKDSTY	SMKGSTP	SMKDYCS	AUD_01	AUD_02	AUD_03	AUD_04	AUD_05	AUDG06	AUD_08	AUD_10	AUDG11	AUD_39	AUDDTTM	AUDFWDY	AUDDWDY	AUDDLD	AUDDYD	AUDDLA	AUDDYA	AUDDL	AUDDY	AUDFINT	AUDDINT	SUD_87	SUDFLTU	SUDFLSU	SUDFLAU	SUDFLCA	SUDFLCM	SUDFYCM	SUDGLOTH	SUDDLAI	SUDDLAE	SUDDLID	SUDDYAI	SUDDYAE	SUDDYID	SUDDLCD	SUDDYCD	SUDDLOD	SUDDYOD	SUDDLCA	SUDDYCA	SUDDLOA	SUDDYOA	SUDDLC	SUDDYC	SUDDLO	SUDDYO	SUDDL	SUDDY	SUDFINT	SUDDINT	DASG01	DASG02	DAS_04	DAS_05	DAS_06	DAS_07	DAS_10	DAS_11	DAS_12	DAS_13	DASGSCR	TWD_1	TWD_2	TWD_3	TWD_4	TWD_5	TWD_6	SR1_001	SR1_004A	SR1_004B	SR1_004C	SR1_004D	SR1_004E	SR1_004F	SR1_004G	SR1_004H	SR1_004I	SR1_004J	SR1_004K	SR1G010	SR1_012	SR1_013	SR1_014A	SR1_014B	SR1_014C	SR1_014D	SR1_014E	SR1_014G	SR1_014H	SR1_014I	SR1_014J	SR1_014L	SR1G020	SR1_022	SR1_023	SR1_024A	SR1_024B	SR1_024C	SR1_024D	SR1_024E	SR1_024F	SR1_024G	SR1_024H	SR1_024I	SR1_024J	SR1_024K	SR1_024L	SR1G030	SR1_032	SR1_033	SR1_034A	SR1_034B	SR1_034C	SR1_034D	SR1_034E	SR1_034G	SR1_034H	SR1_034I	SR1_034J	SR1_034L	SR1G040	SR1_042	SR1_043	SR1_044A	SR1_044B	SR1_044C	SR1_044E	SR1_044F	SR1_044I	SR1_044K	SR1_044L	SR1G050	SR1_052	SR1_053	SR1_054A	SR1_054B	SR1_054C	SR1_054D	SR1_054E	SR1_054F	SR1_054G	SR1_054H	SR1_054I	SR1_054J	SR1_054L	SR1G060	SR1_062	SR1G070	SR1_072	SR1G080	SR1_082	SR1G090	SR1_092	SR1G100	SR1_102	SR1_110	SR1_111	SR1_112A	SR1_112B	SR1_112C	SR1_112D	SR1_112E	SR1_113	SR1_114A	SR1_114B	SR1_114C	SR1_116	SR1G117	SR1_118	SR1G119	SR1_120	SR1_121	SR1FPRU	SR1FNPU	SR1DYPRT	SR1DYNPT	SR1GPRO1	SR1GPRO2	SR1GPRO3	SR1GPRO4	SR1GPRO5	SR1GPROT	MED_01	MED_05	MED_06	MEDFBENZ	MEDFAPSY	MEDFADEP	MEDGOTHR	MEDFDRUG	MEDGTNUM	PNC_01A	PNC_01B	PNC_01C	PNC_01D	PNC_01E	PNC_02A	PNC_02BA	PNC_02BB	PNC_02BC	PNC_02BD	PNC_04A	PNC_04BA	PNC_04BB	PNC_04BC	PNCFH12	PNCDHCT	PNCDPNI	PNCDPNM	PNCDPNC	PNCDPNO	PNCDNEED	PN1_01A1	PN1_01B1	PN1_01C1	PN1_01D1	PN1_01E1	PN1_01F1	PN1_01G1	PN1_01H1	PN1_01I1	PN1_01K1	PN1_01L1	PN1_01A2	PN1_01B2	PN1_01C2	PN1_01D2	PN1_01E2	PN1_01F2	PN1_01G2	PN1_01H2	PN1_01I2	PN1_01J2	PN1_01K2	PN1_01L2	PN1_01A3	PN1_01B3	PN1_01C3	PN1_01D3	PN1_01E3	PN1_01F3	PN1_01G3	PN1_01H3	PN1_01I3	PN1_01K3	PN1_01L3	PN1_01A4	PN1_01C4	PN1_01E4	PN1_01G4	PN1_01K4	PN1_01L4	PN1_01A5	PN1_01B5	PN1_01C5	PN1_01D5	PN1_01E5	PN1_01F5	PN1_01G5	PN1_01H5	PN1_01I5	PN1_01K5	PN1_01L5	PN1_01A6	PN1_01B6	PN1_01C6	PN1_01E6	PN1_01F6	PN1_01G6	PN1_01H6	PN1_01I6	PN1_01K6	PN1_01L6	PN1_01A7	PN1_01B7	PN1_01C7	PN1_01D7	PN1_01E7	PN1_01F7	PN1_01G7	PN1_01H7	PN1_01I7	PN1_01K7	PN1_01L7	PN1_02A7	MHE_01A	MHE_01B	MHE_02A	MHE_02B	MHE_03A	MHE_03B	MHE_04A	MHE_04B	MHE_05A	MHE_05B	MHE_06	MHE_06A	MHE_06B	MHE_06C	MHE_06D	MHE_06E	MHE_06F	MHEDPIS	FMI_01A	FMI_01B	FMI_02	FMI_03A	FMI_03B	FMI_04	FMI_05	FMI_06	FMI_07	SPS_01	SPS_02	SPS_03	SPS_04	SPS_05	SPS_06	SPS_07	SPS_08	SPS_09	SPS_10	SPSDCON	SPSDATT	SPSDGUI	SPSDALL	SPSDINT	SPSDWOR	NSI_01	NSI_02	NSI_03	NSI_04	NSI_05	NSIDSC	CWP_01	CWPG02	CWP_03	CWP_04	CWP_07	CWPG08	CWP_09	CEX_01	CEX_02	CEX_03	CEX_04	CEX_05	CEX_06	CEX_07	CEXDNUM	SPT_01	SPT_02	LBSG31	LBSDWSS	LBSGHPW	LBSDPFT	LBSGSOC	WST_401	WST_402	WST_403	WST_404	WST_405	WST_406	WST_407	WST_408	WST_409	WST_410	WST_411	WST_412	WST_413	WSTDSKI	WSTDAUT	WSTDPSY	WSTDJIN	WSTDPHY	WSTDSOC	WSTDJST	INCG02	INCG7	INC_12	INCGHH	INCGPER	INCDRCA	INCDRPR	SDCFIMM	SDCGRES	SDCGCGT	SDCGLHM	SDC_8	EDUDH04	EDUDR04	WTS_M
0	20140325	1	47	l	l	l	6	7	one	5	NaN	NaN	1	2	3	2	9	2	1	2	2	3	1	2	2	2	1	6	1	1	2	2	2	2	3	3	1	1	2	6	3	1.803	94.50	29.07	3	2	2	2	1	6	2	2	2	2	2	2	2	2	2	2	2	6	6	2	2	2	2	2	1	1	2	1	1	7.5000	2	2	2	2	1	2	2	2	1	1	1	2	1	2	1	0	1	1	61	2	1	5	1	2	4	5	5	5	5	5	5	5	5	5	3	6	6	4	0	1	4	6	6	6	6	2	6	2	96	96	2	2	2	6	99.6	6	6	2	96	96	2	99.6	6	2	2	6	99.6	2	2	2	2	2	2	99.6	99.6	99.6	6	6	6	1	2	1	2	2	2	6	6	6	6	2	6	96	2	99.6	6	2	1	1	3	96	996	996	96	6	6	6	96	996	6	6	6	6	6	5	6	996	1	4	4	6	6	2	2	96	96	9	1	2	1	2	2	1	2	1	2	6	99.6	6	2	2	2	1	1	2	2	1	1	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	6	99.6	1	1	1	1	1	1	1	1	1	6	0.0	2	96	2	96	2	96	2	2	2	2	2	1	2	2	2	2	2	2	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	1	1	2	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	2	2	6	6	6	6	6	2	6	6	6	2	6	2	1	2	2	1	2	1	0	1	1	1	1	2	2	2	1	2	2	2	2	2	2	6	2	2	1	2	2	1	6	6	6	6	2	6	6	6	1	1	1	1	2	1	2	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	96	2	6	2	6	1	1	1	1	1	2	6	96	96	96	96	96	96	96	1	2	3	6	6	6	99999.96	6	6	1	1	1	1	1	1	1	1	1	1	40	8	8	8	8	8	2	2	1	2	1	2	2	2	2	2	2	2	2	1	5	5	5	1	1	2	3	3	3	1	1	40	1	1	2	1	1	3	3	1	1	5	2	6	2	2	1	3	1	2	0	0	96	0.48	1	6	2	5	6	9	9	2	6	1	1	2	4	4	409.66
1	20140325	2	48	l	l	l	6	9	one	4	NaN	NaN	1	2	3	3	7	3	1	2	3	3	2	3	2	2	2	2	2	2	6	2	2	2	2	3	2	2	2	6	3	1.778	76.50	24.20	2	2	2	1	2	2	2	2	2	2	2	2	2	2	2	2	2	6	6	2	2	2	2	2	1	2	2	1	1	1.5169	1	1	1	3	2	5	3	5	1	1	2	1	2	2	1	0	1	1	54	3	2	16	6	6	4	4	5	5	4	5	5	5	5	5	3	6	6	4	2	3	4	6	6	6	6	2	6	2	96	96	2	2	2	6	99.6	6	6	2	96	96	2	99.6	6	2	2	6	99.6	2	2	2	2	2	2	99.6	99.6	99.6	6	6	6	1	2	1	2	2	2	6	6	6	6	2	6	96	2	99.6	6	1	6	3	1	3	25	996	96	6	6	6	96	996	6	6	6	6	6	1	6	40	1	5	5	6	6	8	2	96	96	1	1	2	1	2	2	1	2	1	2	6	99.6	2	2	1	2	1	1	2	1	1	1	1	2	2	2	2	2	2	2	1	2	2	2	1	2	2	2	1	2	6	99.6	1	1	1	1	1	1	1	1	1	6	0.0	2	96	2	96	2	96	2	2	2	2	2	2	2	2	2	2	2	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	2	2	6	6	6	6	6	2	6	6	6	2	6	2	1	6	6	2	2	0	0	1	1	1	1	1	1	2	1	2	2	2	2	2	2	6	2	2	2	2	1	6	6	6	6	6	2	6	6	6	2	0	1	1	1	1	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	96	2	6	2	6	2	6	2	6	2	6	6	96	96	96	96	96	96	96	2	6	6	6	6	6	99999.96	6	6	1	1	1	1	1	1	1	1	1	1	40	8	8	8	8	8	2	1	2	1	1	1	2	2	2	2	2	2	2	1	2	1	1	1	1	2	1	3	4	1	1	40	1	4	2	2	1	4	3	1	2	2	2	4	2	3	2	3	1	2	1	3	4	0.53	1	6	2	3	4	4	3	2	6	1	1	2	4	4	421.86
2	20140325	3	24	l	2	2	2	7	two	2	NaN	NaN	3	5	5	4	3	4	1	4	2	0	4	3	5	1	6	6	2	1	2	1	6	1	2	0	1	2	1	9	7	1.727	72.00	999.99	9	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	1	1	2	2	2	2	2	2	1	2	2	5	1	0.3833	4	4	6	5	4	97	3	4	5	5	3	97	97	97	0	1	0	3	99	5	4	3	2	5	1	1	1	1	1	1	1	3	1	2	1	1	6	1	21	37	1	1	1	2	2	1	2	1	8	8	1	1	2	1	7.5	6	6	2	96	96	2	99.6	6	2	2	6	99.6	2	2	2	2	2	2	99.6	99.6	99.6	6	6	6	1	1	9	1	1	9	1	6	6	6	2	6	96	2	99.6	6	2	2	96	3	96	996	996	96	6	6	6	96	996	6	6	6	6	6	6	6	996	1	3	2	6	6	2	2	96	96	6	1	2	1	2	2	2	2	2	2	6	99.6	9	2	2	2	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	99.9	1	4	4	5	3	3	4	3	5	5	40.0	2	96	1	7	1	14	2	2	1	1	2	2	1	1	2	2	2	2	6	6	6	6	6	6	6	6	6	6	6	6	6	1	2	2	6	6	6	6	6	6	6	6	6	6	6	6	1	4	1	2	2	2	2	2	2	2	2	1	2	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	3	2	3	3	6	6	6	6	6	6	2	1	1	1	2	2	2	2	6	6	6	1	1	1	2	2	2	1	1	2	4	1	9	2	1	1	9	1	1	1	1	2	1	2	1	2	1	1	1	2	2	2	2	2	1	6	6	6	6	6	1	3	2	2	3	1	3	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	2	2	2	2	2	2	2	2	2	2	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	96	2	6	1	1	1	2	1	1	1	1	2	96	96	96	96	96	96	0	2	6	6	6	6	6	99999.96	6	6	2	2	2	2	7	7	2	2	2	2	99	6	9	6	6	9	7	9	9	9	9	99	2	2	2	2	2	2	2	1	5	2	1	8	8	2	99	2	3	1	2	20	2	2	1	2	4	2	1	4	3	1	3	4	2	4	3	4	5	7	2	4	5	1.25	1	1	2	4	3	4	5	2	6	1	2	2	4	4	678.36
3	20140325	4	24	l	2	l	6	14	two	3	NaN	NaN	1	3	3	2	10	3	6	6	2	2	1	3	2	1	6	6	2	2	6	2	2	1	2	3	1	2	1	6	3	1.753	67.79	22.06	2	2	1	2	1	6	1	2	2	2	2	2	2	2	2	2	2	6	6	2	2	2	2	2	1	1	2	2	1	0.7500	2	1	2	4	1	1	1	1	1	1	1	1	1	1	1	0	1	1	65	2	1	3	2	2	5	4	5	4	5	5	4	5	5	5	3	6	6	4	2	3	4	6	6	6	6	2	6	2	96	96	2	2	2	6	99.6	6	6	2	96	96	2	99.6	6	2	2	6	99.6	2	2	2	2	2	2	99.6	99.6	99.6	6	6	6	1	2	2	2	2	2	6	2	6	6	1	2	1	2	99.6	6	2	2	96	3	96	996	996	96	6	6	6	96	996	6	6	6	6	6	6	6	996	1	1	1	6	1	1	2	96	96	6	2	2	2	2	2	2	2	2	2	6	99.6	6	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	6	99.6	1	1	1	2	1	2	1	1	6	1	8.3	2	96	2	96	1	2	2	2	2	2	2	2	2	2	2	2	2	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	2	6	6	6	6	6	2	6	6	6	2	6	2	1	6	6	2	2	0	0	1	1	1	1	1	1	2	1	2	2	2	2	2	2	6	2	2	2	2	1	6	6	6	6	6	2	6	6	6	2	0	1	1	1	1	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	96	1	1	1	2	2	6	1	2	1	2	6	96	96	96	96	96	96	96	2	6	6	6	6	6	99999.96	6	6	1	2	1	1	2	1	2	1	7	1	99	8	8	8	6	9	2	1	1	1	1	0	2	2	2	2	2	2	2	1	1	1	1	1	1	2	0	1	1	6	6	996	6	6	6	6	6	6	6	6	6	6	6	96	96	96	6	96	96	96	6	6	96	9.96	3	6	2	2	3	2	2	1	3	1	1	2	4	4	662.54
4	20140325	5	59	l	l	l	6	13	two	3	NaN	NaN	1	3	3	1	8	2	6	6	3	2	2	3	3	1	6	6	2	2	6	2	2	2	2	2	1	2	2	6	3	1.549	53.55	22.32	2	2	1	2	2	2	2	2	2	2	2	1	2	2	2	2	2	6	6	2	2	2	2	2	1	2	1	3	1	1.1499	2	1	2	4	5	5	1	5	1	2	1	1	1	2	1	0	1	1	51	3	2	4	2	2	4	4	5	5	5	5	5	5	4	5	3	6	6	4	2	3	4	6	6	6	6	2	6	2	96	96	2	2	2	6	99.6	6	6	2	96	96	2	99.6	6	2	2	6	99.6	2	2	2	2	2	2	99.6	99.6	99.6	6	6	6	2	2	2	2	2	2	6	6	6	6	2	6	96	2	99.6	6	2	2	96	3	96	996	996	96	6	6	6	96	996	6	6	6	6	6	6	6	996	1	1	1	6	1	1	2	96	96	6	2	2	2	2	2	2	2	2	2	6	99.6	6	1	2	2	2	2	2	2	1	1	1	1	1	1	2	2	2	2	2	2	2	2	2	2	2	2	2	2	6	99.6	2	2	1	2	1	2	1	1	1	2	16.7	2	96	2	96	2	96	2	2	2	2	2	2	2	2	2	2	2	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	2	6	6	6	6	6	2	6	6	6	2	6	2	2	6	6	2	2	0	0	1	1	1	1	1	1	1	1	2	2	2	1	2	1	1	2	1	2	2	2	1	6	6	6	6	2	6	6	6	1	1	1	2	1	1	2	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	96	2	6	1	2	1	2	1	1	1	1	2	96	96	96	96	96	96	0	2	6	6	6	6	6	99999.96	6	6	1	1	1	1	1	1	1	1	1	1	40	8	8	8	8	8	2	1	1	1	1	0	2	2	2	2	2	2	2	1	1	1	1	1	1	2	0	2	2	6	6	996	6	6	6	6	6	6	6	6	6	6	6	96	96	96	6	96	96	96	6	6	96	9.96	4	6	2	5	6	10	10	2	6	1	1	2	4	4	839.14

	VERDATE	ADM_RNO	GEO_PRV	GEOGCMAl	ADM_N09	ADM_Nl0	ADM_N11	DHHGAGE	DHH_SEX	DHHGMS	...	INCDRCA	INCDRPR	SDCFIMM	SDCGRES	SDCGCGT	SDCGLHM	SDC_8	EDUDH04	EDUDR04	WTS_M
0	20140325	1	47	1	1	1	6	7	one	5	...	9	9	2	6	1	1	2	4	4	409.66
1	20140325	2	48	1	1	1	6	9	one	4	...	4	3	2	6	1	1	2	4	4	421.86
2	20140325	3	24	1	2	2	2	7	two	2	...	4	5	2	6	1	2	2	4	4	678.36
3	20140325	4	24	1	2	1	6	14	two	3	...	2	2	1	3	1	1	2	4	4	662.54
4	20140325	5	59	1	1	1	6	13	two	3	...	10	10	2	6	1	1	2	4	4	839.14

	0
VERDATE	int64
ADM_RNO	int64
GEO_PRV	int64
GEOGCMAl	object
ADM_N09	object
ADM_Nl0	object
ADM_N11	int64
DHHGAGE	int64
DHH_SEX	int64
DHHGMS	int64
DHHGHSZ	float64
DHHGDWE	float64