Sunday 23 November 2014

Interactive visualization of average number of visits to physicians in different age groups and years using mpld3

Saturday 31 May 2014

Subtracting one column from another in Pandas created memory probems ... and a solution

I had two datasets with about 17 million observations for different variables in each. One was an event file (admissions to hospitals, when, what and so on). The other file was a person level file describing the characteristics of the individual who was admitted (gender, birth year and so on). All I wanted to do was to add some of the individual characteristics to the dataframe of the event so that I could calculate the relative frequency of different events in different age categories. Sounds easy. And it is, if you ignore memory problems. Using the dataframe in pandas, python, I first tried:
dfe["male"]=dfp["male"]
dfe["birth_year"] = dfp["birth_year"]
dfe["age"]  = dfe["out_year"] - dfe["birth_year"]
But the last line caused memory errors.

It may be an issue internal to Pandas or the garbage collector in Python.

I guess one might try generators and iterators, but I found something simple that also worked: Just convert the two columns to lists, delete one list from another and append the result to the dataframe. In short, doing the process outside of the dataframe seemed to solve the problem. Like this:

test1 = dfe["out_year"].values
test2 = dfe["birth_year"].values
test3 = test1 - test2


Wednesday 21 August 2013

Shoud "inplace" be standard to reduce clutter, keystrokes and errors

Every time I do something, like replacing missing values in pandas dataframe, I often have to either write the name of the object twice, or explicitly specify "inplace." Now, I know there are good reasons for this, but I just wonder whether it might be practical to do it the other way around. make "inplace standard" when using methods on objects and if you want to avoid changing the object this has to be explicitly specified.

Why? To save keystrokes (important in itself). However, it also reduces the probability of silly errors like typing the name of the object wrong in one place.

Example: I wanted to drop a row because it was added in the original dataset by an error. I wrote:

cost_in_age = cost_in_age.drop(cost_in_age.index[105])

A shorter way, which does not work now, but is equally (or more) intuitive, would be:

cost_in_age.drop(cost_in_age.index[105])

Sometimes it is possible to do the latter, but then one has to specify that it is to be done "inplace". So when we do something with objects we should not have to specify this.

OK, I know there are probably good reasons why it is like this. Methods and functions return things and unless it is explicitly specified they we should put the output returned in an object, it is simply just returned.

But still, it does not look technically impossible to make it the other way. Inplace as standard? Benefits would be a lot of saved typing and reduced errors. Cost?





Tuesday 9 July 2013

Scraping: Don't forget regex!

I was using python to scrape some times related to run pacing and split times. Using lxml.html worked fine to import web pages with tables to collect data and links (See Wes' book: Python for Data Analysis, p. 166- for a brief introduction). But there was one problem. An important link to the pages with historical split time for each individual, was hidden within a table and it used java ("open.window" pop-up). No matter how hard I tried I could not make lxml catch the url of this link. Then, after a few hours the solution occurred to me: Why not simply use regular expressions!

The key problem is that when you parse a document using lxml, it creates a structured document which makes it easy to scrape some things (like tables), but sometimes more difficult to find other things like a single piece of information hidden within an element contained in another element. In this case searching using standard regular expressions on a flat text file of the document may be easier - and faster - than using lxml or other tools.

In fact, this worked so fast end well that it may serve as a useful reminder to others: before investing a lot of time in lxml, BeautifulSoup or Scrapy, consider whether regex might be sufficient!

Here is a the python code for scraping standard links (starting with http):

import re
from urllib2 import urlopen 
url = (address to the web page you want to scrape)
html = urlopen(url).read()
links = re.findall(r'href=[\'"]?([^\'" >]+)', html)

Thursday 20 June 2013

Standard errors in MEPS - Quick comparison of dirty, brr and proper

MEPS (Medical Expenditure Panels Survey) has a complicated survey design which involves repeated interviews, oversampling, and random recruitment of new households from randomly selected sampling units. In short, when estimating the standard errors of, say, mean expenditures, you cannot assume that the simple sample mean and the standard error can be generalized to whole country. To solve this there are several methods and the AHRQ provides both the variables and codes needed. Unfortunately it takes some time to use all this, so I thought it could be useful to give a very brief summary of my experience.

First: Does it really matter? Theoretically one might be accused of "doing it wrong" is one simply use the sample mean without weighting and adjusting, but if the error is small the extra time might not be worth the trouble.

Short and boring answer: It depends (on the size of the subgroups you examine). In my experience the mean is not too bad even for small sub-groups (say 70 individuals in a sample of more than 200 000), but the standard error can be way off. For instance, when examining total health spending among those without insurance who also died (between 1996 and 2010), the sample mean and the weighted mean was in the neighborhood of 9000 to 11 000 USD.

The standard error, however, was way off. The quick and dirty method of using the sample mean, pretending this was a random sample, gave a standard error around 60 USD. This looked too promising, so I went ahead and checked it using BRR (Balanced repeated replication). Why BRR? I had used Pandas and Python to merge and wrangle the data and there was no standard module to find the standard error of a complex dataset in Python/Pandas. Given that I had to write it, BRR seemed like a good option: The AHRQ provide a file with the half-samples (h036brr), it does not require the whole dataset to be in memory (only the sub-sample) and the method is not too difficult to implement (basically just calculating 128 means using weighted average and then finding the average deviations of the overall (weighted) sample mean and the 128 sub-sample means. SQRT this and you have the standard error).

Having done this, I was a bit shocked to see that the result was a standard error of about 5000 USD. Obviously the dirty method gave a standard error which was way too small. However, I also started to doubt the BRR result. It looked too big. I could not find anything wrong with the algorithm, but the method itself is a bit conservative. It converges on the right result, but it may do so with slower speed than other methods.

So I had to bite the bitter apple and go outside Python to try a third approach: Importing the data into stata and declaring to stata that this was a complex sample with various weights and strata. Stata then takes care of all the issues and calculates the standard error (I think Stata uses using Taylor Series methods). The result - a standard error of about 2000 USD.

In sum, yes for small samples the method may make a big difference for the estimation of standard errors, but the mean seems a lot more robust.

Saturday 8 June 2013

Reducing memory problems with large files in Pandas: Eliminate nan's and use dtypes

I have had some problems working with a large dataset in pandas (17 million obervations). I am repeating myself a bit (see previous post), but the problem is not solved. Some solutions so far have been:

1. Eliminate or replace missing data before using pd.read.csv. Make the missing values -1 or 0 or something that Pandas does not interprets as "nan" (not a number, missing). Why? Because if the column contains a nan, Pandas automatically makes the variable a datatype which takes more memory.

2. Even this is often not enough, because the pd.read.csv often assigns datatypes that takes more memory than usual, for instance using float64 when float16 is enough and so on. To reduce this problem, you can explicitly declare the datatype (and import only the variables needed):

dtypes = {"id" : np.int32 , "birth_year" : np.int16 , "male" : np.int8 , "dead_year" : np.int16 , "dead_month" : np.int8}
vars = dtypes.keys()
dfp = pd.read_csv(path + "eolnprreducedv2personfile.csv", usecols = vars, dtype = dtypes, sep = ";", index_col="id")

3. Finally it is a good idea to save and later import the file using the pickel (or hd5) format since the pd.read.csv often runs out of memory even even when the file itself is not very large. So having several large files in memory and adding another using pd.read.csv will lead to memory problems, but importing the same as a whole dataframe it using pickle will be ok.