Good!
The Datasets Package — statsmodels 0.4.0 documentation:
'via Blog this'
Friday, 10 May 2013
Large files in Pandas
I have had some problems loading and analyzing large files in Pandas. Python complained about running out of memory, even when I did not expect it to. After some trial and error I found some partial solutions which might be helpful for others.
Sometime the problem is in the process of reading the file. The "pd.read_csv" seems to use a lot of memory. Anyway, here is what I did:
1. The obvious: Use the "usecols" option to load only the variables needed.
2. I tried using the "chunk" option and gluing together, but for me this dis not solve the problem.
3. I tried reading one column at a time, use "squeeze" and glue toegther, This worked better, but eventually I ran into memory problem here too.
4. The final - and best solution so far - specify the data types. How? Create a dictionary with the variable names and the dtypes you want. Use the dtype = dict option in the "pd.read_csv" Why does it work? Well, it seems like Pandas assigned all variables float64, which takes a lot of memory. Since my variables were usually float16 and less, telling Pandas this dramatically reduced the memory requirement.
The last solution, however, does not completely solve the problem. First of all, the most memory efficient solution (specifying many of the variables as short integers and booleans), does not work when you have missing values. Second, eventually there will be memory problems if you have enouogh variables/observations.
Still, a combination of the methods above seems to dramatically reduce the memory problem when loading the data.
Hopefully some day it will be possible to use pandas on datasets without having to load the whole dataset into memory. SAS does this and making it possible in Pandas would be very useful.
Added:
5. Making sure that there are no missing values (deleting or changing the values) is also helpful because pandas converts integers to float format - which is much more memory intensive - if there are missing variables. observations.
Sometime the problem is in the process of reading the file. The "pd.read_csv" seems to use a lot of memory. Anyway, here is what I did:
1. The obvious: Use the "usecols" option to load only the variables needed.
2. I tried using the "chunk" option and gluing together, but for me this dis not solve the problem.
3. I tried reading one column at a time, use "squeeze" and glue toegther, This worked better, but eventually I ran into memory problem here too.
4. The final - and best solution so far - specify the data types. How? Create a dictionary with the variable names and the dtypes you want. Use the dtype = dict option in the "pd.read_csv" Why does it work? Well, it seems like Pandas assigned all variables float64, which takes a lot of memory. Since my variables were usually float16 and less, telling Pandas this dramatically reduced the memory requirement.
The last solution, however, does not completely solve the problem. First of all, the most memory efficient solution (specifying many of the variables as short integers and booleans), does not work when you have missing values. Second, eventually there will be memory problems if you have enouogh variables/observations.
Still, a combination of the methods above seems to dramatically reduce the memory problem when loading the data.
Hopefully some day it will be possible to use pandas on datasets without having to load the whole dataset into memory. SAS does this and making it possible in Pandas would be very useful.
Added:
5. Making sure that there are no missing values (deleting or changing the values) is also helpful because pandas converts integers to float format - which is much more memory intensive - if there are missing variables. observations.
Tuesday, 7 May 2013
wbopendata: Stata module to access World Bank databases
Perfect for quick downloading of large WHO datasets.
wbopendata: Stata module to access World Bank databases | Data: "db wbopendata"
'via Blog this'
wbopendata: Stata module to access World Bank databases | Data: "db wbopendata"
'via Blog this'
Subscribe to:
Posts (Atom)