Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Python Pandas read_stata() Method



The read_stata() method in Python's Pandas library is used to read or load data from a Stata dataset file into a Pandas DataFrame. In other words, this method allows you to import data from Stata's .dta files into a Pandas DataFrame, enabling easy data manipulation and analysis in Python. Stata is a software tool widely used for statistical analysis, and its dataset files are a common format for storing structured data, which is developed by StataCorp.

This read_stata() method supports features like automatic handling of Stata-specific data types, optional column selection, and chunk-based reading for large datasets. It allows users to convert categorical variables, handle missing values, and preserve data types.

Syntax

Below is the syntax of the Python Pandas read_stata() method −

pandas.read_stata(filepath_or_buffer, *, convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)

Parameters

The Python Pandas read_stata() method accepts the below parameters −

  • filepath_or_buffer: A string, path object, or file-like object representing the location of the Stata dataset file to read.

  • convert_dates: A boolean indicating whether to convert date variables to Pandas datetime values. By default it is set to True.

  • convert_categoricals: A boolean indicating whether to read value labels and convert columns to Categorical/Factor variables. By default it is set to True.

  • index_col: Specifies the column to use as the DataFrame index. If None, no column is used as the index.

  • convert_missing: A boolean indicating whether to convert missing values to their Stata representations. If set to True, columns containing missing values are returned with object data types and missing values are represented by StataMissingValue objects. If set to False, missing values are replaced with nan.

  • preserve_dtypes: If True, preserves the original data types of variables in the Stata file. If False, numeric data are directed to pandas default types for foreign data (float64 or int64).

  • columns: Specifies a subset of columns to include in the output. By default, it includes all columns.

  • order_categoricals: Determines whether the converted categorical data are ordered.

  • chunksize: Read Stata data in chunks of specified size.

  • iterator: Returns the StataReader object.

  • compression: Specifies the compression method to use. If set to 'infer', the method will automatically detect the compression type based on the file extension (e.g., .gz, .bz2, .zip, .xz, .zst, .tar, .tar.gz, or .tar.bz2).

  • storage_options: Additional options for connecting to certain storage back-ends (e.g., AWS S3, Google Cloud Storage).

Return Value

The Pandas read_stata() method returns a DataFrame containing the data read from the specified Stata file or pandas.api.typing.StataReader object.

Example: Basic Reading of a Stata Dataset File

Here is a basic example demonstrating reading a Stata dataset file into a Pandas DataFrame using the read_stata() method.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']})

# Save the DataFrame to a Stata file
df.to_stata("stata_file.dta")

# Read a Stata file  
result = pd.read_stata("stata_file.dta")  

print("DataFrame read from Stata file:")  
print(result)

When we run above program, it produces following result −

DataFrame read from Stata file:
index Col_1 Col_2
0 0 0 a
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
If you visit the folder where the Stata dataset files are saved, you can observe the generated .dta file.

Example: Reading Specific Columns from a Stata file

The following example demonstrates how to read specific columns from a Stata file using the read_stata() method with the columns parameter.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']})

# Save the DataFrame to a Stata file
df.to_stata("stata_file.dta")

# Read specific columns from a Stata file  
df = pd.read_stata("stata_file.dta", columns=["Col_2"])  

print("Selected columns read from Stata file:")  
print(df)  

While executing the above code we get the following output −

Selected columns read from Stata file:
Col_2
0 a
1 b
2 c
3 d
4 e

Example: Setting a Custom Index Column While Reading a Stata File

The following example demonstrates how to use the read_stata() method for setting a custom index from the Stata file column data using the index_col parameter.

import pandas as pd
from datetime import datetime

# Create a DataFrame
df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']})

# Save the DataFrame to Stata with custom gzip compression
df.to_stata("stata_file.dta")

# Read a Stata file by specifying the column to set it as DataFrame Index
df = pd.read_stata("stata_file.dta", index_col="Col_2")  

print("DataFrame read from Stata file with custom index:")  
print(df)  

Following is an output of the above code −

DataFrame read from Stata file with custom index:
index Col_1
Col_2
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Example: Reading a Compressed Stata File

The read_stata() method can also accepts reading a compressed Stata file.

import pandas as pd
from datetime import datetime

# Create a DataFrame
df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']})

# Save the DataFrame to Stata with custom gzip compression
df.to_stata("compressed_file.dta.gz", compression={'method': 'gzip', 'compresslevel': 2})

# Read a compressed Stata file  
df = pd.read_stata("compressed_file.dta.gz", compression="gzip")  

print("DataFrame read from compressed Stata file:")  
print(df)  

Following is an output of the above code −

DataFrame read from compressed Stata file:
index Col_1 Col_2
0 0 0 a
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
python_pandas_io_tool.htm
Advertisements