
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas read_stata() Method
The read_stata() method in Python's Pandas library is used to read or load data from a Stata dataset file into a Pandas DataFrame. In other words, this method allows you to import data from Stata's .dta files into a Pandas DataFrame, enabling easy data manipulation and analysis in Python. Stata is a software tool widely used for statistical analysis, and its dataset files are a common format for storing structured data, which is developed by StataCorp.
This read_stata() method supports features like automatic handling of Stata-specific data types, optional column selection, and chunk-based reading for large datasets. It allows users to convert categorical variables, handle missing values, and preserve data types.
Syntax
Below is the syntax of the Python Pandas read_stata() method −
pandas.read_stata(filepath_or_buffer, *, convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)
Parameters
The Python Pandas read_stata() method accepts the below parameters −
filepath_or_buffer: A string, path object, or file-like object representing the location of the Stata dataset file to read.
convert_dates: A boolean indicating whether to convert date variables to Pandas datetime values. By default it is set to True.
convert_categoricals: A boolean indicating whether to read value labels and convert columns to Categorical/Factor variables. By default it is set to True.
index_col: Specifies the column to use as the DataFrame index. If None, no column is used as the index.
convert_missing: A boolean indicating whether to convert missing values to their Stata representations. If set to True, columns containing missing values are returned with object data types and missing values are represented by StataMissingValue objects. If set to False, missing values are replaced with nan.
preserve_dtypes: If True, preserves the original data types of variables in the Stata file. If False, numeric data are directed to pandas default types for foreign data (float64 or int64).
columns: Specifies a subset of columns to include in the output. By default, it includes all columns.
order_categoricals: Determines whether the converted categorical data are ordered.
chunksize: Read Stata data in chunks of specified size.
iterator: Returns the StataReader object.
compression: Specifies the compression method to use. If set to 'infer', the method will automatically detect the compression type based on the file extension (e.g., .gz, .bz2, .zip, .xz, .zst, .tar, .tar.gz, or .tar.bz2).
storage_options: Additional options for connecting to certain storage back-ends (e.g., AWS S3, Google Cloud Storage).
Return Value
The Pandas read_stata() method returns a DataFrame containing the data read from the specified Stata file or pandas.api.typing.StataReader object.
Example: Basic Reading of a Stata Dataset File
Here is a basic example demonstrating reading a Stata dataset file into a Pandas DataFrame using the read_stata() method.
import pandas as pd # Create a DataFrame df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']}) # Save the DataFrame to a Stata file df.to_stata("stata_file.dta") # Read a Stata file result = pd.read_stata("stata_file.dta") print("DataFrame read from Stata file:") print(result)
When we run above program, it produces following result −
DataFrame read from Stata file:
index | Col_1 | Col_2 | |
---|---|---|---|
0 | 0 | 0 | a |
1 | 1 | 1 | b |
2 | 2 | 2 | c |
3 | 3 | 3 | d |
4 | 4 | 4 | e |
If you visit the folder where the Stata dataset files are saved, you can observe the generated .dta file.
Example: Reading Specific Columns from a Stata file
The following example demonstrates how to read specific columns from a Stata file using the read_stata() method with the columns parameter.
import pandas as pd # Create a DataFrame df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']}) # Save the DataFrame to a Stata file df.to_stata("stata_file.dta") # Read specific columns from a Stata file df = pd.read_stata("stata_file.dta", columns=["Col_2"]) print("Selected columns read from Stata file:") print(df)
While executing the above code we get the following output −
Selected columns read from Stata file:
Col_2 | |
---|---|
0 | a |
1 | b |
2 | c |
3 | d |
4 | e |
Example: Setting a Custom Index Column While Reading a Stata File
The following example demonstrates how to use the read_stata() method for setting a custom index from the Stata file column data using the index_col parameter.
import pandas as pd from datetime import datetime # Create a DataFrame df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']}) # Save the DataFrame to Stata with custom gzip compression df.to_stata("stata_file.dta") # Read a Stata file by specifying the column to set it as DataFrame Index df = pd.read_stata("stata_file.dta", index_col="Col_2") print("DataFrame read from Stata file with custom index:") print(df)
Following is an output of the above code −
DataFrame read from Stata file with custom index:
index | Col_1 | |
---|---|---|
Col_2 | ||
a | 0 | 0 |
b | 1 | 1 |
c | 2 | 2 |
d | 3 | 3 |
e | 4 | 4 |
Example: Reading a Compressed Stata File
The read_stata() method can also accepts reading a compressed Stata file.
import pandas as pd from datetime import datetime # Create a DataFrame df = pd.DataFrame({"Col_1": range(5), "Col_2": ['a', 'b', 'c', 'd', 'e']}) # Save the DataFrame to Stata with custom gzip compression df.to_stata("compressed_file.dta.gz", compression={'method': 'gzip', 'compresslevel': 2}) # Read a compressed Stata file df = pd.read_stata("compressed_file.dta.gz", compression="gzip") print("DataFrame read from compressed Stata file:") print(df)
Following is an output of the above code −
DataFrame read from compressed Stata file:
index | Col_1 | Col_2 | |
---|---|---|---|
0 | 0 | 0 | a |
1 | 1 | 1 | b |
2 | 2 | 2 | c |
3 | 3 | 3 | d |
4 | 4 | 4 | e |