Name	Name	Last commit message	Last commit date
Latest commit History 80 Commits
sql	sql
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
load_into_pg.py	load_into_pg.py
requirements.txt	requirements.txt
row_processor.py	row_processor.py

Name

Last commit message

Last commit date

80 Commits

StackOverflow data to postgres

This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.

Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.

Dependencies

Usage

Create the database stackoverflow in your database: CREATE DATABASE stackoverflow;
- You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
Move the following files to the folder from where the program is executed: Badges.xml, Votes.xml, Posts.xml, Users.xml, Tags.xml.
- In some old dumps, the cases in the filenames are different.
Execute in the current folder (in parallel, if desired):
- python load_into_pg.py -t Badges
- python load_into_pg.py -t Posts
- python load_into_pg.py -t Tags (not present in earliest dumps)
- python load_into_pg.py -t Users
- python load_into_pg.py -t Votes
- python load_into_pg.py -t PostLinks
- python load_into_pg.py -t PostHistory
- python load_into_pg.py -t Comments
Finally, after all the initial tables have been created:
- psql stackoverflow < ./sql/final_post.sql
- If you used a different database name, make sure to use that instead of stackoverflow while executing this step.
For some additional indexes and tables, you can also execute the the following;
- psql stackoverflow < ./sql/optional_post.sql
- Again, remember to user the correct database name here, if not stackoverflow.

Loading a complete stackexchange project

You can use the script to download a given stackexchange compressed file from archive.org and load all the tables at once, using the -s switch.

You will need the urllib and libarchive modules.

If you give a schema name using the -n switch, all the tables will be moved to the given schema. This schema will be created in the script.

To load the dba.stackexchange.com project in the dba schema, you would execute: ./load_into_pg.py -s dba -n dba

The paths are not changed in the final scripts sql/final_post.sql and sql/optional_post.sql. To run them, first set the search_path to your schema name: SET search_path TO <myschema>;

Caveats and TODOs

It prepares some indexes and views which may not be necessary for your analysis.
The Body field in Posts table is NOT populated by default. You have to use --with-post-body argument to include it.
The EmailHash field in Users table is NOT populated.

Sept 2011 data dump

The tags.xml is missing from the data dump. Hence, the PostTag and UserTagQA tables will be empty after final_post.sql.
The ViewCount in Posts is sometimes equal to an empty value. It is replaced by NULL in those cases.

Acknowledgement

@madtibo made significant contributions by adding jsonb and Foreign Key support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StackOverflow data to postgres

Dependencies

Usage

Loading a complete stackexchange project

Caveats and TODOs

Sept 2011 data dump

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 10

Uh oh!

Languages

License

Networks-Learning/stackexchange-dump-to-postgres

Folders and files

Latest commit

History

Repository files navigation

StackOverflow data to postgres

Dependencies

Usage

Loading a complete stackexchange project

Caveats and TODOs

Sept 2011 data dump

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 10

Uh oh!

Languages

Packages