Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Networks-Learning/stackexchange-dump-to-postgres

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StackOverflow data to postgres

This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.

Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.

Dependencies

Usage

  • Create the database stackoverflow in your database: CREATE DATABASE stackoverflow;
    • You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
  • Move the following files to the folder from where the program is executed: Badges.xml, Votes.xml, Posts.xml, Users.xml, Tags.xml.
    • In some old dumps, the cases in the filenames are different.
  • Execute in the current folder (in parallel, if desired):
    • python load_into_pg.py -t Badges
    • python load_into_pg.py -t Posts
    • python load_into_pg.py -t Tags (not present in earliest dumps)
    • python load_into_pg.py -t Users
    • python load_into_pg.py -t Votes
    • python load_into_pg.py -t PostLinks
    • python load_into_pg.py -t PostHistory
    • python load_into_pg.py -t Comments
  • Finally, after all the initial tables have been created:
    • psql stackoverflow < ./sql/final_post.sql
    • If you used a different database name, make sure to use that instead of stackoverflow while executing this step.
  • For some additional indexes and tables, you can also execute the the following;
    • psql stackoverflow < ./sql/optional_post.sql
    • Again, remember to user the correct database name here, if not stackoverflow.

Loading a complete stackexchange project

You can use the script to download a given stackexchange compressed file from archive.org and load all the tables at once, using the -s switch.

You will need the urllib and libarchive modules.

If you give a schema name using the -n switch, all the tables will be moved to the given schema. This schema will be created in the script.

To load the dba.stackexchange.com project in the dba schema, you would execute: ./load_into_pg.py -s dba -n dba

The paths are not changed in the final scripts sql/final_post.sql and sql/optional_post.sql. To run them, first set the search_path to your schema name: SET search_path TO <myschema>;

Caveats and TODOs

  • It prepares some indexes and views which may not be necessary for your analysis.
  • The Body field in Posts table is NOT populated by default. You have to use --with-post-body argument to include it.
  • The EmailHash field in Users table is NOT populated.

Sept 2011 data dump

  • The tags.xml is missing from the data dump. Hence, the PostTag and UserTagQA tables will be empty after final_post.sql.
  • The ViewCount in Posts is sometimes equal to an empty value. It is replaced by NULL in those cases.

Acknowledgement

@madtibo made significant contributions by adding jsonb and Foreign Key support.

About

Python scripts to import StackExchange data dump into Postgres DB.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 10

Languages