As a graduate student, you read a lot of journal articles… a lot. With the material in the articles being as difficult as it is, I didn’t want to worry about organizing everything as well. That’s why I wrote this script to help (I may have also been procrastinating from studying for my qualifiers). This was one of my earliest little projects, so I’m not claiming that this is the best way to do anything.
My goal was to have a central repository of papers that was organized by an author’s last name. Under each author’s name would go all of their papers I had read or planned to read.
I needed it to be portable so that I could access any paper from my computer or iPad, so Dropbox was a necessity. I also needed to organize the papers by subject. I wanted to easily get to all the papers on Asset Pricing, without having to go through each of the authors separately.
Symbolic links were a natural solution to my problem.
A canonical copy of each paper would be stored under /Drobox/Papers/<author name>
, and I could refer that paper from /Macro/Asset Pricing/
with a symbolic link. Symbolic links avoid the problem of having multiple copies of the same paper. Any highlighting or notes I make on a paper is automatically spread to anywhere that paper is linked from.
import os
import re
import sys
import subprocess
import pathlib
class Parser(object):
def __init__(self, path,
repo=pathlib.PosixPath('/Users/tom/Economics/Papers')):
self.repo = repo
self.path = self.path_parse(path)
self.exists = self.check_existance(self.path)
self.is_full = self.check_full(path)
self.check_type(self.path)
self.added = []
def path_parse(self, path):
"""Ensures a common point of entry to the functions.
Returns a pathlib.PosixPath object
"""
if not isinstance(path, pathlib.PosixPath):
path = pathlib.PosixPath(path)
return path
else:
return path
def check_existance(self, path):
if not path.exists():
raise OSError('The supplied path does not exist.')
else:
return True
def check_type(self, path):
if path.is_dir():
self.is_dir = True
self.is_file = False
else:
self.is_file = True
self.is_dir = False
def check_full(self, path):
if path.parent().as_posix() in path.as_posix():
return True
def parser(self, f):
"""The parsing logic to find authors and paper name from a file.
f is a full path.
"""
try:
file_name = f.parts[-1]
self.file_name = file_name
r = re.compile(r' \([\d-]{0,4}\)')
sep_authors = re.compile(r' & |, | and')
all_authors, paper = re.split(r, file_name)
paper = paper.lstrip(' - ')
authors = re.split(sep_authors, all_authors)
authors = [author.strip('& ' or 'and ') for author in authors]
self.authors, self.paper = authors, paper
return (authors, paper)
except:
print('Missed on {}'.format(file_name))
def make_dir(self, authors):
repo = self.repo
for author in authors:
try:
os.mkdir(repo[author].as_posix())
except OSError:
pass
def copy_and_link(self, authors, f, replace=True):
repo = self.repo
file_name = f.parts[-1]
for author in authors:
if author == authors[0]:
try:
subprocess.call(["cp", f.as_posix(),
repo[author].as_posix()])
success = True
except:
success = False
else:
subprocess.call(["ln", "-s",
repo[authors[0]][file_name].as_posix(),
repo[author].as_posix()])
success = True
if replace and author == authors[0] and success:
try:
f.unlink()
subprocess.call(["ln", "-s",
repo[authors[0]][file_name].as_posix(),
f.parts[:-1].as_posix()])
except:
raise OSError
def main(self, f):
authors, paper = self.parser(f)
self.make_dir(authors)
self.copy_and_link(authors, f)
def run(self):
if self.exists and self.is_full:
if self.is_dir:
for f in self.path:
if f.parts[-1][0] == '.' or f.is_symlink():
pass
else:
try:
self.main(f)
self.added.append(f)
except:
print('Failed on %s' % str(f))
else:
self.main(self.path)
self.added.append(self.path)
for item in self.added:
print(item.parts[-1])
if __name__ == "__main__":
p = pathlib.PosixPath(sys.argv[1])
try:
repo = pathlib.PosixPath(sys.argv[2])
except:
repo = pathlib.PosixPath('/Users/tom/Economics/Papers')
print(p)
obj = Parser(p, repo)
obj.run()
The script takes two arguments, the folder to work on and the folder to store the results (defaults to /Users/tom/Economics/Papers
). Already a could things jump out that I should update. If I ever wanted to add more sophisticated command line arguments I would want to switch to something like argparse
. I also shouldn’t have something like /Users/tom
anywhere. This kills portability since it’s specific to my computer (use os.path.expanduser('~')
instead).
I create a Parser
which finds every paper in the directory given by the first argument. I had to settle on a standard naming for my papers. I chose Author1, Author2, ... and AuthorN (YYYY) - Paper Title
. Whenever Parser
find that pattern, it splits off the Authors from the title of the paper, and stores the location of the file.
After doing this for each paper in the directory, it’s time to copy and link the files.
for author in authors:
if author == authors[0]:
try:
subprocess.call(["cp", f.as_posix(),
repo[author].as_posix()])
success = True
except:
success = False
else:
subprocess.call(["ln", "-s",
repo[authors[0]][file_name].as_posix(),
repo[author].as_posix()])
success = True
Since I just one one actual copy of the paper on file, I only copy the paper to the first author’s sub-folder. Thats the if author == authors[0]
. Every other author just links to the copy stored in the first author’s folder. The wiser me of today would use something like shutil
to copy the files instead of subprocess
, but I was still new to python.
The biggest drawback is that I can’t differentiate multiple authors with the same last name that well. I need to edit the original names to include the first initials (C. Romer and D. Romer (2010)
). But overall I’m pleased with the results.