Coreference Resolution
Identifying references to the same thing
Identifying references to the same thing
Including non-experts in data creation and system functions
Playing the board game Diplomacy
Generating code that represents the meaning of text
Work on making and using vector representations of text
Blog posts from my old website
Various topics
I’m trying a new approach to reading literature. I try to read enough of one paper each work day to get the key idea and then add some content here if I want to remember it. My hope is that this helps me get more out of my reading by forcing me to identify what mattered most to me in the paper and to link it to other things I have read.
Note, this is not a literature review. I am not aiming to be comprehensive. The papers I read and write about reflect my interests, biases, and opinions. I also don’t completely summarise the work, but rather focus on the aspects that I want to remember. The pages are also in various states. Some are fairly detailed, others are quite sparse or contain just a list of papers I plan to read / reread and write about.
Advice from elsewhere:
To help me identify the papers I want to read, I have been using the following method (in Chrome on macOS):
Shift+Command+D
or Bookmarks -> Bookmark All Tabs.chrome://bookmarks
, select the new folder then use the menu on the far right of the blue bar to select Export Bookmarks.bookmarks_DATE.html
as input (note, requires PyPDF2
). This produces a pdf with only the introduction of each paper (approximately).# Get the paper URLs
import sys
papers = {}
for line in sys.stdin:
if 'aclanthology.org' in line:
content = line.strip()
url = content.split()[1].split('"')[1][:-1] + ".pdf"
name = content.split(" - ACL Anthology")[0].split(">")[-1]
papers[name] = url
# Download the papers
import io, requests
PDFs = {}
for name, url in papers.items():
r = requests.get(url, auth=('usrname', 'password'), verify=False,stream=True)
assert 200 <= r.status_code < 400
r.raw.decode_content = True
PDFs[name] = io.BytesIO(r.content)
# Get the Introductions
from PyPDF2 import PdfFileReader, PdfFileWriter
import string
pdf_writer = PdfFileWriter()
for name, raw_pdf in PDFs.items():
pdf = PdfFileReader(raw_pdf)
page0 = pdf.getPage(0)
pdf_writer.addPage(page0)
text = page0.extractText().split('\n')
done = False
for part in text:
# Try to find the start of section 2
if part.startswith('2') and len(part) > 1:
if part[1] in string.ascii_letters:
done = True
if not done:
page1 = pdf.getPage(1)
start = page1.extractText().split('\n')[0]
# Try to find the start of section 2
if start.startswith('2') and len(start) > 1:
if start[1] in string.ascii_letters:
done = True
if not done:
pdf_writer.addPage(page1)
with open('example.pdf', 'wb') as out:
pdf_writer.write(out)