arxiv-sanity lite: tag arxiv papers of interest get recommendations of similar papers in a nice UI using SVMs over tfidf feature vectors based on paper abstracts.
Go to file
2021-12-08 00:17:37 -08:00
aslite add ability to store an email for a user for recommendations 2021-11-27 11:04:36 -08:00
data ok now we can sequester all the database files into data/ folder so everything is nice and clean yay 2021-11-25 13:47:45 -08:00
static kind of big changes here: we can now inspect & see the most positive and negative words for a trained svm, to help tune the value C. then there is also the UI for setting value C in the SVM. Finally the value of C I adjusted to default to 0.01 (was 0.1 before) because the results and the weights look more sensible based on manual inspection. We need some dataset of people libraries in order to potentially cross-validate a good value C automatically. For now there are not enough active users of the site that such a thing could be attempted and succeed. Doing my best now just by eyeballing 2021-12-08 00:17:37 -08:00
templates kind of big changes here: we can now inspect & see the most positive and negative words for a trained svm, to help tune the value C. then there is also the UI for setting value C in the SVM. Finally the value of C I adjusted to default to 0.01 (was 0.1 before) because the results and the weights look more sensible based on manual inspection. We need some dataset of people libraries in order to potentially cross-validate a good value C automatically. For now there are not enough active users of the site that such a thing could be attempted and succeed. Doing my best now just by eyeballing 2021-12-08 00:17:37 -08:00
.gitignore and i think that's it, we now support user accounts (lite)git commit -m 'and i think that\'s it, we now support user accounts litegit status sweet.'! sweet. 2021-11-26 16:38:36 -08:00
arxiv_daemon.py use the process exit code to communicate whether any updates successfully made it into the database at all 2021-11-26 20:19:48 -08:00
compute.py allow to use fewer documents for training tfidf features to prevent OOMs 2021-11-29 15:38:36 -08:00
LICENSE Initial commit 2021-11-12 20:34:22 -08:00
Makefile example makefile 2021-11-25 13:51:52 -08:00
README.md add requirements.txt ty @Prakyathkantharaju and @e-tornike for help 2021-12-02 16:08:05 -08:00
requirements.txt add requirements.txt ty @Prakyathkantharaju and @e-tornike for help 2021-12-02 16:08:05 -08:00
screenshot.jpg update the screenshot since the interface changed quite a bit 2021-11-26 20:33:10 -08:00
send_emails.py list out the tags i think 2021-12-07 22:33:50 -08:00
serve.py kind of big changes here: we can now inspect & see the most positive and negative words for a trained svm, to help tune the value C. then there is also the UI for setting value C in the SVM. Finally the value of C I adjusted to default to 0.01 (was 0.1 before) because the results and the weights look more sensible based on manual inspection. We need some dataset of people libraries in order to potentially cross-validate a good value C automatically. For now there are not enough active users of the site that such a thing could be attempted and succeed. Doing my best now just by eyeballing 2021-12-08 00:17:37 -08:00

arxiv-sanity-lite

A much lighter-weight arxiv-sanity from-scratch re-write. Periodically polls arxiv API for new papers. Then allows users to tag papers of interest, and recommends new papers for each tag based on SVMs over tfidf features of paper abstracts. Allows one to search, rank, sort, slice and dice these results in a pretty web UI. Lastly, arxiv-sanity-lite can send you daily emails with recommendations of new papers based on your tags. Curate your tags, track recent papers in your area, and don't miss out!

I am running a live version of this code on arxiv-sanity-lite.com.

Screenshot

To run

To run this locally I usually run the following script to update the database with any new papers. I typically schedule this via a periodic cron job:

#!/bin/bash

python3 arxiv_daemon.py --num 2000

if [ $? -eq 0 ]; then
    echo "New papers detected! Running compute.py"
    python3 compute.py
else
    echo "No new papers were added, skipping feature computation"
fi

You can see that updating the database is a matter of first downloading the new papers via the arxiv api using arxiv_daemon.py, and then running compute.py to compute the tfidf features of the papers. Finally to serve the flask server locally we'd run something like:

export FLASK_APP=serve.py; flask run

All of the database will be stored inside the data directory. Finally, if you'd like to run your own instance on the interwebs I recommend simply running the above on a Linode, e.g. I am running this code currently on the smallest "Nanode 1 GB" instance indexing about 30K papers, which costs $5/month.

(Optional) Finally, if you'd like to send periodic emails to users about new papers, see the send_emails.py script. You'll also have to pip install sendgrid. I run this script in a daily cron job.

Requirements

Install via requirements:

pip install -r requirements.txt

Todos

  • Make website mobile friendly with media queries in css etc
  • The metas table should not be a sqlitedict but a proper sqlite table, for efficiency
  • Build a reverse index to support faster search, right now we iterate through the entire database

License

MIT