|
Project / Data Integration Resources
- Data:
- Aspen System's
- Enron $15K ~160GB mega corpus disk --
Status: Data is on-line (Cliplab access) but was too late for TJR integration.
(Received ~ February 2006, finally cliplab available ~May 2006).
- Data copied to: UMIACS/Cliplab "/fs/cliplab/aspen"
- Original data transmittal / description document
(MS DOC,
PDF)
- Note: Original disk contains major path problems --
Partial directory listing: (9MB GZipped).
Bogus path listing:
listing (source
eMail).
- Enron
Employee list archive --
Status: JIKDdB integrated ~ March 2006
(Suggested by Erin ~ November 2005, rescanned by Dave's crew ~ January 2006).
- employee data (sorted 5 ways in 149 TIF page images)
merged --> JIKDdB.KD_People++ (~1200 tuples).
- UC
- Berkeley Enron eMail category
data -- Status: JIKDdB integrated ~ September 2005.
(at Doug's suggestion, derived from BAILANDO
Enron EMail Analysis project ~ August 2005)
- Enron corpus subset (1702 tuples) --> JIKDdB UCB_Categories, UCB_EM_CatReferences,
and UCB_EM_References tables.
- USC
- ITC Enron eMail annotations database -- Status: Partially JIKDdB integrated
(from Anton Leuski ~ August 2005).
- MySQL.USC_E_Annotations RDB on-line @asp.cs.umd.edu ~October 2005,
also @zaphod.mindlab.umd.edu ~June 2006.
- employeelist merged --> JIKDdB.KD_People ID (172 tuples) -- ?? 2005
- annotation, annotation_item, annotation_label, annotation_set,
body, msg, paragraph, recipient, and reference
tables remain to be picked-thru -- 05 April 2006.
- ISI Enron eMail database -- Status: Employee data JIKDdB integrated
(suggested by Tamer ~ July 2005).
- Published schema and statistics report
(PDF)
- MySQL.USC_Enron RDB on-line @neoborg.cs.umd.edu ~August 2005,
also @zaphod.mindlab.umd.edu ~June 2006.
- employeelist imported --> JIKDdB.KD_People ID (151 tuples) -- ?? 2005
- message, recipientinfo, and referenceinfo tables remain to be picked-thru -- 05 April 2006.
- UMD
- Enron Corpus Lucene index -- Status: Proxy JSP wrapped & on-line
(Tamer's ~ January 2006 "Black-box" JAR provides Lucene
"key-word" and "like" search function API).
- JSP Lucene index "Key-word"
(1a,
1b,
Perl), "ID Like"
(2a,
2b,
Perl), and Message Body
(3a,
3b,
Perl) retrieval examples.
- Enron Corpus eMail proxy -- Status: JSP wrapped & on-line
(TJ's eMail Proxy Java modules provide API for underlying JIKDdB eMail data query functions).
- JSP eMail message body query examples
(1a,
1b,
Perl)
- Yejun's eMail "mentioned names" data extraction --
Status: JIKDdB integrated ~2005
- Jen's original Enron eMail corpus "Connectence" data --
Status: JIKDdB integrated ~ 2006)
(received ~ January? 2006).
- Andres' Enron Telephone header, wav & xml file data --
Status: Partially JIKDdB integrated
(received ~ March 2006)
- JIKDdB.PC_EMailIDs, .PC_InCallNames, PC_KeyWords,
.PC_MessageRefs, & .PC_Subjects rendered ~ April 2006.
- Minor resource corrections pending...
- Enron corpus Telephone proxy -- Status: Stubbed, waiting for testing
with new telephone header data.
(HCIL doesn't need it)
- Yejun's Enron corpus harshness scores data --
Status: JIKDdB integrated ~June 2006 (Importing ATT)
(received ~March 2006).
- Tamer's Enron eMail corpus thread ID data --
Status: JIKDdB integrated ~ April 2006
(received ~ April 2006).
- JIKDdB.KD_EM_ThreadNodes, .KD_EM_ThreadLinks, rendered 22 April.
- Tamer's Enron eMail corpus eMail address / Name / Nickname data --
Status: JIKDdB integrated ~June 2006
(received ~ April 2006).
- JIKDdB.KD_EM_Entities table rendered 02 June 2006
(From Zaphod's "/Users/Shared/JIKD/Code/VS_LINKS/JIKD/Data/Threads/entities.txt" file).
- Jen's improved Enron eMail corpus "Connectence" data --
Status: JIKDdB integrated ~ April 2006.
(received ~ April 2006).
- Enron eMail corpus Summary proxy --
Status: Waiting for code and/or data.
- Sandeep's phone conversation data --
Status: Wav files on-line,
cross-reference data awaiting JIKDdB integration ~June 2006
(received ~June 2006).
- ENRON1_JUN_01_06 disk copied to Zaphod's
"/Users/Shared/JIKD/Data/SpeechGrp/Enron1" directory;
Web Wav file access is available via "http://zaphod.mindlab.umd.edu:16080/JIKD/Data/SpeechGrpWavs" URL root
(e.g. SNO-351_conv2.wav) -- 02 June.
- What did I miss ?
-
Documents:
-
Architecture slides
(PPT, PDF).
- JIKD (MySQL) database
access information.
- JIKD (MySQL) database schema slides
(PPT,
PDF -- updated 03 May 2006;
now includes connectence, threads, and phone call data).
- Server information:
- Zaphod (zaphod.mindlab.umd.edu) is an Apple OSX server residing in
Jim Hendler's mindlab. It was installed to serve as the primary JIKD resource
repository and server (contact Jim Hendler or
Ron Alford for access).
There are three (jikd) shared Zaphod subdirectories as follows:
- JIKD: "/Users/Shared/JIKD" -- The root data file, server, and code resource
subdirectory
(see ReadMe.txt for details)
- JIKD_CGI: "/Library/WebServer/CGI-Executables/JIKD" -- The root web CGI script subdirectory.
- JIKD_Web: "/Library/WebServer/Documents/JIKD" -- The root web server subdirectory.
- Neoborg (neoborg.cs.umd.edu) is a Linux workstation residing in
V.S. Subrahmanian's 4th floor A.V. Williams lab (maintained by CS staff under
the Londo cluster). This machine was hosts the original JIKDdB MySQL
RDB instance (replaced by Zaphod's daemon).
- Asp (asp.cs.umd.edu) is a Linux workstation once residing in A.V.Williams Rm 3247
(part of the Londo cluster maintained by CS staff). This was used to develop much of
the JIKD java resources and currently hosts a TomCat server for several JIKD JSP pages.
Note: The JSP files need to move to another TomCat instance, somewhere else.
- Source code:
|
|