AG2PI Workshop #25 - April 10-12, 2024
Scientific Computing & Data Analytics: A Comprehensive Toolkit for Research
April 10-12, 2024
11:00 AM - 01:00 PM
(US Central Time)
Purpose
Hands-on training in efficient scientific computing techniques for both small and large data sets.
Registration
Register for the virtual event by clicking the link below. Upon registration, you will receive a confirmation email with information about joining the meeting
Workshop RegistrationWorkshop Resources
Click the buttons below to access resources of this workshop
- Workshop Materials (Slides and Code)
- Day 1: Emily's Google Colab Notebook
- Day 2: Bella's Google Colab Notebook
- Day 2: Adtiya's Google Colab Notebook
- Day 3: Brenda's Google Colab Notebook
- Day 3: Emmanuel's Google Colab Notebook
Chat Questions
Questions directed to the speakers placed in the chat can be viewed by clicking the button below
See Chat QuestionsTo fully harness the potential in extensive data sets, it is essential to process the raw data using methods like information extraction, data mining, and knowledge discovery. This workshop series aims to equip you with a comprehensive computational toolkit for your research. This toolkit will enable you to effectively manage and analyze the increasing volumes of data in your field, thereby enhancing your research output.
- Fundamental concepts of Python, Jupyter notebooks, and GitHub
- Overview of machine learning and interactive visualization
- Real-world applications in the field of plant phenotyping
- Understand the fundamentals of scientific computing, including data preprocessing, statistical analysis, machine learning, and data visualization
- Incorporate a variety of scientific computing techniques into your research workflow
- Utilize established software to identify patterns within data sets
- Perform analyses with common machine learning tools, including neural networks and dimensionality reduction methods
About Presenters
Emmanuel Gonzalez is a PhD candidate in Dr. Duke Pauli's lab at the University of Arizona whose work focuses on leveraging plant phenomics, data science, and machine learning to investigate how crops respond to both abiotic and biotic stress.
Jeffrey Demieville is an interdisciplinary R&D Engineer at the University of Arizona whose work focuses on applying biological, agricultural, and systems engineering practices to the field of phenomics.
Brenda Huppenthal is a computer science graduate student at the University of Arizona whose work focuses on using computer vision and specifically deep learning approaches to obtain underlying structural information from point clouds.
Emily Cawley is an undergraduate Computer Science student at the University of Arizona. As an undergraduate researcher in the Pauli lab, she works on predicting transformations of 3D data with neural networks.
Aditya Kumar is an undergraduate researcher in the Pauli lab. He specializes in software development and data visualization and is currently focusing on refining object detection models for Sorghum Panicle Detection.
Bella Salter is an undergraduate researcher in the Pauli lab whose work focuses on predicting late season lettuce growth based on early metrics of success using machine learning techniques such as long-short term memory.
Chat Questions
Session 1
That is correct, data transfer can be time consuming and a bottleneck. Google Colab is limited by the amount of storage space on your Google Drive. This usually means that you will need to go beyond Google Colab notebooks at a certain point when you outgrow the storage space. Often people change to cloud computers or high performance computers with larger storage space. You can read more about Google's quotas and limits here.
(Answer #1) No, you can run it using a free account. Just make sure you are logged in to your Google account to run the code.
(Answer #2) No, not when you are running code that only requires a CPU. If your code requires a GPU, Google does provide limited access to GPUs for free, but it depends on the load on the system at the time.
\n is a control character
that signifies a new line.
bq_stripped
is a variable that holds a string. Under the 'Processing' heading, you can see that we set bq_strippped = block_quote.strip(), so it contains original block quote with leading and trailing spaces stripped. If you'd like to read more about the strip function.
Session 2
For more information on passing the intercept to the scikit-learn function. You can pass the model the intercept with the intercept_ parameter
For 3D processing, point clouds are downsampled using voxel downsampling. This allows us to more easily visualize/analyze the data. You can read more about voxel downsampling here.
The Mapbox api key is free. However, there are some limitations with the # of data points that can be visualized. You can read more about Mapbox and Plotly express here.
(Follow-up Question) Adding to it: Can we use ARCGIS Satellite mapping
(Answer Follow-up #1) I am not sure if ArcGIS satellite mapping is supported. The maps used in Mapbox are provided by OpenStreetMap. You can read more about Mapbox here.
(Answer Follow-up #2) Here is additional documentation from Plotly. The relevant links within it illustrate when you need an access tokenhttps://plotly.com/python/scattermapbox/.
Session 3
We can use Euler characteristic curves to assess the shape signature along an axis or use persistence diagram-based features such as amplitude, persistence entropy. These features provide some insights into a plant's shape in addition to traditional traits such as height, volume, etc.
The code currently being explained is splitting the data into train, validation, and test set. This is done randomly using the Python package sklearn
. Once split, you can use model architectures like the EfficientNetV2 that you mentioned.
(Follow-up Question) Can 'merge_geotiff' do image stitching or orthomosaicking?
(Answer Follow-up) Here is the documentation for the rasterio.merge module used in Brenda's notebook